Unsupervised medical image segmentation using edge mapping and adversarial learning
We develop and approach to unsupervised semantic medical image segmentation that extends previous work with generative adversarial networks. We use existing edge detection methods to construct simple edge diagrams, train a generative model to convert them into synthetic medical images, and construct a dataset of synthetic images with known segmentations using variations on extracted edge diagrams. This synthetic dataset is then used to train a supervised image segmentation model. We test our approach on a clinical dataset of kidney ultrasound images and the benchmark ISIC 2018 skin lesion dataset. We show that our unsupervised approach is more accurate than previous unsupervised methods, and performs reasonably compared to supervised image segmentation models. All code and trained models are available at https://github.com/kiretd/Unsupervised-MIseg.READ FULL TEXT VIEW PDF
Unsupervised medical image segmentation using edge mapping and adversarial learning
In vivo medical imaging is one of the primary technologies available for clinical evaluation, diagnosis, and treatment planning. The physical challenge of imaging internal tissues is reflected in the low resolution, low signal-to-noise ratio, and high degree of occlusion seen with many common medical imaging technologies. Using medical images to make accurate and meaningful clinical decisions requires substantial training and experience combined with a large body of medical knowledge. As a result, current medical practice places a significant burden on highly trained clinicians specialized in interpreting medical images[46, 55].
A fundamental step in medical image analysis is to identify a region of interest, i.e.,, segmentation. This typically means identifying a bounding region that separates an organ or abnormality from other tissue in the image. For human readers, segmentation allows the extraction of clinically important metrics, such as volume, and for the planning of radiation therapy or surgical removal. In computer-aided-diagnosis (CAD), organ and tissue segmentation allows computer vision models to focus their feature extraction or feature learning computation on the clinically relevant tissue, allowing for more computationally efficient models that are better able to avoid extraneous information in the data. Manually performing these segmentations is time-consuming, expensive, and subjective, leading to major research effort in developing algorithms that can efficiently perform accurate and reliable semantic segmentation in medical images (i.e.,, segmentation by associating pixels or regions of the image with a classification label).
We present an approach to organ and tissue segmentation based on the use of a Generative Adversarial Networks (GANs) to generate a labelled synthetic training set in the absence of ground truth labels for real medical images. We circumvent the need for labelled real images by generating medical images from simplistic and arbitrary edge diagrams. We then use the synthetic training set to train supervised segmentation models, which are then applied to real images. We evaluate our approach using two datasets: a dataset of ultrasound images for which the task is to segment the kidney, and the ISIC 2018 Skin Lesion Analysis competition dataset, for which the task is to segment skin lesions in dermoscopic images.
The main contributions of this work are as follows:
We demonstrate a novel form of data augmentation by using GANs to generate labelled training data from edge diagrams for applications in which we can exploit a common geometry that is inherent to the semantic segmentation task itself.
We show that GANs can generate reasonable synthetic medical images with corresponding organ segmentation maps from just edge diagrams.
By generating data using edge diagrams, we show that We can obtain accurate and reliable organ segmentation in a fully unsupervised way, with the option of semi-supervised training if labelled data are available.
Traditional approaches to algorithmic image segmentation were largely unsupervised, i.e.,, they did not rely on ground truth (clinician-supplied) segmentations to train a model. A variety of such methods were developed in previous decades (for example, methods based on edge detection , region growing , contour modelling , and texture analysis ); however, these typically relied on built-in constraints about object appearance or differences in contrast or intensity between regions of interest and background pixels. Such constraints do not always work well for medical images, particularly for imaging modalities that produce lower quality images (e.g.,, ultrasound imaging), or for regions of the body where multiple organ and tissue types are imaged together.
To overcome the shortcomings of these earlier approaches, modern image segmentation techniques often rely on supervised learning with deep neural networks and large amounts of labelled training data. These models are capable of performing semantic segmentation, and thus can categorize regions of images based on meaningful labels provided by a clinician. The most common approach of the last few years has been based on convolutional neural networks (CNNs), which have been widely demonstrated to be successful for many kinds of computer vision tasks[59, 64, 17, 37].
Key developments in semantic segmentation have been based on variations of CNNs. The Fully Convolutional Network (FCN) omitted the fully connected layers used in standard CNNs, which are used to obtain a pixel-wise grouping label, and instead used deconvolution layers to obtain segmentation probability maps for images. A similar idea based on encoder-decoder networks was developed by deconvolving VGG16 
, a CNN pretrained on the ImageNet dataset that is sometimes used as a starting point for specific medical imaging problems (e.g., [36, 29]). In order to take greater advantage of the spatial correlations between pixels that should be grouped together, CNNs have also been combined with Conditional Random Fields (CRFs) .
Different forms of these CNN approaches have dominated the field of medical imaging segmentation as well [40, 66, 32, 16, 6, 11]. In particular, a specific instance of FCNs, U-net , has performed well for a variety of medical imaging segmentation tasks (e.g.,, ). Due to its success, it has since been extended in many ways: for 3D images , with an attention mechanism , with a pretrained VGG11 encoder , and so on.
Two major limitations reduce the utility of the CNN approaches described above: 1) they are trained explicitly to minimize pixel-wise segmentation error and therefore typically require significant post-processing of their outputs in order to obtain solutions that are spatially contiguous, and 2) they require ground truth segmentations for training, which can be very difficult and expensive to obtain on the scale that is required for effective deep learning. While some recent methods have been able to address the first limitation by training for scalable spatial coherence using patch learning with multi-scale loss functions[21, 27, 44], they do not address the need for manually segmented training images. Here we propose the use of a GAN to create synthetic training data that can be used to train supervised image segmentation models when no labelled training data are available, thus allowing for unsupervised medical image segmentation.
GANs have been formulated as image-to-image translation architectures that take paired images as input, and thus have been successfully applied to semantic segmentation by training them on pairs of images with their corresponding ground truth segmentations. This has been done in a fully supervised manner [38, 62, 52, 63, 40] and in a semi-supervised or weakly supervised manner . Most interestingly, researchers have taken advantage of the fact that GANs, by their nature, can be used to generate synthetic data as a form of data augmentation . Using this approach, GANs can be trained with a relatively low number image-segmentation pairs to generate additional training data for a DualGAN semantic segmentation model [53, 19], or a fully supervised model like U-net . However, these approaches, like the previous CNN-based models, are limited by the fact that ground truth segmentations are required to train the GANs in the first place.
Different approaches have been taken to overcome the need for segmentation labels during training. W-net pairs two U-nets to form a deep auto-encoder that can be used in combination with a CRF algorithm for scene decomposition . In contrast, co-segmentation approaches exploit feature similarity for multiple instances of same-class objects in an image, which is suitable for certain kinds of segmentation tasks with distinct ROIs . Recent recomposition approaches based on generative modelling (e.g., SEIGAN ) segment foreground objects by moving them to similar background images. Perhaps most similar to ours is a very recent approach, ReDO , that performs scene decomposition following region-wise composition using a GAN based on the assumption that different objects composing a scene would be statisticaly independent with respect to certain properties, such as colour and texture.
All of the above approaches assume that the target ROI for segmentation is easily distinguishable from the rest of the image along some feature dimensions, such as brightness or colour, and therefore try to define or learn the properties that distinguish regions of the image. In many medical imaging applications, this is extremely difficult to do, as there may not be a set of learnable properties that support the task. In the case of organ segmentation, as demonstrated with the kidney ultrasound dataset presented here, a clinical expert would typically rely heavily on prior anatomical knowledge and experience, which provides an expectation of the contours of the kidney in the absence of a clear boundary. For this reason, non-expert humans are likely to fail at this particular task (see Figure 5). We overcome this challenge using a generative process to learn an expectation of the shape of the ROI in the data generation phase, as described below.
Here we propose a way of extending this previous work to generate synthetic training data using GANs in a fully unsupervised way for applications in which there is an expected segmentation geometry that can serve as a prior. It is based on the assumption that there exists a simple template structure that can be exploited to generate simple diagrams with known segmentations, what we call edge diagrams, from which a GAN can generate sufficiently realistic (and similarly challenging) training images. As long as reasonable edge diagrams can be extracted from the original images to train the GAN, and new edge diagrams can be constructed using variations on the template structure as the ground truth segmentations, then synthetic training data can be generated with known segmentations.
Our approach follows a simple recipe. First we generate simple edge diagrams from real unlabelled training images using available computer vision techniques. We use the corresponding image-diagram pairs to train a GAN to produce synthetic medical images from the edge diagrams. We then use a simple algorithm to generate variations of these edge diagrams with known ROIs, and use the trained GAN to synthesize new images from these new edge diagrams. Finally, we use these new purely synthetic image-segmentation pairs to train a supervised image segmentation model that can be used to identify ROIs in real medical images. The entire approach is illustrated in Figure 6.
We use a dataset of renal ultrasound images developed for prenatal hydronephrosis, a congenital kidney disorder marked by excessive and potentially dangerous fluid retention in the kidneys . The dataset consists of 2492 2D sagittal kidney ultrasound images from 773 patients across multiple hospital visits. This is a difficult dataset for image segmentation due to poor image quality, unclear contours of the kidneys, and the large variation introduced by different degrees of the kidney disorder called hydronephrosis (see Figure 5). In addition, a major challenge of this dataset is that the two most salient boundaries, the outer ultrasound cone inherent to ultrasound imaging with a probe, and the dark inner region of the kidney, which is caused by fluid retention in hydronephrosis, are both misleading with respect to segmenting the kidney.
We follow a similar methodology used for preprocessing renal ultrasound imaging for deep learning described in . We crop the images to remove white borders, despeckle them to remove speckle noise caused by interference with the ultrasound probe during imaging , and re-scale to 256256 pixels for consistency. We remove text annotations made by clinicians using the pre-trained Efficient and Accurate Scene Text Detector (EAST) . We then normalize the pixel intensity of each image to be from 0 to 1 after trimming the pixel intensity from the 2nd percentile to the 98th percentile of the original pixel intensity across the image. In addition, we enhance the contrast of each image using Contrast Limited Adaptive Histogram Equalization with a clip limit of 0.03 
. Finally, we normalize the images by the mean and standard deviation of the training set during cross-validation. The results of preprocessing can be seen in the example given in Figure6.
We perform no preprocessing for the ISIC skin lesion images other than to resize them to 265 256 pixels.
To obtain edge diagrams from real medical images, we start with a rough edge map given by a pre-trained edge detector  that uses richer convolutional features (RCF) with the VGG16 architecture , which we then fine-tune using non-maximum suppression with Structured Forests for  edge thinning (as recommended by the authors of RCF). In order to simplify the edge map and remove non-zero pixels that do not belong to the ROI, we downscale the image to 3232 pixels, remove any regions with an area smaller than 3 pixels, and skeletonize the image .
Since the edge diagrams are simplistic, synthetic edge diagrams can be generated in a variety of ways (e.g.,, they can be drawn by hand if desired). For the results presented here, we train a Variational Autoencoder (VAE) to learn a latent space representing edge diagrams obtained from real images. While this model can generate synthetic edge diagrams, it does not directly provide a known ground truth segmentation. We therefore use Otsu’s method [43, 48] for edge detection to extract just the outer profile of the edge diagram, which corresponds to the ultrasound cone (the outer profile of ultrasound images produced by the ultrasound probe). We then generate a ground truth segmentation inside the cone of the synthetic edge diagram to ensure that we know every pixel belonging to the desired segmentation mask.
To generate the ground truth segmentation representing the kidney ROI, we compute a random ellipse with a random origin, rotation, and major and minor axes within the bounds of the 3232 pixel edge diagram. We draw randomly selected arcs from the ellipse so as to leave gaps in the kidney outline, simulating occlusion of the kidney boundary. We then also draw an arc inside the ellipse roughly parallel to the major axis to represent the renal pelvis. Finally, we add some noise in the form of random pixels inside the ellipse. Both the extracted and synthetic edge diagrams are rescaled up to 256256 pixels for training the GAN.
Note that in many medical imaging applications, the entire process involving the VAE may be skipped and only the ground truth segmentation is needed (as we do with the ISIC 2018 dataset). We specifically include the cone and the segmentation in the synthetic edge diagrams for ultrasound images to ensure the GAN generates synthetic ultrasound images with cone profiles, thus preventing the later segmentation model from learning to only segment the outer cone.
The same process was used to generate synthetic edge diagrams for the skin lesion images, with two notable exceptions: no cone was created, and random lines, arcs, and smaller ellipses were added to some synthetic edge diagrams to mimic the presence of rulers, pen marks, hairs, and other objects that sometimes appeared in the real images.
In principle, any method that produces edge diagrams with known segmentations and enough variation can be used. The methods described here are included for reproducibility rather than methodological necessity.
The conventional GAN  uses the loss function
min_θ_Gmax_θ_DL(θ_G,θ_D) = E_x∼P_X[log(D(x))] + E_z∼P_Z[log(1-D(G(z)))],
where and are the parameters for generator and discriminator , is a real image from our set of real ultrasound images with unknown distribution , and
is a random vector noise vector drawn from some defined probability distribution
(in this case, a Gaussian distribution). Training the GAN involves setting the generator an discriminator in competition with one another: the generator is trained to minimize the objective function by generating images that are indistinguishable from the real training images, and the discriminator is trained to maximize the objective function by learning to distinguish the images synthesized by the generator from real training images. For this work we use thepix2pixHD architecture .
This architecture uses two subnetworks to create a coarse-to-fine generator that can upscale image quality during image-to-image translation, and three multiscale discriminators to address the need to discriminate between high resolution synthetic images and real images while keeping the network size and memory requirements relatively low. Training the entire network comes with a loss function extended from 3.6 for multiple discriminators by summing over the discriminators to obtain
min_θ_G (( max_θ_D_1,θ_D_3,θ_D_3 ∑_k=1^3 L(θ_G,θ_D_k) ) + λ∑_k=1^3 L_FM(θ_G,θ_D_k) ), where is a parameter used to balance the influence of each term of the loss function. Here, is the layer-wise feature matching loss that is incorporated to account for the fact that the generator must now model data distributions at multiple scales:
where is the number of layers and is the number of units in layer . In this work, we are not upscaling the resolution of images, but we find pix2pixHD to also be valuable for translating from a simple image (our edge diagrams) to more complex images (medical images).
For our ultrasound images, a trained surgical urologist provided segmentations for 491 images (approximately evenly split by class; range: 96-100). We reserve those images for evaluation (i.e.,, they are not used to train any model). We additionally remove any training images taken from the same patients that are also represented in the evaluation set to avoid overfitting due to subject-specific characteristics. In total, we use 918 images to train the GAN with 20% used for validation. From these, we create a synthetic training set of 2000 image-segmentation pairs.
For the ISIC 2018 dataset, 2075 images are used for training and 519 are used for evaluation. Using these data, we generate 3000 synthetic image-segmentation pairs for training and 750 for validation. For both datasets, we generated as many images as required until segmentation accuracy on the validation set no longer improved.
We train our implementation of pix2pixHD using the same settings given in 
and choose the parameters corresponding to the epoch that minimizes the Fréchet Inception Distance (FID) with respect to the validation data. This is the 90th epoch for the ultrasound dataset, and the 100th epoch for the skin lesion dataset.
We use the U-net architecture defined in  to train a segmentation model for the ultrasound dataset. However, we use the sum of the pixel-wise binary cross-entropy and the dice coefficient as our loss function. We use Adam for optimization with a batch size of 1. Finally, we perform data augmentation with horizontal flips (50% probability) and horizontal and vertical translations of up to 26 pixels (10%).
We use the Mask-RCNN implementation provided here  adjusted for the ISIC 2018 dataset. We use anchor sizes of and 32 training ROIs per image. Other hyperameters were kept as default values. We perform data augmentation with both horizontal and vertical flips (50% probability), rotation of or , and a Gaussian blur of up to 5 standard deviations.
We evaluate our model using three standard metrics: the F1 score, mean intersection over union (mIoU), and pixel-wise classification accuracy (pACC).
The clinician-provided segmentations were drawn as imprecise outlines on the ultrasound images (see Figure 10), and therefore could not be used to generate masks in a simple and direct way. We therefore use OpenCV  to convert these segmentations to masks.
For each clinician-provided segmentation, we first compute its difference with the original unsegmented ultrasound image. Since some background noise iss retained in most images, we use an adaptive threshold to convert the difference image to a binary image using the following formula:
where is the mean in the pixel neighbourhood around each pixel computed from the difference image .
We then use a border-following algorithm  with Teh-Chin chain approximation  to identify contours from the binary images. Contours with an area of less than 25 pixels are removed as noise. We compute and fill the convex hull of the remaining contours using the Sklansky algorithm . We use these as the ground truth masks for evaluating segmentation performance.
Following this procedure, the masked images are visually inspected compared to the clinician-provided segmentations, and those masks which deviate significantly from the clinician’s segmentations (e.g., because additional annotations are added to the segmented images, as seen in Figure 10) are omitted from further analysis. In total, 53 images are removed (438 are used for evaluation), and the class distribution remains relatively even (range: 83-93 per class).
To illustrate the similarity between real and synthetic images, we show a random sample of real and generated kidney ultrasound images in Figure 11, and a random sample of real and generated dermoscopic images in Figure 12. Since our goal is to generate images that are similar enough for generalizeable training of a segmentation model, our approach does not produce state-of-the-art synthetic image generation. Instead, it produces images that have similar segmentation properties.
In Figure 13 we show the kidney segmentation masks learned through our fully unsupervised approach (with U-net as the segmentation model), compared with a purely supervised U-net and a purely unsupervised W-net. In Table 1 we show the corresponding segmentation performance metrics. For the semi-supervised extensions of our approach, we train a U-net using real and synthetic ultrasound images in a standard training protocol (U-net), and we also train a U-net using just the synthetic data followed by supervised fine-tuning with 45 of the real images with clinician segmentations, which were then removed from the evaluation set (U-net+).
|Unsup.||Ours (U-net)||0.81 (0.09)||0.92 (0.05)||0.87 (0.14)||0.69 (0.12)||0.90 (0.05)|
|W-net||0.46 (0.10)||0.20 (0.05)||0.98 (0.02)||0.41 (0.07)||0.41 (0.07)|
|Semi-Sup.||Ours (U-net)||0.87 (0.11)||0.97 (0.04)||0.86 (0.13)||0.78 (0.13)||0.93 (0.05)|
|Ours (U-net+)||0.88 (0.08)||0.97 (0.03)||0.88 (0.09)||0.80 (0.11)||0.94 (0.04)|
|Sup.||U-net||0.91 (0.09)||0.97 (0.04)||0.90 (0.10)||0.84 (0.10)||0.95 (0.03)|
Performance metrics on the ISIC 2018 dataset using our unsupervisd approach are shown in Table 2 along with results obtained by the competition winner and current top submission. Here we use the metrics given by the online submission system, which includes a thresholded mIoU (th-mIoU). This metric sets all per-image IoU scores that are less than 0.65 to 0 before computing the mean IoU. Semi-supervised results are not available because the ISIC 2018 test submission page has been removed whie preparing this manuscript, and the test set is not currently available. Examples of the output masks on validation images are shown in Figure 14.
|Ali et al. 2019 ||0.543||n/a||n/a||0.440||n/a||n/a|
We present an unsupervised approach to semantic medical image segmentation that takes advantage of recent advances in image synthesis and generative modelling by making assumptions about the common geometry inherent to an object of interest. This method performs better than some previous unsupervised methods that fit the problem definition (e.g., W-net), or for which results are available (e.g., the CNN-based approach in ). For example, W-net performs poorly on the kidney segmentation task because it only identifies the ultrasound cone itself, rather than the kidney. We also show that our approach performs nearly as well as supervised methods for most images. Importantly, we show that with just a few training examples for supervised fine-tuning (here, only 10% of the data used for the supervised models), we approach the segmentation performance of purely supervised models.
Our method tends towards identifying larger ROIs that contain the desired ROI, which results in high specificity (0.92 and 0.947 for the kidney dataset and ISIC 2018 respectively) and only moderate sensitivity (0.87 and 0.835). For both datasets, the model fails for a small subset of the images. In the case of the kidney dataset, we find no clear pattern to explain the failed images. However, in the case of ISIC 2018 images, the unsupervised model does poorly with images that contain a lens or film placed on top of the skin lesion (in these cases, the model incorrectly segments the lens instead of the skin lesion underneath).
Interestingly, even though we construct edge diagrams based on smooth and convex shapes for image synthesis, the resulting segmentation models are able to fit non-smooth and non-convex boundaries. It is possible that alternative methods for generating edge diagrams with greater complexity may lead to a more flexible model that can adapt to more complex geometries. We are currently exploring the utility of this method in segmenting organs with more complex geometries, segmenting multiple objects per image, and performing 3D segmentation. We are also currently exploring adaptations of our approach that make it more end-to-end, e.g., by using multiple GANs.
Canadian Conference on Artificial Intelligence, pages 373–379. Springer, 2019.
2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Image-to-image translation with conditional adversarial networks.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
Medical Imaging: Artificial Intelligence, Image Recognition, and Machine Learning Techniques, page 207, 2019.