Pathological Retinal Region Segmentation From OCT Images Using Geometric Relation Based Augmentation

03/31/2020 ∙ by Dwarikanath Mahapatra, et al. ∙ EPFL 0

Medical image segmentation is an important task for computer aided diagnosis. Pixelwise manual annotations of large datasets require high expertise and is time consuming. Conventional data augmentations have limited benefit by not fully representing the underlying distribution of the training set, thus affecting model robustness when tested on images captured from different sources. Prior work leverages synthetic images for data augmentation ignoring the interleaved geometric relationship between different anatomical labels. We propose improvements over previous GAN-based medical image synthesis methods by jointly encoding the intrinsic relationship of geometry and shape. Latent space variable sampling results in diverse generated images from a base image and improves robustness. Given those augmented images generated by our method, we train the segmentation network to enhance the segmentation performance of retinal optical coherence tomography (OCT) images. The proposed method outperforms state-of-the-art segmentation methods on the public RETOUCH dataset having images captured from different acquisition procedures. Ablation studies and visual analysis also demonstrate benefits of integrating geometry and diversity.



There are no comments yet.


page 1

page 2

page 4

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Medical image segmentation is an important task for healthcare applications like disease diagnosis, surgical planning, and disease progression monitoring. While deep learning (DL) methods demonstrate state-of-the-art results for medical image analysis tasks

[39], their robustness depends upon the availability of a diverse training dataset to learn different disease attributes such as appearance and shape characteristics. Large scale dataset annotations for segmentation require image pixel labels, which is time consuming and involves high degree of clinical expertise. The problem is particularly acute for pathological images since it is difficult to obtain diverse images for less prevalent disease conditions, necessitating data augmentation. We propose a generative adversarial network (GAN) based approach for pathological images augmentation and demonstrate its efficacy in pathological region segmentation. Figure 1 summarizes the image generation results of our approach and [43], and highlights our superior performance by incorporating geometric information.

(a) (b) (c)
(d) (e)
Figure 1: (a) Base image (red contour denotes segmentation mask); Example of generated images using: (b) Our proposed method; (c) Zhao et al. [43]; (d) method by [1]; (e) method by [25].
Figure 2: Example of normal and fluid filled OCT images: (a) example control subject image without any abnormalities (taken from [9]); (b) images with accumulated fluid up due to diabetic macular edema and AMD from our dataset. The fluid areas are marked with red arrows.

Traditional augmentations such as image rotations or deformations have limited benefit as they do not fully represent the underlying data distribution of the training set and are sensitive to parameter choices. Recent works [15, 43, 14, 30] proposed to solve this issue by using synthetic data for augmentation and increase diversity in the training samples. However, certain challenges have not been satisfactorily addressed by these methods.

Zhao et. al. [43] proposed a learning-based registration method to register images to an atlas, use corresponding deformation field to deform a segmentation mask and obtain new data. This approach presents the following challenges: 1) since registration errors propagate to subsequent stages, inaccurate registration can adversely affect the data generation process; 2) with atlas of a normal subject it is challenging to register images from diseased subjects due to appearance or shape changes. This is particularly relevant for layer segmentation in retinal optical coherence tomography (OCT) images, where there is a drastic difference in layer shape between normal and diseased cases. Figure 2 (a) shows the retinal layers of a normal subject, and Figure 2 (b) shows two cases of retinal fluid build up due to diabetic macular edema (DME) and age related macular degeneration (AMD). The retinal layers are severely distorted compared to Figure 2 (a) and registration approaches have limited impact in generating accurate images.

Recent methods for data augmentation [14, 30, 6, 7] using a generative adversarial network (GAN) [13] have shown moderate success for medical image classification. However, they have limited relevance for segmentation since they do not model geometric relation between different organs and most augmentation approaches do not differentiate between normal and diseased samples. Experiments in Section 4.5 show segmentation methods trained on normal subject images (Figure 2 (a) ) are not equally effective for diseased cases due to significant shape changes between the two types. Hence there is a need for augmentation methods that consider the geometric relation between different anatomical regions and generate distinct images for diseased and normal cases. Another limitation of current augmentation approaches is that they do not incorporate diversity in a principled manner. In [25] shape mask was incorporated manually for image generation, which is not practical and may lead to unrealistic deformations.

2 Related Work

2.1 Deep Models for Retinal OCT Segmentation

One of the first works to use multi-scale convolutional neural nets (CNNs) on OCT images [36] employed patch-based voxel classification for detecting intraretinal fluid (IRF) and subretinal fluid (SRF) in fully supervised and weakly supervised settings. Fully convolutional neural nets and U-nets were used in [40, 12] to segment IRF, and in [34] to segment both the retinal layers and the fluid. Explicit fluid segmentation methods such as [41] also achieve high classification performance.

2.2 Data Augmentation (DA)

While conventional augmentation approaches are easy to implement and generate a large database, their capabilities are limited in inducing data diversity. They are also sensitive to parameter values [11], variation in image resolution, appearance and quality [22].

Recent DL based methods trained with synthetic images outperform those trained with standard DA over classification and segmentation tasks. Antoniou et al. [1] proposed DAGAN for image generation in few shot learning systems. Bozorgtabar et al. [8] used GAN objective for domain transformation by aligning feature distribution of target data and source domain. Mahapatra et al. [25] used conditional GAN (cGAN) for generating informative synthetic chest Xray images conditioned on a perturbed input mask. GANs have also been used for generating synthetic retinal images [44] and brain magnetic resonance images (MRI) [14, 37], facial expression analysis [5]

, for super resolution

[21, 24, 31], image registration [28, 27, 26] and generating higher strength MRI from their low strength acquisition counterparts [42]. Generated images have implicit variations in intensity distribution but there is no explicit attempt to model attributes such as shape variations that are important to capture different conditions across a population. Milletari et al. [29] augmented medical images with simulated anatomical variations but demonstrate varying performance based on transformation functions and parameter settings.

2.3 Image Generation Using Uncertainty

Kendall et al. [17]

used approximate Bayesian inference for parameter uncertainty estimation in scene understanding, but did not capture complex correlations between different labels. Lakshminarayanan et al.

[20] proposed a method to generate different samples using an ensemble of networks while Rupprecht et al. [35] presented a single network with heads for image generation. Sohn et al. [38]

proposed a method based on conditional variational autoencoders (cVAE) to model segmentation masks, which improves the quality of generated images. In probabilistic UNet

[19], cVAE is combined with UNet [33] to generate multiple segmentation masks, although with limited diversity since randomness is introduced at highest resolution only. Baumgartner et al. [2] introduced a framework to generate images with a greater diversity by injecting randomness at multiple levels.

2.4 Our Contribution

Based on the premise that improved data augmentation yields better segmentation performance in a DL system, we hypothesize that improved generation of synthetic images is possible by considering the intrinsic relationships between shape and geometry of anatomical structures [4]. In this paper we present a Geometry-Aware Shape Generative Adversarial Network (GeoGAN) that learns to generate plausible images of the desired anatomy (e.g., retinal OCT images) while preserving learned relationships between geometry and shape. We make the following contributions:

  1. Incorporating geometry information contributes to generation of realistic and qualitatively different medical images and shapes compared to standard DA. Other works such as [25, 44] do not incorporate this geometric relationship between anatomical parts.

  2. Use of uncertainty sampling and conditional shape generation on class labels to introduce diversity in the mask generation process. Compared to previous methods we introduce diversity at different stages (different from [25, 44, 19]

    ) and introduce an auxiliary classifier (different from

    [2, 38] ) for improving the quality and accuracy of generated images.

3 Method

Our augmentation method: 1) models geometric relationship between multiple segmentation labels; 2) preserves disease class label of original image to learn disease specific appearance and shape characteristics; and 3) introduces diversity in the image generation process through uncertainty sampling. Figure 3 shows the training workflow using a modified UNet based generator network. The set of images and segmentation masks are used to train the generator while the discriminator provides feedback to improve the generator output. Figure 4 depicts generation of synthetic images from the validation image set and their subsequent use in training a UNet for image segmentation at test time.

Figure 3: Overview of the steps in the training stage of our method. The images () and corresponding segmentation masks () are input to a STN whose output is fed to the generator network. Generator network is based on UNet architecture, and diversity through uncertainty sampling is injected at different levels. The generated mask is fed to the discriminator which evaluates its accuracy based on , and . The provided feedback is used for weight updates to obtain the final model.
Figure 4: Depiction of mask generation. The trained generator network is used on validation set base images to generate new images that are used to train a segmentation network (UNet or Dense UNet). The model then segments retinal layers from test images.

3.1 Geometry Aware Shape Generation

Let us denote an input image as , the corresponding manual segmentation masks as and the disease class label of as

. Our method learns to generate a new image and segmentation label map from a base image and its corresponding manual mask. The first stage is a spatial transformer network (STN)

[16] that transforms the base mask to a new shape with different attributes of location, scale and orientation. The transformations used to obtain new segmentation mask are applied to to get corresponding transformed image . Since the primary aim of our approach is to learn contours and other shape specific information of anatomical regions, a modified UNet architecture as the generator network effectively captures hierarchical information of shapes. It also makes it easier to introduce diversity at different levels of image abstraction.

The generator takes input

and a desired label vector of output mask

to output an affine transformation matrix A via a STN, i.e., (. A is used to generate and . The discriminator determines whether output image preserves the desired label or not. The discriminator is tasked with ensuring that the generated masks and images are realistic. Let the minimax criteria between and be

. The loss function

has three components


where 1) is an adversarial loss to ensure outputs realistic deformations; 2)

ensures generated image has characteristics of the target output class label (disease or normal); and 3)

ensures new masks have realistic shapes. balance each term’s contribution.

Adversarial loss

- : The STN outputs , a prediction for A conditioned on and a new semantic map is generated. is defined as:


Classification Loss

- : The affine transformation A is applied to the base image x to obtain the generated image . We add an auxiliary classifier when optimizing both and and define the classification loss as,


where the term

represents a probability distribution over classification labels computed by


Shape Loss

-: We intend to preserve the relative geometric arrangement between the different labels. The generated mask has regions with different assigned segmentation labels because the base mask (from which the image was generated) already has labeled layers. Let us denote by the image region (or pixels) in assigned label . Consider another set of pixels, , assigned label . We calculate , which is, given regions , the pairwise probability of being label . If denotes the total number of labels, for every label we calculate the such probability values and repeat it for all labels. Thus


The probability value is determined from a pre-trained modified VGG16 architecture to compute where the input has two separate maps corresponding to the label pair. Each map’s foreground has only the region of the corresponding label and other labels considered background. The conditional probability between the pair of label maps enables the classifier to implicitly capture geometrical relationships and volume information without the need to define explicit features. The geometric relation between different layers will vary for disease and normal cases, which is effectively captured by our approach.

3.2 Sample Diversity From Uncertainty Sampling

The generated mask is obtained by fusing levels of the generator (as shown in Figure 3), each of which is associated with a latent variable . We use probabilistic uncertainty sampling to model conditional distribution of segmentation masks and use separate latent variables at multi-resolutions to factor inherent uncertainties. The hierarchical approach introduces diversity at different stages and influences different features (e.g., low level features at the early layers and abstract features in the later layers). Denoting the generated mask as for simplicity, we obtain conditional distribution for latent levels as:


Latent variable models diversity at resolution of the original image (e.g. and denote the original and image resolution). A variational approximation approximates the posterior distribution where . , where is the evidence lower bound, and

is the Kullback-Leibler divergence. The prior and posterior distributions are parameterized as normal distributions


Figure 3 shows example implementation for . We use resolution levels and latent levels. Figure 3 shows the latent variables forming skip connections in a UNet architecture such that information between the image and segmentation output goes through a sampling step. The latent variables are not mapped to a 1-D vector to preserve the structural relationship between them, and this substantially improves segmentation accuracy. ’s dimensionality is , where , are image dimensions.

4 Experimental Results

4.1 Dataset Description

We apply our method to OCT images since retinal disease leads to significant change of retinal layers, while changes due to disease in other modalities, such as Xray or MRI, are not so obvious for mildly severe cases. Moreover, in retinal OCT there is greater interaction between different layers (segmentation labels) which is a good use case to demonstrate the effectiveness of our attempt to model the geometric relation between different anatomical regions. The publicly available RETOUCH challenge dataset [3] is used for our experiments. It has images of the following pathologies: 1) Intraretinal Fluid (IRF): contiguous fluid-filled spaces containing columns of tissue; 2) Subretinal Fluid (SRF): accumulation of a clear or lipid-rich exudate in the subretinal space; 3) Pigment Epithelial Detachment (PED): detachment of the retinal pigment epithelium (RPE) along with the overlying retina from the remaining Bruch’s membrane (BM) due to the accumulation of fluid or material in sub-RPE space. It is common for age related macular degeneration (AMD).

OCT volumes were acquired with spectral-domain SD-OCT devices from three different vendors: Cirrus HD-OCT (Zeiss Meditec), Spectralis (Heidelberg Engineering), and T-1000/T-2000 (Topcon). There were pathological OCT volumes from each vendor. Each Cirrus OCT consists of B-scans of pixels. Each Spectralis OCT had B-scans with pixels and each Topcon OCT has B-scans of (T-2000) or (T-1000) pixels. All OCT volumes cover a macular area of mm with axial resolutions of: m (Cirrus), m (Spectralis), and m (Topcon T-2000/T-1000). We use an additional dataset of normal subjects derived equally () from the three device types who had no incidence of retinal disease. The training set consists of OCT volumes, with and diseased volumes acquired with Cirrus, Spectralis, and Topcon, respectively, with an extra normal subjects ( from each device). The test set has volumes, 14 diseased volumes from each device vendor and normal subjects ( from each device type). The distribution of different fluid pathologies (IRF, SRF, PED) and diseases (AMD, RVO) is almost equal in the training and test set.

The total number of images are as follows: training images (2D scans of the volume) - diseased and normal; test images- diseased and normal. Segmentation layers and fluid regions (in pathological images) were manually annotated in each of the ( B-scans. Manual annotations were performed by graders and the final annotation was based on consensus.

4.2 Experimental Setup, Baselines and Metrics

Our method has the following steps: 1) Split the dataset into training (), validation (), and test () folds such that images of any patient are in one fold only. 2) Use training images to train the image generator. 3) Generate shapes from the validation set and train UNet segmentation network [33] on the generated images. 4) Use trained UNet to segment test images. 5) Repeat the above steps for different data augmentation methods. We trained all models using Adam optimiser [18] with a learning rate of and batch-size of . Batch-normalisation was used. The values of parameters and in Eqn. 1 were set by a detailed grid search on a separate dataset of volumes ( from each device) that was not used for training or testing. They were varied between in steps of by fixing and varying for the whole range. This was repeated for all values of . The best segmentation accuracy was obtained for and , which were our final parameter values.

We denote our method as GeoGAN (Geometry Aware GANs), and compare it’s performance against other methods such as: 1) rotation, translation and scaling (denoted as DA-Data Augmentation); 2) - data augmentation GANs of [1]; 3) - the conditional GAN based method of [25]; and 4) - the atlas registration method of [43]. Segmentation performance is evaluated in terms of Dice Metric (DM) [10] and Hausdorff Distance (HD) [32]. DSC of indicates perfect overlap and indicates no overlap, while lower values of HD (in mm) indicate better segmentation performance.

Algorithm Baselines.

The following variants of our method were used for ablation studies:

  1. GeoGAN- GeoGAN without classification loss (Eqn.3).

  2. GeoGAN- GeoGAN without shape relationship modeling term (Eqn.4).

  3. GeoGAN - GeoGAN without uncertainty sampling for injecting diversity to determine sampling’s relevance to the final network performance.

  4. GeoGAN - GeoGAN using classification loss (Eqn.3) and adversarial loss (Eqn.2) to determine ’s relevance to GeoGAN’s performance.

  5. GeoGAN - GeoGAN using shape loss (Eqn.4) and adversarial loss (Eqn.2) to determine ’s contribution to GeoGAN’s performance.

  6. GeoGAN - GeoGAN using only uncertainty sampling and adversarial loss (Eqn.2). This baseline quantifies the contribution of sampling to the image generation process.

4.3 Segmentation Results And Analysis

We hypothesize that a good image augmentation method should capture the different complex relationships between the anatomies and the generated images leading to the improvement in segmentation accuracy. Average DSC for pathological images from all device types are reported in Table 1 for the RETOUCH test dataset. Figure 5 shows the segmentation results using a UNet trained on images from different methods. Figure 5 (a) shows the test image along with the manual mask overlayed and shown as the red contour and Figure 5 (b) shows the manual mask. Figures 5 (c)-(g) show, respectively, the segmentation masks obtained by GeoGAN, , , and .

Our method outperforms baseline conventional data augmentation and other competing methods by a significant margin. Results of other methods are taken from [3]. GeoGAN’s DSC of is higher than the DSC value () of the best performing method (obtained on the Spectralis images of the datasaet). While GeoGAN’s average performance is equally good across all three device images, the competing methods rank differently for different devices. GeoGAN’s superior segmentation accuracy is attributed to it’s capacity to learn geometrical relationship between different layers (through ) much better than competing methods. Thus our attempt to model the intrinsic geometrical relationships between different labels could generate superior quality masks.

In a separate experiment we train GeoGAN with images of one device and segment images of the other devices, and repeat for all device types. The average DSC value was , and HD was mm. The decrease in performance compared to GeoGAN in Table 1 is expected since the training and test images are from different devices. However we still do better than [43] and competing methods on the same dataset.

We repeat the set of experiments in Table 1 using a Dense UNet [23] instead of UNet as the segmentation network. We obtain the following average DSC values: GeoGAN -, , , and . GeoGAN gives the best results, thus indicating it’s better performance irrespective of the backbone segmentation framework.

Comparison approaches Proposed
[1] [25] [43]
DM 0.793 0.825 0.851 0.884 0.906
(0.14) (0.10) (0.07) (0.09) (0.04)
HD 14.3 12.9 10.6 8.8 7.9
(4.2) (3.8) (3.0) (3.3) (2.2)
Table 1: Segmentation results for pathological

OCT images from the RETOUCH database. Mean and standard deviation (in brackets) are shown. Best results per metric is shown in bold.

(a) (b) (c) (d) (e) (f) (g)
Figure 5: Segmentation results on the RETOUCH challenge dataset for (a) cropped image with manual segmentation mask (red contour); Segmentation masks by (b) ground truth (manual); (c) GeoGAN; (d) Zhao [43]; (e) ; (f) and (g) conventional .

Ablation Studies.

Table 2 shows the segmentation results for different ablation studies. Figure 6 shows the segmentation mask obtained by different baselines for the same image shown in Figure 5 (a). The segmentation outputs are quite different from the ground truth and the one obtained by GeoGAN. In some cases the normal regions in the layers are included as pathological area, while parts of the fluid region are not segmented as part of the pathological region. Either case is undesirable for disease diagnosis and quantification. Thus, different components of our cost functions are integral to the method’s performance and excluding one or more of classification loss, geometric loss and sampling loss adversely affects segmentation performance.

(a) (b) (c)
(d) (e) (f)
Figure 6: Ablation study results for: (a) ; (b) ; (c) ; (d) ; (e) ; (f) . HD is in mm.
DM 0.867(0.07) 0.864(0.09) 0.862(0.09)
HD 9.4(3.0) 9.5(3.3) 9.9(3.2)
DM 0.824(0.08) 0.825(0.07) 0.818(0.06)
HD 11.2(2.9) 11.1(3.0) 12.5(2.8)
Table 2: Mean and standard deviation (in brackets) of segmentation results from ablation studies on pathological OCT images from the RETOUCH database. HD is in mm.

4.4 Realism of Synthetic Images

Prior results show GeoGAN could generate more diverse images, which enables the corresponding UNet to show better segmentation accuracy. Figure 1 shows examples of generated synthetic images using and the other image generation methods except since it involves rotation and scaling only while Figure 7 shows examples from the ablation models. The base image is the same in both figures. Visual examination shows GeoGAN generated images respect boundaries of adjacent layers in most cases, while other methods tend not to do so.

Only GeoGAN and to some extent generate images with consistent layer boundaries. Images generated by other methods suffer from the following limitations: 1) tend to be noisy; 2) multiple artifacts exposing unrealistic appearance; 3) smoothed images which distort the layer boundaries; 4) different retinal layers tend to overlap with the fluid area. Segmentation models trained on such images will hamper their ability to produce accurate segmentations.

Two trained ophthalmologists having and years experience in examining retinal OCT images for abnormalities assessed realism of generated images. We present them with a common set of synthetic images from GeoGAN and ask them to classify each as realistic or not. The evaluation sessions were conducted separately with each ophthalmologist blinded to other’s answers as well as the image generation model. Results with GeoGAN show one ophthalmologist () identified () images as realistic while identified () generated images as realistic. Both of them had a high agreement with common images ( -“” in Table 3) identified as realistic. Considering both and feedback, a total of 473 () unique images were identified as realistic (“” in Table 3). Subsequently, () of the images were not identified as realistic by any of the experts (“” in Table3). Agreement statistics for other methods are summarized in Table 3.

The highest agreement between two ophthalmologists is obtained for images generated by our method. For all the other methods their difference from is significant. Zhao et. al. [43] has the best performance amongst them, but has agreement difference of more than (for “”) compared to ( vs ). The numbers from Table 3 show a larger difference for the other methods, thus highlighting the importance of modeling geometric relationships in pathological region segmentation.

Agreement Both Atleast 1 No
Statistics Experts Expert Expert
GeoGAN 88.0 (440) 94.6 (473) 5.4 (27)
Zhao et. al.[43] 84.8 (424) 88.2 (441) 11.8 (59)
cGAN ([25]) 83.2 (416) 85.4 (427) 14.6 (73)
DAGAN([1]) 82.2 (411) 84.2 (421) 15.8 (79)
DA 80.4 (402) 82.4 (412) 17.6 (88)
GeoGAN 83.6 (418) 86.4 (432) 13.6 (68)
GeoGAN 83.0 (415) 85.6 (428) 14.4 (72)
GeoGAN 82.8 (414) 85.0 (425) 15.0 (75)
GeoGAN 82.2 (411) 84.0 (420) 16.0 (80)
GeoGAN 81.2 (406) 83.4 (417) 16.6 (83)
GeoGAN 80.4 (402) 82.8 (414) 17.2 (86)
Table 3: Agreement statistics for different image generation methods amongst 2 ophthalmologists. Numbers in bold indicate agreement percentage while numbers within brackets indicate actual numbers out of patients.
(a) (b) (c)
(d) (e) (f)
Figure 7: Generated images for ablation study methods: (a) ; (b) ; (c) ; (d) ; (e) ; (f) .

4.5 Combining Disease And Normal Dataset

Section 4.3 shows results of training the UNet on diseased population shapes to segment diseased shapes. In this section we show the opposite scenario where the training was performed on normal images, the network subsequently used to generate images from the diseased base images and segment test images of a diseased population. Table 4 shows the corresponding results and also for the scenario when the training images were a mix of diseased and normal population, while the test images were from the diseased population. All reported results are for the same set of test images.

Comparing them with the results in Table 1, the superior performance of training separate networks for diseased and normal population is obvious. Figure 8 (a) shows the segmentation output when training and test image are from the diseased population, while Figure 8 (b) shows the scenario where the training images are from the normal population while the test images are the diseased case. Red contours show the outline of the manual segmentation while the green contours show the output of our method. When training images are from normal population it is more challenging to segment an image from the diseased population. Inaccurate segmentation of the fluid layers can have grave consequences for subsequent diagnosis and treatment plans. Figure 8 (c) shows the results when the training database is a mix of diseased and normal population, which is a more accurate representation of real world scenarios. A mixture of normal and diseased population images in the training set leads to acceptable performance. However, training a network exclusively on disease cases improves segmentation accuracy of pathological regions, which is certainly more critical than segmenting normal anatomical regions. Since it is challenging to obtain large numbers of annotated images, especially for diseased cases, our proposed image augmentation method is a significant improvement over existing methods.

Train on Normal, Test on Diseased
[1] [25]
DM 0.741 0.781 0.802 0.821 0.856
HD 15.3 14.5 13.7 11.3 9.9
Train on Mix, Test on Diseased
[1] [25]
DM 0.762 0.798 0.820 0.848 0.873
HD 14.8 14.0 13.2 10.8 9.2
Table 4: Segmentation results for mix of diseased and normal OCT images. Best results per metric is shown in boldface. HD is in mm.
(a) (b) (c)
Figure 8: Segmentation results of test images for different training data sources: (a) diseased population only; (b) normal population only; (c) mix of diseased and normal population.

5 Conclusion

We propose a novel approach to generate plausible retinal OCT images by incorporating relationship between segmentation labels to guide the shape generation process. Diversity is introduced in the image generation process through uncertainty sampling. Comparative results show that the augmented dataset from outperforms standard data augmentation and other competing methods, when applied to segmentation of pathological regions (fluid filled areas) in retinal OCT images. We show that synergy between shape, classification and sampling terms lead to improved segmentation and greater visual agreement of experienced ophthalmologists. Each of these terms is equally important in generating realistic shapes. Our approach can be used for other medical imaging modalities without major changes to the workflow.

Despite the good performance of our method we observe failure cases when the base images are noisy due to inherent characteristics of the image acquisition procedure, and when the fluid areas greatly overlap with other layers. Although the second scenario is not very common, it can be critical in the medical context. In future work we aim to evaluate our method’s robustness on a wide range of medical imaging modalities such as MRI, Xray, etc. Our method is also useful to generate realistic images for educating clinicians, where targeted synthetic images (e.g. generation of complex cases, or disease mimickers) can be used to speed-up training. Similarly, the proposed approach could be used in quality control of deep learning systems to identify potential weaknesses through targeted high-throughput synthetic image generation and testing.


  • [1] A. Antoniou, A. Storkey, and H. Edwards (2017) Data augmentation generative adversarial networks. In arXiv preprint arXiv:1711.04340,, Cited by: Figure 1, §2.2, §4.2, Table 1, Table 3, Table 4.
  • [2] C. F. Baumgartner, K. C. Tezcan, K. Chaitanya, A. M. Hötker, U. J. Muehlematter, K. Schawkat, A. S. Becker, O. Donati, and E. Konukoglu (2019) PHiSeg: capturing uncertainty in medical image segmentation.. In Proc. MICCAI(2), pp. 119–127. Cited by: item 2, §2.3.
  • [3] H. Bogunovic and et. al., (2019) RETOUCH: the retinal oct fluid detection and segmentation benchmark and challenge. IEEE Trans. Med. Imag. 38 (8), pp. 1858–1874. Cited by: §4.1, §4.3.
  • [4] F. L. Bookstein. (2015) Integration, disintegration, and self-similarity: characterizing the scales of shape variation in landmark data. Evolutionary Biology 42 (4), pp. 395–426. Cited by: §2.4.
  • [5] B. Bozorgtabar, D. Mahapatra, and Jean-Philippe. Thiran (2020) ExprADA: adversarial domain adaptation for facial expression analysis..

    In Press Pattern Recognition

    100, pp. 15–28.
    Cited by: §2.2.
  • [6] B. Bozorgtabar, D. Mahapatra, H. von Teng, A. Pollinger, L. Ebner, J. Thiran, and M. Reyes (2019) Informative sample generation using class aware generative adversarial networks for classification of chest xrays.. Computer Vision and Image Understanding 184, pp. 57–65. Cited by: §1.
  • [7] B. Bozorgtabar, M. S. Rad, H. K. Ekenel, and J. Thiran (2019) Learn to synthesize and synthesize to learn. Computer Vision and Image Understanding 185, pp. 1–11. Cited by: §1.
  • [8] B. Bozorgtabar, M. S. Rad, D. Mahapatra, and J. Thiran (2019)

    SynDeMo: synergistic deep feature alignment for joint learning of depth and ego-motion

    In Proceedings of the IEEE International Conference on Computer Vision, pp. 4210–4219. Cited by: §2.2.
  • [9] S. J. Chiu, X. T. Li, P. Nicholas, C. A. Toth, J. A. Izatt, and S. Farsiu (2010) Automatic segmentation of seven retinal layers in sdoct images congruent with expert manual segmentation. Opt. Express 18 (18), pp. 19413–19428. Cited by: Figure 2.
  • [10] L. R. Dice (1945) Measures of the amount of ecologic association between species. Ecology 26 (3), pp. 297–302. Cited by: §4.2.
  • [11] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and T. Brox. (2016)

    Discriminative unsupervised feature learning with exemplar convolutional neural networks

    IEEE Trans. Patt. Anal. Mach. Intell. 38 (9), pp. 1734–1747. Cited by: §2.2.
  • [12] G. N. Girish, B. Thakur, S. R. Chowdhurya, A. R. Kothari, and J. Rajan (2018) Segmentation of intra-retinal cysts from optical coherence tomography images using a fully convolutional neural network model. IEEE J. Biomed. Health Inform. 23 (1), pp. 296–304. Cited by: §2.1.
  • [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • [14] C. Han, H. Hayashi, L. Rundo, R. Araki, W. Shimoda, S. Muramatsu, Y. Furukawa, G. Mauri, and H. Nakayama (2018) GAN-based synthetic brain mr image generation. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 734–738. Cited by: §1, §1, §2.2.
  • [15] S. Huang, C. Lin, S. Chen, Y. Wu, P. Hsu, and S. Lai (2018) Auggan: cross domain adaptation with gan-based data augmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 718–731. Cited by: §1.
  • [16] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu (2015) Spatial transformer networks. In NIPS, pp. –. Cited by: §3.1.
  • [17] A. Kendall, V. Badrinarayanan, and R. Cipolla (2015) Bayesian segnet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding.. In arXiv:1511.02680, Cited by: §2.3.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. In arXiv preprint arXiv:1412.6980,, Cited by: §4.2.
  • [19] S. A. A. Kohl, B. Romera-Paredes, C. Meyer, J. D. Fauw, J. R. Ledsam, K. H. Maier-Hein, S. M. A. Eslami, D. J. Rezende, and O. Ronneberger (2018) A probabilistic u-net for segmentation of ambiguous images.. In Proc. NIPS, pp. 6965–6975. Cited by: item 2, §2.3.
  • [20] B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles.. In Proc. NIPS, pp. 6402–6413. Cited by: §2.3.
  • [21] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi (2016) Photo-realistic single image super-resolution using a generative adversarial network. CoRR abs/1609.04802. Cited by: §2.2.
  • [22] K. K. Leung, M. J. Clarkson, J. W. Bartlett, S. Clegg, C. R. J. Jr, M. W. Weiner, N. C. Fox, S. Ourselin, and A. D. N. Initiative (2010) Robust atrophy rate measurement in alzheimer’s disease using multi-site serial mri: tissue-specific intensity normalization and parameter selection. Neuroimage 50 (2), pp. 516–523. Cited by: §2.2.
  • [23] X. Li, H. Chen, X. Qi, Q. Dou, C. Fu, and P. Heng (2018) H-DenseUNet: hybrid densely connected unet for liver and tumor segmentation from ct volumes. IEEE Trans. Med. Imag. 37 (12), pp. 2663–2674. Cited by: §4.3.
  • [24] D. Mahapatra, B. Bozorgtabar, and S. Hewavitharanage (2017) Image super resolution using generative adversarial networks and local saliency maps for retinal image analysis. In MICCAI, pp. 382–390. Cited by: §2.2.
  • [25] D. Mahapatra, B. Bozorgtabar, J. Thiran, and M. Reyes (2018)

    Efficient active learning for image classification and segmentation using a sample selection and conditional generative adversarial network

    In MICCAI, pp. 580–588. Cited by: Figure 1, §1, item 1, item 2, §2.2, §4.2, Table 1, Table 3, Table 4.
  • [26] D. Mahapatra, Z. Ge, S. Sedai, and R. Chakravorty. (2018) Joint registration and segmentation of xray images using generative adversarial networks. In In Proc. MICCAI-MLMI, pp. 73–80. Cited by: §2.2.
  • [27] D. Mahapatra and Z. Ge (2019)

    Training data independent image registration with gans using transfer learning and segmentation information

    In In Proc. IEEE ISBI, pp. 709–713. Cited by: §2.2.
  • [28] D. Mahapatra and Z. Ge (2020) Training data independent image registration using generative adversarial networks and domain adaptation.. In press Pattern Recognition 100, pp. 1–14. Cited by: §2.2.
  • [29] F. Milletari, N. Navab, and S. Ahmadi (2016) V-net: fully convolutional neural networks for volumetric medical im- age segmentation.. In Proc. Int. Conf. on 3D vision, pp. 565–571. Cited by: §2.2.
  • [30] C. Nielsen and M. Okoniewski (2019) GAN data augmentation through active learning inspired sample acquisition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 109–112. Cited by: §1, §1.
  • [31] M. S. Rad, B. Bozorgtabar, U. Marti, M. Basler, H. K. Ekenel, and J. Thiran (2019) Srobb: targeted perceptual loss for single image super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2710–2719. Cited by: §2.2.
  • [32] J. Ribera, D. Güera, Y. Chen, and E. Delp (2018) Weighted hausdorff distance: a loss function for object localization.. In arXiv preprint arXiv:1806.07564, Cited by: §4.2.
  • [33] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In In Proc. MICCAI, pp. 234–241. Cited by: §2.3, §4.2.
  • [34] A. G. Roy, S. Conjeti, S. P. K. Karri, D. Sheet, A. Katouzian, C. Wachinger, and N. Navab (2017) ReLayNet: retinal layer and fluid segmentation of macular optical coherence tomography using fully convolutional networks. Biomed. Opt. Express 8 (8), pp. 3627–3642. Cited by: §2.1.
  • [35] C. Rupprecht, I. Laina, R. DiPietro, M. Baust, F. Tombari, N. Navab, and G. D. Hager (2017) Learning in an uncertain world: representing ambiguity through multiple hypotheses.. In Proc. CVPR, pp. 3591–3600. Cited by: §2.3.
  • [36] T. Schlegl, S. M. Waldstein, W. Vogl, U. Schmidt-Erfurth, and G. Langs (2015) Predicting semantic descriptions from medical images with convolutional neural networks. In Proc. Int. Conf. Inform. Process. Med. Imag. (IPMI), pp. 437–438. Cited by: §2.1.
  • [37] H. Shin, N. A. Tenenholtz, J. K. Rogers, C. G. Schwarz, M. L. Senjem, J. L. Gunter, K. Andriole, and M. Michalski (2018) Medical Image Synthesis for Data Augmentation and Anonymization using Generative Adversarial Networks. In Proc. MICCAI-SASHIMI, Cited by: §2.2.
  • [38] K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models.. In Proc. NIPS, pp. 3483–3491. Cited by: item 2, §2.3.
  • [39] N. Tajbakhsh, Jae. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang. (2016) Convolutional neural networks for medical image analysis: full training or fine tuning?.. IEEE Trans. Med. Imag. 35 (5), pp. 1299–1312. Cited by: §1.
  • [40] F. G. Venhuizen, B. van Ginneken, B. Liefers, F. van Asten, V. Schreur, S. Fauser, C. Hoyng, T. Theelen, and C. I. Sanchez (2018) Deep learning approach for the detection and quantification of intraretinal cystoid fluid in multivendor optical coherence tomography. Biomed. Opt. Express 9 (4), pp. 1545–1569. Cited by: §2.1.
  • [41] X. Xu, K. Lee, L. Zhang, M. Sonka, and M. D. Abramoff (2015) Stratified sampling voxel classification for segmentation of intraretinal and sub-retinal fluid in longitudinal clinical oct data. IEEE Trans. Med. Imag. 34 (7), pp. 1616–1623. Cited by: §2.1.
  • [42] X. Yi, E. Walia, and P. Babyn (2019) Generative adversarial network in medical imaging: a review. Med. Imag. Anal. 58. Cited by: §2.2.
  • [43] A. Zhao, G. Balakrishnan, F. Durand, J. V. Guttag, and A. V. Dalca (2019) Data augmentation using learned transforms for one-shot medical image segmentation. In In Proc. CVPR, pp. 8543–8552. Cited by: Figure 1, §1, §1, §1, Figure 5, §4.2, §4.3, §4.3, §4.4, Table 1, Table 3, Table 4.
  • [44] H. Zhao, H. Li, S. Maurer-Stroh, and LiCheng (2018)

    Synthesizing retinal and neuronal images with generative adversarial nets.

    Med. Imag. Anal 49, pp. 14–26. Cited by: item 1, item 2, §2.2.