On the Evaluation and Real-World Usage Scenarios of Deep Vessel Segmentation for Funduscopy

09/09/2019 ∙ by Tim Laibacher, et al. ∙ Idiap Research Institute 0

We identify and address three research gaps in the field of vessel segmentation for funduscopy. The first focuses on the task of inference on high-resolution fundus images for which only a limited set of ground-truth data is publicly available. Notably, we highlight that simple rescaling and padding or cropping of lower resolution datasets is surprisingly effective. Additionally we explore the effectiveness of semi-supervised learning for better domain adaptation. Our results show competitive performance on a set of common public retinal vessel datasets using a small and light-weight neural network. For HRF, the only very high-resolution dataset currently available, we reach new state-of-the-art performance by solely relying on training images from lower-resolution datasets. The second topic concerns evaluation metrics. We investigate the variability of the F1-score on the existing datasets and report results for recent SOTA architectures. Our evaluation show that most SOTA results are actually comparable to each other in performance. Last, we address the issue of reproducibility by open-sourcing our complete pipeline.



There are no comments yet.


page 4

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The accurate and automatic segmentation of the retinal vasculature structure has several important applications in ophthalmology, such as in the diagnosis of diabetic retinopathy and wet age related macular degeneration. This work identifies and addresses three research areas of interest in the field with high relevance for practical deployments.

The first concerns the availablity of high-resolution fundus images. Owning to the popularity and ease of use of lower-resolution fundus datasets such as DRIVE [staal_ridge-based_2004] and STARE [hoover_locating_2000]

for convolutional neural network training, the majority of previous works is still mostly focused on images that are magnitudes smaller than the fundus images which are today taken by clinics and practitioners around the world. To the best of our knowledge, only the HRF dataset 

[budai_robust_2013] has a resolution that reaches or comes close to the resolution taken by modern fundus cameras.

Since manual annotation by experts is time consuming and costly, we propose a set of methods to leverage existing public low-resolution datasets with annoted ground-truth vessel labels to train convolutional-neural networks that perform well on unseen high-resolution images.

The second gap in existing research is the lack of detail in terms of how evaluation metrics are calculated and presented. We outline two averaging methods and introduce a set of plots containing standard deviation bands that provide additional insights into the robustness of models.

Finally we address the issue of reproducibility. We found that the degree of reproducibility in previous work is unsatisfactory. Additionally individual data-loading pipelines for each of the public datasets are necessary since they do not come with a coherent folder structure or API. This means that for new researchers looking to enter the field, a large amount of engineering work is often necessary before actual research can take place. To address this, we introduce the bob.ip.binseg package111https://gitlab.idiap.ch/bob/bob.ip.binseg that integrates with the Bob framework [anjos_bob_2017], can be easily extended and allows for the reproduction of our experiments.

Ii Related Work

Ii-a High-Resolution Images

In this section we describe related work that touches or focuses on the high-resolution fundus dataset HRF. Previous contributions can be roughly summarized as follows:

  1. Small fully convolutional neural networks, trained on full-resolution images [laibacher_m2u_2018].

  2. Large fully convolutional neural networks, trained on downsampled images [yan_joint_2018].

  3. Patch-based training of fully convolutional neural networks with deformable convolutions [jin_dunet_2019].

  4. Approaches with Generative Adversarial Networks  [goodfellow_gan_2014] (GANs) using downsampled images  [zhao_supervised_2019].

Due to it’s proven record on various segmentation domains, fully convolutional neural networks (FCN) based on VGG16 [simonyan_vgg_2015] as encoder are employed in the majority of works such as [jin_dunet_2019, meyer_deep_2017, yan_joint_2018, zhao_supervised_2019]. The M2U-Net introduced in [laibacher_m2u_2018] adopts a structure similar to the U-Net [ronneberger_u-net_2015] in  [yan_joint_2018], relies however on MobileNetV2 in the encoder part and proposes light-weight inverted contracting residuals blocks in the decoder part. Similarly the DUNet by Jin et al. [jin_dunet_2019] adopts the U structure but uses Deformable Convolutions [dai_deformable_2016] in parts of the network.

Year F1 GAN Patch-based Parameter
Orlando et al. [orlando_discriminatively_2017]* 2017 0.7158 No No -
Yan et al. [yan_joint_2018]* 2018 0.7212 No No  25.85M
Laibacher et al. [laibacher_m2u_2018]* 2018 0.7814 No No 0.55M
Jin et al. [jin_dunet_2019]* 2019 0.7988 No Yes 0.88M
Zhao et al. [zhao_supervised_2019] 2019 0.7659 Yes No 14.94M
DRIU [maninis_deep_2016] (our impl.)* 2019 0.7865 No No 14.94M
*Same train-test split
TABLE I: Previous works on the high-resolution HRF dataset

The different methods come with a computational/pipeline complexity and time vs. segmentation quality trade-off as indicated in Table I

. The deformable-convolution in DUNet, while light in terms of parameter count, impose a great reduction in inference speed as reported by the authors (47.7s vs 9.7s for an 999 x 960 image). Additionally patch-based inference pipelines are estimated to be slower in inference than methods that utilize full resolution  

[laibacher_m2u_2018] or downsampled images [yan_joint_2018]. The GAN based-approach by Zhao et al. [zhao_supervised_2019] requires a two-step training procedure. In the first step a synthesized target dataset is constructed using a modified GAN. In the second step, DRIU [maninis_deep_2016] is trained on the created synthesized images.

Taking into account these trade-offs, we argue that the proposed methods remain largely comparable in performance and minor reported improvements do not represent highly significant breakthroughs.

Ii-B Evaluation Metrics

A coherent and transparent standard for evaluation metrics is necessary to allow for a fair comparison of methods. In the field of vessel segmentation however, the reported metrics frequently differ both in terms of kind and quantity. While the F1-score has emerged as one of the dominant metrics, exact details on it’s calculation are often not clearly stated.

FCNs commonly output probability maps, attaching a vessel probability score to each pixel in the image. Since the ground-truth labels are binary, the probability map has to be thresholded. In  

[maninis_deep_2016, laibacher_m2u_2018, jin_dunet_2019, zhao_supervised_2019] metrics on the test-set are evaluated at all thresholds, presumably ranging from 0 to 1 in steps of 0.01. Metrics at the optimal test set threshold only are reported in  [fraz_ensemble_2012, marin_new_2011, meyer_deep_2017]. Li et al. [li_cross-modality_2016] utilize the threshold determined on the training set, a scheme also adopted by Yan et al. [yan_joint_2018].

The metrics Precision (Pr), Recall (Re), Specificity (Sp), Accuracy (Acc) and F1-Score (F1) are derived from the number of True Positives (TP), False Positives (FP), True Negatives (TN) and False Negatives (FN) that are calculated for each test image/ground-truth pair:


The average F1-score can then either be calculated based on individual F1-scores for each test image or on average Precision and Recall.

The differences in evaluation metrics are identified as the first barrier for fair comparison and evaluation of competing methods. The second is the difference in training and test splits. Out of the five considered datasets, only DRIVE [staal_ridge-based_2004] defines a train-test split. The remaining datasets leave it to the author to define an appropriate split. As will become evident in Section VII, while in some cases a dominant split has emerged in the literature, in other cases the utilized splits differ considerably, further hindering fair comparisons. In this work both barriers are addressed, we clearly describe metric calculations and train-test splits and hope to inspire future work to adopt a similar approach.

Ii-C Reproducibility

Machine learning experiments are becoming increasingly complex, making it harder to reproduce them [anjos_bob_2017]

. This problem is especially pronounced in niche computer vision fields like vessel segmentation, that receive less attention compared to popular image-classification or object detection tasks in which reproducible work is found more frequently. The maskrcnn_benchmark 

[massa_mrcnn_2018] for example, since it’s introduction saw several independent contributions and publications building on top of it [tian_fcos_2019, fu_retinamask_2019]. Besides being good practice, reproducible research has shown to be beneficial to the impact of publications [vandewalle_reproducible_2009]. Vandewalle et al. [vandewalle_reproducible_2009] distinguish six degrees of reproducibility, ranging from easily reproducible (5) to not reproducible (0). We refer to the paper for the exact definitions.

We found that existing works do either not provide any source-code at all  [fraz_ensemble_2012, li_cross-modality_2016, liskowski_segmenting_2016, orlando_discriminatively_2017], estimated to require extreme effort to reproduce (2), provide source-code that lack documentation and instructions on how to setup the training environment and datasets  [maninis_deep_2016, zhao_supervised_2019, jin_dunet_2019] or provide only parts of the training pipeline [yan_joint_2018] and therefore require considerable effort to reproduce (3).

This highlights the need for a ”class 5” reproducible work, which we provide in the form of the comprehensive software-package bob.ip.binseg.

Iii Datasets

The five most commonly used datasets are DRIVE [staal_ridge-based_2004], STARE [hoover_locating_2000], CHASE_DB1 [owen_measuring_2009], HRF [budai_robust_2013] and IOSTAR [abbasi_iostar_2015] with the order being indicative of their appearance in the literature.

Iii-a Train-Test Splits

For DRIVE we use the train-test split as proposed by the authors of the dataset. For STARE we follow Maninis et al. [maninis_deep_2016] and Zhao et al. [zhao_supervised_2019] with a 10/10 split. The split adopted for CHASE_DB1 was first proposed by Fraz et al. [fraz_ensemble_2012], which uses the first 8 images for training and the last 20 for testing. For HRF we adopt the split as proposed by Orlando et al. [orlando_discriminatively_2017] and adapted in [laibacher_m2u_2018] and [jin_dunet_2019], whereby the first five images of each category (healthy, diabetic retinopathy and glaucoma) are used for training and the remaining 30 for testing. For IOSTAR we select the 20/10 split introduced by Meyer et al. [meyer_deep_2017]. Table II provides an compact overview of dataset sizes, resolutions, splits and references.

Iii-B Combined Vessel Dataset

Since models are trained on a combination of the above mentioned dataset we refer to the combination of them as COVD (Combined Vessel Dataset). Whenever we exclude the dataset we use for testing from training, we indicate it by a sign. E.g. COVD tested on target dataset HRF, means we include all datasets for training except HRF. Similarly COVD evaluated on the target dataset CHASE_DB1 means we include all datasets for training except CHASE_DB1.

This way we simulate real-world cases where there often is no ground-truth data available.

In cases where we use semi-supervised learning we utilize the training images but not the ground-truth data of the target dataset. E.g. For COVDSSL evaluated on HRF, we utilize all datasets except HRF for training with ground-truth pairs and for SSL we use only the training images of the HRF training set.

Dataset H x W Imgs. Train Test Reference
DRIVE 584 x 565 40 20 20 [staal_ridge-based_2004]
STARE 605 x 700 20 10 10 [maninis_deep_2016]
CHASE_DB1 960 x 999 28 8 20 [fraz_ensemble_2012, laibacher_m2u_2018]
IOSTAR 1024 x 1024 30 20 10 [meyer_deep_2017]
HRF 2336 x 3504 45 15 30 [orlando_discriminatively_2017, laibacher_m2u_2018, jin_dunet_2019]
TABLE II: Overview of retina vessel segmentation datasets.

Iv Baseline Benchmarks

Before investigating potential approaches to vessel segmentation on high-resolution images, we run a set of benchmark baselines that compared the performance of four popular convolutional neural networks used for retinal vessel segmentation (number of trainable network parameters in brackets): DRIU (14.94M) [maninis_deep_2016], HED (14.73M) [xie_holistically-nested_2015], M2U-Net (0.55M) [laibacher_m2u_2018] and U-Net (25.85M) [ronneberger_u-net_2015]. The results are shown in Table  III

. The fact that the considerable smaller M2U-Net’s performance almost reaches the performance of larger models like DRIU and U-Net, especially for the high resolution datasets HRF, hints at overparametrization of the pretrained ImageNet 

[imagenet_cvpr09] part of those models. Similar observations were made by Raghu et al. [raghu_transfusion_2019] on the RETINA [gulshan_development_2016] classification dataset.

Out of the four models we picked DRIU and M2U-Net to train on COVD and to apply semi-supervised learning. M2U-Net because of the aforementioned properties and DRIU since it came very close to the heavy U-Net while requiring less parameters.

(std) DRIU HED M2U-Net U-Net
CHASEDB1 0.810 (0.021) 0.810 (0.022) 0.802 (0.019) 0.812 (0.020)
DRIVE 0.820 (0.014) 0.817 (0.013) 0.803 (0.014) 0.822 (0.015)
HRF 0.783 (0.055) 0.783 (0.058) 0.780 (0.057) 0.788 (0.051)
IOSTAR 0.825 (0.020) 0.825 (0.020) 0.817 (0.020) 0.818 (0.019)
STARE 0.827 (0.037) 0.823 (0.037) 0.815 (0.041) 0.829 (0.042)
TABLE III: Baseline benchmark results, models are trained and tested on the same dataset using the splits as indicated in Table II

V Methods

In this section we describe two approaches for vessel segmentation for high-resolution images: We first outline the conducted rescaling, padding and cropping scheme, followed by our implementation of semi-supervised learning. Here we propose a simple scheme whereby three guesses of unlabeled images, created by the network during training, are averaged and incorporated in a combined loss function by a weighting factor. Finally we describe other implementation details and hyper-parameters.

V-a Rescaling, Cropping, Padding

Whenever we train a model for a target dataset with a specific resolution and spatial composition we perform image transformations to the soure dataset so that it has the resolution and approximate spatial composition of the target dataset. This is best illustrated by an example:

Treating HRF as the target dataset, and CHASE_DB1 as the source dataset, we first perform a crop on the latter followed by a resize operation as depicted in Figure 4.

Fig. 4: Illustration of upscaling and cropping from source dataset CHASE_DB1 to target dataset HRF. (a): CHASE_DB1 image, (b): upscaled and cropped to HRF resolution, (c): HRF image for comparison

This is in contrast to approaches where the high-resolution target dataset is downscaled to the resolution of the source dataset, feed through the network and upsampled again to the target dataset resolution [zhao_supervised_2019, yan_joint_2018].

V-B Semi-Supervised Learning

Semi-Supervised Learning (SSL) [chapelle_ssl_2010] has seen increased interest in the image classification domain, with recent works including [berthold_mixmatch_2019, olivier_contrastivessl_2019]. In this work we adopt the approach of Berthold et al. [berthold_mixmatch_2019] of using unlabeled examples and labeled examples in separate loss terms that are combined by a weighting factor. Given a batch of unlabeled examples from the target dataset, for each unlabeled image in the batch, three guessed probability vessel labels are generated via a forward pass through the model using the unlabeled image, a horizontally flipped version of it and a vertical flipping version of it which are averaged to form :


and then used in the SSL-Loss covered in the following section.

V-B1 Loss Functions

For standard supervised-learning without SSL we utilize the Jaccard Loss [iglovikov_ternausnetv2_2018] that is a combination of Binary Cross-Entropy loss and the Jaccard Score weighted by a factor :


We adopt as suggested by Iglovikov et al. [iglovikov_ternausnetv2_2018].

The Binary Cross-Entropy loss, where are values corresponding to predicted probability of a pixel belonging to the vessel class and is the ground-truth binary value, forms the first part of the equation and is defined as:


An adaptation of the Jaccard coefficient for continuous pixel-wise probabilities forms the second part of the combined loss function:


We note that the Jaccard coefficient has a monotonically increasing relation with the F1-Score (also know as Dice coefficient) [pont-tuset_supervised_2016], so it can act as an appropriate loss-function even though our actual evaluation metrics is the F1-Score.

During SSL training, we further combine Equation 7 for labeled examples and Equation 8 for unlabeled examples with guessed labels , weighted by :


Instead of using a constant , we use a quadratic ramp-up schedule illustrated in Figure 5.

Fig. 5: Quadratic ramp-up of during training

With the intuition behind that the further we are in the training process the better the semi-supervised predictions should get.

V-C Implementation Details

The following details apply to both, ”normal” training and SSL training. We deploy AdaBound [luo_adaptive_2019]

as an optimizer, using the default parameters as suggested in the original paper with a learning rate of 0.001. During the training-phase we apply the following random augmentations: horizontal flipping, vertical flipping, rotation and changes in brightness, contrast, saturation and hue. Training is conducted for 1000 epochs with a reduced learning rate of 0.0001 after 900 epochs. For all datasets except HRF, we use the original resolution and perform necessary cropping/padding so that the resolution is a multiple of 32, a requirement for the U-Net and M2U-Net variants. Since our training hardware setup did not have enough GPU memory for the training of the full resolution HRF images (and the upscaled COVD

source datasets), we use half resolution images for training (1168 x 1648) but run inference on the full resolution images (2336 x 3296). We refer to the released bob-package and documentation for all details necessary to reproduce our results.

Vi Metrics and Evaluation

In this section we describe in more detail the two ways to calculate the F1-score and introduce an advanced Precision vs. Recall plot.

As mentioned in Section  II-B, the average F1-Score for all test-images can either be calculated on a micro level, that is each individual F1-score is averaged or on a macro level, where the F1-score is calculated based on the average Precision and Recall:


While previous published work uses the later, which leads to slightly higher scores, the former calculation method allows for additional insights on the variability of the model’s performance since F1-score standard deviations can be included.

An alternative representation of the results is show in Figure 6, in the form of an extended precision vs. recall curve. Here and are plotted for every threshold together with the standard deviation in both precision and recall. In addition iso-F curves are plotted in light green and the point along the curve with the highest F1-score is highlighted in black. In this case, in order to have a consistent representation within the plot, the F1-score is macro averaged (Equation 12). This setup allows for an easy visual comparison of model performance and their variability across test-images.

E.g. in Figure 6 it can be observed that the variability across CHASE_DB1 annotations made by the 2nd human is higher than that of our models since their standard deviation bands are narrower compared to the standard deviation of the 2nd human annotator depicted by a single line in light red. Put simply, the models make more consistent predictions across test images compared to the second annotator.

Fig. 6: M2U-Net precision vs. recall curves on CHASE_DB1. Bands in light colors depict standard deviations.

Vii Experiments

To evaluate the performance of our rescaling, padding and cropping scheme, we treated each of the datasets in Table II in turn as the target dataset. Here we only report the results for M2U-Net and refer to Appendix A for the results with DRIU.

On all tested datasets we found that training on COVD yields competitive results that come close to the performance of the baselines where the model was trained and tested on the same dataset. This is encouraging, given the large differences in illumination, contrast, color and resolution of the source datasets. For HRF we can report performance improvements of almost 2 p.p. compared to the baseline.

Further applying SSL we gain additional improvements of around 1 p.p. for CHASE-DB1 and STARE, in the latter case now narrowly beating the baseline. For DRIVE we only found marginal improvements and worse performance for HRF and IOSTAR. We therefore fail to make conclusive statements about the viability of SSL for domain adaption and leave the investigation, mitigation and improvement of this method to future work.

Table IV summarizes the results for COVD and COVDSSL. Additional precision vs. recall curves are shown in Appendix B

TargetSource Target COVD COVDSSL
DRIVE 0.803 (0.014) 0.789 (0.018) 0.791 (0.014)
STARE 0.815 (0.041) 0.812 (0.046) 0.820 (0.044)
CHASEDB1 0.802 (0.019) 0.788 (0.024) 0.799 (0.026)
HRF 0.780 (0.057) 0.802 (0.045) 0.797 (0.044)
IOSTAR 0.817 (0.020) 0.793 (0.015) 0.785 (0.018)
TABLE IV: -score (std) for M2U-Nets on trained on Target, COVD and COVDSSL.

Viii Evaluation

To put our results into perspective, Tables V,VI,VII,VIII and IX show previous works. We report the -score for our results. Overall M2U-Net trained on COVD is competitive, with the best performance on the high-resolution dataset HRF, where a new state-of-the-art F1-score could be reached. Additionally to the available public datasets, we trained M2U-Net on COVD for a private target dataset with a resolution of 1920x1920 for which no ground-truth data is available. The predicted vessel probability maps are displayed in Figure 13.

Target: DRIVE (584x565)
Method Year F1 Acc Pr Re (Se) Sp
2nd human observer 0.7931 0.7881 0.8072 0.7796 0.9717
Unsupervised - - - -
Bibiloni et al. [bibiloni_real-time_2018] 2018 0.7521 0.938 0.786 0.721 0.970
Fraz et al. [fraz_ensemble_2012] 2012 0.7929 0.9480 0.8532 0.7406 0.9807
Jin et al. [jin_dunet_2019] 2019 0.8237 0.9566 0.8529 0.7963 0.9800
Laibacher et al. [laibacher_m2u_2018] 2018 0.8091 - - - -
Li et al. [li_cross-modality_2016] 2016 - 0.9527 - 0.7569 0.9816
Liskowski et al. [liskowski_segmenting_2016] 2016 - 0.9535 - 0.7811 0.9807
Maninis et al. [maninis_deep_2016] 2016 0.8220 - - - -
Marin et al. [marin_new_2011] 2011 0.8134 0.9452 0.9582 0.7067 0.9801
Orlando et al. [orlando_discriminatively_2017] 2017 0.7857 - 0.7854 0.7897 0.9684
Yan et al. [yan_joint_2018] 2018 0.8183 0.9529 0.8124 0.8242 0.9720
Zhao et al. [zhao_supervised_2019] 2019 0.7882 - - - -
M2U-Net DRIVE 0.8030 0.9619 0.8103 0.8000 0.9797
M2U-Net COVD – 0.7885 0.9592 0.7990 0.7824 0.9787
M2U-Net COVD – SSL 0.7913 0.9598 0.8016 0.7862 0.9789
All supervised methods use the same train-test split
TABLE V: Comparison with previous works on DRIVE
Target: STARE (605x700)
Method Year F1 Acc Pr Re (Se) Sp
2nd human observer - 0.9347 0.6432 0.8955 0.9382
Unsupervised - - - - -
Bibiloni et al. [bibiloni_real-time_2018] 2018 0.752 0.938 0.786 0.721 0.970
Fraz et al. [fraz_ensemble_2012] 2012 0.7747 0.9347 0.7956 0.7548 0.9763
Jin et al. [jin_dunet_2019] 2019 0.8143 0.9641 0.8777 0.7595 0.9878
Li et al. [li_cross-modality_2016] 2016 - 0.9628 - 0.7726 0.9844
Maninis et al. [maninis_deep_2016]* 2016 0.831 - - - -
Marin et al. [marin_new_2011] 2011 0.8080 0.9526 0.9659 0.6944 0.9819
Orlando et al. [orlando_discriminatively_2017] 2017 0.7644 - 0.7740 0.7680 0.9738
Yan et al. [yan_joint_2018] 2018 - 0.9612 - 0.7581 0.9846
Zhao et al. [zhao_supervised_2019]* 2019 0.7960 - - - -
M2U-Net STARE 0.8150 0.9727 0.8090 0.8257 0.9848
M2U-Net COVD – 0.8117 0.9724 0.8128 0.8114 0.9851
M2U-Net COVD – SSL 0.8196 0.9734 0.8164 0.8282 0.9847
*Same train-test split as adopted in this work
TABLE VI: Comparison with previous works on STARE
Target: IOSTAR (1024x1024)
Method Year F1 Acc Pr Re (Se) Sp
Abbasi-Sureshjani et al. [abbasi_iostar_2015] 2015 - 0.9501 0.7863 0.9747
Meyer et al. [meyer_deep_2017]* 2017 - 0.9695 - 0.8038 0.9801
Zhang et al. [zhang_robust_2016] 2016 - 0.9514 - 0.7545 0.9740
Zhao et al. [zhao_supervised_2019] 2019 0.7707 - - - -
DRIU [maninis_deep_2016] (our impl.) 0.8273 0.9721 0.8173 0.8376 0.9839
M2UNet IOSTAR 0.8173 0.9708 0.8081 0.8311 0.9831
M2U-Net COVD – 0.7928 0.9665 0.7755 0.8161 0.9798
M2U-Net COVD – SSL 0.7845 0.9644 0.7544 0.8221 0.9770
*Same train-test split as adopted in this work
TABLE VII: Comparison with previous works on IOSTAR
Target: CHASE_DB1
Method Year F1 Acc Pr Re (Se) Sp
2nd human observer 0.7686 0.9538 - - -
Azzopardi et al. [azzopardi_trainable_2015] 2015 - 0.9387 - 0.7585 0.9587
Zhang et al. [zhang_robust_2016] 2016 - 0.9452 - 0.7626 0.9661
Fraz et al. [fraz_ensemble_2012]* 2012 0.7566 0.9469 0.7415 0.7224 0.9711
Jin et al. [jin_dunet_2019] 2019 0.7883 0.9610 0.7630 0.8155 0.9752
Laibacher et al. [laibacher_m2u_2018]* 2018 0.8006 - - - -
Li et al. [li_cross-modality_2016] 2016 - 0.9581 - 0.7507 0.9793
Orlando et al. [orlando_discriminatively_2017] 2017 0.7332 - 0.7438 0.7277 0.9712
Roychowdhury et al. [roychowdhury_blood_2015] 2015 - 0.9530 - 0.7201 0.9824
Yan et al. [yan_joint_2018] 2018 - 0.9610 - 0.7633 0.9809
DRIU [maninis_deep_2016] (our impl.)* 0.8114 0.9716 0.8068 0.8160 0.9842
M2U-Net CHASE_DB1 0.8022 0.9704 0.7985 0.8086 0.9835
M2U-Net COVD – 0.7884 0.9678 0.7710 0.8095 0.9807
M2U-Net COVD – SSL 0.7988 0.9694 0.7819 0.8189 0.9816
*Same train-test split as adopted in this work
TABLE VIII: Comparison with previous works on CHASE_DB1
Target: HRF (2336x3504)
Method Year F1 Acc Pr Re (Se) Sp
Annunziata et al. [annunziata_leveraging_2016] 2016 0.7578 0.9581 0.8089 0.7128 0.9836
Budai et al. [budai_robust_2013] 2013 - 0.9610 - 0.669 0.985
Odstrcilik et al. [odstrcilik_retinal_2013] 2013 0.7324 0.9494 0.7741 0.9669
Zhang et al. [zhang_robust_2016] 2016 - 0.9556 - 0.7978 0.9710
Orlando et al. [orlando_discriminatively_2017]* 2017 0.7158 - 0.6630 0.7874 0.9584
Yan et al. [yan_joint_2018]* 2018 0.7212 0.9437 0.6647 0.7881 0.9592
Laibacher et al. [laibacher_m2u_2018]* 2018 0.7814 0.9635 - - -
Jin et al. [jin_dunet_2019]* 2019 0.7988 0.9651 0.8593 0.7464 0.9874
Zhao et al. [zhao_supervised_2019] 2019 0.7659 - - - -
DRIU [maninis_deep_2016] (our impl.)* 0.7865 0.9646 0.7863 0.7868 0.9806
M2U-Net HRF 0.7800 0.9641 0.7798 0.7880 0.9798
M2U-Net COVD - 0.8020 0.9669 0.7889 0.8188 0.9802
M2U-Net COVD - SSL 0.7972 0.9659 0.7898 0.8021 0.9807
*Same train-test split as adopted in this work
TABLE IX: Comparison with previous works on HRF

Ix Conclusion

In this work we showed that simple transformation techniques like rescaling, padding and cropping of lower-resolution source datasets to the resolution and spatial composition of a higher-resolution target dataset can be a surprisingly effective way to improve segmentation quality. Our experiments with semi-supervised learning show first promising results but require further investigation and work. We emphasized the need for a more rigourous and detailed focus on evaluation metrics and proposed a set of plots and metrics that give additional insights into model performance. Lastly, we provide open-source code and documentation for future researchers to build upon and hope to inspire future work in the field.

Appendix A DRIU and DRIU BN Results

In addition to the M2U-Net architecture, we also evaluated the larger DRIU network and a variation of it that contains batch normalization (DRIU BN) on COVD

and COVDSSL. Perhaps surprisingly, for the majority of combinations, the performance of the DRIU variants are roughly equal or worse than the M2U-Net. We anticipate that one reason for this could be the aforementioned overparameterization of large VGG16 models that are pretrained on ImageNet. The results are listed in Table X.

Source Target DRIU DRIU BN M2U-Net
COVD DRIVE 0.788 (0.018) 0.797 (0.019) 0.789 (0.018)
COVDSSL DRIVE 0.785 (0.018) 0.783 (0.019) 0.791 (0.014)
COVD STARE 0.778 (0.117) 0.778 (0.122) 0.812 (0.046)
COVDSSL STARE 0.788 (0.102) 0.811 (0.074) 0.820 (0.044)
COVD CHASE_DB1 0.796 (0.027) 0.791 (0.025) 0.788 (0.024)
COVDSSL CHASE_DB1 0.796 (0.024) 0.798 (0.025) 0.799 (0.026)
COVD HRF 0.799 (0.044) 0.800 (0.045) 0.802 (0.045)
COVDSSL HRF 0.799 (0.044) 0.784 (0.048) 0.797 (0.044)
COVD IOSTAR 0.791 (0.021) 0.777 (0.032) 0.793 (0.015)
COVDSSL IOSTAR 0.791 (0.021) 0.811 (0.074) 0.785 (0.018)
TABLE X: Comparison of -Scores (std) of DRIU and M2U-Net on COVD and COVDSSL. Standard deviation across test-images in brackets.

Appendix B Qualitative Results and PR Curves for M2U-Net

Precision vs. recall curves for each evaluated dataset are shown in Figures 7, 8, 9, 10 and 11. Additionally an illustration of predicted vessel maps vs ground truths for M2U-Net on HRF is provided in Figure 12.

Fig. 7: M2U-Net precision vs. recall curves on HRF
Fig. 8: M2U-Net precision vs. recall curves on CHASE_DB1
Fig. 9: M2U-Net precision vs. recall curves on IOSTAR
Fig. 10: M2U-Net precision vs. recall curves on STARE
Fig. 11: M2U-Net precision vs. recall curves on DRIVE
Fig. 12: Illustration of predicted vessel maps vs ground truths for M2U-Net evaluated on the HRF test-set. The source dataset is indicated in the first column. True positives, false positives and false negatives are displayed in green, blue and red respectively
Fig. 13: Illustration of predicted vessel maps for M2U-Net applied on the private high-resolution dataset (1920x1920) for which no ground-truth data is available.