Dynamic Deep Networks for Retinal Vessel Segmentation

03/19/2019 ∙ by Aashis Khanal, et al. ∙ Georgia State University 0

Segmenting the retinal vasculature entails a trade-off between how much of the overall vascular structure we identify vs. how precisely we segment individual vessels. In particular, state-of-the-art methods tend to under-segment faint vessels, as well as pixels that lie on the edges of thicker vessels. Thus, they underestimate the width of individual vessels, as well as the ratio of large to small vessels. More generally, many crucial bio-markers---including the artery-vein (AV) ratio, branching angles, number of bifurcation, fractal dimension, tortuosity, vascular length-to-diameter ratio and wall-to-lumen length---require precise measurements of individual vessels. To address this limitation, we propose a novel, stochastic training scheme for deep neural networks that better classifies the faint, ambiguous regions of the image. Our approach relies on two key innovations. First, we train our deep networks with dynamic weights that fluctuate during each training iteration. This stochastic approach forces the network to learn a mapping that robustly balances precision and recall. Second, we decouple the segmentation process into two steps. In the first half of our pipeline, we estimate the likelihood of every pixel and then use these likelihoods to segment pixels that are clearly vessel or background. In the latter part of our pipeline, we use a second network to classify the ambiguous regions in the image. Our proposed method obtained state-of-the-art results on five retinal datasets---DRIVE, STARE, CHASE-DB, AV-WIDE, and VEVIO---by learning a robust balance between false positive and false negative rates. In addition, we are the first to report segmentation results on the AV-WIDE dataset, and we have made the ground-truth annotations for this dataset publicly available.



There are no comments yet.


page 2

page 7

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Retinal vessels provide the only non-invasive view of the cardiovascular system. Thus, they are a key diagnostic feature for a number of diseases, including diabetic retinopathy [1], coronary heart disease [2], and atherosclerosis [3]. However, the current standard of care requires manual inspection by an ophthalmologist, which makes it more challenging for people in developing nations and low-income communities to receive care. For example, only 30% of African Americans in southern Los Angeles reported being screened for diabetic retinopathy, despite being the most at-risk ethnicity [4]. Therefore, it is vital to develop automatic retinal analysis methods to improve screening rates and public health outcomes.

The first step in a retinal analysis pipeline is to segment the regions in the image that correspond to the vasculature. Formally, vessel segmentation is a binary classification problem, but it presents unique challenges. First, existing datasets are small—typically on the order of 20-40 images—because an ophthalmologist has to manually trace every vessel in each fundus image. In addition, we need to classify hundreds of thousands or even millions of pixels per image, and the labels of nearby pixels are correlated.

Deep neural networks are the state of the art for a wide range of classification problems, but, due to the aforementioned challenges, training a deep network to segment retinal images is not straightforward. The two main strategies for applying deep learning to this domain are: (1) dividing each image into small patches to maximize the number of training samples

[5] or (2) combining traditional convolutional layers with upsampling to learn both local and global features [6]. In particular, the U-net architecture [6] and its extensions (e.g., Recurrent U-net [7]) have achieved state-of-the-art results by applying the latter strategy.

However, U-net and other deep learning-based methods favor precision over recall, i.e., they tend to classify ambiguous pixels as background. This strategy is statistically sound since retinal images have about a nine to one ratio of background to vessel pixels, but it under-segments faint vessels, as well as pixels that lie on the edges of thicker vessels. Thus, current methods underestimate the width of individual vessels, as well as the ratio of large to small vessels. This, in turn, compromises later diagnoses since many crucial bio-markers—including the artery-vein (AV) ratio, branching angles, number of bifurcation, fractal dimension, tortuosity, vascular length-to-diameter ratio and wall-to-lumen length—require precise measurements of individual vessels.

Figure 1: Dynamic Vessel Segmentation Pipeline: We separate vessel segmentation into simpler tasks. The steps in gray boxes are fixed, while the ones in blue require training. In (a), we first extract the green channel and preprocess it with contrast adaptive histogram equalization. Then, in (b)

we train a U-net architecture using a dynamic loss function. This network outputs a likelihood map across the entire image. In

(c), we use two thresholds, support and resistance, (SRS) to separate easy (top image) from ambiguous (bottom image) pixels. We assign the final labels to easy pixels at this stage. To classify ambiguous pixels, in (d) we use the support threshold to generate a high-recall estimate of the vascular structure (right image) and then skeletonize this mask (left image). We then intersect this skeleton with a lattice mask (shown in (e)) to find a set of seed points (shown in (f). The left image in (f) shows the seed locations for the entire image, while the right image is a close-up of the square shown in red. We then define a small patch around each seed point and train a small network (mini-U-net) on these patches (g). Each patch has two channels. The channel highlighted with the purple border is extracted from the original likelihood map, while the channel with the green border includes only the ambiguous pixels in that region. In (h), we merge the mini-U-net predictions for ambiguous pixels with the labels for the easy pixels to obtain our complete segmentation result (shown in (i)). Figure best viewed on-screen.

In this paper, we propose a novel, two-stage segmentation method that combines both upsampling and patch-based processing. Our approach also leverages loss functions with stochastic weights to better detect faint and ambiguous vessel pixels. As our experiments show, our proposed pipeline achieves state-of-the-art results on five vessel segmentation datasets, including conventional fundus images [8, 9, 10], wide field-of-view (FOV) images [11], and low-resolution video stills and mosaics [12].

As Fig. 1 shows, we separate the overall vessel segmentation problem into two stages: (1) likelihood estimation and (2) targeted prediction. In the first half of our pipeline, we train a U-net architecture to estimate the vessel likelihood of each pixel using a stochastic cross entropy loss. That is, we randomly assign a weight to each class on each training iteration, which force the network to alternate between higher false positive vs. higher false negative rates during training, preventing it from getting stuck in a single minimum. In the second half of our pipeline, we first separate potential vessel pixels into two categories—easy and ambiguous—and then train a patch-based network to classify the latter. (Easy pixels can be classified with a simple threshold.) This second network, which we refer to as mini-U-net, is a scaled-down version of the network used in the first stage. To the best of our knowledge, we are the first to apply stochastic weights for vessel segmentation, as well as the first to combine both upsampling and patch-based learning into a single pipeline.

The rest of this paper is organized as follows. First, we discuss related work on vessel segmentation, particularly deep learning-based approaches. Then, we detail our complete methodology, including our stochastic loss function and targeted prediction steps. We then present and discuss experimental results on five datasets and explore avenues for future work.

2 Related work

Vessel segmentation has a long history, although the advent of deep neural networks has yielded significant improvements in recent years. Earlier techniques relied primarily on handcrafted features, including matched filters [13], quadrature filters [14], and Gabor filters [15, 16]. The latter approach, in particular, uses a predefined kernel bank to incorporate all vessel widths and orientations. However, handcrafted features are limited by our ability to analytically model the segmentation process. Other classic techniques, as surveyed further in  [17], include piece-wise thresholding [9], region growing [18], and concavity measurements [19]. In addition to handcrafted features, some prior approaches used graph-theoretic techniques to trace the vascular structure, including shortest-path tracking  [12] and the fast marching algorithm [20].

Since the breakthrough by convolutional neural networks (CNNs) on the ImageNet dataset in 2012  


, deep learning has achieved tremendous success across a wide array of machine learning tasks  

[21, 22, 23, 24]. Currently, deep networks are the state of the art in vessel segmentation. In particular, the most successful deep architecture in this domain—U-net [6]—combines traditional convolutional layers with feature upsampling to estimate both local and global image features. As noted above, earlier deep learning approaches [25, 26] divided an image into a series of patches and applied a classifier to each patch independently. For instance, Dasgupta et al. divided the full image into small patches to train a CNN  [25], while Ciresan et al. defined a patch around each pixel and used those patches to train their network  [26]. One of the drawbacks of patch-based approaches is that we need to train a CNN with a large number of patches, most of which have redundant information. The upsampling in U-net, on the other hand, allows a network to train with fewer, significantly larger patches.

There are numerous extensions to the original U-net architecture. Alom et al. increased the performance of U-net by introducing residual connection, and a recursive training strategy  

[22], albeit at the expense of increased training time due to the extra connections and recursion. Zhuang et al. arranged two U-net architectures in serial fashion, enabling the architecture to learn more abstract features  [27]. This serialization significantly increases the training time and requires heavy computational infrastructure to be able to train well. Jin et al. used deformable CNNs to construct better vascular features [28], but also with added training complexity. Finally, Yun et al. combined conditional generative adversarial networks (GANs) with U-net [23], achieving results comparable to the other U-net extensions. In contrast to these extensions, our approach not only yielded better results, but is simpler and faster to train, as detailed in the following section.

3 Methodology

As noted above, we separate vessel segmentation into two stages. Fig. 1 details each step of our pipeline. In the first stage, we use a stochastic loss function to obtain an overall likelihood map for every pixel in the image. Then, we identify those pixels that are most likely to be misclassified—generally faint vessels and pixels at the periphery of thick vessels—and apply a second, smaller network to these ambiguous regions. Below, we first present and motivate the use of stochastic loss functions and then discuss each step in our pipeline in turn.

3.1 Dynamic loss functions

A weighted loss function penalizes some types of errors more than others. They are useful for problems with unbalanced datasets or in which we want to prioritize certain outcomes (e.g., minimize false positives for a particular class). Traditionally, these weights have been fixed a priori by estimating them using a training set.

In this work, however, we propose using dynamic or stochastic weights. In this case, each weight is defined not as a single value but as a distribution. During training, every time we feed a new mini-batch to the network, we first sample weights for each class and then use those sampled weights to compute the loss. Since the weights will vary for each training iteration, the same error will be penalized more or less heavily at different points during training. Empirically, this forces the network to optimize across all ranges of false positive vs. false negative rates, leading to more robust classification. For concreteness, below we discuss the use of dynamic weights for the cross-entropy loss, but one can randomize any weighted loss function.

The cross-entropy loss is a natural choice for binary classification since it encourages values to converge towards zero or one. This loss is defined as follows:


where is the true distribution and

is the predicted distribution, i.e., the network’s outputs. The training process aims to make these two distributions as close as possible. However, when the data is highly skewed, we can trivially obtain a high accuracy by always assigning the most common class to every pixel. In the case of vessel segmentation, very few pixels are part of a vessel, so always assigning a label of

background will yield an accuracy of around 90% (but a recall of 0%). A straightforward way to ameliorate this imbalance is by applying different weights to each class:


where are the class weights and are the class labels ( for binary segmentation). By using weights, we force the network to care about both positive and negative samples equally during training. An easy way to define these weights is to estimate them from the training set, as was done for the original U-net architecture [6].

A major drawback of the above strategy is that it imposes a fixed ratio of vessel-to-background. As our experiments show, this ratio is a major determinant of whether ambiguous pixels will be mapped to either vessel or background. In other words, when faced with an ambiguous pixel, the network will use this ratio to "guess" which label to assign to it. We can see in Figs. 3(b) and 3(c) that a network trained with fixed weights will preferentially over- or under-segment ambiguous pixels.

Instead, we propose using stochastic weights which are randomly generated at each training iteration:


where function assigns random class weight within a range of to and is the step by which the random class weights are taken. For example, = 1 would generate random integers between . Intuitively, dynamic weights prevent the network from getting stuck in a low-value local minimum.

When we use fixed class weights, the neural network will always penalize all pixels by the same factor. As noted earlier and as Figs. 3(b) and 3(c) show, fixed weights fail to account for fainter vessels and are insensitive to noise. Stochastic costs reduce this bias because the network must vary in how aggressively it segments ambiguous pixels throughout the training process. Even if completely different weights are given in a subsequent iteration, some of the information learned in the previous step is preserved, leading to a more balanced segmentation of ambiguous regions. As Fig. 3(d) shows, dynamic weights capture fainter vessels more accurately than fixed weights.

We now discuss the various steps of our proposed pipeline below.

3.2 Likelihood map

The first step in our pipeline consists of training a U-net architecture using stochastic weights to derive a likelihood map over the entire image. The goal of this map is not to segment the image directly but to indicate how likely a particular pixel is to be part of a vessel. We do not segment the image at this stage because the up-sampling used by U-net blurs the original vessels, making it difficult to correctly label high-frequency elements (e.g., thin vessels or pixels near the edge of a thick vessel).

Figure 3 contrast the outputs of our stochastic U-net vs. two fixed-weight U-nets on the datasets used in our experiments (see figure for details). Overall, networks trained with fixed weights struggle to capture faint vessels, as well as the pixels at the edges of thicker vessels. Interestingly, our stochastic weights are also better able to ignore vessel-like artifacts, e.g., the rim of a lens as seen in Fig. 3.

Figure 2: U-net and mini-U-net architectures: As originally detailed in [6]

, the first half of the U-net architecture is a standard sequence of convolution and max-pooling layers. The layers in the second half, on the other hand, receive two inputs: the features from the previous layer and the features from the corresponding layer in the first half of the network (i.e., the one with the same dimensionality). The features from previous layers are upsampled to match the larger dimensions of subsequent layers. Intuitively, the upsampled features provide more global information, while the features transferred from the earlier layer give more local information. Our mini-U-net architecture (highlighted by the light blue square) consists of the middle layers of the full U-net.

(a) (b) (c) (d)
Figure 3: Likelihoods maps generated by U-net with different weights variations: The rows correspond to a sample from the DRIVE, STARE, AV-WIDE, VEVIO (Mosaics), VEVIO (Frames), and CHASE-DB datasets, resp. (a) Original image. (b) Likelihood map when trained with class weights estimated from the training data. (c) Likelihood map when using equal weights for both classes. (d) Likelihood map when using dynamic weights (see text for details). Using dynamic weights allows the network to capture more information about the vascular structure without introducing additional noise. Figure best viewed on-screen.

3.3 Support-resistance segmentation (SRS) for easy pixels

The likelihood map generated in the previous step ranges from [0,1]. Pixels with a higher probability of being vessels have values closer to one (and closer to zero otherwise). Intuitively, the closer a value is to either extreme, the more confident the network is in its classification. Thus, we use this likelihood map to separate pixels into either

easy or ambiguous classes. The top and bottom of Fig. 1(c) show the easy and ambiguous pixels, respectively, of the likelihood map in Fig. 1(b).

We use two thresholds to classify pixels into these two classes. We refer to the lower threshold as support and the upper threshold as resistance. All pixels between resistance and support are considered ambiguous. Pixels below support (those with values near zero), which account for the majority of pixels, are background pixels whereas pixels above the resistance (those with value near one) are considered vessel pixels.

The choice of support and resistance thresholds affects the trade-off between precision and recall. Thresholds close to one and zero, respectively, lead to higher precision for easy pixels but cause more pixels to fall in the intermediate, ambiguous band. Thresholds closer to the middle lead to higher recall, but affect precision. For our experiments, we empirically set a support of 20 and a resistance of 235.

3.4 Targeted prediction for ambiguous pixels

The last stage of our pipeline segments ambiguous pixels, i.e., those that fall within the support-resistance band described above. Figs. 1(d)-(g) illustrate the various steps involved in this targeted prediction. Generally, most vessel pixels will lie above the support threshold, so they have already been predicted with high precision. To classify the pixels with intermediate values, we train a second, patch-based network only on those pixels. As Fig. 2 shows, this mini-U-net network is a scaled-down version that uses only the middle layers of the full U-net architecture. We use two channels as inputs to this network: the full likelihood map outputted by the first U-net (Fig. 1(b)) and a version of this likelihood map which only contains ambiguous pixels (the bottom image in Fig. 1(c)).

Unlike prior patch-based approaches (e.g., Ciresan et al. [5]), we do not extract a patch for every pixel. Instead, we use a sparse set of patches centered at the intersections between ambiguous pixels and a regular lattice, as described below:

  1. Raw estimate: We first produce a high-recall binary mask by considering all pixels above the support threshold as vessels. Intuitively, this mask should contain almost no false negatives, but will have numerous false positives. We then skeletonize the largest connected component in this mask (Fig. 1(d)).

  2. Patch selection: We then define a regular lattice of width (Fig. 1(e)) and determine the intersection points between this lattice and the skeleton from the previous step (Fig. 1(d)). We empirically verified that the union of the patches defined by these intersections always covered every ambiguous pixel.

  3. Mini-U-net classification: Finally, we classify the ambiguous pixels in each patch using our second network (Fig. 1(g)). For each patch, we extract the corresponding regions in the full likelihood map and the map with only ambiguous pixels, resp. Intuitively, the second channel indicates which pixels are most crucial to classify correctly. Finally, we use the outputs from the second network as the final labels for ambiguous pixels (Fig. 1(h)).

4 Experiments and Results

We carried out experiments on five different retinal vessel datasets to verify the effectiveness of our proposed approach. Below, we first detail our experimental protocol and then discuss our results.

4.1 Materials and methods

Hardware: All our experiments were conducted on a Dell Precision 7920R server with two Intel Xeon Silver 4110 CPUs, two GeForce GTX 1080 Ti graphics cards, and 128 GBs of RAM.

Datasets: We used five datasets for our experiments (see the first column of Fig. 3 for a sample image from each dataset):

  • DRIVE [8]: 40, 565584 color fundus images centered on the macula. Each image includes a circular field-of-view (FOV) mask. This dataset is pre-divided into 20 training and 20 test images, but we further divided the training set randomly into 15 training and 5 validation images.

  • STARE [9]: 20, 700 605 color images with no FOV masks, also centered on the macula. Many images exhibit some degree of pathology (e.g., exudates).

  • AV-WIDE [11, 29]: 30 wide-FOV images. Image sizes vary, but are around 1300800 pixels for most images. Images are loosely centered on the macula. As explained below, we used the existing, graph-based annotations to manually segment these images.

  • VEVIO [12]: This dataset has two types of images: 16, 640480 frames from 16 different video indirect ophthalmoscopy (VIO) videos and 16 corresponding mosaics (around 600500 pixels) obtained by fusing different frames into a single image [30]. Due to the difficult image acquisition process, these images are blurry and have lens artifacts.

  • CHASE-DB [10]: 14 left-to-right pairs of color images centered on the optic nerve (28 images total). Images are 999960 pixels.

Data preparation: We used only the green channel of each image. We then applied contrast adaptive histogram equalization [31] as a preprocessing step to enhance contrast. Given the limited number of images in each dataset, we used image augmentation techniques to increase our data size. Specifically, we flipped the input image horizontally and vertically randomly in each iteration.

Ground truth preparation: All datasets except AV-WIDE had preexisting ground-truth pixel labels. The original AV-WIDE dataset had only graph-based labels, so we manually traced the vessels in Adobe Photoshop CS (Adobe Systems Inc., San Jose, CA) using a Wacom Intuous3 graphics tablet (Wacom Co. Ltd, Kazo-shi, Saitama, Japan) to obtain pixel-level segmentations. The original, graph-based annotations had been carried out by an expert ophthalmologist, so we followed them exactly.

Network architectures: As noted above, we used a full-sized U-net architecture for our initial probability map and a smaller, mini-U-net network for targeted prediction. The U-net architecture consist of a series of down-sampling steps followed by the same number of up-sampling steps. Each up-sampling step receives, crops to match the size, and concatenates the feature map from the down-sample step on the same level, as shown in Fig. 2. We used 33-pixel kernels except in the last layer; here, we used 11 kernels to produce two outputs, which are then passed through a softmax layer to obtain two probabilities: one for that particular pixel being a vessel and another for it being part of the background. Depending on which of the probabilities is higher, a pixel is labeled as either vessel or background. As in the original U-net architecture [6], this network receives inputs of size 572572 pixels, but only predicts values for the internal 388388 pixels. The outer band of 9292 pixel allows the U-net a wider view of the vascular structure. As in [6], we shift the U-net to obtain a label for every pixel and mirror the 388 388 region if the outer band extends beyond the image.

Our second network, the mini-U-net architecture, is identical to the middle three layers of the full-size U-net (see Fig. 2). Thus, its input size is 140 140 and it produces labels for the middle 100100 pixels. Similar to the first network, we expand and mirror patches as needed. However, we train this network using a dice loss, rather than cross entropy because our goal is to maximize the F1 score of our pipeline. The weighted dice loss is given by the following formula:


, where controls how much we want to favor precision vs. recall. Higher beta yields higher precision and vice versa. As with the full-size U-net, we used a stochastic loss in which we sampled for each training iteration:


, where is picked randomly within range - with a stepsize of .

Training parameters: We trained our networks using Adam optimization [32]

with a learning rate of 0.001 and a mini-batch size of 4. We trained the full U-net for 250 epochs and the mini-U-net for 60 epochs. For the full U-net, we used stochastic weights of

, i.e., where weights randomly oscillated within a range of with a step size of . For the mini-U-net, we used a dice loss with stochastic weights of (1, 2, 0.1), where randomly oscillated between with a step size of .

Model validation: We used 5-fold cross validation with training, validation, and test sets on all datasets except DRIVE, since this dataset already had a predefined test set. For each fold, we randomly split the training set into 75% training and 25% validation images. In our tables, we report the performance on the test set across all the folds. Note that, using this protocol, each image is only included once in a test set.

4.2 Pipelines tested

To better understand the impact of each stage of our pipeline, we tested two versions of our approach, as well as two baseline methods:

  • Dynamic weights with targeted prediction: Our proposed approach.

  • Dynamic weights only: We trained the full U-net using dynamic class weights, but omitted the targeted prediction step.

  • Fixed weights only: We trained the full U-net using fixed class weights based on the ratio of vessel to background pixels in the training set. No targeted prediction step.

  • Fixed weights with targeted prediction: We applied both fixed weights and targeted prediction.

The parameters for the Mini-U-net at the end of the full pipelines (variations 1 and 4) were the same for both configurations. In total, we carried out twenty experiments across the five datasets.

4.3 Results

We quantified our method’s performance in terms of precision, recall and F1 score (also called F-measure). For comparison to the state of the art, we also calculated the accuracy of our model, even though this value can be misleading for unbalanced datasets [12]. The precision, recall, and accuracy values in our tables are the mean values across all test images. The reported F1 score is the harmonic average of the corresponding mean precision and recall values. As Forman et al. showed [33], if the positive cases constitute around 7% or more of the samples, then this is an unbiased way to calculate the mean F1 score for -fold cross validation. Across the different datasets, vessel pixels constituted around 10% of all pixels.

Method Precision Recall F-Score Accuracy
U-net[22] 0.8852 0.7537 0.8142 0.9531
Residual U-net[22] 0.8614 0.7726 0.8149 0.9553
Recurrent U-net[22] 0.8603 0.7751 0.8155 0.9556
R2 U-net[22] 0.8589 0.7792 0.8171 0.9556
Conditional GAN [23] 0.8143 0.8274 0.8208 0.9608
LadderNet[27] 0.8593 0.7856 0.8208 0.9561
DUNet[28] 0.8537 0.7894 0.8203 0.9697
Fixed weights only 0.7657 0.8410 0.8015 0.9633
Fixed weights with targeted prediction 0.7823 0.8246 0.8028 0.9643
Dynamic weights only 0.8323 0.8163 0.8242 0.9692
Dynamic weights with targeted prediction 0.8284 0.8235 0.8259 0.9693
Table 1: Results on DRIVE dataset

Tables 1 to 6 summarize our results on the various datasets. A value in bold represents the best score for that particular column. For convenience, we also included those of current state-of-the-art approaches. Note, though, that we only list methods that reported F1 scores (or, equivalently, both precision and recall values). Due to the strong class imbalance between the two classes, accuracy alone is not an informative measure of performance. As the various tables show, our proposed pipeline consistently outperformed existing methods. To the best of our knowledge, our F1 scores and most of our accuracy values are state-of-the-art results across the five datasets.

Overall, dynamic weights outperformed fixed weights across the various datasets, including the highly studied DRIVE and STARE datasets. Specifically, stochastic weights led to more balanced precision and recall scores. For example, in the DRIVE dataset, the full, stochastic pipeline obtained a precision of 0.8284, and a recall of 0.8235 while the fixed-weight pipeline gave a more unbalanced results (precision of 0.7823 and recall of 0.8246). This pattern was replicated across datasets. Compared to prior methods, we consistently achieved better recall with little or no sacrifice in precision. Existing methods with high precision and low recall seldom misclassify a background pixel as a vessel, but they struggle to detect faint vessels.

Notably, our method outperformed other extensions of the original U-net architecture, including LadderNet  [27], R2Unet  [22], and DUNet [28]. The difference is particularly large for the CHASE-DB dataset (Table 3). Here, our method score was higher than the previous state of the art (LadderNet).

Method Precision Recall F-Score Accuracy
U-net[22] 0.8475 0.8270 0.8373 0.9690
Residual U-net[22] 0.8581 0.8203 0.8388 0.9700
Recurrent U-net[22] 0.8705 0.8108 0.8396 0.9706
R2 U-net[22] 0.8659 0.8298 0.8475 0.9712
Conditional GAN [23] 0.8466 0.8538 0.8502 0.9771
DUNet[28] 0.8856 0.7428 0.8079 0.9729
Fixed weights only 0.8138 0.8538 0.8480 0.9739
Fixed weights with targeted prediction 0.7979 0.8692 0.8320 0.9732
Dynamic weights only 0.8413 0.8424 0.8418 0.9758
Dynamic weights with targeted prediction 0.8559 0.8541 0.8549 0.9780
Table 2: Results on STARE dataset
Method Precision Recall F-Score Accuracy
U-net[22] 0.7336 0.8288 0.7783 0.9578
Residual U-net[22] 0.7857 0.7726 0.7800 0.9553
Recurrent U-net[22] 0.8195 0.7459 0.7810 0.9622
R2 U-net[22] 0.81072 0.0.7756 0.7928 0.9634
LadderNet [27] 0.8084 0.7978 0.8031 0.9656
DUNet[28] 0.7510 0.8229 0.7853 0.9724
Fixed weights only 0.8089 0.8271 0.8179 0.9744
Fixed weights with targeted prediction 0.8266 0.8085 0.8174 0.9749
Dynamic weights only 0.8175 0.8296 0.8235 0.9753
Dynamic weights with targeted prediction 0.8550 0.8143 0.8245 0.9759
Table 3: Results on CHASE-DB dataset
Method Precision Recall F-Score Accuracy
Fixed weights only 0.7611 0.7898 0.7751 0.9706
Fixed weights with targeted prediction 0.7439 0.8083 0.7747 0.9698
Dynamic weights only 0.8154 0.7751 0.7947 0.9742
Dynamic weights with targeted prediction 0.8283 0.7815 0.8042 0.9755
Table 4: Results on AV-WIDE dataset

Table  4 shows our results for the AV-WIDE dataset. Our paper is the first assessment of this dataset as we manually created the ground truth for it. As with the other datasets, we can observe a similar pattern of unbalanced precision and recall for fixed class weights. In contrast, our stochastic technique has more balanced precision and recall. Moreover, the score and accuracy are significantly higher for our method, particularly for the full pipeline with targeted prediction.

Method Precision Recall F-Score Accuracy
Forest-based Dijkstra [12] - - 0.5053 0.9573
Fixed weights only 0.5537 0.6832 0.6093 0.9637
Fixed weights with targeted prediction 0.5472 0.6984 0.6136 0.9632
Dynamic weights only 0.6147 0.6355 0.6249 0.9683
Dynamic weights with targeted prediction 0.6573 0.6739 0.6654 0.9719
Table 5: Results on VEVIO dataset (Mosaics)
Method Precision Recall F-Score Accuracy
Forest-based Dijkstra [12] - - 0.5403 0.9101
Fixed weights only 0.5212 0.6580 0.5817 0.9569
Fixed weights with targeted prediction 0.5328 0.6482 0.5849 0.9581
Dynamic weights only 0.5924 0.5896 0.5910 0.9629
Dynamic weights with targeted prediction 0.60052 0.5435 0.5706 0.9628
Table 6: Results on VEVIO dataset (Frames)

Finally, our method also achieved state-of-the-art results in the highly challenging VEVIO dataset, which has two sets of images: composite mosaics and individual frames taken from a low-resolution video [30]. As with the other datasets, fixed class weights gave highly biased scores in terms of precision and recall, while dynamic weights gave more balanced results (Tables 5 and 6).

Dataset and class weights Precision Recall F-Score Accuracy
DRIVE {1, 1} 0.8418 0.8023 0.8216 0.9694
DRIVE class-weighted 0.7657 0.8410 0.8015 0.9633
STARE {1, 1} 0.8559 0.8208 0.8379 0.9757
STARE class-weighted 0.8138 0.8538 0.8480 0.9739
CHASE-DB {1, 1} 0.8332 0.8135 0.8232 0.9757
CHASE-DB class-weighted 0.8089 0.8271 0.8179 0.9744
AV-WIDE {1, 1} 0.8231 0.7552 0.7876 0.9737
AV-WIDE class-weighted 0.7611 0.7898 0.7751 0.9706
VEVIO Mosaic {1, 1} 0.6507 0.5564 0.5998 0.9690
VEVIO Mosaic class-weighted 0.5537 0.6832 0.6093 0.9637
VEVIO Frame {1, 1} 0.6288 0.5257 0.5726 0.9643
VEVIO Frame class-weighted 0.5212 0.6580 0.5817 0.9569
Table 7: Equal weights {1,1} vs. class-based weights across datasets. Results shown are for U-net without targeted prediction.
Dataset and class weights Precision Recall F-Score Accuracy
DRIVE class-weighted (about {1, 10}) 0.7657 0.8410 0.8015 0.9633
DRIVE {1, 5} 0.7806 0.8264 0.8028 0.9642
DRIVE {1, 1} 0.8418 0.8023 0.8216 0.9694
DRIVE {5, 1} 0.8722 0.7270 0.7930 0.9665
DRIVE {10, 1} 0.8712 0.7137 0.7867 0.9654
DRIVE 0.8508 0.7965 0.8227 0.9696
DRIVE (our approach) 0.8323 0.8163 0.8242 0.9692
Table 8: Results of different combination of fixed weights in cross entropy loss function without targeted prediction for DRIVE dataset {, }={weight for class 0 or background pixels, and weight for class 1 or vessel pixel}.
(a) (b) (c) (d) (e) (f)
Figure 4: Color map of recall vs precision for different weight variations: The left column shows the progression of recall (x-axis) vs. precision (y-axis) during training, while the right column shows the corresponding values for the validation set. Lighter dots indicate later training epochs. The first row, (a, b), corresponds to dynamic weights (our approach), (c, d) to weights estimated from the training set, and (e, f) to equal weights for both classes. The scale of precision and recall is [0.5, 1.0] in all plots. Figure best viewed in color.

5 Discussion

The key novelties of our technique are two-fold. First, we use dynamic, as opposed to fixed, weights to allow our networks to learn a more robust balance between false positives and false negatives. Second, we separate different types of tasks involved in retinal vessel segmentation. The likelihood estimation step only focuses on capturing the vascular structure without worrying about the final segmentation. Conversely, the targeted prediction step relies on having a precomputed likelihood map with which to make the final predictions. We discuss specific properties of our approach below.

5.1 Dynamic weights yield smoother likelihood maps

In Fig. 3

(columns b and c), we can see that using class weights computed from the training set almost binarizes the image. All the likelihoods are very close to either 0 (background) or 1 (vessel); thus, we have no any useful information with which to refine our estimate in the targeted prediction step. Any information about fainter pixels was lost. However, if we use dynamic weights, a higher percentage of pixels will be assigned non-trivial likelihood values (Fig. 

3(d)), which will allow our second network to better classify ambiguous pixels.

5.2 Dynamic weights yields more balanced precision and recall

As we can observe from the results (Tables 1 to 6), our method consistently achieves a good balance between precision and recall. This implies that our approach handles fainter or more ambiguous pixels, rather than ignoring them. In contrast, the state-of-the-art results for the DRIVE dataset [27] report precision and recall values of and , respectively. Our corresponding, more balanced values were and . We can see even more of a difference on the VEVIO dataset (Tables 5 and 6). For the mosaics, the precision and recall for the U-net with class weights were and , respectively. However, our full, dynamic pipeline yielded values of and .

Our dynamic weights allow a network to settle on a local optimum that better balances precision and recall. As noted earlier, retinal images have a much higher percentage of background pixels (over 90%). Thus, a network trained on this type of data will be more reticent to label an ambiguous pixel a vessel (since the prior probability is so much lower). This imbalance makes the network settle on a local optimum that is skewed towards high precision and low recall. Previous approaches have ameliorated this effect by assigning fixed class weights in order to re-balance the classification problem. For example, if only 10% of pixels are vessels, then misclassifying those pixels will be penalized by a factor of 9. However, not all faint vessels are similar or distributed equally across different images, so a one-size-fits-all treatment is suboptimal. Our technique, on the other hand, applies

stress tests during the training process to ensure that the network is robust across different ratios of precision and recall. As our experiments show, this approach allows the network to settle on a a better optimum.

In more detail, Fig. 4 illustrates the effects of dynamic vs. fixed weights during training for the DRIVE dataset. Here, each subplot on the left-hand side shows precision vs. recall during training, while the subplots on the right show this value for the validation set. As Fig. 4(c) shows, a network that uses weights calculated from the training set will begin by strongly favoring recall, since a false negative is penalized more heavily than a false positive. In contrast, an unweighted network (Fig. 4(e)) will favor precision at first, since the probability that an ambiguous pixel is a vessel is very low, a priori. Over time, both networks learned to classify the training set, but their initial bias lead to poor results on the validation set, suggesting overfitting (Fig. 4(d and f)). This behavior implies that the network was very confident about easy pixels and therefore settled on an optimum that favored this class of pixels; however, this choice made it ignore fainter pixels, in turn. Our method, on the other hand, (Fig. 4(a, b)), had more balanced scores throughout the training process. The optimum on which it settled was more robust and thus achieved better scores on the validation and test sets. Our results suggest that the network was able to learn features that separate faint vessels from noise in the background.

5.3 Effect of different weighing schemes

Finally, we carried out additional analyses to investigate the impact of different weighing schemes on the final segmentation result. First, we trained a U-net architecture (without targeted prediction) with equal weights for the two classes. In other words, each misclassification, whether false positive or false negative, was weighted equally. Table 7 shows the results of this equal weighing on our different datasets (we also show the class-weighted results for each dataset for comparison). Equal weights yielded better precision but worst recall. Due to the roughly 9 to 1 class imbalance between background and vessel pixels, a network trained with equal weights will tend to label ambiguous pixels as background. We verified this trade-off further by systematically varying the class weights for the DRIVE dataset and recording the resulting precision and recall values. Table 8

lists the results when training with background-to-vessel weights of {1,5}, {1,1} (equal weights), {5,1}, and {10,1}. For comparison, the class-based weights were roughly 1 to 10. We also tested a U-net with dynamic weights, but with a range of weights spanning only 10 to 1 (our main experiments used a range of 100 to 1). In this case, the dynamic weights has less variance across training epochs.

Interestingly, the best fixed-weight result for the DRIVE dataset used equal weights (this result was not consistent across datasets, though, as Table 7 shows). Overall, we can see that the trade-off of precision vs. recall is driven by the ratio of class weights. Also, the dynamic weights with a 10-to-1 range did not balance the precision vs. recall as much as the 100-to-1 weights, so their results more closely resembled equal weights. We observed a similar pattern for the other datasets (we omitted these results for conciseness). Skewed weights tend to favor either precision or recall and reduce the other score accordingly. By effectively balancing these two scores, our dynamic weights achieve the best result.

6 Conclusion

An accurate segmentation of the vascular structure is crucial for retinal disease diagnosis. However, this task is not easy because the quality and features of a retinal image depend on many factors, including the imaging device, lighting conditions, and an individual’s anatomy. Thus, trying to enforce a single, static approach for all retinal images is suboptimal. Instead, by separating retinal vessel segmentation into sub-tasks, we have more degrees of freedom with which to adapt our processing to the current image. Furthermore, dynamic weights allow a network to learn to classify all types of vessels, regardless of their colour or intensity, which in turn generates a likelihood map with clear distinctions between vessel, background, and noise. Our targeted prediction step then uses this likelihood map to better classify ambiguous pixels. Our technique gives better results than state-of-the-art techniques, many of which are much complex and less intuitive. Instead of optimizing a single, black box, our approach breaks down the problem into more manageable steps and makes these steps more robust by using dynamic weights. In future work, we plan to apply our pipeline to other medical imaging domains, including CT and MRI scans. We also intend to investigate the theoretical properties of our stochastic weights further.


The authors declare that there are no conflicts of interest related to this article.


  • [1] M. Y. Claudia Kondermann, Daniel Kondermann, “Blood vessel classification into arteries and veins in retinal images,” Proc. SPIE 6512 (2007).
  • [2] E. Rochtchina, G. Burlutsky, G. Liew, P. Mitchell, J. J. Wang, B. E. Klein, M. D. Knudtson, R. Klein, and T. Y. Wong, “Retinal vessel diameter and cardiovascular mortality: pooled data analysis from two older populations,” European Heart Journal 28, 1984–1992 (2007).
  • [3] M. K. Ikram, F. J. de Jong, J. R. Vingerling, J. C. M. Witteman, A. Hofman, M. M. B. Breteler, and P. T. V. M. de Jong, “Are Retinal Arteriolar or Venular Diameters Associated with Markers for Cardiovascular Disorders? The Rotterdam Study,” Investigative Ophthalmology & Visual Science 45, 2129–2134 (2004).
  • [4] Y. Lu, L. Serpas, P. Genter, C. Mehranbod, D. Campa, and E. Ipp, “Disparities in diabetic retinopathy screening rates within minority populations: Differences in reported screening rates among african american and hispanic patients,” Diabetes Care 39, e31–e32 (2016).
  • [5]

    D. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber, “Deep neural networks segment neuronal membranes in electron microscopy images,” in

    Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds. (Curran Associates, Inc., 2012), pp. 2843–2851.
  • [6] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, (2015).
  • [7] M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K. Asari, “Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation,” CoRR abs/1802.06955 (2018).
  • [8] J. Staal, M. Abramoff, M. Niemeijer, M. Viergever, and B. van Ginneken, “Ridge based vessel segmentation in color images of the retina,” IEEE Transactions on Medical Imaging 23, 501–509 (2004).
  • [9] A. D. Hoover, V. Kouznetsova, and M. Goldbaum, “Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response,” IEEE Transactions on Medical Imaging 19, 203–210 (2000).
  • [10] M. M. Fraz, P. Remagnino, A. Hoppe, B. Uyyanonvara, A. R. Rudnicka, C. G. Owen, and S. A. Barman, “An ensemble classification-based approach applied to retinal blood vessel segmentation,” IEEE Transactions on Biomedical Engineering 59, 2538–2548 (2012).
  • [11] R. Estrada, C. Tomasi, S. C. Schmidler, and S. Farsiu, “Tree topology estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 1688–1701 (2015).
  • [12] R. Estrada, C. Tomasi, M. T. Cabrera, D. K. Wallace, S. F. Freedman, and S. Farsiu, “Exploratory dijkstra forest based automatic vessel segmentation: applications in video indirect ophthalmoscopy (vio),” Biomed Opt Express 3, 327–339 (2012). 22312585[pmid].
  • [13] S. Chaudhuri, S. Chatterjee, N. Katz, M. Nelson, and M. Goldbaum, “Detection of blood vessels in retinal images using two-dimensional matched filters,” IEEE Transactions on Medical Imaging 8, 263–269 (1989).
  • [14]

    G. Läthén, J. Jonasson, and M. Borga, “Blood vessel segmentation using multi-scale quadrature filtering,” Pattern Recognition Letters

    31, 762 – 767 (2010). Award winning papers from the 19th International Conference on Pattern Recognition (ICPR).
  • [15] Z. Yavuz and C. Köse, “Retinal blood vessel segmentation using gabor filter and top-hat transform,” in 2011 IEEE 19th Signal Processing and Communications Applications Conference (SIU), (2011), pp. 546–549.
  • [16] F. Farokhian and H. Demirel, “Blood vessels detection and segmentation in retina using gabor filters,” in 2013 High Capacity Optical Networks and Emerging/Enabling Technologies, (2013), pp. 104–108.
  • [17] D. Kaur and Y. Kaur, “Various image segmentation techniques : A review,” (2014).
  • [18] M. E. Martínez-Pérez, A. D. Hughes, A. V. Stanton, S. A. Thom, A. A. Bharath, and K. H. Parker, “Retinal blood vessel segmentation by means of scale-space analysis and region growing,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI’99, C. Taylor and A. Colchester, eds. (Springer Berlin Heidelberg, Berlin, Heidelberg, 1999), pp. 90–97.
  • [19] B. S. Y. Lam, Y. Gao, and A. W. Liew, “General retinal vessel segmentation using regularization-based multiconcavity modeling,” IEEE Transactions on Medical Imaging 29, 1369–1381 (2010).
  • [20]

    F. Benmansour and L. D. Cohen, “Tubular structure segmentation based on minimal path method and anisotropic enhancement,” International Journal of Computer Vision

    92, 192–210 (2011).
  • [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds. (Curran Associates, Inc., 2012), pp. 1097–1105.
  • [22] M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K. Asari, “Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation,” CoRR abs/1802.06955 (2018).
  • [23] Y. Jiang and N. Tan, “Retinal vessel segmentation based on conditional deep convolutional generative adversarial networks,” CoRR abs/1805.04224 (2018).
  • [24] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask R-CNN,” CoRR abs/1703.06870 (2017).
  • [25] A. Dasgupta and S. Singh, “A fully convolutional neural network based structured prediction approach towards the retinal vessel segmentation,” in 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), (2017), pp. 248–251.
  • [26] D. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber, “Deep neural networks segment neuronal membranes in electron microscopy images,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds. (Curran Associates, Inc., 2012), pp. 2843–2851.
  • [27] J. Zhuang, “Laddernet: Multi-path networks based on u-net for medical image segmentation,” CoRR abs/1810.07810 (2018).
  • [28] Q. Jin, Z. Meng, T. D. Pham, Q. Chen, L. Wei, and R. Su, “DUNet: A deformable network for retinal vessel segmentation,” arXiv e-prints arXiv:1811.01206 (2018).
  • [29] R. Estrada, M. J. Allingham, P. S. Mettu, S. W. Cousins, C. Tomasi, and S. Farsiu, “Retinal artery-vein classification via topology estimation,” IEEE Transactions on Medical Imaging 34, 2518–2534 (2015).
  • [30] R. Estrada, C. Tomasi, M. T. Cabrera, D. K. Wallace, S. F. Freedman, and S. Farsiu, “Enhanced video indirect ophthalmoscopy (vio) via robust mosaicing,” Biomed. Opt. Express 2, 2871–2887 (2011).
  • [31] K. Zuiderveld, “Graphics gems iv,” (Academic Press Professional, Inc., San Diego, CA, USA, 1994), chap. Contrast Limited Adaptive Histogram Equalization, pp. 474–485.
  • [32] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” arXiv e-prints arXiv:1412.6980 (2014).
  • [33] G. Forman and M. Scholz, “Apples-to-apples in cross-validation studies: Pitfalls in classifier performance measurement,” SIGKDD Explor. Newsl. 12, 49–57 (2010).