Learn to Segment Retinal Lesions and Beyond

12/25/2019 ∙ by Qijie Wei, et al. ∙ 13

Towards automated retinal screening, this paper makes an endeavor to simultaneously achieve pixel-level retinal lesion segmentation and image-level disease classification. Such a multi-task approach is crucial for accurate and clinically interpretable disease diagnosis. Prior art is insufficient due to three challenges, that is, lesions lacking objective boundaries, clinical importance of lesions irrelevant to their size, and the lack of one-to-one correspondence between lesion and disease classes. This paper attacks the three challenges in the context of diabetic retinopathy (DR) grading. We propose L-Net, a new variant of fully convolutional networks, with its expansive path re-designed to tackle the first challenge. A dual loss that leverages both semantic segmentation and image classification losses is devised to resolve the second challenge. We propose Side-Attention Net (SiAN) as our multi-task framework. Harnessing L-Net as a side-attention branch, SiAN simultaneously improves DR grading and interprets the decision with lesion maps. A set of 12K fundus images is manually segmented by 45 ophthalmologists for 8 DR-related lesions, resulting in 290K manual segments in total. Extensive experiments on this large-scale dataset show that our proposed approach surpasses the prior art for multiple tasks including lesion segmentation, lesion classification and DR grading.



There are no comments yet.


page 3

page 11

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Fundus versus natural images in a multi-task setting. Unlike natural images where segmentation results can be easily converted to image labels, the fact that retinal lesions and diseases lack one-to-one correspondence makes it nontrivial to exploit lesion segmentation for disease prediction.

The goal of this paper is to simultaneously resolve two computer vision tasks,

i.e.semantic segmentation and image classification, in the context of color fundus images. Given the increasing demand of retinal screening and the clear shortage of experienced ophthalmologists, fundus image based retinal disease diagnosis is crucial for the well-being of many [referable, miccai19-multidis, miccai19-amd]. Previous studies on fundus image segmentation concentrate on anatomical structures in retina including optic disc / cup and vessels [tmi18-mnet, tmi19-cenet, cvprw19-m2u-net]. By contrast, this paper aims for retinal lesions, which are symptoms of ocular fundus diseases manifested in color fundus images. By answering the question of what lesions are in a fundus image and where in the image they are located, lesion segmentation has a potential to enable clinical interpretability of disease classes predicted at the image level. Attacking lesion segmentation and retinal disease classification in a unified framework is thus valuable.

Note that for natural images as in PASCAL-VOC alike tasks [wildcat, Ge_2018_CVPR, Zhou_2018_CVPR], the semantic segmentation task and the image classification task typically share the same class vocabulary. Consequently, developing a multi-task approach seems to be relatively straightforward, e.g., by converting classes predicted at the pixel-level to the image-level by max or mean pooling, as illustrated in Fig. 1 (a). For fundus images, however, lesion labels and disease classes are distinct and lack one-to-one correspondence. See for instance lesions used in the clinical practice guidelines for diabetic retinopathy111Diabetic retinopathy is a complication of diabetes mellitus caused by damage to blood vessels of the light-sensitive tissue at the retina [dc17-dr]. (DR) grading in Table 1. This means lesion segmentation cannot be directly converted to image-level DR grades. Hence, a unified framework that effectively segments lesions and exploits the segmentation for accurate disease classification is in demand.

Grade Lesion evidence for DR grading
Sufficient Indirect
DR1 Microaneurysm (MA), exclusively ✓
DR2 Intraretinal hemorrhage (iHE) ✓ Hard exudate (HaEx) ✓
Any of the following:
Over 20 iHEs in each of 4 quadrants
Venous beading in 2+ quadrants
IrMA in 1+ quadrants
Cotton-wool spot
(CWS) ✓
Any of the following:
Neovascularization (NV) ✓
Vitreous hemorrhage (vHE) ✓
Preretinal hemorrhage (pHE) ✓
Fibrous proliferation
(FiP) ✓
Table 1: A summary of the American Academy of Ophthalmology (AAO) preferred practice pattern guidelines for diabetic retinopathy grading. As venous beading and IrMA are very difficult to be recognized even for ophthalmologists and occur rarely, we exclude them from this study. The eight lesions stuided in this work are indicated by ✓.
Figure 2: Retinal lesions with manual / automated segmentations. For better viewing, we show only one lesion per image. For lesions with relatively clear boundaries, i.e., microaneurysm (MA), intraretinal hemorrhage (iHE), hard exudate (HaEx), and cotton-wool spot (CWS), the proposed L-Net-16s is on par with DeepLabv3+. As for lesions lacking clear boundaries, i.e., vitreous hemorrhage (vHE), preretinal hemorrhage (pHE), neovascularization (NV) and fibrous proliferation (FiP), L-Net-16s achieves better segmentation. Best viewed on screen.

Given a fundus image, instances of a specific lesion class occupy a specific region or multiple regions with diverse visual appearance, see Fig. 2. With the advent of fully convolutional networks (FCN) [fcn], exciting progress has been made in semantic segmentation, especially for natural scenes [dialated, segnet, deeplabv3p, cvpr19-danet, iccv19-ccnet]. However, directly applying the state-of-the-art for retinal lesion segmentation is problematic. Unlike objects in natural images, retinal lesions lack clear boundaries against the background. It is practically impossible for ophthalmologists to segment lesions at the same preciseness, meaning an FCN has to learn from annotations with imprecise boundaries. In the meanwhile, for diagnosis, it is mostly the presence and locality of specific lesions that are involved, see Table 1. Extremely precise segmentation is not only difficult to achieve but also unnecessary from a clinical view. We thus hypothesize that cutting-edge FCNs, e.g., DeepLabv3+ [deeplabv3p], are over-designed for lesion segmentation.

Moreover, while the importance of an object in a natural image is largely reflected by its size [cvpr12-berg-import], the importance of a lesion in a fundus image does not count on the amount of pixels it possesses. Diabetic retinopathy, depending on what lesions are presented, is categorized into five levels, from DR0 (i.e., no DR) to DR4 (i.e., proliferative DR). The presence of a preretinal hemorrhage, even though in a relatively small size, means DR4. Such a property cannot be well addressed by current segmentation losses including cross entropy [fcn, unet, cvpr18-seg-every], focal loss [iccv17-focal, Yu_2018_CVPR], and Dice [vnet, tmi18-mnet].

To conquer the aforementioned challenges, this paper makes the first endeavor to solve retinal lesion segmentation and disease classification in a joint and end-to-end framework. We choose DR, a leading cause of blindness [dc17-dr], as our target disease. Our main novelties are

  • We study eight lesions including microaneurysm (MA), intraretinal hemorrhage (iHE), hard exudate (HaEx), cotton-wool spot (CWS), vitreous hemorrhage (vHE), preretinal hemorrhage (pHE), neovascularization (NV), and fibrous proliferation (FiP) that support the full range of DR grades. This is a new state-of-the-art in terms of quantity, complexity and clinical usability.

  • We propose L-Net for retinal lesion segmentation. While inheriting FCN’s classical contracting-and-expansive structure, L-Net has a re-designed expansive path with its length adjustable and its upsampling operation lightweight trainable. We devise a novel dual loss that combines both semantic segmentation and image classification losses. These two designs enable L-Net to effectively learn from lesion annotations with imprecise boundaries and to substantially reduce false alarms of small-size lesions.

  • We propose Side-Attention Net (SiAN) that effectively harnesses lesion segmentation maps, as side information, for improving DR grading. Such an attention mechanism conceptually differs from prevalent self-attention mechanisms [nips17-atten, abn, cvpr19-danet]. Once trained, SiAN performs three tasks, i.e., lesion segmentation, lesion classification and DR grading, all in one forward pass.

  • We conduct extensive experiments on 12K color fundus images collected from Kaggle [kaggle] and local hospitals. With 290K expert-labeled pixel-level lesion segments, the dataset is the largest of its kind. The experiments confirm the superiority of both L-Net and SiAN against the prior art including FCN [fcn], U-Net [unet], DANet [cvpr19-danet], and DeepLabv3+ [deeplabv3p] for lesion segmentation, Inception-v3 [inception-v3] and ABN [abn] for DR grading. To promote related research, the Kaggle part of our test data, containing 1,593 images and 34,268 expert-labeled lesion segments, will be released.

2 Related Work

Models for semantic segmentation. Since Long et al.[fcn], FCNs have been the de facto standard technique for semantic segmentation. An FCN can be conceptually decomposed into a contracting path and an expansive path. The contracting path progressively extracts and downsamples feature maps from an input image. The expansive path, by transforming and upsampling, produces a full-resolution segmentation map of the same size as the input image. Towards more precise segmentation, novel designs are continuously proposed either in the contracting path, or in the expansive path or in both. For instance, dilated convolutions are introduced in [dialated], so the contracting path can produce feature maps with higher resolutions to preserve more detailed spatial information. In U-Net [unet], the contracting path and the expansive path are carefully designed to be symmetrical. Skip connections from the contracting path to the expansive path are added, again for the purpose of preserving spatial information to generate more accurate segmentation boundaries. In order to capture long-range contextual information in both spatial and channel dimensions, DANet [cvpr19-danet] introduces a position attention module and a channel attention module in the expansive path. The state-of-the-art DeepLabv3+ uses both dilated convolutions and spatial pyramid pooling in its contracting path [deeplabv3p]. Its expansive path uses multiple skip connections to exploit features from lower levels. As identifying the precise boundary of a retinal lesion is secondary to the practical use of lesion segmentation, a new FCN is required.

Losses for semantic segmentation. While standard cross entropy remains popular [fcn, unet, cvpr18-seg-every], new losses are being invented, mainly for addressing class imbalance. Exemplars are weighted cross entropy [gdice], Focal Loss [iccv17-focal], and Dice [vnet], all computed at the pixel-level. A multi-scale Dice is used in [tmi18-mnet], where an image is resized to multiple scales and pixel-level Dice losses computed at each scale are combined. Meanwhile, image classification losses are used for weakly supervised semantic segmentation [Ge_2018_CVPR, wildcat, cvpr16-cam, Wang_2018_CVPR], where only image-level annotations are given. To the best of our knowledge, the joint use of both semantic segmentation and image classification losses remains new for fully supervised semantic segmentation.

Retinal lesion segmentation. While earlier works for retinal lesion segmentation use traditional image processing techniques [fleming2010role, niemeijer2007automated]

, current works mostly take a patch-based deep learning approach

[tmi16-hemorrhage-detection, ma-he-exudates, high-risk-lesions, miccai18-lesion-detction]. In [tmi16-hemorrhage-detection], for instance, a customized CNN is used to segment iHE by patch classification. Similarly in [high-risk-lesions]

, a patch-trained CNN is applied in a sliding window manner, classifying every grid in a test image into five classes,

i.e., normal, MA, iHE, HaEx and high-risk lesion. By predicting whether a given patch contains a specific lesion, segmentation maps obtained by the above works tend to be sparse and imprecise. A more fundamental drawback is that the approach lacks a holistic view. Consider MA and iHE for instance. The two lesions are visually close as both are small lesions look like dark dots. However, MA occurs around vessels. Also, an image with no other lesion is more likely to have MA than iHE. For a model looking only at local areas, modeling these kinds of holistic clues is difficult.

Lesion-enhanced DR grading. While the above works consider lesion segmentation as a standalone task, some initial efforts have been made towards lesion-enhanced DR grading. A two-step method is developed in [lesion-dr1], where an input image is first converted to a weight map by using a CNN to classify all patches of the image as normal, MA or iHE. Then the image, multiplied by the weight map, is fed into a DR grading network. A lesion-guided attention mechanism is introduced in [lesion-dr2] to weigh specific regions in the input image. Three lesions are considered, i.e., MA, iHE and HaEx. Neither of these works considers severe lesions such as pHE, vHE, and NV. In [zoom-in-net, aaai19-pathlogoical], activation maps in a DR detection network are exploited to localize potential lesions. As their goal is to visualize pathological evidence used by the network, what lesions and to what extent they can be segmented are not quantitatively evaluated.

Attention-enhanced image classification. The state-of-the-art is Attention Branch Network (ABN) [abn], which extends a response-based visual explanation model [cvpr16-cam] by introducing an attention branch into a specific CNN. Consequently, ABN not only improves image classification but also produces an attention map to interpret the decision. Note that the attention is self-generated. Our attention mechanism exploits the output of the semantic segmentation network as side information, and thus conceptually differs from ABN.

3 Approach

Given a color fundus image, we aim to perform lesion segmentation, classification and subsequently DR grading in a unified framework. We use to denote a specific image, which contains an array of pixels . Let be lesions in consideration. Regions of distinct lesions, e.g., HaEx and iHE, often overlap partially, meaning a pixel can be assigned with multiple labels. So the goal of lesion segmentation is to automatically assign to each pixel a

-dimensional probabilistic vector,

, where

indicates the probability of the pixel belonging to the

-th lesion. Lesion classification is to predict lesions at the image level. Given the probabilistic segmentation map , the probability of the presence of a specific lesion , denoted as

, is naturally obtained by global max pooling on the map,



For both lesion segmentation and classification, hard labels are obtained by thresholding at . As for DR grading, the goal is to exclusively assign one of the following labels, i.e., {DR0, DR1, DR2, DR3, DR4}, to the given image.

Next, we depict the proposed lesion segmentation network, followed by the multi-task SiAN.

3.1 L-Net for Retinal Lesion Segmentation

Our variant of FCN, as Fig. 3 shows, has a shorter expansive path, making the overall shape resemble the letter L, which is also the initial of the word lesion. We thus term the new segmentation network L-Net.

Figure 3: Conceptual diagram of the proposed multi-task Side-Attention Network (SiAN) for (1) lesion segmentation, (2) lesion classification and (3) DR grading. Given a color fundus image, L-Net (the lower branch) generates probabilistic segmentation maps for eight lesions. Lesion classification is accomplished by global max pooling on the maps. For lesion-enhanced DR grading, a side-attention branch is used to fuse the segmentation maps with an array of 2,048 feature maps from Inception-v3 in the upper branch. Compared with directly weighing the feature maps with the segmentation maps, the trainable side-attention is more effective.
Figure 4: Variants of L-Net. The variable-length expansive path enables learning from lesion annotations with imprecise boundaries.

Network architecture. For the contracting path of L-Net, we use convolutional blocks of Inception-v3 [inception-v3]

for its outstanding feature extraction ability. Note that other state-of-the-art CNNs

[resnet, densenet, icml19-efficientnet] can, in principle, be used.

Our task-specific design lies in the expansive path, where we leverage the effectiveness of U-Net [unet] for re-using information from the contracting path and the flexibility of the original FCN [fcn] in cutting off the expansive path for preventing over-precise segmentation.

In concrete, in order to re-use feature maps from the contracting path, we adopt U-Net’s copy-and-merge strategy instead of adding operations in the FCN. For upsampling, we replace U-Net’s deconvolution by a

convolution to adjust the number of feature maps and subsequently a parameter-free bilinear interpolation to enlarge the feature maps. Such a tactic not only reduces the number of parameters. By applying an element-wise sigmoid activation, the output of the

convolution is naturally transformed to probabilistic maps with respect to the lesions.

The fact that retinal lesions lack accurate boundaries makes it unnecessary to seek for very precise segmentation. While the symmetry between the contracting and expansive paths in U-Net is useful in its original context of cell segmentation, we argue that such a constraint is unnecessary for the current task. In fact, extra parameters introduced by the symmetry into the expansive path increases the difficulty of training the network. Therefore, we let the length of L-Net’s expansive path adjustable. If the expansive path is cut at an early stage with feature maps of size , the maps need to be upsampled by a factor of to produce the final segmentation maps. Following the convention of [fcn], we term this variant L-Net-32s. By contrast, L-Net-2s exploits all the intermediate feature maps. The models that fall in between are L-Net-16s, L-Net-8s and L-Net-4s. Fig. 4 shows L-Net with distinct expansive paths.

Loss function. Training L-Net is nontrivial due to the following two issues. First, while the area of a specific lesion varies, the importance of the lesion does not depend on its size. This property cannot be well reflected in a pixel-wise loss, to which a smaller blob contributes less. As exemplified in Table 2

, misclassifying a small blob does not lead to a significant increase in the segmentation loss, and thus difficult to be corrected during training. Such a small misclassification, even though ignorable from the viewpoint of semantic segmentation, can be crucial for proper diagnosis of related diseases. Second, the data is extremely imbalanced, making commonly used loss functions such as cross entropy ineffective. Our study on a set of 12k expert-labeled fundus images shows that pixels of lesions account for less than 1%. By contrast, for PASCAL VOC2012

[ijcv15-voc], a popular benchmark set for natural image segmentation, the proportion of pixels corresponding to objects is about 25%. We find in preliminary experiments that with the cross-entropy loss, the lesion segmentation model easily got trapped in a local optimum, predicting all pixels as negative, albeit a very low training loss.

Ground-truth Prediction
0.0736 0.0 0.0589
0.0797 0.3325 0.1303
Table 2: Behavior of different losses. With the image classification loss added, the proposed dual loss is more sensitive to small-sized errors (bright spots in the second row).

To jointly address the two issues, we introduce a new dual loss that combines a semantic segmentation loss and an image classification loss , i.e.,


where is a hyper parameter to strike a balance between the two sub-losses. In principle, can be instantiated with any existing segmentation loss, and any classification loss for . We adopt the Dice loss, previously used for segmenting prostate MRI [vnet] and optic disc / cup [tmi18-mnet]. Our ablation study in Section 4.2.3 shows that Dice is more effective than Weighed Cross Entropy [gdice] and Focal Loss [iccv17-focal]. The weight is empirically set to based on a held-out validation set, a common practice for selecting hyper parameters.

Given a mini-batch of images, we compute the Dice version of as


where is ground truth of the -th pixel with respect to the -th lesion. In the extreme case where all pixels are predicted as negative, the dice loss is close to .

We compute the Dice version of as


where is the ground-truth label indicating whether the -th lesion is present in the -th image in the given batch. Recall that both and are obtained by global max pooling on the pixel-level labels, so , and accordingly are all invariant to the lesion size.

3.2 SiAN for Lesion-enhanced DR Grading

To predict DR grades, we choose Inception-v3 [inception-v3] as our baseline model. This model has established the state-of-the-art for predicting referable DR [referable], age-related eye diseases [amd-deep] and other retinal abnormalities [miccai19-multidis]. In fact, for all CNN models evaluated in this work, we use Inception-v3 as their backbones for fair comparison. To produce a probabilistic score per DR grade, we slightly modify Inception-v3 by adding after the global average pooling (GAP) layer a fully connected layer of size

, followed by a softmax layer. Different from previous works that use a typical resolution of

[referable], we use a much larger resolution of , making our Inception-v3 a stronger baseline.

For lesion-enhanced DR grading, we propose Side-Attention Net (SiAN), with its overall architecture shown in Fig. 3. SiAN consists of two branches. At the top is its main branch, with Inception-v3 as the backbone, that performs DR grading. The side-attention branch, with L-Net as its backbone, is responsible for injecting semantic and spatial information contained in the lesion segmentation maps into the main branch. In particular, the injection is performed at the last feature maps, denoted as , in the main branch, with . To that end, the side-attention branch shall generate the same number of weight maps, denoted as . Multiplying the feature maps by the weight maps side by side generates new weighted feature maps as


where indicates element-wise multiplication. The new feature maps then go through a GAP layer, followed by a classification block. It is worth point out that the weight information is essentially from the side-attention branch rather than generated by the main branch itself. Hence, SiAN is conceptually different from self-attention networks [nips17-atten, abn, cvpr19-danet].

To convert the lesion segmentation maps into the weight maps, an intuitive strategy is to let the main branch pay attention to regions with maximal lesion response. This is achieved by channel-wise max pooling (CW-MaxPool) over the segmentation maps222The segmentation maps here are already down-sampled to the same size as the feature maps in the main branch., i.e.


However, a region deemed to be negative with respect to the lesions does not necessarily mean it is useless for DR grading. So we further consider a learning based strategy, using a lightweight convolutional block consisting of two convolutional layers, i.e.


We train SiAN with the cross-entropy loss, commonly used for multi-class image classification.

4 Evaluation

4.1 Experimental Setup

Lesion selection. According to the AAO guidelines [aao2017], there are seven lesions used as sufficient evidence for specific DR grades. Among them, venous beading and IrMA are very difficult to be recognized even for ophthalmologists and occur rarely. So they are excluded from this study. We include three other lesions, i.e., HaEx, CWS, and FiP, indirectly related to DR grading. The three lesions alone shall not be considered as a sign of DR. However, they become informative once cooccuring with lesions directly related to DR. For instance, HaEx is lipids as a consequence of hemorrhages, so images with HaEx and MA are likely to be graded as DR2. When an image is observed with many DR2-related evidences, the presence of a CWS shall lift the grade from DR2 to DR3. New vessels in the retinal may undergo fibrosis and form FiPs. Therefore, a FiP is often considered to be a sign of DR4 when it does not occur alone. We compile a final list of eight lesions, see Table 1.

Ground-truth construction. Public datasets suited for our purpose does not exist. So we construct a large collection of 12,252 color fundus images with both pixel-level lesion annotations and image-level DR grades as follows. We collected initially 23K color fundus images of posterior pole, consisting of 12k images from our hospital partners and 11k images randomly sampled from the Kaggle DR Detection task [kaggle]. While the images were from patients with diabetes, some of them show other eye diseases such as glaucoma, AMD and RVO. So DR0 does not necessarily mean a healthy eye. Such a characteristic makes the data close to the real scenario and thus challenging.

For expert labeling, a panel of 45 experienced ophthalmologists was formed. We developed a web-based annotation system, where an annotator marks out lesions in a given image using either ellipses or polygons and accordingly grade the image. Lesion annotation and DR grading from a single image are somewhat subjective. So for quality control, each image was assigned to at least three annotators. Images receiving consistent DR grades, i.e., the majority vote for a specific grade, are preserved. Accordingly, per image we cleaned lesion annotations so they are complied to the diagnostic guidelines. Eventually, we obtain 12,252 images with 290K expert-labeled lesion segments. We refer to the supplementary material for more details. We split the dataset at random into three disjoint subsets for training (70%), validation (10%) and test (20%).

Implementations. An input image is sized to , as small lesions can not be seen in lower resolution. We use SGD with a weight decay factor of and a momentum of . The initial learning rate is . Validation occurs every 1K batches. If the validation performance does not improve in 4 consecutive validations, the learning rate will be divided by . Early stop occurs once the performance does not improve in 10 consecutive validations.

For training L-Net, we start with . Once the learning rate is reduced, is replaced by

. For DR grading, a pre-trained L-Net is used for SiAN. We tried to train both branches, but found no improvement in DR grading yet an absolute decrease of 0.01 in the segmentation performance. So we did not go further in that direction. For the varied CNN models assessed in this paper, we use Inception-v3 pre-trained on ImageNet 


as their backbones. Random rotation, crop, flip and random changes in brightness, saturation and contrast are used for data augmentation. Training was performed using PyTorch on a NVIDA Tesla P40 GPU. Once trained, running SiAN takes about 1.5GB GPU memory.

Our criteria for choosing baseline methods are two-fold: state-of-the-art in related tasks and open-source, allowing us to run them with the same preciseness as intended by their developers.

4.2 Experiments

4.2.1 Experiment 1. Lesion Segmentation

Baselines. We compare with four prior arts, i.e., FCN [fcn], U-Net [unet], DeepLabv3+ [deeplabv3p], and DANet [cvpr19-danet]. To let the comparison focus on the expansive paths, for all networks we use Inception-v3 as the backbone of their contracting paths. In addition, as the majority of the existing works utilize a patch-based sliding window approach to detect retinal lesions, we include patch-based FCN-32s. To train the patch-based model, we uniformly divide each image into patches, each sized to 224224. Given a test image, the model is run with a window size of

and a stride of

. Scores from overlapped areas are averaged. All the baselines are trained with the pixel-level Dice loss.

Performance metric. Pixel-wise F1 score is reported.

Model Mean MA iHE HaEx CWS vHE pHE NV FiP
patch FCN-32s 0.553 0.209 0.583 0.714 0.535 0.622 0.549 0.554 0.659
FCN-32s 0.571 0.327 0.592 0.728 0.528 0.642 0.562 0.530 0.662
FCN-16s 0.587 0.369 0.608 0.737 0.575 0.639 0.515 0.581 0.671
FCN-8s 0.586 0.369 0.609 0.740 0.573 0.640 0.534 0.583 0.639
U-Net 0.570 0.384 0.598 0.730 0.565 0.547 0.604 0.538 0.592
DeepLabv3+ 0.553 0.367 0.612 0.732 0.558 0.550 0.477 0.498 0.631
DANet 0.585 0.351 0.608 0.733 0.560 0.623 0.589 0.543 0.671
L-Net-32s 0.573 0.289 0.590 0.730 0.539 0.632 0.536 0.582 0.687
L-Net-16s 0.591 0.377 0.612 0.740 0.565 0.645 0.590 0.571 0.623
L-Net-8s 0.603 0.377 0.617 0.740 0.575 0.648 0.616 0.580 0.667
L-Net-4s 0.592 0.394 0.614 0.743 0.577 0.633 0.588 0.570 0.616
L-Net-2s 0.581 0.381 0.614 0.744 0.569 0.634 0.569 0.565 0.572
Table 3: Lesion segmentation by different models.

Comparing L-Net with distinct settings. As Table 3 shows, the overall performance of L-Net increases first, from 0.573 (L-Net-32s) to 0.591 (L-Net-16s), and decreases later, from 0.592 (L-Net-4s ) to 0.581 (L-Net-2s). The peak is obtained by L-Net-8s, with an F1 of 0.603. The result confirms our hypothesis that when the network parameters keep increasing, the additional layers can have a negative effect on the performance.

Comparing loss functions. We compare Dice [vnet], focal loss [iccv17-focal], and weighted cross-entropy (WCE) [gdice]. As the parameter in the focal loss is dataset-dependent, we set it to according to our validation set, with the parameter set to as suggested in the original paper. As shown in Table 4, Dice and the proposed dual loss outperform focal and WCE with a large margin. Correcting small-sized errors cannot be well reflected by the pixel-wise F1 score. This explains the relatively small difference between Dice and the dual loss for lesion segmentation.

Loss Lesion segmentation Lesion classification
Weighted cross-entropy 0.365 0.533
Focal loss 0.504 0.635
Dice 0.594 0.769
Dual loss (this paper) 0.591 0.801
Table 4: Performance of L-Net-16s with different losses.

Comparing with the baselines. L-Net outperforms the baselines. Patch-based FCN-32s is less effective than its full-resolution counterpart. As noted in Section 2, properly recognizing MA requires a holistic view, which is absent for the patch-based model. This explains its lowest performance (F1 of 0.209) on this lesion. Patch-based FCN-32s also has difficulty in segmenting large lesions such as vHE and pHE. Compared to DeepLabv3+, L-Net shows similar performance on MA, iHE, HaEx and CWS while noticeably better for vHE, pHE, NV and FiP. Comparing the two groups of lesions, the latter lack clear boundaries. The results confirm our hypothesis that DeepLabv3+ is over-designed for this task. In the meantime, the viability of the proposed L-Net for retinal lesion segmentation is justified.

4.2.2 Experiment 2. Lesion Classification

Baselines. We re-use the five baselines from Experiment 1, with lesion classification obtained by global max pooling on segmentations. We also compare with two segmentation-free models, i.e., Inception-v3 [inception-v3] and ABN [abn], both trained using image-level lesion annotations and Dice.

Performance metric. Image-wise F1 is used.

Comparing L-Net with distinct settings. As Table 5 shows, for lesion classification L-net with a shorter expansive path, e.g.L-Net-16s and L-Net-32s, is preferred. From Table 4 we see that L-Net trained with the dual loss is the best, suggesting small misclassified blobs are reduced.

Model Mean MA iHE HaEx CWS vHE pHE NV FiP
patch FCN-32s 0.704 0.886 0.849 0.828 0.720 0.634 0.544 0.637 0.535
FCN-32s 0.769 0.900 0.858 0.856 0.771 0.722 0.683 0.694 0.669
FCN-16s 0.787 0.890 0.849 0.847 0.743 0.758 0.726 0.696 0.783
FCN-8s 0.778 0.891 0.858 0.854 0.749 0.766 0.711 0.671 0.725
U-Net 0.757 0.888 0.855 0.843 0.755 0.639 0.689 0.653 0.737
DeepLabv3+ 0.794 0.899 0.863 0.866 0.764 0.800 0.693 0.677 0.792
DANet 0.775 0.900 0.853 0.852 0.772 0.713 0.682 0.715 0.712
Inception-v3 0.716 0.895 0.893 0.865 0.766 0.500 0.540 0.594 0.678
ABN 0.726 0.900 0.900 0.871 0.761 0.519 0.552 0.627 0.678
L-Net-32s 0.792 0.899 0.881 0.857 0.778 0.720 0.773 0.669 0.762
L-Net-16s 0.801 0.902 0.882 0.866 0.792 0.733 0.726 0.701 0.807
L-Net-8s 0.780 0.900 0.881 0.861 0.771 0.687 0.693 0.711 0.733
L-Net-4s 0.781 0.904 0.883 0.862 0.791 0.678 0.660 0.719 0.748
L-Net-2s 0.787 0.900 0.893 0.868 0.793 0.706 0.706 0.667 0.764
Table 5: Lesion classification by different models.

Comparing with the baselines. Inception-v3 and ABN are less effective than the majority of the segmentation based models. The results suggest the importance of lesions’ spatial information even for making image-level predictions. Different from its behavior for lesion segmentation, DeepLabv3+ becomes runner-up for lesion classification. For vHE, this model outperforms the others with a large margin. Note that DeepLabv3+ is specifically designed to capture multi-scale information by its parallel dialated convolutions. This design appears to be good at capturing the major pattern of vHE which often occupies more than half of an image. Overall L-Net-16s is the best.

4.2.3 Experiment 3. Lesions for DR Grading

Baseline. We again compare with Inception-v3 and ABN, both re-trained for DR grading. One might also consider a more straightforward method that enriches the output of the GAP layer by concatenating the -dimensional lesion vector . Accordingly, the size of the fully connected layer is adjusted to

. Note that similar ideas have been exploited in the context of image captioning for obtaining semantically enhanced image features

[wu2016value, tmm2019-cococn]. We term this baseline Lesion-Concat.

Performance metric. We report the quadratic weighted kappa, which measures inter-annotator agreement and used by the Kaggle DR Detection task [kaggle].

Results. We use L-Net-16s in SiAN for its best overall performance in the previous experiments. As Table 6 shows, using a better lesion segmentation model results in more accurate DR grading, with SiAN (L-Net-16s) as the top performer. While ABN has a relatively close performance, it is single-task. More importantly, we observe that its attention maps lack correspondence to manually marked lesion regions, see Fig. 5 and more in the supplementary.

DR model Lesion model Kappa
Inception-v3 () 0.660
Inception-v3 () 0.729
Inception-v3 0.774
Lesion-Concat Inception-v3 0.780
SiAN U-Net 0.780
SiAN FCN-8s 0.787
SiAN DeepLabv3+ 0.787
Lesion-Concat L-Net-16s 0.788
ABN 0.797
SiAN L-Net-16s 0.803
Table 6: DR grading. Numbers after Inception-v3 means the input resolution.
Figure 5: Visualizing DR grading results. Compared to ABN’s attention maps, lesion maps by SiAN better match ground truth.

An ablation study concerning the attention strategies is provided in Table 7. Fusion at the feature map level is better than fusion at the input image. The better performance of learned weights supports our statement that a region deemed to be negative with respect to the lesions does not necessarily mean it is useless for DR grading.

Fusion position Attention weights
CW-MaxPool (Eq. 6) Conv (Eq. 7)
Input image 0.768 0.768
Feature maps 0.781 0.803
Table 7: Ablation study on attention strategies in SiAN. L-Net-16s is used.

5 Conclusions

We have developed a multi-task approach to lesion segmentation, lesion classification and disease classification for color fundus images. Extensive experiments justify the superiority of the proposed approach against the prior art. The proposed L-Net, with its re-designed expansive path and the proposed dual loss, is found to be effective for learning from retinal lesion annotations with imprecise boundaries. Exploiting L-Net as a side-attention branch, the multi-task SiAN model simultaneously improves DR grading and interprets the decision with lesion maps.

While working on fundus images, our work reveals good practices for developing a semantic segmentation network given training data with imprecise object boundaries and extremely imbalanced classes, and for converting attributes predicted at pixel-level to categories at a higher level. We believe the lessons learned are beyond the specific domain.