1 Introduction11footnotetext: Corresponding author33footnotetext: This work was funded in part by “The Pearl River Talent Recruitment Program Innovative and Entrepreneurial Teams in 2017” under grant No. 2017ZT07X152 and Shenzhen Fundamental Research Fund under grants No. KQTD2015033114415450 and No. ZDSYS201707251409055.
Diabetic retinopathy is now a common disease especially among working-age people, also considered as the main cause of blindness. To conduct clinical examination of it, ophthalmologists usually use fundus photography techniques to display the back of the eyeball in a very high-resolution image. On the images, retinal lesions can be apparently visualized. For example, microaneurysms appear as dark spots while hard exudates abnormally display bright regions. Manually carrying out such diagnosis is very non-objective, time-consuming, and highly depends on expertise. It is highly desirable to automate this procedure. To this end, many approaches have been proposed [1, 2], formulating the problem as segmenting those lesion regions, by pixel-wise binary labeling.
Recently, deep Fully Convolutional Neural networks (FCN) has gained much popularity in the field of image segmentation for its ability to learn the most discriminative pixel-wise features. A sequence of convolution and pooling layers are used to form an encoder to covert the input image into feature maps. These feature maps are then decoded to a segmentation mask with another set of deconvolution layers. Based on this architecture, U-net  introduces skip concatenation between the encoder and the decoder layers. This improvement reduces the dependence on large samples and results in much better performance. The work stimulates many other variants such as V-net  and Segnet . These deep learning architectures are so efficient that they have been commonly used in various applications of medical images segmentation [4, 5].
However, the methods above tend to fail in our settings. The significantly high-resolution (usually upto 35003500) fundus images with small-size target lesion regions burden the computational resources, and increase the difficulties of learning. By downsampling the input image and then rescaling the output as the final result, those architectures could be adopted yet hard to reach fine segmentation due to loss of information caused by downsampling. Many attempts have been made [7, 8, 9] to avoid such phenomena. Those works split the image into several patches and conduct patch-level segmentation on them respectively. Although these methods preserve detailed information, they usually cause mislabeling of lesion regions and inconsistency across patches, as a result of the poor capture of global contexts.
In this paper, a novel network architecture is proposed to overcome the shortcomings of both global-level and patch-level approaches, by combining them in a unified learning framework. In particular, it consists of two streams: one global stream performs segmentation in a downsampled version that converts the input image into low-resolution label maps; another local stream takes cropped patches as inputs and produces their corresponding segmentation results. These two, each exploiting U-Net as the basic component, are integrated by concatenating the outcome feature maps of the global decoder to the local decoder part. Then, the two steams are jointly optimized. With the whole network well trained, we next conduct segmentation on patches and stitch the outputs together to get our final results. This mechanism benefits the performance in two aspects: 1) the learned context features in the global part are passed to the local stream, reducing ambiguities and correcting errors; 2) the losses in the local stream are passed to global part, enhancing the learning of context features to improve the performance of the local component. In this way, the local and global nets are mutually enhanced. We tested our approach on the public fundus images dataset and conducted segmentation for microaneurysms (MA), soft exudates (SE), hard exudates (EX) and hemorrhages (HE). Our experiments showed that our model significantly outperforms local-only net and global-only net for MA and EX. And in the case of SE and HE, the global-only net outperforms all other variants. We found that the global-only net is more suitable when the lesion regions are compact and large. In contrast, the network proposed will do better when the lesion regions are scattered and of small size.
The overview of our proposed algorithm is shown in Fig. 1. Generally, there are three components in our model: GlobalNet, LocalNet and Fusion module. GlobalNet accepts a downsampled version of the image as input and produces a coarse segmentation map with the same size as the input. LocalNet accepts cropped image patches as input and produces segmentation maps in their original resolution. The proposed fusion module is used to crop the feature maps from GlobalNet and concatenate them to the LocalNet so that both global and local information can be captured.
2.1 Network Architecture
In this paper, we adopt U-Net as the backbone of our GlobalNet. The U-Net consists of an encoder and a decoder. The encoder is stacked by conv-bn-relu-conv-bn-relu basic blocks. Along the downsampling path in the encoder, the height and width of the feature map are halved while the number of channels doubles. The decoder architecture exactly follows an inverse encoder, whose feature spatial size doubles while the number of channels halves. In addition, the feature maps with same spatial size in the encoder and the decoder are concatenated. We adopt U-Net with 6 pooling layers for EX and HE, 4 pooling layers for SE and 3 pooling layers for MA.
LocalNet. We also adopt U-Net as the backbone of our LocalNet. Different from GlobalNet, the inputs of the LocalNet are image patches with smaller spatial size. In our experiments, we adopt U-Net with 3 pooling layers for EX, MA and 6 pooling layers for HE and SE.
Feature Fusion. As shown in Fig 1, the LocalNet and GlobalNet are fused in the end of their decoder component. In particular, the feature map in the end of the global decoder, before outputting the segmentation maps, is firstly took out. Then, it is concatenated to the output feature map in the end of the local decoder, forming a new feature map . As the GlobalNet takes the downsampled original images as inputs while the LocalNet takes cropped patches as inputs, to build correspondences, rescaling and cropping are conducted on before the concatenation. Finally, two 33 and one 11 convolution layers are exploited to transform to produce patched segmentation map.
Dataset. The dataset used in this paper is provided by 2018 ISBI grand challenge on diabetic retinopathy segmentation and grading . We use the dataset of segmentation sub-challenge. This dataset consists of 81 color fundus images with signs of diabetic retinopathy (DR) and other 164 without signs of DR. We only adopt the images with DR, specifically, 81 images of MA, 81 images of hard EX, 80 images of HE, 40 images of SE. Each image with sign of DR may contain more than one abnormality. Generally, the dataset was split to 54 training samples and 27 testing samples by the organizer.
The resolution of original image is 28484288 with zero fillings on both sides. We first center crop the image to 28163328 to eliminate the zeros fillings. For GlobalNet, we downsample the image to 640640, while 256256 patches are cropped uniformly for LocalNet.
Before training the LocalNet and the GlobalNet, the data are augmented by random rotation (with 359 degrees), zooming, flipping and adding random noise. When augmenting the data for the training of the fused net, only rotation with 90, 180, 270 degrees are exploited since the fusion module requires accurate alignment.
Loss function. We keep the supervision of GlobalNet when training our fused network. Thus, the loss is defined as follows:
where , and are weights for each part of loss. is the regularization term, e.g. norm. The definition of is same as . To handle the severe class imbalance, we adopt weighted cross entropy loss  for them:
where is the input image, is the pixel-wise binary label map for , and are the sets of positive and negative label pixels, is the weight for positive class,
is the hyperparameter to adjust the weight scale.
Training strategy. Training a deep neural network is a challenging task. For each input of LocalNet, only one patch of the output from GlobalNet is used, so the gradient of GlobalNet will be dominated by the patch easily, leading to unstable training. Apart from this, the available augmentations for fused network are not enough since doing fusion module requires pixel-wise aligning. This problem may lead to degradation of generalization. To address these issues, we first pre-train GlobalNet and LocalNet, and then freeze the layers before the fusion module, train the fusion module only until the network converges. Finally, we unfreeze all layers and fine-tune the whole network.
3 Experimental Results
3.1 Implementation Details
with an initial learning rate 0.0002 are used for both the training of GlobalNet and LocalNet. For fused net, we first train the fusion module for 10 epochs with learning rate 0.0002, then finetune the whole net with learning rate 1e-4 for 60 epochs, where Adam is used as the optimizer. The models are trained and tested with PyTorch on two NVIDIA GeForce GTX1080. It costs about 1 hour to train the GlobalNet, 4 hours to train the LocalNet and 4 to 8 hours to fine-tune the fused net. The inference of each patch costs around 10 to 30 ms.
In this paper, we utilize Area Under Precision Recall curve (AUPR) as our evaluation metric, which is same as the one used in the 2018 ISBI grand challenge. Precision (PPV) and True Positive Rate (TPR) are defined as follows:
where true positives (TP) are lesion pixels that classified correctly, false positives (FP) are non-lesion pixels that incorrectly classified as lesion pixels, false negatives (FN) are lesion pixels that incorrectly classified as non-lesion. The precision-recall curve is obtained by plotting the precision-recall pairs given different thresholds, which are set as all the non-equal values of the lesion probability map.
Comparisons. We compared the performance of the proposed fused network against two base models, LocalNet and GlobalNet mentioned above. The quantitative results are shown in Table 1, from which, we can see that our fused model outperforms the other two methods for EX and MA. However, for HE and SE, although our method is superior than LocalNet, still worse than GlobalNet. We discuss the reasons as below. For MA and EX, as seen in Fig 2, the segmentation results of GlobalNet are very coarse due to the lost of details. The proposed fusion strategy can effectively compensate this drawback, achieving better results. However, as can be seen from Fig 3, the lesion areas of HE and SE are very large and compact, where the details that GlobalNet loses due to downsampling are ignorable. So in this case, GlobalNet can capture more useful features than LocalNet. Therefore, we can have a finding that the proposed network can improve the segmentation performance when the target lesions are scattered and of small size. We also refer the readers to the leaderboard of 2018 ISBI grand challenge 222https://idrid.grand-challenge.org/ , where we are beyond all of the reported results. Even though our results maybe not significantly higher than the leaderboard, that is due to we use U-Net as our backbone, rather than duplicate their networks as backbones. We believe our framework can also work with other network backbones and improve their performances.
|Our fused model||0.889||0.525||0.703||0.679|
For segmenting small-size lesions in high-resolution retinal fundus images, downsampling-based methods will lose detailed information and patch-based methods are difficult to capture global contexts. Therefore, both of them may lead to performance degradation. In this paper, we proposed an end-to-end mutually local-global U-nets to solve this problem. The model consists of a global segmentation branch and a local(patch) segmentation branch, which are fused and jointly optimized, better capturing both the local details and the global contexts. The experimental results demonstrated the efficacy of our proposed method.
Since currently there is no large similar dataset, we plan to collect more data by ourselves and test the framework in future research. In addition, we believe the proposed fused model is not only applicable in retinal fundus lesions segmentation but also can be extended to other segmentation tasks.
P. Chudzik, S. Majumdar, F. Caliva, B. Al-Diri, and A. Hunter,
“Exudate segmentation using fully convolutional neural networks and inception modules,” 2018.
-  J. Amin, M. Sharif, and M. Yasmin, “A review on recent developments for detection of diabetic retinopathy,” Scientifica, vol. 2016, 2016.
-  E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 640–651, April 2017.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, Eds., Cham, 2015, pp. 234–241, Springer International Publishing.
-  F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571, 2016.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, Dec 2017.
-  C. K. Lam, C. Y. Yu, L. C. Huang, and D. L. Rubin, “Retinal lesion detection with deep learning using image patches,” in Investigative ophthalmology & visual science, 2018.
-  Y. Yang, T. Li, W. Li, H. Wu, W. Fan, and W. Zhang, “Lesion detection and grading of diabetic retinopathy via two-stages deep convolutional neural networks,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2017, pp. 533–540.
-  P. Chudzik, M. Somshubra, F. Caliva, B. Al-Diri, A. Hunter, et al., “Exudate segmentation using fully convolutional neural networks and inception modules,” SPIE, 2018.
-  P. Porwal, S. Pachade, R. Kamble, M. Kokare, G. Deshmukh, V. Sahasrabuddhe, and F. Meriaudeau, “Indian diabetic retinopathy image dataset (idrid),” 2018.
S. Xie and Z. Tu,
“Holistically-nested edge detection,”
2015 IEEE International Conference on Computer Vision (ICCV), Dec 2015, pp. 1395–1403.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang,
“Denseaspp for semantic segmentation in street scenes,”
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS-W, 2017.