A Deep Gradient Boosting Network for Optic Disc and Cup Segmentation

11/05/2019 ∙ by Qing Liu, et al. ∙ 0

Segmentation of optic disc (OD) and optic cup (OC) is critical in automated fundus image analysis system. Existing state-of-the-arts focus on designing deep neural networks with one or multiple dense prediction branches. Such kind of designs ignore connections among prediction branches and their learning capacity is limited. To build connections among prediction branches, this paper introduces gradient boosting framework to deep classification model and proposes a gradient boosting network called BoostNet. Specifically, deformable side-output unit and aggregation unit with deep supervisions are proposed to learn base functions and expansion coefficients in gradient boosting framework. By stacking aggregation units in a deep-to-shallow manner, models' performances are gradually boosted along deep to shallow stages. BoostNet achieves superior results to existing deep OD and OC segmentation networks on the public dataset ORIGA.



There are no comments yet.


page 1

page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Structure illustration for optic disc (OD) in retinal fundus image. It includes two parts: optic cup (OC) and neuralretinal rim. (a) A fundus image from ORIGA [19]. (b) OD window cropped from (a).

Glaucoma is the leading cause of irreversible blindness. It is a chronic eye disease that degrades the optic nerves, resulting in a large ratio between optic cup (OC, see Fig. 1) to optic disc (OD, see Fig. 1) [9]

. In clinical, the cup-to-disc ratio (CDR) is manually estimated by expertised ophthalmologists. It is labour intensive and time-consuming. To make the accurate CDR quantification automated and assist glaucoma diagnosis, the segmentation of OD and OC is attracting a lot of attention.

Figure 2: Architecture comparison. From left to right: (a) Single dense prediction branch. (b) Multiple dense prediction branches. (c) Multiple dense prediction branches with boosting modules. The boosting modules are stacked in a deep-to-shallow manner and learn residuals to boost previous classification model.
Figure 3:

The architecture of proposed BoostNet. The deformable side-output units (DSU) and aggregation units (AU) with deep supervision learn residuals and the expansion coefficients to boost the dense classification models. The implementation details about DSU and AU are illustrated in Fig.

4. DSU together with AU build the boosting module in Fig. 2 (c). Input is a polar OD window, which is same with previous methods MNet [4] and SAN [11].
Figure 4: Implementation for (a) DSU and (b) AU. ‘dc’ and ‘fc’ denote deformable convolutional layer [2] and full connection layer respectively. ‘C’  and ‘up’ denotes concatenation and upsample operators respectively.

Current state-of-the-arts for OD and OC segmentation are data-driven. It is formulated as a dense classification problem and solved by designed deep neural networks consisting of a feature learning network and one or multiple prediction branches. Some works such as SAN [10] and CE-Net [5] directly append one branch of prediction on the final output of a CNN architecture, as is shown in Fig. 2 (a). This facilitates for recognition since high-level features contain rich semantic information. However, they ignore the spatial details in shallow layers, which results in inaccurate classification around edges. To make use of spatial details, others such as MNet [4] and AG-Net [20] make predictions on multiple levels of features. The final outputs are obtained by element-wise summation. Fig. 2

(b) illustrates the architecture with multiple prediction branches. However, in this kind of design, multiple prediction branches are independent, which limits the learning capacity of the segmentation model. Regarding each prediction branch as a classification model, how to ensemble them for better OD and OC segmentation remains unsolved. In the field of traditional machine learning, an immediate thought for model ensemble is boosting. This motivates us to solve the dense classification problem in a gradient boosting framework

[3] [14] and develop an architecture with boosting modules to boost the branch’s classification ability. Fig. 2(c) illustrates the proposed boosting idea.

In detail, this paper explores a boosting deep neural network called BoostNet for OD and OC segmentation. It learns a sequence of base functions with deformable side-output unit (DSU) to boost the dense classification model in a stage-wise manner. Each base function learns the residual between ground-truth and output by classification model at the previous stage. Combining the classification model at the previous stage with base function in an additive form by the proposed aggregation unit (AU) results in a better classification model. Stacking the AUs with deep supervision in a deep-to-shallow manner, the classification models are boosted stage-by-stage. Fig. 3 illustrates the architecture of our BoostNet. We note that our BoostNet is different from existing combination of boosting and CNNS called BoostCNN [12]. Our BoostNet is designed for pixel-level classification which treats the inner sub-networks as weaker learners while BoostCNN [12] is proposed for image-level classification which treats the classification models after different training iterations or trained with different inputs as weaker learners.

The contributions of this paper are two-fold: (1) We introduce gradient boosting into convolutional neural network and design BoostNet for OD and OC segmentation; (2) We show that BoostNet achieves state-of-the-art performances on public OD and OC segmentation dataset ORIGA



The proposed gradient boosting network (BoostNet) roots in gradient boosting [3] and is equipped with the designed Deformable Side-output Units (DSUs) and Aggregation Units (AUs). It can be trained in an end-to-end way. Similar to previous methods MNet [4] and SAN [11], BoostNet takes polar images as input and outputs polar segmentation maps.

Problem Formulation. The goal of OD and OC segmentation is to find a function that maps the input to an expected label

such that some specified loss function

over the joint distribution of all

is minimized:


By merging the predicted OC and rim masks, OD is obtained.

Different from previous methods [11] [4] [20] [21] that directly estimate the function in Eq. 1, we solve it from the view of gradient boosting [3]. The key thought is to approximate by greedily learning a sequence of base functions in an additive expansion of the form [3]:


where are called base functions parametrised by respectively and are the expansion coefficients. Commonly, they are solved in a stage-wise way. First, the start base function is optimised and . The initial approximated function is determined by:


Then for , the approximated function is expressed as:


The base function and expansion coefficient are determined by:


We note that the base function here is the residual of . It is similar to the residual learning in ResNet [6] except that base functions in gradient boosting framework learn residuals of models’ outputs while residual block in ResNet [6] learns the residual of features. By aggregating with residuals stage-by-stage under supervision, the models are boosted gradually and make predictions closer to ground-truth. Next we will detail how to design a network to learn the base functions and the coefficients.

Gradient Boosting Network (BoostNet). CNNs such as VGG [17], ResNet [6], DenseNet [7] and HRNet [18] produce feature maps in a stage-wise manner. It is in line with the way that gradient boosting learns base functions and coefficients. This naturally motivates us to leverage the CNN to solve the optimisation problem defined by Eq. 3 - Eq. 5.

Fig. 3 shows the proposed BoostNet. It consists of a backbone network, three deformable side-output units (i.e. DSU0, DSU1, DSU2) and two aggregation units (i.e. AU1 and AU2). The outputs of DSU0, DSU1, DSU2 are denoted as and outputs of AU1, AU2 are denoted as respectively. We adopt HRNet [18] as the backbone network since it is able to learn powerful representation with high-resolution. HRNet[18] consists of four stages. Each has multiple feature learning branches and produces multiple groups of features with different spatial resolutions. From first to fourth stage, the number of branches ranges from one to four. For brevity, only feature maps from second stage to the last are shown. The backbone network together with DSUs are used to learns base functions and AUs are used to learn coefficients.

Specifically, to start the boosting, a DSU called DSU0 is attached on the fourth stage of backbone network to produce . Let and with supervision on , in Eq. 3 is determined. Similarly, DSU1 is appended on third stage of backbone network to learn base function . An AU called AU1 with supervision aggregates the and resulting in a boosted model . By stacking AUs with supervision in a deep-to-shallow way, the models’  performances are boosted gradually.

Deformable Side-output Unit (DSU). Fig. 4(a) shows the implementation of proposed DSU. It is attached on the end of a stage of the backbone network. Supposing the inputs of th DSU are where is the number of feature learning branches, a deformable convolutional layer [2] with parameters is first attached on each feature learning branch and the outputs are upsampled to same spatial resolution. Then those upsampled features are concatenated and followed by two full connection layer. The first layer parametrised by fuses concatenated features. The second layer parametrised by learns the residual. Denoting parameters in backbone network as , then parameters to be learned in a base function include .

Aggregation Unit with Deep Supervision. The AU is designed to learn the expansion coefficient in Eq. 4. Fig. 4(b) shows its implementation. It first concatenates the outputs by classification model at previous stage and the residual by base function at current stage, then a full connection layer with parameters is attached. With the supervision, coefficients are learned and a boosted segmentation model is obtained. In BoostNet, cross-entropy loss is used.

3 Experimental Results

To show the effectiveness of out proposed BoostNet, we validate it on the public dataset ORIGA [19] and compare it with seven state-of-the-arts.

Data Preprocess and Augmentation. ORIGA dataset contains 650 pairs of fundus images and annotations by experts. Among ORIGA [19], 325 pairs are used for training and the rests for testing. Since we focus on OD regions, OD windows size of are cropped from original images size of . Inspired by [15] [4] [11]

, a radial transform is first performed on OD windows in Cartesian coordinate system before feeding them into BoostNet. To increase the diversity of training data, three tricks are used to augment the training data in Cartesian system: (1) random OD windows cropping near the OD centre (

20 pixels of horizontal and vertical offsets to OD centre); (2) multi-scale augmentation (factors ranging from 0.8 to 1.2); (3) horizontal flipping. For training data, the OD centres are determined based on the ground-truth OD masks. For testing data, we train an HRNet [18] for OD segmentation with the training data of ORIGA and estimate OD centres according to the predicted OD masks. During testing phase, vertical flipping augmentation is used on polar OD windows.

Experimental Setting. Our BoostNet is built on the top of implementation of HRNet [18]

within the PyTorch framework. We initialize the weights in HRNet


with the pre-trained model on ImageNet


and the rest weights with Gaussian distribution with zero mean and standard deviation

. Parameters are optimised by SGD on three GPUs. Hyper-parameters includes: base learning rate (0.01, poly policy with power of 0.9), weight decay (0.0005), momentum (0.9), batch size (9) and iteration epoch (200).

Comparison with State-of-the-arts. We compare the proposed BoostNet with seven state-of-the-arts: lightweight U-Net [16], MNet [4], JointRCNN [8], AG-Net [20], FC-DenseNet [1], SAN [11], and HRNet [18]. The first six methods are designed for OD and OC segmentation and the last one is originally designed for natural scene image segmentation. The results of lightweight U-Net [16] and MNet [4] are from [4]. JointRCNN [8], AG-Net [20], FC-DenseNet [1] and SAN [11] are from the original paper. Results of HRNet [18] is obtained by fine-tuning with ORGIA [19]. Following [4] [11], we use overlapping error

as evaluate metric. Results are reported in Table

1. As we can see that our method achieves superior segmentation performances to the state-of-the-arts. Fig. 5 and Fig. 6 show two examples for OC and OD segmentation respectively, in which the results by our BoostNet are much closer than MNet [4] and SAN [11] to ground-truth.

lightweight U-Net [16] 0.115 0.287 0.303
MNet [4] 0.071 0.230 0.233
JointRCNN [8] 0.063 0.209 -
AG-Net [20] 0.061 0.212 -
CE-Net [5] 0.058 - -
SAN [11] 0.059 0.208 0.215
HRNet [18] 0.057 0.209 0.208
ours blue0.051 blue0.202 blue0.196
Table 1: Performances of different methods on ORIGA [19].
Figure 5: Examples for OC segmentation. From left to right are input OD window, results by ours, MNet [4] and SAN [11]. Pixels in yellow, red and green are correct, miss and error detections respectively.
Figure 6: Examples for OD segmentation. From left to right are input OD window, results by ours, MNet [4] and SAN [11]. Pixels in yellow, red and green are correct, miss and error detections respectively.

Ablation Study. Table 2 evaluates the effectiveness of boosting. After boosted once, the overlapping errors of OD, OC and rim segmentation are improved from to . After boosted twice, the performances are improved to . Fig. 7 shows the results of one example by different segmentation architectures. With boosting, pixels near edges are predicted more accurately, which results in lower overlapping errors.

without boosting 0.056 0.209 0.206
boosting once 0.054 0.202 0.200
boosting twice blue0.051 blue0.202 blue0.196
Table 2: Ablation study on boosting on ORIGA [19].
Figure 7: Example illustration for the effectiveness of proposed boosting module. From left to right: input image, results by architectures without boosting, with once boosting and twice boosting. From top to down: results of rim and OC. Pixels in yellow, red and green are correct, miss and error detections respectively.

4 Conclusion

This paper proposes a BoostNet which solves the segmentation problem in a gradient boosting framework. Deformable side-output unit (DSU) and aggregation unit (AU) with deep supervision are designed to estimate the base functions and expansion coefficients in gradient boosting framework. Our BoostNet can be trained end-to-end by standard back-propagation. Validation on ORIGA [19] demonstrates the effectiveness of our proposed BoostNet on the segmentation of OD and OC in fundus images.


  • [1] B. Al-Bander, B. M. Williams, W. Al-Nuaimy, M. A. Al-Taee, H. Pratt, and Y. Zheng (2018) Dense fully convolutional segmentation of the optic disc and cup in colour fundus for glaucoma diagnosis. Symmetry 10. Cited by: §3.
  • [2] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In ICCV, pp. 764–773. Cited by: Figure 4, §2.
  • [3] J. H. Friedman (2002) Stochastic gradient boosting. Computational Statistics & Data Analysis 38 (4), pp. 367 – 378. Cited by: §1, §2, §2.
  • [4] H. Fu, J. Cheng, Y. Xu, D. W. K. Wong, J. Liu, and X. Cao (2018-07) Joint optic disc and cup segmentation based on multi-label deep network and polar transformation. TMI 37 (7), pp. 1597–1605. Cited by: Figure 3, §1, §2, §2, Figure 5, Figure 6, Table 1, §3, §3.
  • [5] Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao, and J. Liu (2019) CE-net: context encoder network for 2d medical image segmentation. TMI. Cited by: §1, Table 1.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2016-06) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §2, §2.
  • [7] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger (2017-07) Densely connected convolutional networks. In CVPR, pp. 2261–2269. External Links: ISSN 1063-6919 Cited by: §2.
  • [8] Y. Jiang, L. Duan, J. Cheng, Z. Gu, H. Xia, H. Fu, C. Li, and J. Liu (2019) JointRCNN: a region-based convolutional neural network for optic disc and cup segmentation. TBME. Cited by: Table 1, §3.
  • [9] J. B. Jonas, A. Bergua, P. Schmitz–Valckenberg, K. I. Papastathopoulos, and W. M. Budde (2000) Ranking of optic disc variables for detection of glaucomatous optic nerve damage. Investigative Ophthalmology & Visual Science 41 (7), pp. 1764. Cited by: §1.
  • [10] A. Li, Z. Niu, J. Cheng, F. Yin, D. W. K. Wong, S. Yan, and J. Liu (2018) Learning supervised descent directions for optic disc segmentation. Neurocomputing 275, pp. 350 – 357. External Links: ISSN 0925-2312 Cited by: §1.
  • [11] Q. Liu, X. Hong, S. Li, Z. Chen, G. Zhao, and B. Zou (2019) A spatial-aware joint optic disc and cup segmentation method. Neurocomputing 359, pp. 285 – 297. Cited by: Figure 3, §2, §2, Figure 5, Figure 6, Table 1, §3, §3.
  • [12] M. Moghimi, S. J. Belongie, M. J. Saberian, J. Yang, N. Vasconcelos, and L. Li (2016) Boosted convolutional neural networks.. In BMVC, pp. 24–1. Cited by: §1.
  • [13] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. IJCV 115 (3), pp. 211–252. Cited by: §3.
  • [14] M. J. Saberian and N. Vasconcelos (2011) Multiclass boosting: theory and algorithms. In NIPS, pp. 2124–2132. Cited by: §1.
  • [15] H. Salehinejad, S. Valaee, T. Dowdell, and J. Barfett (2018) Image augmentation using radial transform for training deep neural networks. In ICASSP, Cited by: §3.
  • [16] A. Sevastopolsky (2017-07-01) Optic disc and cup segmentation methods for glaucoma detection with modification of u-net convolutional neural network. Pattern Recognition and Image Analysis 27 (3), pp. 618–624. Cited by: Table 1, §3.
  • [17] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §2.
  • [18] K. Sun, B. Xiao, D. Liu, and J. Wang (2019-06)

    Deep high-resolution representation learning for human pose estimation

    In CVPR, Cited by: §2, §2, Table 1, §3, §3, §3.
  • [19] Z. Zhang, F. S. Yin, J. Liu, W. K. Wong, N. M. Tan, B. H. Lee, J. Cheng, and T. Y. Wong (2010) ORIGA-light: an online retinal fundus image database for glaucoma analysis and research. In 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, pp. 3065–3068. Cited by: Figure 1, §1, Table 1, Table 2, §3, §3, §3, §4.
  • [20] Z. Zhang, H. Fu, H. Dai, J. Shen, Y. Pang, and L. Shao (2019-10) Attention guided network for retinal image segmentation. In MICCAI, Cited by: §1, §2, Table 1, §3.
  • [21] Z. Zhang, H. Fu, H. Dai, J. Shen, Y. Pang, and L. Shao (2019-10) ET-net: a generic edge-attention guidance network for medical image segmentation. In MICCAI, Cited by: §2.