Glaucoma is the leading cause of irreversible blindness. It is a chronic eye disease that degrades the optic nerves, resulting in a large ratio between optic cup (OC, see Fig. 1) to optic disc (OD, see Fig. 1) 
. In clinical, the cup-to-disc ratio (CDR) is manually estimated by expertised ophthalmologists. It is labour intensive and time-consuming. To make the accurate CDR quantification automated and assist glaucoma diagnosis, the segmentation of OD and OC is attracting a lot of attention.
Current state-of-the-arts for OD and OC segmentation are data-driven. It is formulated as a dense classification problem and solved by designed deep neural networks consisting of a feature learning network and one or multiple prediction branches. Some works such as SAN  and CE-Net  directly append one branch of prediction on the final output of a CNN architecture, as is shown in Fig. 2 (a). This facilitates for recognition since high-level features contain rich semantic information. However, they ignore the spatial details in shallow layers, which results in inaccurate classification around edges. To make use of spatial details, others such as MNet  and AG-Net  make predictions on multiple levels of features. The final outputs are obtained by element-wise summation. Fig. 2
(b) illustrates the architecture with multiple prediction branches. However, in this kind of design, multiple prediction branches are independent, which limits the learning capacity of the segmentation model. Regarding each prediction branch as a classification model, how to ensemble them for better OD and OC segmentation remains unsolved. In the field of traditional machine learning, an immediate thought for model ensemble is boosting. This motivates us to solve the dense classification problem in a gradient boosting framework  and develop an architecture with boosting modules to boost the branch’s classification ability. Fig. 2(c) illustrates the proposed boosting idea.
In detail, this paper explores a boosting deep neural network called BoostNet for OD and OC segmentation. It learns a sequence of base functions with deformable side-output unit (DSU) to boost the dense classification model in a stage-wise manner. Each base function learns the residual between ground-truth and output by classification model at the previous stage. Combining the classification model at the previous stage with base function in an additive form by the proposed aggregation unit (AU) results in a better classification model. Stacking the AUs with deep supervision in a deep-to-shallow manner, the classification models are boosted stage-by-stage. Fig. 3 illustrates the architecture of our BoostNet. We note that our BoostNet is different from existing combination of boosting and CNNS called BoostCNN . Our BoostNet is designed for pixel-level classification which treats the inner sub-networks as weaker learners while BoostCNN  is proposed for image-level classification which treats the classification models after different training iterations or trained with different inputs as weaker learners.
2 Gradient BOOSTING NETWORK
The proposed gradient boosting network (BoostNet) roots in gradient boosting  and is equipped with the designed Deformable Side-output Units (DSUs) and Aggregation Units (AUs). It can be trained in an end-to-end way. Similar to previous methods MNet  and SAN , BoostNet takes polar images as input and outputs polar segmentation maps.
Problem Formulation. The goal of OD and OC segmentation is to find a function that maps the input to an expected label
such that some specified loss function
over the joint distribution of allis minimized:
By merging the predicted OC and rim masks, OD is obtained.
Different from previous methods     that directly estimate the function in Eq. 1, we solve it from the view of gradient boosting . The key thought is to approximate by greedily learning a sequence of base functions in an additive expansion of the form :
where are called base functions parametrised by respectively and are the expansion coefficients. Commonly, they are solved in a stage-wise way. First, the start base function is optimised and . The initial approximated function is determined by:
Then for , the approximated function is expressed as:
The base function and expansion coefficient are determined by:
We note that the base function here is the residual of . It is similar to the residual learning in ResNet  except that base functions in gradient boosting framework learn residuals of models’ outputs while residual block in ResNet  learns the residual of features. By aggregating with residuals stage-by-stage under supervision, the models are boosted gradually and make predictions closer to ground-truth. Next we will detail how to design a network to learn the base functions and the coefficients.
Gradient Boosting Network (BoostNet). CNNs such as VGG , ResNet , DenseNet  and HRNet  produce feature maps in a stage-wise manner. It is in line with the way that gradient boosting learns base functions and coefficients. This naturally motivates us to leverage the CNN to solve the optimisation problem defined by Eq. 3 - Eq. 5.
Fig. 3 shows the proposed BoostNet. It consists of a backbone network, three deformable side-output units (i.e. DSU0, DSU1, DSU2) and two aggregation units (i.e. AU1 and AU2). The outputs of DSU0, DSU1, DSU2 are denoted as and outputs of AU1, AU2 are denoted as respectively. We adopt HRNet  as the backbone network since it is able to learn powerful representation with high-resolution. HRNet consists of four stages. Each has multiple feature learning branches and produces multiple groups of features with different spatial resolutions. From first to fourth stage, the number of branches ranges from one to four. For brevity, only feature maps from second stage to the last are shown. The backbone network together with DSUs are used to learns base functions and AUs are used to learn coefficients.
Specifically, to start the boosting, a DSU called DSU0 is attached on the fourth stage of backbone network to produce . Let and with supervision on , in Eq. 3 is determined. Similarly, DSU1 is appended on third stage of backbone network to learn base function . An AU called AU1 with supervision aggregates the and resulting in a boosted model . By stacking AUs with supervision in a deep-to-shallow way, the models’ performances are boosted gradually.
Deformable Side-output Unit (DSU). Fig. 4(a) shows the implementation of proposed DSU. It is attached on the end of a stage of the backbone network. Supposing the inputs of th DSU are where is the number of feature learning branches, a deformable convolutional layer  with parameters is first attached on each feature learning branch and the outputs are upsampled to same spatial resolution. Then those upsampled features are concatenated and followed by two full connection layer. The first layer parametrised by fuses concatenated features. The second layer parametrised by learns the residual. Denoting parameters in backbone network as , then parameters to be learned in a base function include .
Aggregation Unit with Deep Supervision. The AU is designed to learn the expansion coefficient in Eq. 4. Fig. 4(b) shows its implementation. It first concatenates the outputs by classification model at previous stage and the residual by base function at current stage, then a full connection layer with parameters is attached. With the supervision, coefficients are learned and a boosted segmentation model is obtained. In BoostNet, cross-entropy loss is used.
3 Experimental Results
To show the effectiveness of out proposed BoostNet, we validate it on the public dataset ORIGA  and compare it with seven state-of-the-arts.
Data Preprocess and Augmentation. ORIGA dataset contains 650 pairs of fundus images and annotations by experts. Among ORIGA , 325 pairs are used for training and the rests for testing. Since we focus on OD regions, OD windows size of are cropped from original images size of . Inspired by   
, a radial transform is first performed on OD windows in Cartesian coordinate system before feeding them into BoostNet. To increase the diversity of training data, three tricks are used to augment the training data in Cartesian system: (1) random OD windows cropping near the OD centre (20 pixels of horizontal and vertical offsets to OD centre); (2) multi-scale augmentation (factors ranging from 0.8 to 1.2); (3) horizontal flipping. For training data, the OD centres are determined based on the ground-truth OD masks. For testing data, we train an HRNet  for OD segmentation with the training data of ORIGA and estimate OD centres according to the predicted OD masks. During testing phase, vertical flipping augmentation is used on polar OD windows.
Experimental Setting. Our BoostNet is built on the top of implementation of HRNet 
within the PyTorch framework. We initialize the weights in HRNet
with the pre-trained model on ImageNet
. Parameters are optimised by SGD on three GPUs. Hyper-parameters includes: base learning rate (0.01, poly policy with power of 0.9), weight decay (0.0005), momentum (0.9), batch size (9) and iteration epoch (200).
Comparison with State-of-the-arts. We compare the proposed BoostNet with seven state-of-the-arts: lightweight U-Net , MNet , JointRCNN , AG-Net , FC-DenseNet , SAN , and HRNet . The first six methods are designed for OD and OC segmentation and the last one is originally designed for natural scene image segmentation. The results of lightweight U-Net  and MNet  are from . JointRCNN , AG-Net , FC-DenseNet  and SAN  are from the original paper. Results of HRNet  is obtained by fine-tuning with ORGIA . Following  , we use overlapping error
as evaluate metric. Results are reported in Table1. As we can see that our method achieves superior segmentation performances to the state-of-the-arts. Fig. 5 and Fig. 6 show two examples for OC and OD segmentation respectively, in which the results by our BoostNet are much closer than MNet  and SAN  to ground-truth.
|lightweight U-Net ||0.115||0.287||0.303|
Ablation Study. Table 2 evaluates the effectiveness of boosting. After boosted once, the overlapping errors of OD, OC and rim segmentation are improved from to . After boosted twice, the performances are improved to . Fig. 7 shows the results of one example by different segmentation architectures. With boosting, pixels near edges are predicted more accurately, which results in lower overlapping errors.
This paper proposes a BoostNet which solves the segmentation problem in a gradient boosting framework. Deformable side-output unit (DSU) and aggregation unit (AU) with deep supervision are designed to estimate the base functions and expansion coefficients in gradient boosting framework. Our BoostNet can be trained end-to-end by standard back-propagation. Validation on ORIGA  demonstrates the effectiveness of our proposed BoostNet on the segmentation of OD and OC in fundus images.
-  (2018) Dense fully convolutional segmentation of the optic disc and cup in colour fundus for glaucoma diagnosis. Symmetry 10. Cited by: §3.
-  (2017) Deformable convolutional networks. In ICCV, pp. 764–773. Cited by: Figure 4, §2.
-  (2002) Stochastic gradient boosting. Computational Statistics & Data Analysis 38 (4), pp. 367 – 378. Cited by: §1, §2, §2.
-  (2018-07) Joint optic disc and cup segmentation based on multi-label deep network and polar transformation. TMI 37 (7), pp. 1597–1605. Cited by: Figure 3, §1, §2, §2, Figure 5, Figure 6, Table 1, §3, §3.
-  (2019) CE-net: context encoder network for 2d medical image segmentation. TMI. Cited by: §1, Table 1.
-  (2016-06) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §2, §2.
-  (2017-07) Densely connected convolutional networks. In CVPR, pp. 2261–2269. External Links: Cited by: §2.
-  (2019) JointRCNN: a region-based convolutional neural network for optic disc and cup segmentation. TBME. Cited by: Table 1, §3.
-  (2000) Ranking of optic disc variables for detection of glaucomatous optic nerve damage. Investigative Ophthalmology & Visual Science 41 (7), pp. 1764. Cited by: §1.
-  (2018) Learning supervised descent directions for optic disc segmentation. Neurocomputing 275, pp. 350 – 357. External Links: Cited by: §1.
-  (2019) A spatial-aware joint optic disc and cup segmentation method. Neurocomputing 359, pp. 285 – 297. Cited by: Figure 3, §2, §2, Figure 5, Figure 6, Table 1, §3, §3.
-  (2016) Boosted convolutional neural networks.. In BMVC, pp. 24–1. Cited by: §1.
-  (2015) Imagenet large scale visual recognition challenge. IJCV 115 (3), pp. 211–252. Cited by: §3.
-  (2011) Multiclass boosting: theory and algorithms. In NIPS, pp. 2124–2132. Cited by: §1.
-  (2018) Image augmentation using radial transform for training deep neural networks. In ICASSP, Cited by: §3.
-  (2017-07-01) Optic disc and cup segmentation methods for glaucoma detection with modification of u-net convolutional neural network. Pattern Recognition and Image Analysis 27 (3), pp. 618–624. Cited by: Table 1, §3.
-  (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §2.
Deep high-resolution representation learning for human pose estimation. In CVPR, Cited by: §2, §2, Table 1, §3, §3, §3.
-  (2010) ORIGA-light: an online retinal fundus image database for glaucoma analysis and research. In 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, pp. 3065–3068. Cited by: Figure 1, §1, Table 1, Table 2, §3, §3, §3, §4.
-  (2019-10) Attention guided network for retinal image segmentation. In MICCAI, Cited by: §1, §2, Table 1, §3.
-  (2019-10) ET-net: a generic edge-attention guidance network for medical image segmentation. In MICCAI, Cited by: §2.