Boundary-assisted Region Proposal Networks for Nucleus Segmentation

Nucleus segmentation is an important task in medical image analysis. However, machine learning models cannot perform well because there are large amount of clusters of crowded nuclei. To handle this problem, existing approaches typically resort to sophisticated hand-crafted post-processing strategies; therefore, they are vulnerable to the variation of post-processing hyper-parameters. Accordingly, in this paper, we devise a Boundary-assisted Region Proposal Network (BRP-Net) that achieves robust instance-level nucleus segmentation. First, we propose a novel Task-aware Feature Encoding (TAFE) network that efficiently extracts respective high-quality features for semantic segmentation and instance boundary detection tasks. This is achieved by carefully considering the correlation and differences between the two tasks. Second, coarse nucleus proposals are generated based on the predictions of the above two tasks. Third, these proposals are fed into instance segmentation networks for more accurate prediction. Experimental results demonstrate that the performance of BRP-Net is robust to the variation of post-processing hyper-parameters. Furthermore, BRP-Net achieves state-of-the-art performances on both the Kumar and CPM17 datasets. The code of BRP-Net will be released at


page 2

page 12


Fully Convolutional Instance-aware Semantic Segmentation

We present the first fully convolutional end-to-end solution for instanc...

Proposal-free Network for Instance-level Object Segmentation

Instance-level object segmentation is an important yet under-explored ta...

End-to-end Neuron Instance Segmentation based on Weakly Supervised Efficient UNet and Morphological Post-processing

Recent studies have demonstrated the superiority of deep learning in med...

Adaptive Boundary Proposal Network for Arbitrary Shape Text Detection

Arbitrary shape text detection is a challenging task due to the high com...

The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation

State-of-the-art approaches for semantic image segmentation are built on...

Systematic Assessment of Hyperdimensional Computing for Epileptic Seizure Detection

Hyperdimensional computing is a promising novel paradigm for low-power e...

CPP-Net: Context-aware Polygon Proposal Network for Nucleus Segmentation

Nucleus segmentation is a challenging task due to the crowded distributi...

1 Introduction

Nucleus segmentation is a crucial task in computational pathology, as it provides rich spatial and morphometric information regarding nuclei. However, automatic nucleus segmentation remains challenging. This is for a number of reasons: first, a large amount of nucleus clusters exist, which results in crowded and overlapping nuclei; second, the boundary of nuclei in out-of-focus images tends to be blurry, which increases the difficulty associated with separating crowded instances; third, both nucleus appearance and shape exhibit dramatic variation, which makes the segmentation task more difficult.

Figure 1: An example that illustrates the essential difference between the semantic segmentation task and the instance boundary detection task. (a) original image; (b) ground-truth for semantic segmentation; (c) ground-truth for nucleus boundary. The boundaries that separates two overlapping instances, i.e. pixels colored in green in (c), cannot be directly inferred from the semantic segmentation results in (b).

Many approaches to nucleus segmentation have been proposed. One popular scheme is based on the use of boundary detection [1, 2, 3]. These approaches subtract instance boundaries from semantic segmentation results and then employ complex post-processing rules to obtain specific instances. In order to obtain the instance boundaries, DCAN [1] adopted two decoders for U-Net, one for semantic segmentation and another for instance boundary detection. No interactions take place between the two decoders. To make use of their correlation, BES-Net [2] and CIA-Net [3] further introduced uni-directional and bi-directional information transmission, respectively, which means one decoder obtains extra features from the other one. There are two key downsides of the above approaches. First, as they adopt a shared encoder for both tasks, they consequently underestimate the essential differences between tasks in feature learning; for example, the boundaries in Fig. 1 that separate two overlapping instances cannot be directly inferred from the semantic segmentation results in Fig. 1. Second, because these approaches adopt complex post-processing rules, their performance is sensitive to the variation of post-processing hyper-parameters.

Another popular strategy used to separate crowded instances is the distance-based approach [4, 5]. For example, DIST [4] predicted the distance between each foreground pixel and its nearest background pixel, while HoVer-Net [5] enriched prediction by considering distances in both the horizontal and vertical directions. Subsequently, these works apply the watershed algorithm to the predicted distance maps to obtain instances. However, one downside of this approach is that the watershed algorithm may be sensitive to the noise in the distance maps. Finally, clustering-based methods predict the spatial location of the associated instance for each foreground pixel [6]. These instances are separated by clustering the predicted location coordinates.

Figure 2: Overview of BRP-Net. BRP-Net comprises two stages: one stage to obtain instance proposals and another for proposal-wise segmentation.

In this paper, we propose a novel framework for nucleus segmentation, referred to as Boundary-assisted Region Proposal Network (BRP-Net). Similar to Mask R-CNN [7], BRP-Net comprises two stages: one stage to obtain instance proposals and another for proposal-wise segmentation. In the first stage, we implement the boundary detection-based scheme to obtain instance proposals. This can be contrasted with Mask R-CNN [7], which predicts rectangular proposals directly from feature maps. As was demonstrated in [16], crowded instances result in bounding boxes with significant overlap; this means a single bounding box can be associated with multiple instances, consequently affecting the optimization quality of the network. Moreover, we further propose the Task-aware Feature Encoding (TAFE) network, which efficiently extracts high-quality features for semantic segmentation and instance boundary detection tasks. TAFE aids BRP-Net in robustly obtaining instance proposals. The second stage refines the segmentation result for each proposal, which enables BRP-Net to be robust to the variation of post-processing hyper-parameters in TAFE. Extensive experiments are conducted on two publicly available nucleus segmentation datasets, from which we can conclude that BRP-Net consistently achieves state-of-the-art performance on both datasets.

2 Method

The overall framework of BRP-Net is presented in Fig. 2. This framework includes two stages: one for obtaining instance proposals and another for proposal-wise segmentation. The first stage adopts a similar pipeline to CIA-Net [3], and the second one aims to refine the segmentation results of the first stage in a proposal-wise manner.

2.1 Region Proposal Generation

Figure 3: Architecture details of TAFE. The number of channels in is set to 256 consistently. Feature maps produced by both encoders are fused in FFMs to make use of their correlation. For simplicity, only one FFM is shown and the other two FFMs are ignored in this figure. (Best viewed in color).

We adopt a boundary detection-based scheme to obtain high-quality region proposals. Following the post-processing rules outlined in [1, 3], instance boundaries are subtracted from the predictions of semantic segmentation. Subsequently, connected component analysis is applied to produce instance proposals. Extant approaches have integrated semantic segmentation and instance boundary detection tasks into one model [1, 2, 3]; however, as they adopt a shared encoder for both tasks, they may underestimate their essential differences regarding feature learning, as is analyzed in Sec. 1. One intuitive solution would be adopting independent encoders for the two tasks. However, this strategy increases the model complexity and also completely ignores their correlation. Accordingly, we propose a novel Task-aware Feature Encoding (TAFE) network capable of efficiently extracting high-quality features for each of these tasks.

Fig. 3 presents the architecture of TAFE. First, nucleus images are fed into a single backbone encoder to extract feature maps that are {1, 1/2, 1/4, 1/8} of the original image size. The structure of the backbone encoder is provided in the supplementary file. Subsequently, each of them is passed through one unshared convolutional layer to obtain and . and are fed into Task-specific Encoders (TSE), which are designed for the semantic segmentation and instance boundary detection tasks, respectively. In each encoder, feature maps after down-sampling are merged with an of the same size via element-wise summation. The merged features are then passed through one convolutional layer to generate . Similar to CIA-Net [3]

, deep supervision is applied and the auxiliary classifiers take

as inputs. Moreover, inspired by the Information Aggregation Modules [3], we propose the light-weight Feature Fusion Modules (FFMs), which is based on residual learning to aggregate information in and

. In the experimentation section, we demonstrate the superiority of FFMs. FFMs are helpful for making use of the correlation as well as reserving the differences between both tasks. Outputs of each FFM are fed into two shallow decoders via element-wise summation. The two decoders are used for the semantic segmentation task and the instance boundary detection task. Each decoder contains three BN-ReLU-Conv layers.

2.2 Proposal-wise Segmentation

Figure 4: The two networks in the proposal-wise segmentation stage adopt the same architecture. Each layer in the network includes one dense block that consists of four convolutional layers. Growth rates of the four dense blocks are set to 16, 32, 64, and 128, respectively. The number below each group of feature maps denotes the number of channels. (Best viewed in color).

The first stage of BRP-Net, i.e. TAFE, adopts hand-crafted post-processing rules to obtain instance proposals. Accordingly, the quality of proposals is affected by post-processing hyper-parameters. To address this problem, we propose a second stage for BRP-Net to facilitate more robust segmentation.

We crop one square patch containing each proposal with a minimal margin of 12 pixels on each side. Because the patches vary dramatically in size, we group them into small and large patches with a threshold of according to their length. Then, small and large patches are resized to and , respectively. Finally, we train one network for the small and another for the large patches. These two networks have the same architecture, the details of which are illustrated in Fig. 4

. Inputs to the model include the patch, and the probability maps that are predicted by the semantic segmentation and boundary detection tasks in the first stage. To relieve the influence of background, elements in the probability maps that fall outside of the dilated proposal are set to zero. The dilation rate is set to 2 pixels.

During training, each proposal is matched to a ground-truth instance depending on their Intersection over Union (IoU). For proposals with an IoU larger than , their label maps are set with reference to the matched ground-truth instance; otherwise, the proposals are considered to be false-positive predictions. Therefore, all elements in their label maps are set to zero (denoting background).

2.3 Inference

During the inference process, nucleus images are fed into BRP-Net. Semantic segmentation and instance boundary detection results are produced by TAFE. Then, post-processing operations in [1, 3] are implemented to obtain instance proposals. Finally, patches containing these proposals are extracted and respectively fed into proposal-wise segmentation networks for robust instance segmentation.

3 Experiments

We conduct experiments on two publicly available datasets. The first is a multi-organ nucleus dataset [8, 9], referred to as Kumar, which contains 30 Hematoxylin and Eosin () stained images with resolution of . They are divided into a training set of 16 images and a testing set of 14 images according to the same protocol used in previous works [8, 3, 5, 6]. In the testing set, 8 images are from 4 organs in the training set (seen organ), and the remained 6 images are from 3 organs that do not appear in the training set (unseen organ). The second dataset is Computational Precision Medicine Dataset (CPM17) [10], which contains 32 images for training and 32 images for testing.

Evaluation metrics for the two datasets are different. In the Kumar dataset, the main metric is the Average Jaccard Index (AJI) [8]. We also report the F1-Score to measure the instance detection performance. In CPM17, we use the same metrics as used in [10], i.e. the DICE coefficient (DICE 1) and Ensemble Dice (DICE 2). DICE 1 measures the overall overlap between the predictions and the ground truth, and DICE 2 measures the average overlap between the predictions and their matched ground truth instances. Besides, in order to better compare with one state-of-the-art work [5], we also report AJI in the experiments.

3.1 Implementation Details

We first perform stain normalization [12]

to reduce the color differences between the stained images. In the next step, we normalize each image by subtracting the mean and dividing by the standard deviation of the training set. Training data are augmented by random cropping, flipping, color jittering, blurring and elastic transformation. We crop images to a size of

pixels before using them as the input of BRP-Net.

In a similar way to CIA-Net [3], we adopt DenseNet [13] as TAFE’s backbone encoder and initialize its parameters using a single pretrained model ***The pretrained model can be downloaded from We also adopt both the Smooth Truncated Loss [3] and Soft Dice Loss [17] for the optimization of both tasks in TAFE. Weight of the Soft Dice Loss is set to 0.5. We use the AdamW [14]

optimizer for training. The number of training epochs is set to 600. The learning rate is initially set to 0.0003, and decreases according to the cosine annealing schedule

[14]. The learning rate decreases to zero in 40 epochs and is then reset. At each restart, the new start learning rate is set to be one half of the previous rate, while the new period lasts for twice as long as the previous one.

Finally, and are set to be 48 and 176 pixels, respectively. Training settings for the proposal-wise segmentation networks are similar to those of TAFE. But we use Focal Loss [15] for optimization and the training lasts for only ten epochs. The learning rate is set to 0.0003 initially, and decreases according to the cosine annealing schedule without restart.

3.2 Ablation Study

3.2.1 Effectiveness of TAFE

Network AJI (%) F1-Score (%)
seen unseen all seen unseen all
Baseline 61.15 62.58 61.76 82.99 84.08 83.46
Baseline+FFMs 61.41 63.39 62.26 82.35 84.90 83.44
TAFE 61.96 63.84 62.77 82.81 84.34 83.47
Table 1: Performance comparisons between the baseline, baseline+FFMs, and TAFE.
Figure 5: Evaluation on different settings for BRP-Net. (a) The influence of different dilation radii in the post-processing step of TAFE. (b) The choice of IoU thresholds

and different loss functions for the second stage of BRP-Net.

We compare the performance of TAFE with a baseline network that is similar to existing boundary detection-based methods [3]. In brief, it shares encoder for the semantic segmentation and instance boundary detection tasks. The two tasks still own respective decoders equipped with IAMs. For fair comparison, the baseline has the same number of parameters as TAFE. Table 1 presents the performance of TAFE and baseline. We also equip IAMs with the same residual learning scheme as FFMs and report the performance of baseline again, which is referred to as ‘baseline+FFMs’ in the table. Architecture details of both the baseline and ‘baseline+FFMs’ are provided in the supplementary file. It can be seen from our results that TAFE achieves higher AJI performance on both the seen and unseen organ datasets. It is also clear that the residual learning scheme in FFMs is helpful. This may be because this scheme better highlights the differences between the two tasks, as illustrated in Fig. 3. The comparison justifies the effectiveness of TAFE and the FFM modules.

3.2.2 Evaluation on Post-Processing Settings in TAFE

Existing boundary detection-based methods are sensitive to the post-processing hyper-parameters, particularly the dilation radius for recovering the subtracted instance boundaries [1, 3]. We conduct experiments to evaluate the influence of different dilation radii on both TAFE and the entire BRP-Net pipeline. Results are presented in Fig. 5. It can be found that due to the proposal-wise segmentation stage, BRP-Net is highly robust to the value of dilation radius. By contrast, the performance of single-stage method is less stable.

3.2.3 Evaluations on Settings for Proposal-wise Segmentation

We evaluate the influence of IoU thresholds and different loss functions on the second stage of BRP-Net. Experimental results are presented in Fig. 5. It is shown that the performance of BRP-Net is generally robust to the value of , as well as that focal loss [15] slightly outperforms cross-entropy loss. According to the evaluation results, we select focal loss for training and set as 0.5 for the second stage.

Network AJI (%) F1-Score (%)
seen unseen all seen unseen all
CNN3 [8] 51.54 49.89 50.83 82.26 83.22 82.67
DIST [4] 55.91 56.01 55.95 - - -
Mask R-CNN [7] 59.78 55.31 57.86 81.07 82.91 81.86
CIA-Net [3] 61.29 63.06 62.05 82.44 84.58 83.36
HoVer-Net [5] - - 61.80 - - -
Spa-Net [6] 62.39 63.40 62.82 82.81 84.51 83.53
BRP-Net (ours) 63.07 65.75 64.22 83.46 85.26 84.23
(a) Comparisons on the Kumar database [8].
Network Dice 1 (%) Dice 2 (%) AJI (%)
DRAN [10] 86.2 70.3 68.3
HoVer-Net [5] 86.9 - 70.5
Micro-Net [11] 85.7 79.6 -
BRP-Net (ours) 87.7 79.5 73.1
(b) Comparisons on CPM17 database [10]
Table 2: Quantitative comparisons between BRP-Net and existing methods.

3.3 Comparisons with State-of-the-art Methods

Comparisons between BRP-Net and state-of-the-art methods on the Kumar database are reported in Table 3(a). It can be seen that BRP-Net achieves both the highest AJI and the highest F1-Score among all the methods. In particular, BRP-Net outperforms the previous best method, i.e. SPA-Net, by 0.68%, 2.35%, and 1.40% on the seen organ, unseen organ, and all testing data, respectively. We also provide qualitative comparisons in the supplementary file.

We further conduct comparisons on CPM17 database [10] and summarize the results in Table 3(b). From the results, we can see that BRP-Net continues to achieves state-of-the-art performance. Its performance in Dice 1 and AJI outperforms existing approaches by 0.8% and 2.6%, respectively. The above comparisons demonstrate the effectiveness of BRP-Net.

4 Conclusion

In this paper, we propose the Boundary-assisted Region Proposal Network (BRP-Net) for nucleus segmentation. BRP-Net contains one stage designed for obtaining instance proposals and a second stage for proposal-wise segmentation. To separate crowded nuclei, we adopt a boundary detection-based scheme for the first stage. We further propose a novel Task-specific Feature Encoding network with Feature Fusion Modules to achieve this goal. The second stage is further introduced to segment proposals of various size, and enables BRP-Net to be robust to the variation of post-processing hyper-parameters in the first stage. Finally, BRP-Net achieves strong performance on both the Kumar and CPM17 datasets.


Changxing Ding is supported by NSF of China under Grant 61702193 and U1801262, the Science and Technology Program of Guangzhou under Grant 201804010272, the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant 2017ZT07X183, and the Fundamental Research Funds for the Central Universities of China under Grant 2019JQ01. Dacheng Tao is supported by Australian Research Council Project FL-170100117.  


  • [1]

    Chen, H., Qi, X., Yu, L., Heng, P.A.: DCAN: deep contour-aware networks for accurate gland segmentation. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2487–2496 (2016)

  • [2] Oda, H., Roth, H.R., Chiba, K., Sokolić, J., Kitasaka, T., Oda, M., Hinoki, A., Uchida, H., Schnabel, J.A., Mori, K.: Besnet: boundary-enhanced segmentation of cells in histopathological images. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 228–236. Springer (2018)
  • [3] Zhou, Y., Onder, O.F., Dou, Q., Tsougenis, E., Chen, H., Heng, P.A.: Cia-net: Robust nuclei instance segmentation with contour-aware information aggregation. In: International Conference on Information Processing in Medical Imaging. pp. 682–693. Springer (2019)
  • [4] Naylor, P., Laé, M., Reyal, F., Walter, T.: Segmentation of nuclei in histopathology images by deep regression of the distance map. IEEE Trans. Med. Imaging 38(2), 448–459 (2018)
  • [5] Graham, S., Vu, Q.D., Raza, S. E A., Azam, A., Tsang, Y. W., Kwak, J. T., Rajpoot, N.: Hover-Net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images. Medical Image Analysis, 58, 101563 (2019).
  • [6]

    Koohbanani, N.A., Jahanifar, M., Gooya, A., Rajpoot, N.: Nuclear instance segmentation using a proposal-free spatially aware deep learning framework. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 622-630. Springer (2019)

  • [7] He, K., Gkioxari G., Dollár P., Girshick R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
  • [8] Kumar, N., Verma, R., Sharma, S., Bhargava, S., Vahadane, A., Sethi, A.: A dataset and a technique for generalized nuclear segmentation for computational pathology. IEEE Trans. Med. Imaging 36(7), 1550–1560 (2017)
  • [9] Kumar, N., et al.: A multi-organ nucleus segmentation challenge. IEEE Trans. Med. Imaging. 10.1109/TMI.2019.2947628
  • [10] Vu, Q.D., et al.: Methods for segmentation and classification of digital microscopy tissue images. Frontiers in bioengineering and biotechnology, 7, 53 (2019)
  • [11] Raza, S. E A., Cheung, L., Shaban, M., Graham, S., Epstein, D., Pelengaris, S., Khan, M., Rajpoot, N.M.: Micro-Net: A unified model for segmentation of various objects in microscopy images. Medical Image Analysis, 52, 160–173 (2019)
  • [12] Macenko M., Niethammer M., Marron J S, et al.: A method for normalizing histology slides for quantitative analysis. In: Proceedings of IEEE International Symposium on Biomedical Imaging, pp. 1107–1110 (2009)
  • [13] Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 4700-4708 (2017)
  • [14] Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101 (2017)
  • [15] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980-2988 (2017).
  • [16] Ding, H., Qiao, S., Shen, W., Yuille, A.: Shape-aware Feature Extraction for Instance Segmentation. arXiv preprint arXiv:1911.11263 (2019).
  • [17]

    Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: International Conference on 3D Vision (3DV), pp. 565–571. IEEE (2016)

1 Architecture Details of the Backbone Encoder

Figure 6:

Architecture details of the backbone encoder in Sec. 2.1 of the main paper. We adopt DenseNet121 as the backbone. The four dense blocks contain 6, 12, 18 and 24 convolutional layers, respectively, and the growth rate is set to 32. The stride of the first

convolutional layer is set to 1 and the first max-pooling layer is removed. Therefore, sizes of the obtained feature maps from the four dense blocks are 1, 1/2, 1/4 and 1/8 of the input image size, respectively. (Best viewed in color).

2 Architecture Details of the Baseline

Figure 7: Architecture details of the baseline and ‘baseline+FFMs’ in Table 1 of the main paper. Their difference is that the former adopts Information Aggregation Modules (IAMs), while the latter equips IAMs with the same residual learning scheme as Feature Fusion Modules (FFMs). ‘IAM Core’ refers to the core components for information aggregation in IAM, as illustrated in (b). The baseline and ‘baseline+FFMs’ has the same number of parameters as TAFE. (Best viewed in color).

3 Qualitative Comparisons

Figure 8: Qualitative comparisons between different models. From left to right in each row: the original image, ground truth segmentations, the predictions by CIA-Net [3], TAFE, and BRP-Net. (Best viewed in color).