Nucleus segmentation is a process aimed at detecting and delineating each nucleus in microscopy images. This process is capable of providing rich spatial and morphological information about nuclei; therefore, it plays an important role in many cell analysis applications, such as cell-counting, cell-tracking, phenotype classification and treatment planning . Manual nucleus segmentation is time-consuming, meaning that automatic nucleus segmentation methods have become increasingly necessary.
However, automatic nucleus segmentation still remains a challenging task in terms of robustness due to the crowded distribution of nuclei and their blurry boundaries, as illustrated in Fig. 1. Unlike objects in natural images, nuclei tend to overlap with each other. As a result, the bounding box for one instance often covers other nuclei, which negatively impacts the robustness of traditional bounding box-based detection methods, such as Mask R-CNN . Another major challenge lies in the blurry boundary between touching nuclei, which increases the difficulty of inferring their boundaries.
A large number of approaches have been proposed to handle the above challenges [3, 7, 8, 9, 11, 4, 12, 5, 17, 20, 21]. For example, Chen et al.  differentiate instances of nuclei according to their boundaries. Graham et al.  represent nucleus instances using pixel-to-centroid distance maps in both the horizontal and vertical directions. Koohbanani et al.  infer nucleus instances by clustering bounding boxes predicted on each pixel within nuclei. When attempting to finally obtain nucleus instances, the above approaches typically resort to complex post-processing operations, such as morphologic operations , watershed algorithms [11, 4, 12], and clustering . Several recent works [13, 29, 14] represent each instance using a polygon, which is realized by predicting a set of centroid-to-boundary distances. They require only light-weight post-processing operations, i.e., non-maximum suppression, to remove redundant proposals; therefore, their pipelines are more straightforward and efficient.
However, these approaches predict polygons using features of the centroid pixel for each instance only, whereas the centroid alone lacks contextual information [30, 31]. In particular, the centroid is located far away from boundary pixels for large-sized nuclei, which degrades the distance prediction accuracy. Moreover, supervision is imposed on each respective distance value and there is a lack of global constraint on the shape of each nucleus.
In this paper, we propose a Context-aware Polygon Proposal Network (CPP-Net) to improve the robustness of polygon-based methods  for nucleus segmentation. The contributions of this paper are made from three perspectives. First, CPP-Net explores more contextual information to improve the prediction accuracy for the centroid-to-boundary distances; specifically, it adopts the StarDist  model to conduct initial distance prediction along a set of pre-defined directions. It then samples a set of points between the centroid and the initially predicted boundary along each direction. As these points are closer to the boundary than the centroid pixel, their distance to the ground-truth boundary can be predicted much more accurately. Correspondingly, the initially predicted centroid-to-boundary distance value can be refined with reference to the predictions for those sampled points.
Second, the prediction confidence of these sampled points typically varies according to their feature quality. For example, the errors contained in the distances initially predicted by StarDist  can be amplified in case where some sampled points actually fall outside the nucleus. Accordingly, the weights of the sampled points should change depending on their prediction confidence. We therefore propose a Confidence-based Weighting Module (CWM) that adaptively fuses the predicted distances for these points. With the assistance of CWM, CPP-Net can more robustly utilize contextual information from the sampled points.
Third, we introduce a novel Shape-Aware Perceptual (SAP) loss, which constrains CPP-Net’s predictions regarding the nucleus shape. The original perceptual loss  penalizes the differences in the hidden feature maps of a pre-trained classification network between two input images. To encode the shape information of the nucleus into the perceptual loss, we train an encoder-decoder model that maps the representation of nucleus shape in CPP-Net, i.e., the pixel-to-boundary distance maps and the centroid probability map, to other shape representations, such as nucleus bounding boxes. By being trained in this way, this model is capable of extracting rich shape information related to nuclei. We then adopt the encoder part to extract feature maps for the predictions and the ground-truth output of CPP-Net, respectively. The SAP loss penalizes the differences between these extracted feature maps. In this way, the shapes of nuclei during training are constrained.
In this paper, we conduct ablation study on the proposed components in CPP-Net on the DSB2018  and BBBC006  databases. Our experimental results justify the effectiveness of these components. Finally, we compare the performance of CPP-Net with state-of-the-art methods on DSB2018, BBBC006, and PanNuke [34, 35]; under these circumstances, CPP-Net consistently achieves state-of-the-art performance.
The remainder of this paper is organized as follows. Related works on nucleus segmentation are reviewed briefly in Section 2. The proposed methods are described in Section 3, while implementation details are presented in Section 4. Experimental results are presented in Section 5, along with their analysis. Finally, we conclude this paper in Section 6.
2 Related Works
A number of effective approaches for nucleus segmentation have been proposed. In this section, we divide the recent researches into two categories, namely traditional methods and deep-learning based methods.
Many traditional methods are based on the watershed algorithm [22, 23, 24]. For example, Malpica et al.  proposed a morphological watershed-based algorithm, which is assisted by means of empirically designed image processing operations. This approach utilizes both intensity and morphology information for nucleus segmentation. However, this is likely to cause over-segmentation, and also results in limitations in the processing of overlapping nuclei [23, 24]. Yang et al.  proposed a new marker extraction method based on condition erosion to alleviate the over-segmentation problem. Tareef et al.  proposed a Multi-Pass Fast Watershed method that adaptively and efficiently segments overlapping cervical cells. Moreover, the active contour model (ACM) has also been widely adopted for nucleus segmentation [25, 26]. For example, Molna et al.  proposed to promote the performance of ACM by exploring prior knowledge, specifically the understanding that nuclei usually have ellipse-shaped boundaries. Other traditional methods, such as level-set  and template-matching , have also been adopted for nucleus segmentation. The common downside of traditional methods is that they typically require hand-crafted features, which depend on human expertise and have limitations in terms of their representation power.
In recent years, deep-learning based approaches have achieved notable success on nucleus segmentation tasks [6, 3, 7, 8, 9, 10, 11, 15, 4, 12, 5, 17, 20, 21, 16, 14, 18, 13, 19]. These works can be further categorized into two-stage and one-stage methods.
Two-stage methods consist of a detection stage, which locates nucleus instances, and a segmentation stage, which predicts a foreground mask for each instance. One representative method of this kind is Mask R-CNN [2, 18], which detects nucleus instances using bounding boxes. However, the shape of nuclei tends to be elliptical, and severe occlusion typically exists between instances; this means that each bounding box may contain pixels representing two or more instances, indicating that bounding boxes may be ultimately sub-optimal for nucleus segmentation [13, 16]. To handle this problem, SpaNet  detects instance centroids and performs semantic segmentation in its first stage. In its second stage, it predicts the bounding box of the associated instance according to the feature of each foreground pixel. Finally, it separates overlapping nuclei by clustering the above pixel-wise predictions using the centroids as clustering centers. Moreover, BRP-Net  is also a two-stage network. It includes a detection stage, which generates region proposals based on instance boundaries, and a refinement stage, which refines the foreground area of each instance. Notably, neither SpaNet  nor BRP-Net  is designed in an end-to-end manner, which increases the complexity of the entire system.
By contrast, one-stage methods adopt a single network. Based on the network prediction, they utilize post-processing operations to obtain nucleus instances. Depending on the network prediction property being utilized, one-stage methods can be further subdivided into classification-based models and regression-based models.
As the name suggests, classification-based models output classification probability maps. Existing works in this sub-category include boundary-based [6, 9, 10, 3, 7, 8] and connectivity-based  methods. Boundary-based methods typically include a boundary detection branch and a semantic segmentation branch [3, 7, 8]; for example, DCAN  constructs two separate decoders for boundary detection and semantic segmentation, respectively. Because these two tasks are related, BES-Net  and CIA-Net  respectively introduce uni- and bi-directional connections between the two branches. These methods process images in the RGB color space. In comparison, Zhao et al.  leveraged the optical characteristics of Haemotoxylin and Eosin (H&E) staining, and proposed a Hematoxylin-aware Triplet U-Net, which makes predictions with reference to the extracted Hematoxylin component in the image. By subtracting instance boundaries from the segmentation maps, overlapped nuclei can be separated; the downside of this is that such a subtraction operation may result in a loss of segmentation accuracy . Moreover, we term PatchPerPix  as a connectivity-based method, since the prediction it makes indicates whether a pixel is located in the same instance as each of its neighbors. Due to the advantages it offers in the context of describing the local shape of instances in small patches, PatchPerPix is capable of segmenting instances with sophisticated shapes.
In comparison, the regression-based models output regression maps, e.g., distances or coordinate offsets for each pixel of the input image. For example, HoVer-Net  predicts the distances from each foreground pixel to its corresponding nucleus centroid in both the horizontal and vertical directions. It then employs the marker-controlled watershed algorithm as post-processing to obtain nucleus instances. The performance of these approaches is affected by the empirically designed post-processing strategies. Recently, Schmidt et al.  proposed the StarDist approach, which predicts both the centroid probability maps and distances from each foreground pixel to its associated instance boundary along a set of pre-defined directions. In the post-processing step, StarDist generates polygon proposals based on the set of predicted distances for each centroid pixel. Each polygon represents one nucleus instance. In this method, polygons are predicted using the features of the centroid pixel only; as a result, contextual information for large-sized nucleus instances is lacking, which affects the prediction accuracy.
Our proposed CPP-Net is a one-stage method and relates closely to StarDist . CPP-Net improves the robustness of StarDist by integrating rich contextual information from a sampled point set for each centroid pixel. Moreover, CPP-Net adopts a novel Shape-Aware Perceptual loss, that constrains CPP-Net’s predictions according to the shape prior of nuclei.
Fig. 2 presents the structure of CPP-Net for nucleus segmentation. The backbone of CPP-Net is a simple U-Net. Three parallel convolutional (Conv) layers are attached to the backbone. These layers predict the pixel-to-boundary distance maps , the confidence maps , and the centroid probability map , respectively. and represent the height and width of the image, respectively. For clarity, we denote the coordinate space of the input image as and the total number of elements in as . The same as , each element in the -th channel of refers to the distance between a foreground pixel and the boundary of its associated instance along the -th pre-defined direction. denotes the number of total directions. Elements in indicate the probability of each foreground pixel being the instance centroid.
In what follows, we first propose a Context Enhancement Module (CEM), which samples a point set to explore more contextual information for pixel-to-boundary distance prediction. We then design a Confidence-based Weighting Module (CWM) that adaptively combines the predictions from the sampled points. Finally, we introduce the Shape-Aware Perceptual (SAP) loss, which further promotes the segmentation accuracy.
3.2 Context Enhancement Module
The nucleus segmentation task comprises two subtasks: instance detection and instance-wise segmentation. The recently developed StarDist approach  performs these two subtasks in parallel. The first detects the centroid of each nucleus, whereas the second segments each instance using a polygon, which is represented using the distances from the centroid pixel to the instance boundary along pre-defined directions. In , the distances are predicted using only the features of the centroid. However, the size of nuclei may vary dramatically, meaning that the centroid pixel alone may lack contextual information for precise distance predictions.
To handle the above problem, we propose CEM, which utilizes pixels that are closer to the boundaries to refine the distance prediction. To achieve this goal, CEM first samples points between each pixel and its predicted boundary position along each direction. It then merges the predicted pixel-to-boundary distances of the points, and adaptively updates the pixel-to-boundary distance of the initial pixel. Formally speaking, the refined pixel-to-boundary distance along the -th direction for one pixel can be obtained as follows:
where denotes the initially predicted pixel-to-boundary distance in along the -th direction for . , where indexes the sampling directions. is equal to . In this paper, we uniformly sample the points between the initial pixel and its predicted boundary along each specified direction. The coordinates for the -th sampled point are accordingly computed as follows:
Finally, in Eq. (1) denotes the weight of the -th sampled point. One simple weighting strategy for use is averaging, i.e., setting all to .
3.3 Confidence-based Weighting Module
Although the averaging strategy is effective for Eq. (1), it is also sub-optimal as it neglects the impact of prediction quality on the sampled points. Prediction quality is affected by both image quality and the position of the sampled points. In particular, sampled points near to the boundary may actually lie outside of the nucleus, as in Eq. (1) contains errors. Therefore, the prediction accuracy on the sampled points is variable. Accordingly, we propose a Confidence-based Weighting Module (CWM) that adaptively fuses predictions on these sampled points.
As Fig. 2 illustrates, we attach an extra Conv layer to the backbone model in order to produce confidence maps , the sizes of which are the same as those of . Each element in measures the prediction confidence for the corresponding element in . We then perform sampling on both and using coordinates computed according to Eq. (2) and Eq. (3
) along each sampling direction, respectively. Sizes of the resulting tensors are thereforefor each direction. The tensor sampled from is fed into a
Conv layer and a Softmax layer. The output dimension of theConv layer is also . The Softmax layer outputs the normalized weights; these normalized weights are used as in Eq. (1). It is worth noting that the sampling directions share parameters of the Conv layer.
3.4 Loss Functions
The StarDist model  utilizes two loss terms: the binary cross entropy loss for centroid probability prediction, and the weighted L1 loss for pixel-to-boundary distance regression. These two loss terms are formulated as follows:
where and represent elements in the ground-truth and predicted centroid probability maps, respectively. We follow the same process as that outlined in  to obtain the ground-truth centroid probability map, i.e., utilizing the normalized pixel-to-boundary distance map as centroid probability map. and denote elements of the ground-truth and predicted pixel-to-boundary maps respectively along the -th direction.
For CPP-Net, there are two predicted distance maps, namely and . is predicted by the backbone model, while represents the final pixel-to-boundary distance prediction by CPP-Net. Accordingly, we modify Eq. (5) for CPP-Net as follows:
where denotes the refined pixel-to-boundary distance in along the -th direction for .
Eq. (5) and Eq. (7) penalize the prediction error in each respective pixel-to-boundary distance value, while the overall shapes of nucleus instances are ignored. In fact, nucleus instances typically have similar shapes; this can be utilized as prior knowledge to facilitate accurate nucleus segmentation. However, it is challenging to explicitly represent the overall shape of a single nucleus instance. To deal with this problem, we adopt an implicit approach inspired by the perceptual loss 
, which is proposed for style transformation and super-resolution tasks. In
, a network pre-trained for image classification on ImageNet is used as a feature extractor, with the differences between the extracted features of one image pair being penalized. This approach encourages the high-level information of the two images to be similar. Inspired by the original perceptual loss, we propose a Shape-Aware Perceptual (SAP) loss for nucleus segmentation.
The aim of the SAP loss is to penalize the differences in shape feature between the predicted and ground-truth nucleus representations. To encode the shape information in a deep model, we propose transforming the nucleus representations in CPP-Net, i.e., the pixel-to-boundary distance maps and the centroid probability map , to other representation forms [13, 8, 4, 5]. This transformation is accomplished using an encoder-decoder structure as illustrated in Fig. 3.
This paper mainly considers two nucleus representation strategies: first, the semantic segmentation and boundary detection maps in boundary-based approaches ; second, the location and the size of the associated bounding box for each nucleus. During training of the transformation model, we concatenate the ground-truth and for each image to create the inputs. The binary cross-entropy loss and L1 loss are adopted for the two target representation strategies, respectively.
After training is completed, we adopt the encoder of the transformation model for the SAP loss to train CPP-Net. The SAP loss can be formulated as follows:
where denotes the 2D coordinate space of the extracted shape-aware feature maps, while and
are the vectors inand at the location of , respectively. Moreover, denotes the encoder of the pre-trained transformation model. The parameters of are fixed during the training of CPP-Net. Finally, the entire loss of CPP-Net is summarized as follows:
In the interests of simplicity, we adopt equal weights for the three terms in .
4 Experimental Setup
Data Science Bowl 2018 (DSB2018)  is a nucleus detection and segmentation competition, in which a dataset of 670 images and manual annotations are available. To facilitate fair comparisons with existing approaches, we follow the evaluation protocol outlined in . In this protocol, the training, validation, and testing sets include 380, 67, and 50 images, respectively.
Images in BBBC006  were captured by one 384-well microplate containing stained U2OS cells. Two fields of view are selected for each well to obtain images. There are two images for each field of view: one Hoechst image and one phalloidin image. Accordingly, BBBC006 contains 1,536 images from 768 fields of view. In our experiments, we randomly divide the dataset into training, validation, and testing sets, which contains 924, 306, and 306 images, respectively.
patches from a total of 19 different tissue types. The nuclei are classified into neoplastic, inflammatory, connective/soft tissue, dead, and epithelial cells. We follow the evaluation protocol outlined in, which divides the patches into three folds containing 2,657, 2,524, and 2,723 images, respectively. Three different dataset splits are then made based on these three folds. One fold of data is used for training, with the remaining two folds used as validation and testing sets respectively.
4.2 Implementation Details
On DSB2018 and BBBC006, we adopt a very similar U-Net backbone as that used in 
for CPP-Net to facilitate fair comparison. This backbone includes three down-sampling blocks in its encoder and three up-sampling blocks in its decoder. The only change is that we replace all Batch Normalization (BN) layers with Group Normalization (GN) layers , since we use a small batch size of 1 for training. On PanNuke, we make two changes to this backbone. First, to ensure fair comparison with existing approaches , we replace the encoder of this backbone with ResNet-50 
, and initialize its weights with those pre-trained on ImageNet
. Second, we attach another decoder to classify nucleus types for each input image pixel. Loss functions for this decoder include the summation of the Cross Entropy loss and the Dice loss.
We adopt a deeper structure for the encoder-decoder model in the SAP loss. This model includes four down-sampling and four up-sampling blocks that are used to extract more high-level information. The other details of the architecture are the same as the U-Net backbone in CPP-Net, except that the encoder-decoder model does not utilize shortcuts.
The Adam algorithm  is employed for optimization. The initial learning rate is set to , and is reduced through multiplying by 0.5 if the validation loss no longer reduces. The training process halts if the learning rate is reduced to less than . We adopt online data augmentation of random rotation and horizontal flipping during training. As for the encoder-decoder model, we use the same training settings outlined as above, except that data augmentation is not employed.
4.3 Evaluation Metrics
For DSB2018 and BBBC006, we adopt the same evaluation metric as in and . According to the metric, the average precision (AP) with IoU thresholds ranging from 0.5 to 0.9 with a step size of 0.05 are computed. For the PanNuke database, we adopt the Panoptic Quality (PQ) presented in  as the evaluation metric. PQ has been widely adopted in panoptic segmentation tasks and was introduced into nucleus segmentation in . We report the PQs of all 19 tissues. Besides, both multi-class PQ (mPQ) and binary PQ (bPQ) are computed for evaluation. The mPQ averages the PQ performance on each of the five nucleus categories, while the bPQ directly computes the overall performance on images of all five nucleus categories.
5 Experimental Results
In what follows, we first conduct experiments on two publicly available databases, DSB2018  and BBBC006 , to determine the optimal number of sampling points and demonstrate the effectiveness of the CEM module. We then justify the effectiveness of the CWM module and the SAP loss. Finally, we compare the performance of CPP-Net with other methods on all three databases.
5.1 Evaluation of CEM
In this experiment, we evaluate the optimal number of sampling points in CEM. To facilitate clean comparison, we remove the SAP loss for CPP-Net, and consistently adopt CWM as the weighting strategy in Eq. (1). We further change the number of sampling points, i.e., , from 0 to 7, and report the experimental results in Table 1. When is equal to 0, CPP-Net reduces to the StarDist model . As Table 1 shows, the performance of CPP-Net continues to improve as increases from 0 to 6; however, its performance saturates when exceeds 6. Therefore, we consistently set to 6 in the following experiments.
It is clear that a single sampling point alone is able to significantly boost the APs on both databases, especially for APs under high IoU thresholds. Moreover, when is equal to 6, CEM improves the mean APs by on the DSB2018 database and on the BBBC006 database. The above experiments justify the effectiveness of CEM.
5.2 Evaluation of CWM
The results of the ablation study on the CWM module are summarized in Table 2. In this table, ‘baseline’ refers to the StarDist model , i.e., setting in CPP-Net to 0. In addition to CWM, another two weighting strategies are evaluated. ‘Equal weights’ denotes the averaging weighting strategy for Eq. (1), while ‘Naïve attention’ represents learning fixed weights for the points in Eq. (1), using a trainable vector with elements.
It is shown that CEM consistently outperforms the baseline model by large margins, regardless of the specific weighting strategy in Eq. (1). Moreover, compared with the other two weighting strategies, CWM achieves the best mean AP performance. CWM’s advantage lies mainly in its APs under high IoU thresholds, which indicates that the instance segmentation accuracy is increased. This performance improvement can be ascribed to the superior flexibility of CWM. In short, unlike the two weighting strategies that adopt fixed weights, CWM can adaptively weigh each sampled point according to the quality of its features. The above experimental results justify the effectiveness of CWM.
5.3 Evaluation of the SAP Loss
In this experiment, we justify the effectiveness of the SAP loss. Utilizing the SAP loss requires pre-training an encoder-decoder model that transforms the instance representations in CPP-Net to other types of representations (as described in Section 3.4). Accordingly, we evaluate the following three types of representation strategies for the SAP loss. The first strategy is boundary-based, in that it predicts both semantic segmentation masks and instance boundaries [3, 7, 8]; the second strategy is bounding box-based, in that it regresses both the coordinates of nucleus centroids and bounding box positions for each pixel inside one instance . The third strategy predicts both the above mentioned representations. For simplicity, these three strategies are denoted as ‘seg & bnd’, ‘bbox’, and ‘both’ in Table 3.
In Table 3, we first show the performance of CPP-Net without using the SAP loss. On both datasets, the SAP loss promotes performance in terms of mean AP. Specifically, the SAP loss improves the mean AP by on DSB2018 and on BBBC006. Furthermore, it is also clear that the improvement is mainly from APs under high IoU thresholds: for example, , , , , and improvements on on DSB2018. For APs with lower IoU values, SAP loss does not introduce significant performance promotion. From this phenomenon, we can conclude that the SAP loss primarily penalizes the prediction error in nucleus shape, rather than the localization or detection errors.
We also train the CPP-Net with another variant of the SAP loss, in which the encoder-decoder model is trained to reconstruct its input representations, i.e., the ground-truth centroid probability and pixel-to-boundary distance maps. The results of CPP-Net trained with this variant are denoted as ‘recons.’ in Table 3. The results show that the proposed SAP loss achieves better performance than this variant. The advantage achieved by our proposed SAP loss can be attributed to the transformation between different representation strategies. Through the use of this transformation task, the encoder-decoder model is forced to extract essential information related to the nucleus shape. By contrast, the ‘recons.’ variant is likely to only memorize the input information. Accordingly, our proposed SAP loss achieves better overall performance than all other three variants. In the following, we adopt our proposed SAP loss to train CPP-Net.
5.4 Qualitative Comparisons
In this experiment, we conduct qualitative comparisons between StarDist , CPP-Net (w/o SAP loss), and CPP-Net trained with SAP Loss, the results of which are presented in Fig. 4. As is shown in the first and second rows, StarDist may mistakenly segment a single nucleus instance into multiple nuclei; for its part, CPP-Net achieves more robust segmentation. Results in the third and fourth rows further indicate that the predictions of CPP-Net are more accurate regarding instance boundaries (e.g., the concave areas along nucleus boundaries). This can be attributed to CEM’s ability to explore more contextual information for centroid-to-boundary distance prediction. Finally, the SAP loss further corrects nucleus shape prediction errors, e.g., the highlighted instances in the lower-left and upper-right corners of the first example image. The above qualitative comparisons justify the effectiveness of the CEM module and SAP loss, respectively.
5.5 Comparisons with State-of-the-Art Methods
|DSB2018||Mask R-CNN ||0.8323||0.8051||0.7728||0.7299||0.6838||0.5974||0.4893||0.3525||0.1891||0.6058|
|Head & Neck||0.3946||0.5457||0.3668||0.5242||0.4530||0.6331||0.4613||0.6331||0.4596||0.6244||0.4768||0.6433||0.4667||0.6468|
|Average across tissues||0.3688||0.5528||0.4059||0.6053||0.4629||0.6596||0.4625||0.6485||0.4700||0.6609||0.4744||0.6692||0.4817||0.6767|
|STD across splits||0.0047||0.0076||0.0082||0.0050||0.00760||0.0036||0.0078||0.0054||0.0082||0.0062||0.0037||0.0014||0.0057||0.0018|
5.5.1 Comparisons on the DSB2018 database
We compare the performance of CPP-Net with Mask-RCNN[13, 2], KeypointGraph, HoVer-Net, PatchPerPix, and StarDist. The results of this comparison are tabulated in Table 4. It is notable here that some above-mentioned methods were evaluated using different training and testing data split protocols in their respective papers. In the interests of fair comparison, we evaluate the performance of Hover-Net  and KeypointGraph  by ourselves using codes released by the authors, under the same evaluation protocol as [13, 17]. We also reimplement the StarDist approach on DSB2018 and replace its BN layers with GN layers. Accordingly, we achieve better performances than the results reported in .
As shown in Table 4, StarDist and PatchPerPix are two powerful approaches and have their own respective advantages. Specifically, StarDist achieves higher than PatchPerPix, but much lower APs under high IoU thresholds. We conjecture StarDist may be affected by prediction accuracy regarding the shape of nucleus boundaries. This is because StarDist adopts the features of centroid pixels only for shape prediction; however, the centroid pixel alone lacks contextual information. In comparison, CPP-Net consistently achieves better performance than StarDist; in particular, it significantly improves the performance at high IoU thresholds. Finally, CPP-Net achieves the best mean AP performance among all methods. The above comparison experiments justify the effectiveness of CPP-Net.
We further summarize the inference time of different models in Table 6. Here, inference time includes the network prediction time and the associated post-processing time. We compare the inference time under the same hardware conditions: one NVIDIA TITAN Xp GPU, Intel(R) Core(TM) i7-6850K CPU @3.60GHz, and 128GB RAM. As shown in Table 6, StarDist  is the fastest among all compared approaches, while CPP-Net increases costs by only around relative to StarDist. Compared with other approaches presented in Table 6, CPP-Net and StarDist are more efficient owing to their light-weight backbone and their simple post-processing operations.
5.5.2 Comparisons on the BBBC006 database
To facilitate fair comparison, we train StarDist , HoVer-Net , KeypointGraph , and InstanceEmbedding  using the same data split protocol as ours. Experimental results are summarized in Table 4. As the table shows, similar to the results on DSB2018, the StarDist model achieves a promising score but an unsatisfactory score. By contrast, the proposed CPP-Net promotes the nucleus segmentation performance and maintains its advantages in terms of nucleus detection. It also continues to outperform all other state-of-the-art methods. Experimental results on this database justify the effectiveness of CPP-Net. Moreover, it is worth noting that BBBC006 consists of two types of images, specifically Hoechst images and phalloidin images. The latter image type contains a significant amount of noise, which affects the performance of KeypointGraph  and InstanceEmbedding . In comparison, StarDist, CPP-Net, and HoVer-Net continues to achieve promising results, which shows their robustness when processing noisy images.
5.5.3 Comparisons on the PanNuke database
We provide the performance of StarDist and CPP-Net with two different backbones. The first backbone adopts the same encoder as that used in the DSB2018 database, while the second employs ResNet-50 as the encoder. Their performance is compared with that of Mask-RCNN , Micro-Net , and HoVer-Net  in Tables 5. We further adopt the same evaluation metrics as those in . In Table 5, both bPQ and mPQ are computed for each of the 19 tissues.
As the experimental results in Table 5 demonstrate, CPP-Net consistently outperforms StarDist using each of the two backbones. Moreover, when CPP-Net is equipped with the same ResNet-50 backbone as HoVer-Net, it achieves better average performance than all other methods: for example, it outperforms StarDist by and in mPQ and bPQ, respectively. Results of the above comparisons are consistently with those on the first two databases, which further justifies the effectiveness of CPP-Net.
In this paper, we improve the performance of StarDist from two aspects. First, we propose a Context Enhancement Module that enables us to explore more contextual information and accordingly predict the centroid-to-boundary distances more robustly, especially for large-sized nuclei. We further propose a Confidence-based Weighting Module that adaptively fuses the predictions of the sampled points in the CEM module. Second, we propose a Shape-Aware Perceptual loss, which constrains the high-level shape information contained in the centroid probability and pixel-to-boundary distance maps. We conduct extensive ablation studies to justify the effectiveness of each proposed component. Finally, our proposed CPP-Net model is found to significantly outperform the StarDist model and achieve state-of-the-art performance on three popular datasets for nucleus segmentation.
-  J.C. Caicedo et al., “Nucleus segmentation across imaging experiments: the 2018 Data Science Bowl,” Nat. Methods, vol. 16, no. 12, pp. 1247-1253, Oct. 2019.
-  K. He, G. Gkioxari, P. Dollàr, and R. Girshick, “Mask R-CNN,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 386-397, 2020.
H. Chen, X. Qi, L. Yu, and P.A. Heng,“DCAN: Deep contour-aware Nnetworks for accurate gland segmentation,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2487-2496.
-  S. Graham et al., “HoVer-Net: Simultaneous segmentation and classification of nuclei in multi-tissue histology,” Med. Image Anal., vol. 58, p. 101563, Dec. 2019.
-  N.A. Koohbanani, M. Jahanifar, A. Gooya, and N. Rajpoot,“Nuclear instance segmentation using a proposal-free spatially aware deep learning framework,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. (MICCAI), Oct. 2019, pp. 622-630.
-  N. Kumar, R. Verma, S. Sharma, S. Bhargava, A. Vahadane and A. Sethi, “A dataset and a technique for generalized nuclear segmentation for computational pathology,” IEEE Trans. Med. Imag., vol. 36, no. 7, pp. 1550-1560, Jul. 2017.
-  H. OdaEmail et al., “BESNET: Boundary-enhanced segmentation of cells in histopathological images,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. (MICCAI), Sep. 2018, pp. 228-236.
-  Y. Zhou, O.F. Onder, Q. Dou, E. Tsougenis, H. Chen, and P.A. Heng, “CIA-net: Robust nuclei instance segmentation with contour-aware information aggregation,” in Proc. IPMI, 2019, pp. 682-693.
-  B. Zhao et al., “Triple U-net: Hematoxylin-aware nuclei segmentation with progressive dense feature aggregation,” Med. Image Anal., vol. 65, p. 101786, Oct. 2020.
-  M. W. Lafarge, E. J. Bekkers, J. P.W. Pluim, R. Duits, and M. Veta, “Roto-translation equivariant convolutional networks: Application to histopathology image analysis,” Med. Image Anal., vol. 68, p. 101849, Feb. 2021.
-  P. Naylor, M. Laé, F. Reyal and T. Walter “Segmentation of nuclei in histopathology images by deep regression of the distance map,” in IEEE Trans. Med. Imag., vol. 38, no. 2, pp. 448-459, Feb. 2019.
-  S. Wolf et al., “The mutex watershed algorithm for efficient segmentation without seeds,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2018, pp. 546-562.
-  U. Schmidt, M. Weigert, C. Broaddus, and G. Myers, “Cell detection with star-convex polygons,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. (MICCAI), Sep, 2018, pp. 265-273.
-  F. C. Walter, S. Damrich, and F. A. Hamprecht, “MultiStar: Instance segmentation of overlapping bbjects with star-convex polygons,” arXiv:2011.13228, 2020.
-  S.E.A. Raza et al., “Micro-Net: A unified model for segmentation of various objects in microscopy images,” Med. Image Anal., vol. 52, pp. 160-173, Feb. 2019.
-  S. Chen, C. Ding, and D. Tao, “Boundary-assisted region proposal networks for nucleus segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. (MICCAI), Sep. 2020, pp. 279-288.
-  P. Hirsch, L. Mais, D. Kainmueller, “PatchPerPix for instance segmentation,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2018, pp. 546-562.
-  A. O. Vuola, S. U. Akram, and J. Kannala “Mask-RCNN and U-Net ensembled for nuclei segmentation,” in Proc. IEEE Int. Symp. Biomed. Imag. (ISBI), Apr. 2019, pp. 208-212.
-  J. Yi et al., “Multi-scale cell instance segmentation with keypoint graph based bounding boxes,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. (MICCAI), Oct. 2019, pp. 369-377.
-  C. Long, M. Strauch, and D. Merhof, “Instance segmentation of biomedical images with an object-aware embedding learned with local constraints,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. (MICCAI), Oct. 2019, pp.451-459.
N. Dietler et al.,
“A convolutional neural network segments yeast microscopy images with high accuracy,”Nat. Commun., vol. 11, no. 1, p. 5723, 2020.
-  N. Malpica et al., “Applying watershed algorithms to the segmentation of clustered nuclei,” Cytometry, vol. 28, pp. 289-297, 1997.
X. Yang, H. Li and X. Zhou, “Nuclei segmentation using marker-controlled watershed, tracking using mean-shift, and Kalman filter in time-lapse microscopy,”IEEE Trans. Circuits Syst. I, Reg Papers, vol. 53, no. 11, pp. 2405-2414, Nov. 2006.
-  A. Tareef et al., “Multi-pass fast watershed for accurate segmentation of overlapping cervical cells,” IEEE Trans. Med. Imag., vol. 37, no. 9, pp. 2044-2059, Sep. 2018.
-  P. Bamford and B. Lovell, “Unsupervised cell nucleus segmentation with active contours,” Signal Process., vol. 71, no. 2, pp. 203-213, 1998.
-  C. Molna et al., “Accurate morphology preserving segmentation of overlapping cells based on active contours,” Sci. Rep., vol. 6, p. 32412, 2016.
-  Z. Lu, G. Carneiro and A. P. Bradley, “An improved joint optimization of multiple level set functions for the segmentation of overlapping cervical cells,” IEEE Trans. Image Process, vol. 24, no. 4, pp. 1261-1272, Apr. 2015.
C. Chen, W. Wang, J. A. Ozolek and G. K. Rohde, “A flexible and robust approach for segmenting cell nuclei from 2d microscopy images using supervised learning and template matching,”Cytometry A, vol. 83A, no. 5, pp. 495-507, 2013.
-  E. Xie et al., “PolarMask: Single shot instance segmentation with polar representation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 12193-12202.
F. Wei, X. Sun, H. Li, J. Wang, S. Lin, “Point-set anchors for object detection, instance segmentation and pose estimation,” inProc. Eur. Conf. Comput. Vis. (ECCV), Aug. 2020, pp. 527-544.
-  Y. Meng et al., “CNN-GCN Aggregation Enabled Boundary Regression for Biomedical Image Segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. (MICCAI), Sep. 2020, pp. 352-362.
-  J. Johnson, A. Alahi, L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Oct. 2016, pp. 694-711.
-  V. Ljosa, K.L. Sokolnicki, and A.E. Carpenter, 2012. “Annotated high-throughput microscopy image sets for validation,” Nat. Methods, vol. 9, no. 7, pp. 637-637, Jun. 2012.
-  J. Gamper, N.A. Koohbanani, K. Benet, A. Khuram, and N. Rajpoot, “PanNuke: An open pan-cancer histology dataset for nuclei instance segmentation and classification,” in Proc. Eur. Congr. Digit. Pathol. (ECDP), 2019, pp. 11-19.
-  J. Gamper et al., “PanNuke dataset extension, insights and baselines,” arXiv:2003.10778, 2020.
A. Y. Ng et al.,
“On spectral clustering: Analysis and an algorithm,” inProc. Advances in Neural Information Processing Systems (NeurIPS), 2002, pp. 849-856.
-  J. Deng et al.,“ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 248-255.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. Int. Conf. Mach. Learn. (ICML), Feb. 2015, pp. 448-456.
-  Y. Wu and K. He, “Group normalization,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 201, pp. 3-198.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), June 2016, pp. 770-778.
F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully convolutional neural networks for volumetric medical image segmentation,” inProc. Int. Conf. 3D Vis., Oct. 2016, pp. 565-571.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Representations (ICLR), 2015, pp. 1-15.