Instance segmentation is a prerequisite step for biomedical image processing, which not only assigns a class label for each pixel but also separates each object within the same class. By assigning a unique ID for every single object, the morphology, spatial locations, and distribution of the objects can be further studied to analyze the biological behaviors from the given images. In the digital pathology domain, the nuclear pleomorphism (size and shape) contributes to the tumor and cancer grading, and the spatial arrangement of cancer nuclei facilitates the understanding of cancer prognostic predictions [12, 8, 2, 46]. In the plant and agriculture study, analyzing each distinguished leaf in plant images enables the experts to learn about the plant phenotype including the number of leaves, maturity condition, and its similar cultivars, which serves as the key factor of understanding plant function and growth condition [30, 42, 41]. Traditional manual assessment for biomedical image instance segmentation is not suitable for current practice, as it is labor-intensive and time-consuming. Additionally, limitations of objective and reproducibility are unavoidable due to the intra- and inter-observer variability . To this end, automatic and accurate methods for instance segmentation in biology images are necessary and in high demand.
There still remain some challenges in instance segmentation tasks for biomedical images. First, some background structures have a similar appearance to the foreground object, such as cytoplasm or stroma in histopathology images. Therefore, methods relying on thresholding are ineffective. Second, within the same dataset, the objects in different images have large variability in size, shape, texture, and intensity. It is caused by the various biological structures and activities when acquiring different images [43, 33]. Third, there are clusters of objects overlapping with each other. The boundaries between these touching objects are ambiguous due to nonuniform staining absorption and similar object intensity. This might result in segmenting several objects into a single one. In order to tackle these issues, deep learning based methods are prevalent and effective by learning from feature representations.
CNN based instance segmentation methods can be categorized into two types: proposal-free and proposal-based methods. For the proposal-free instance segmentation methods, each pixel is firstly assigned a class label with a semantic segmentation model. The post-processing steps are then employed to separate each foreground object within the same category, according to their morphology characteristic, structures, and spatial arrangement [21, 31, 34, 9, 4]. Although post-processing among these methods is capable of separating the connected components, they still suffer from artificial boundaries during overlapping object segmentation. Even though [21, 4, 48] focus on boundaries learning at the semantic segmentation stage, the global contextual information is still not enough to separate the touching objects, especially when their borders become unclear. On the other hand, the proposal-based instance segmentation methods incorporate the detection task with the segmentation task [13, 27]. First, the spatial location for each object is detected as a bounding box. Then, a mask generator is further employed to segment each object within the corresponding predicted bounding box. By detecting and segmenting every single object separately, the proposal-based methods are capable of separating the touching objects. However, they are limited as there is a lack of global semantic information between the foreground and background.
For instance segmentation tasks, both the global semantic and local instance information are important. The global semantic information indicates the useful clues in the scene context, such as the relationship between the foreground and background and the spatial distribution of all the foreground objects. On the other hand, local-level instance information describes the spatial location and detailed contour for every single object. To integrate the benefits of the global and local features, panoptic segmentation , reconciliation of the semantic and instance segmentation, was proposed.
Based on  and , we previously proposed  and  for nuclei instance segmentation in histopathology images. In the Cell R-CNN , we designed a panoptic architecture to enable the encoder of the instance segmentation model to learn the global semantic-level features, by jointly training a semantic segmentation model with an instance segmentation model. In order to further facilitate the semantic-level contextual learning in the instance segmentation model, our IJCAI work  was proposed to induce the instance branch to learn directly about the semantic-level features. As the extension of Cell R-CNN, we refer to  as Cell R-CNN V2 in the following sections. In Cell R-CNN V2, we firstly introduce a new semantic segmentation prediction from the instance branch. Then a feature fusion mechanism to incorporate the feature maps is designed to induce the semantic feature learning in the decoder of the instance segmentation branch, by integrating the mask prediction of the instance branch with that of the semantic branch. In addition, a dual-model mask generator is proposed for instance mask segmentation, in order to prevent information loss.
In this work, we further extend our preliminary Cell R-CNN V2 , by addressing several remaining problems. First, the feature fusion mechanism in  directly replaces the part of the feature map in the semantic segmentation branch with those from the output of the mask generator. Although the mask predictions from the instance branch interpret more instance-level features than the semantic branch, the global contextual features from the semantic segmentation prediction are also important. To this end, we propose a residual attention feature fusion mechanism (RAFF) in this work, to replace the previous feature fusion mechanism. In our newly proposed RAFF, the local features from the instance branch are intergrated with the global semantic features, without deprecrating any semantic-level features. Second, two semantic segmentation tasks with the same ground truth are optimized together in the overall architecture of . In order to facilitate the robust learning of two segmentation tasks, we add a semantic consistency regularization between them to enforce the two semantic predictions from two different branches as similar as possible. In addition, there remain some low-quality mask predictions with an unexprected high classification score in the traditional Mask R-CNN, as mentioned in . It would be harmful to the segmentation accuracy if treating these poorly generated results as the ones with high confidence. To this end, we propose a new mask quality branch in this work, by learning an auxiliary quality score of each mask prediction based on the Dice score and Intersection-over-Union (IoU) score. During inference, the classification score of each mask is re-weighted through multiplication by its corresponding mask quality score.
The work described in this manuscript is an extension of Cell R-CNN V2 and is therefore named Cell R-CNN V3. In line with our previous Cell R-CNN V2 and Cell R-CNN, we are the first to employ the panoptic segmentation idea on biomedical image analysis, to the best of our knowledge. Overall, the contributions of this work compared with Cell R-CNN V2 are summarized as follows:
We design a residual attention feature fusion mechanism to integrate the features of each detected object in the semantic and instance levels.
We design a semantic task consistency mechanism to regularize the semantic segmentation tasks training for robustness.
We design an extra mask quality branch to ensure the mask segmentation quality for each object is compatible with its confidence score.
Our proposed Cell R-CNN V3 is validated on the instance segmentation tasks for various biomedical datasets, including histopathology images, fluorescence microscopy images, and plant phenotyping images. Our results for all metrics outperform the state-of-the-art methods by a large margin.
Ii Related work
Instance segmentation for biomedical images is widely studied, ranging from the handcrafted feature-based methods to the learning-based methods. In order to emphasize the contributions of our proposed method, we mainly focus on the literature of deep learning based instance segmentation methods, which can be grouped into two classes: the proposal-free and proposal-based methods.
Ii-a Proposal-free Instance Segmentation
Proposal-free instance segmentation methods are mainly based on the morphology and spatial relationship of all the objects in the images. For example, object boundary is an important feature for separating the touching object. In [21, 4, 48, 1]
, the instances are separated according to the probability map for the foreground objects and their boundaries. Similarly, separates each instance according to the distance between the two connected components. Additionally, post-processing methods are employed to separate the touching objects based on the semantic segmentation predictions, such as conditional region growing algorithm , morphological dynamics algorithms , and watershed algorithm [48, 4]. In addition to the traditional classification-based segmentation methods, regression-based methods are also widely employed. In , a distance transform map describing the distance between each pixel and its nearest background pixel is predicted, with a regression CNN architecture. To obtain the instance segmentation map directly, [34, 19, 9] employ the clustering algorithm on the high dimensional embeddings predicted from the deep regression CNN model. Based on adversarial learning architecture, Zhang et al 
proposed an image-to-image translation method for a more accurate probability map compared with the classification-based method.
Ii-B Proposal-based Instance Segmentation
Compared with proposal-free instance segmentation methods, the proposal-based methods predict the mask segmentation for each object based on the predictions of their corresponding locations in the whole image [36, 6]. One fundamental proposal-based instance segmentation method is Mask R-CNN . Based on the high-dimensional feature maps from the backbone CNN network, Mask R-CNN firstly generates regions of interest (ROIs) containing the foreground objects with a region proposal network (RPN). After aligning the ROIs to the same size, a box branch and a mask branch are employed to predict the coordinate, class label, and mask prediction for each ROI. With the help of the local-level information from the spatial locations of the instances, Mask R-CNN achieved state-of-the-art performance compared with the traditional box-free methods. Following the Mask R-CNN, other methods were further proposed with a higher accuracy:  proposed a path aggregate backbone to preserve the feature maps at high resolutions,  added a branch for mask IoU score prediction based on the mask prediction on the original Mask R-CNN, and  employed a cascade connection of several bounding box and mask prediction branches.
Although the proposal-based instance segmentation methods achieve higher performance compared with the proposal-free methods by processing each object separately, their effectiveness is still limited due to the lack of the semantic-level global information on the context of the whole images. To tackle this issue, panoptic segmentation was recently proposed to jointly process the foreground things and the background stuff , by incorporating the semantic segmentation with the instance segmentation. Inspired by this joint segmentation idea,  fuses the instance segmentation result for foreground objects with the semantic segmentation result for the background for urban scene semantic segmentation. However, the instance branch and semantic branch are trained separately in previous work. In [10, 17], both the instance and semantic segmentation branches are trained together by sharing the same backbone module. Then, the losses of the two branches are summed together for back propagation to optimize the parameters of the whole framework. Later, more methods for fusing the results of things and stuff are proposed. In , attention mechanism is employed to fuse the proposals and masks from the instance branch with the feature map from the semantic branch.  proposed a spatial ranking module to separate the overlapping objects from different categories by fusing the semantic segmentation predictions with the instance segmentation ones.
Similar to the jointly learning paradigm in the panoptic segmentation, combining the semantic segmentation task of the proposal-based instance segmentation also enables the model to achieve higher performance by learning the auxiliary semantic-level contextual information. In , the semantic segmentation prediction is fused with the proposed hybrid cascade instance segmentation architecture to make the architecture manipulate the global semantic features and achieve state-of-the-art performance compared with previous instance segmentation methods. In medical analysis tasks, we previously proposed Cell R-CNN  to induce the encoder of the instance segmentation to learn semantic-level information by jointly training a semantic segmentation network and a Mask R-CNN with a shared backbone network. With the help of the semantic-level contextual information, Cell R-CNN outperforms Mask R-CNN in the nuclei segmentation tasks on histopathology images. However, the decoder of the Cell R-CNN only learns the semantic features indirectly, which still makes the model lack global information during inference. In Cell R-CNN V2 , we, therefore, designed a feature fusion module to incorporate the feature maps from the semantic segmentation branch and the instance segmentation branch during the training phase. By retaining semantic-level features in the encoder and decoder of the instance segmentation model, our previous work  achieved state-of-the-art performance on several nuclei instance segmentation tasks under both object- and pixel-level metrics.
Iii Cell R-Cnn V3
In this section, we firstly introduce the overall architecture of the proposed Cell R-CNN V3. Then, the three newly proposed modules are described in detail. Finally, the training and inference details are presented.
Iii-a Overall architecture
Fig. 1 illustrates our proposed Cell R-CNN V3. For each input image, it first passes through a ResNet-101  backbone network to obtain the feature maps at different resolutions. Then, the feature maps are sent to a semantic segmentation branch to learn the global semantic-level feature and an instance segmentation branch to learn the object-level local features.
For the semantic segmentation branch, we employed the decoder of the global convolutional network (GCN) , as shown in Fig. 3. Specifically, multi-resolution feature maps after the ResNet101 backbone network are sent to a skip connected decoder, which contains several large kernel global convolutional modules. Each large kernel global convolutional module is simulated by incorporating two 1D convolutional kernels in different orders. To this end, the model has a large receptive field as well as memory efficiency, and the semantic branch is capable of processing more global-level contextual features compared with the CNN architectures with a normal size convolutional kernels.
Our instance segmentation branch in Fig. 2 is based on that of Cell R-CNN V2 . First, multi-resolution feature maps ( and in Fig. 2) are obtained by the feature pyramid network (FPN)  connected after the backbone encoder. Along with the anchors in different ratios and sizes, and then pass through a region proposal network (RPN)  to generate ROIs which represent the features of all possible foreground objects in the original images. As the ROIs after RPN are in various sizes, a ROIAlign mechanism  is further employed to reshape all the ROIs to the same size, which is in this work. Eventually, all the ROIs are sent to a bounding box branch to predict the locations and class scores and a dual-model mask generator  for mask instance segmentation prediction. In order to induce the semantic feature learning in the decoder of the instance segmentation branch, we further propose an attention-based feature fusion mechanism to incorporate the mask prediction and bounding box prediction for all the ROIs with the semantic segmentation feature map obtained from the top layer of FPN (). In addition, the mask segmentation result is fused with the ROI features for a newly proposed mask quality branch to predict the quality of the mask segmentation for each ROI according to the corresponding IoU and Dice score.
Iii-B Residual attention feature fusion mechanism
In Cell R-CNN V2, we proposed a feature fusion mechanism to incorporate the semantic-level contextual features with the local-level instance features by using the mask prediction from the instance branch to replace the subset of the semantic segmentation features according to the location of the bounding box branch. Although the fused feature map contains both semantic- and instance-level features, only the background features at the semantic level are learned by the instance segmentation branch, as the foreground features in the original semantic feature map are deprecated. However, the foreground feature for each object from the global view in the original semantic feature maps is still important, as it contains the relationship between each object and the whole background. In the instance segmentation branch, the mask prediction of each object is predicted according to the relationship between the foreground and the background within the corresponding ROI, instead of the background of the whole image. Moreover, part of the background feature in the semantic feature map is also replaced by that from the instance predictions after the feature fusion mechanism in the Cell R-CNN V2, which results in the contextual information loss in the semantic segmentation prediction. To this end, directly replacing the subset of the semantic feature map with the predictions of the instance branch is harmful to the semantic-level feature learning in the decoder of the instance branch.
To tackle this issue, we design an attention-based feature fusion mechanism, as illustrated in Fig. 4. The number of ROIs in the instance segmentation branch is denoted as , and the mask and bounding box predictions for each ROI are defined as and , respectively, where . Specifically, can be written as:
where and represent the corrdinates of the bottom left point of the rectangle ROI in and axes and and is its width and height. In addition, the semantic feature map before the attention-based feature fusion is defined as , as illustrated in Fig. 2. During the attention-based feature fusion for each , first we obtain its probability map :
where is the sigmoid operation. Then, we fuse each with the subset of according to the correpsonding . as shown in Algorithm 1, where reshapes the to with bilinear interpolation, and is the element-wise multiplication.
The value of each coordinate of represents the probability of this pixel being the foreground. Therefore, the proposed residual attention feature fusion mechanism highlights the foreground features on the original semantic feature map. By fusing the instance-level features on the semantic feature map while preserving all its contextual features, optimizing the semantic segmentation task of the instance branch enables the mask generator to learn accurate and sufficient semantic features.
Iii-C Mask quality branch
During the inference process of the traditional Mask R-CNN, the mask predictions are determined by the highest classification score. However, the classification scores for the mask predictions are not always correlated with their quality, such as the IoU between the mask prediction and the ground truth . In the testing phase of the Cell R-CNN and Cell R-CNN V2, if there remain two overlapping predictions, the overlapping part is assigned to the mask with a higher classification score. Therefore, low-quality mask predictions with high classification scores affect the performance when processing the overlapping objects during inference.
Inspired by , we propose a new mask quality branch to predict the quality of the mask predictions in the instance branch, as shown in Fig. 2. For each mask prediction in size , we select its foreground score map. Then, each score map is reshaped to size , to concatenate with the ROI feature map. The fused feature map with size then passes through convolutional layers and fully connected layers to predict the quality of the mask, which is a float value in . Table I
indicates the detailed hyperparameters setting in the mask quality branch. As Dice coefficient is an important metric in the biomedical images segmentation task, the mask quality score is determined by the IoU and Dice score between the predictions and the ground truth. During training, a mask prediction and its corresponding ground truth are denoted asand , respectively, the the mask quality score is defined as:
, where means the total number of the pixels. Therefore, is in . Eventually, loss is employed between the and the mask quality prediction.
Iii-D Semantic task consistency regularization
Our motivation for this module is from . When there are two tasks in a multi-task learning architecture focusing on the same objective, adding a consistency regularization between the outputs of these two tasks enables the robust learning of both. In our proposed architecture, both the semantic and the instance branches generate semantic segmentation predictions. In the ideal situation, the semantic segmentation predictions from both two branches should be equal to each other and equal to the ground truth. Therefore, we propose a consistency regularization between these two semantic segmentation predictions to reduce the distance between them. The softmax semantic segmentation prediction of the semantic and instance branch are denoted as and , respectively, which are both in range . The semantic consistency regularization is:
where is the total number of activations in the .
Iii-E Training and inference details
As shown in Fig. 1, the total loss function of the Cell R-CNN V3 is defined as:
For the instance segmentation task, and are the smooth L1 regression loss and cross entropy classification loss for RPN, respectively. and are the bounding box regression and the classification loss of the box branch, is the binary cross entropy segmentation loss for the mask branch, and is the regression loss for the mask quality branch. On the other hand, and are the semantic segmentation losses for the semantic branch and instance branch. is the mean square loss for the semantic consistency regularization, as shown in Eq. 4. and are trade-off parameters to balance the importance of each task and are set as and , respectively, in our experiments.
During inference, the instance mask predictions from the mask generator of the instance branch are employed. A confident threshold score is firstly employed to depreciate the masks whose classification scores are smaller than . Then, a mask confidence score for each object is calculated based on its classification score and mask quality prediction :
For any two touching predictions, the overlapping part belongs to the prediction with the higher .
Iv-a Dataset description
This dataset contains histopathology images in size , obtained from the The Cancer Genome Atlas (TCGA) at magnification . Each image is from one of the seven organs, including breast, bladder, colon, kidney, liver, prostate, and stomach. In order to compare with the state-of-the-art methods, we have the same data split as in [21, 31, 25]. images total from the breast, kidney, liver, and prostate are employed for training ( from each organ). During training, patches in size are randomly cropped from each image. Next, basic augmentation techniques are applied, including horizontal and vertical flipping and rotation of , , and . Due to the noise and variability of color in the histopathology images, advanced augmentation including Gaussian blur, median blur, Gaussian noise are then employed to ensure the robustness of the model. The validation set contains images from the breast, kidney, liver, and prostate. For the remaining images, images from the same organs in the training set form the seen testing set, while from the other organs unavailable to the training are selected as the unseen testing set. During testing, each image is directly employed for nuclei instance segmentation.
This is our second histopathology dataset focusing on the Triple Negative Breast Cancer (TNBC) dataset from . The TNBC dataset contains histopathology images at magnification, collected from different patients of the Curie Institute. We conduct 3-fold cross validation for all the experiments on this dataset. During training, patches are cropped from each images, following data augmentation including including horizontal and vertical flipping, rotation of , , and , Gaussian blur, median blur, and Gaussian noise. For testing, each image is directly employed.
Iv-A3 Fluorescence microscopy images
In addition to the histopathology images, we also validate our method on the fluorescence microscopy images analysis. We employ the BBBC039V1 dataset from , which contains images obtained from fluorescence microscopy. Each image focuses on the U2OS cells with a single field of view on the DNA channel, with various cell shape and density. In our experiment, we follow the official data split (https://data.broadinstitute.org/bbbc/BBBC039/), with images for training, for validation, and the rest for testing. For training data preparation, first, patches are randomly cropped from each image. As the background components in this dataset are not as complicated as the others, only basic data augmentation is employed, including horizontal and vertical flipping and rotation of , , and . During inference, each image is directly used.
Iv-A4 Plant Phenotyping
To demonstrate the effectiveness of our proposed method on instance segmentation task for other biology images, we study the leaf instance segmentation task. We employ the Computer Vision Problems in Plants Phenotyping (CVPPP) dataset, which contains top-down view images of leaves with various shapes and complicated occlusions. In this work, we focus on the A1 subset with a total of images, which has been broadly studied for instance segmentation in several state-of-the-art works. Out of the training images provided by the challenge, we employed images for training and the remaining for validation. During training, each image is firstly reshaped to size . Then, data augmentation including horizontal and vertical flipping, rotation of , , and , Gaussian blur, median blur, and Gaussian noise are employed to avoid overfitting. During inference, the predictions are directly obtained from the images. To evaluate the performance, the predicted results are submitted to the official evaluation platform (https://competitions.codalab.org/competitions/18405).
Iv-B Evaluation metrics
To evaluate the performance on the nuclei segmentation in the histopathology images and cell segmentation in the fluorescence microscopy images, we employed Aggregated Jaccard Index (), object-level F1 score (), and pixel-level Dice score (). is an extended Jaccard Index for object-level segmentation evaluation , defined as:
where is the nucleus in a ground truth with a total of nuclei. U is the set of false positive predictions without the corresponding ground truth. For each ground truth object , is the index of the prediction with the largest overlapping with it and each can only be used once, which is defined as:
Object-level F1 score is the metric for the detection performance, defined based on the number of true and false detections:
, where TP, FN, and FP represent the number of true positive (corrected detected objects), false negative (ignored objects), and false positive (detected objects without corresponding ground truth) detections, respectively. To evaluate the foreground and background segmentation accuracy, pixel-level Dice score is employed between the binarized prediction and the ground truth:
, where and represent the binarization prediction and ground truth, respectively. means the total number of foreground pixels.
For the evaluation metrics of the leaf segmentation task, we directly employ the official Symmetric Best Dice () score:
where and are the predictions and ground truth, respectively. is the best dice between and :
where means the total number of foreground pixels.
Iv-C Implementation Details
For the network initialization, the weights of the ResNet101 backbone are pretrained on the ImageNet classification task, while the weights for other layers are initialized with “Kaiming” initialization 
. When training the Cell R-CNN V3, stochastic gradient descent (SGD) is used to optimize the network, with a weight decay of, and momentum of . The mini-batch size is , which is relatively a small batch size. We, therefore, employed group normalization layers  with a group number of
to replace the traditional batch normalization layers. The initial learning rate is set to, with a linear warm-up for the first iterations. The learning rate is then decreased to when it reaches the
of the total training iterations. Our experiments are implemented on two Nvidia GeForce 1080Ti GPUs with Pytorch.
Iv-D Comparison with state-of-the-art methods
|Mask R-CNN ||avg|
|Cell R-CNN ||avg|
|Cell R-CNN V2 ||avg|
|Cell R-CNN V3||avg||0.5975||0.6282||0.6107||0.7967||0.8256||0.8091||0.8317||0.8383||0.8345|
The comparison of results for TCGA-Kumar dataset. avg and std represent average and standard deviation, respectively. For DIST, the results of object-level F1 are unknown.
Our result is compared with several state-of-the-art nuclei instance segmentation methods, including CNN3 , DIST , Mask R-CNN , Cell R-CNN , and Cell R-CNN V2 . With the same data split, we directly compare the performance reported in [21, 31]. For Mask R-CNN, Cell R-CNN, and Cell R-CNN V2, we re-implement them by adding group normalization with the same settings as our proposed Cell R-CNN V3, for a fair comparison. Therefore their performance is slightly better than in . Table II and Fig. 5 illustrate our quantitative and qualitative comparison results, respectively. As shown in Table II
, our proposed Cell R-CNN V3 outperforms all the other methods in all three metrics on the seen and unseen testing set. It indicates that our method has a strong generalization ability when testing on the cases from the unseen organs. In order to test the statistical significance between the results of our Cell R-CNN V3 and other methods, we employed one-tailed-paired t-test to calculate the p-value. As shown in TableIII, our improvements under all three metrics is statistically significant (p-value ) except for the of CNN3. However, only relies on the number of corrected detected objects, regardless of the segmentation quality of each detected object. By outperforming CNN3 by a large margin in the other two segmentation metrics (over on and on ), our method still achieves better performance on nuclei segmentation tasks compared with CNN3. Fig. 6 is the box plot for all the comparison method under the three metrics, which shows that our proposed methods not only outperforms all the methods, but is also more stable and robust.
|Cell R-CNN V2|
|Cell R-CNN V2|
|Cell R-CNN V3||0.6313 (0.0750)||0.8037 (0.0557)||0.8600 (0.0849)|
We conducted comparison experiments on the second histopathology dataset with 3-fold cross-validation and the results are shown in Table IV and Fig. 7. As in Table IV, our Cell R-CNN V3 outperforms its previous versions under all three metrics. Compared with Mask R-CNN, the effectiveness of the Cell R-CNN is improved by a large margin. The background components in the TNBC dataset are complicated and some background textures have a similar appearance to the foreground. Therefore, processing the semantic-level information is beneficial to the segmentation and detection accuracies. Compared with the Cell R-CNN, the improvement of Cell R-CNN V2 is not as large as in the TCGA KUMAR dataset, especially under the object-level F1 score. Although the feature fusion mechanism in the Cell R-CNN V2 facilitates the semantic feature learning in the instance branch, there is a lack of contextual features around each object due to the depreciation of part of the semantic feature map. Therefore, the detection accuracies are affected when the boundaries of two touching objects become ambiguous. Similar to the results on the TCGA KUMAR dataset, our proposed Cell R-CNN V3 outperforms the comparison methods by a large margin.
|Cell R-CNN V2|
|Cell R-CNN V3||0.8477 (0.0757)||0.9478 (0.0071)||0.9451 (0.0536)|
In addition to the nuclei segmentation tasks in the histopathology images, our proposed method is also effective for cell instance segmentation in the fluorescence microscopy images. As illustrated in Table V, our methods outperform all the comparison methods. We notice that the performance of Cell R-CNN is at the same level as Mask R-CNN, due to the limited improvement. In the fluorescence microscopy images, the background components are not as complicated as in the histopathology images, as shown in Fig. 8. Therefore, Cell R-CNN is not capable of improving accuracy as it fails to process the contexture information about the foreground objects by learning the semantic features in the backbone encoder. On the other hand, Cell R-CNN V2 improves Cell R-CNN by designing a dual-modal mask generator for improving the mask segmentation accuracies in the instance branch and inducing the mask generator to learn global semantic-level features. However, the improvement of the pixel-level Dice score of Cell R-CNN is still limited. Based on Cell R-CNN V2, our proposed Cell R-CNN V3 achieves high improvements in all three metrics.
Iv-D4 CVPPP Challenge
|Cell R-CNN V2|
|Cell R-CNN V3||0.9062 (0.0269)||0.9338 (0.0268)|
|Deep coloring |
|Discriminative loss |
|Recurrent with attention |
|Data augmentation |
|Harmonic embeddings |
|Synthesis data |
|Cell R-CNN V3||91.1|
First, we compare with our previous work, with the 3-fold cross-validation on the training images. The results are shown in Table VI and Fig. 9. By outperforming our previous Cell R-CNN and Cell R-CNN V2 on the instance segmentation tasks for biology images as well as the medical images, our proposed methods are demonstrated to be effective.
To further demonstrate the effectiveness of our Cell R-CNN on biology image analysis, we also conducted a comparison experiment with other previous work, using the leaf segmentation testing images. Table VII is the performance between our work and the state-of-the-art methods and our segmentation accuracy outperforms all the existing published work on this dataset. Among these methods, RIS , RNN , and Recurrent with attention 
processed one instance each time, with the help of the temporal chain from recurrent neural work (RNN) or long-short-term memory (LSTM). In addition, achieved better performance compared with the previous  and  due to the attention module and proposal-based architecture. As there is actually no temporal information in the leaf instance segmentation task, other methods focusing on the spatial relationship are more suitable for the task with better performance. , , , and  are proposal-free instance segmentation methods.  divides all the objects into several groups of untouching instances and processes them separately. Instead of directly learning the instance mask prediction for each leaf, , , and  learned high-dimensional embedding maps projected from the original images. Then, clustering algorithms were employed to separate each instance during inference. Without focusing on each object, their performance is still limited due to the lack of local-level information. Similar to our work, the CNN architecture in  and  are proposal-based Mask R-CNN. With the help of the auxiliary synthesized images, these two methods outperform most of the previous state-of-the-art methods. However, the image synthesis methods in  and  are entirely based on the characteristic of the leaves in the given plant phenotype images, such as the texture, direction, and spatial relationship with other leaves. Therefore, the methods are task-specific and hard to fit to other datasets with different characteristics. With the help of panoptic-level features in a local and global view, our proposed-based Cell R-CNN V3 outperforms all the other methods on the CVPPP A1 dataset, without any task-specific design. In addition, the competitive performance on other instance segmentation tasks further demonstrates the generalization ability of our method.
Iv-E Ablation study
In this section, we conduct ablation experiments on the TNBC, BBBC039V1, and CVPPP dataset, to test the effectiveness of the three newly proposed modules in the Cell R-CNN V3 on different types of images. For TNBC and BBBC039V1, we have the same data settings as the previous experiments. For CVPPP, we conduct 3-fold cross-validation on the training images.
Iv-E1 Residual attention feature fusion mechanism
|w / o|
|w / o|
In this section, we first study the selection of the feature fusion mechanism. As shown in Fig 10, we present
different selections: (a) replacement: replace the semantic feature map with the mask predictions, which is employed in Cell R-CNN V2; (b) summation: sum the semantic feature map with the mask prediction; (c) attention: we first obtain the mask probability maps with the sigmoid operation, then they are multiplied with the corresponding semantic features; (d) residual attention: add a residual connection on (c). For all the experiments, the rest of the model is the same as our proposed Cell R-CNN V3 in addition to the feature fusion mechanism. The comparison results are shown in TableVIII.
As discussed above, the previous feature fusion mechanism in Cell R-CNN V2 directly replaced the subset of the semantic feature map using the mask predictions for each object, which results in the semantic-level information loss. Therefore, the models with summation (b), attention (c), and residual attention (d) fusion mechanism achieve better performance than the replacement fusion (a), as all of them preserve the original semantic features. Among (b), (c), and (d), residual attention fusion mechanism (d) always achieves the best performance on all datasets, and it is therefore employed in our Cell R-CNN V3.
Iv-E2 Mask quality branch
To demonstrate the effectiveness of our proposed mask quality branch, we conducted an ablation study by removing the mask quality branch and comparing the performance with Cell R-CNN V3. As shown in Table IX, the accuracies under all the metrics are decreased after removing the mask quality branch, especially on the object-level metrics. Without the mask quality branch, there exist low-quality mask predictions with a high classification score, which affect the segmentation of the touching object during inferences and are eventually harmful to the object-level accuracies.
Iv-E3 Semantic task consistency regularization
Table X illustrates the effectiveness of the semantic consistency regularization by ablating it from the original Cell R-CNN V3. Although the regularization aims to facilitate the semantic information learning in the instance branch, we notice that the improvements under the object-level metrics are at the same level of the pixel-level metrics, on all the three datasets. Compared with Table VIII and Table IX, we notice that the model without the semantic consistency regularization outperforms the model without the residual attention fusion mechanism and mask quality branch, which indicates the effectiveness of the semantic consistency regularization is the lowest among all the three proposed modules. However, the semantic task consistency regularization is still a novel module as it is implemented by only adding one more loss function, which is straightforward and easy to adapt to other related tasks.
In this work, we propose a novel panoptic Cell R-CNN V3 for instance segmentation in the biomedical images, which incorporates semantic- and instance-level features. By extending our previous Cell R-CNN  and Cell R-CNN V2 , our newly proposed Cell R-CNN V3 is improved with a residual attention feature fusion mechanism, mask quality branch, and semantic consistency regularization. With the help of the residual attention feature fusion mechanism, the semantic features of foreground objects are retained and the mask generators are able to learn more global contextual information from the semantic segmentation task in the instance branch. To alleviate the misalignment issue between the quality of each mask prediction and its classification score, our mask quality branch learns the mask prediction quality scores during training and employs the quality score to re-weigh the classification of each instance prediction during inference. For robust and accurate learning of the semantic features, the semantic consistency mechanism is proposed to regularize the two semantic segmentation tasks jointly. Furthermore, our methods have wide applicability on various biomedical datasets, including histopathology images, fluorescence microscopy images, and plant phenotype images, where we outperform several state-of-the-art methods by a large margin, including our previous Cell R-CNN and Cell R-CNN V2.
By fulfilling the future work in Cell R-CNN V2 , our Cell R-CNN V3 has been verified to be effective on various biomedical datasets. In future work, we would further adapt our method to general image processing tasks. As our methods are effective for 2D image analysis, we can also extend it for 3D microscopy image instance segmentation, which is another important and interesting problem related to this work.
Deep watershed transform for instance segmentation.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 5221–5229. Cited by: §II-A.
-  (2011) Multi-field-of-view strategy for image-based outcome prediction of multi-parametric estrogen receptor-positive breast cancer histopathology: comparison to oncotype dx. J. Pathol. Inform. 2. Cited by: §I.
-  (2019) Cascade r-cnn: high quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §II-B.
-  (2017) DCAN: deep contour-aware networks for object instance segmentation from histology images. Med. Image Anal. 36, pp. 135–146. Cited by: §I, §II-A.
-  (2019) Hybrid task cascade for instance segmentation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 4974–4983. Cited by: §II-B.
-  (2018) Masklab: instance segmentation by refining object detection with semantic and direction features. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 4013–4022. Cited by: §II-B.
-  (2018) Domain adaptive faster r-cnn for object detection in the wild. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3339–3348. Cited by: §III-D.
-  (1991) Pathologic correlates of survival in 378 lymph node-negative infiltrating ductal breast carcinomas. mitotic count is the best single predictor. Cancer 68 (6), pp. 1309–1317. Cited by: §I.
-  Cited by: §I, §II-A, §IV-D4, TABLE VII.
-  Cited by: §II-B.
-  (2009) Imagenet: a large-scale hierarchical image database. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 248–255. Cited by: §IV-C.
-  (1991) Pathological prognostic factors in breast cancer. i. the value of histological grade in breast cancer: experience from a large study with long-term follow-up. Histopathology 19 (5), pp. 403–410. Cited by: §I.
-  (2017) Mask r-cnn. In Proc. IEEE Int. Conf. Comput. Vis., pp. 2980–2988. Cited by: §I, §II-B, §III-A, Fig. 5, Fig. 7, Fig. 8, Fig. 9, §IV-D1, TABLE II.
-  (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proc. IEEE Int. Conf. Comput. Vis., pp. 1026–1034. Cited by: §IV-C.
-  (2016) Deep residual learning for image recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 770–778. Cited by: §III-A.
-  (2019) Mask scoring r-cnn. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6409–6418. Cited by: §I, §II-B, §III-C, §III-C.
-  (2019) Panoptic feature pyramid networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6399–6408. Cited by: §I, §II-B.
-  (2019) Panoptic segmentation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 9404–9413. Cited by: §I, §I, §II-B.
-  Cited by: §II-A, §IV-D4, TABLE VII.
-  Cited by: §II-A, §IV-D4, TABLE VII.
-  (2017) A dataset and a technique for generalized nuclear segmentation for computational pathology. IEEE Trans. Med. Imaging 36 (7), pp. 1550–1560. Cited by: §I, §II-A, §IV-A1, §IV-B, §IV-D1, TABLE II.
-  (2019) Data augmentation for leaf segmentation and counting tasks in rosette plants. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, Cited by: §IV-D4, TABLE VII.
-  (2019) Attention-guided unified network for panoptic segmentation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 7026–7035. Cited by: §II-B.
-  (2017) Feature pyramid networks for object detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Vol. 1, pp. 4. Cited by: §III-A.
-  (2019) Nuclei segmentation via a deep panoptic model with semantic feature fusion. In Proc. IJCAI, pp. 861–868. Cited by: §I, §I, §II-B, §III-A, Fig. 5, Fig. 7, Fig. 8, Fig. 9, §IV-A1, §IV-D1, TABLE II, §V, §V.
-  (2019) An end-to-end network for panoptic segmentation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6172–6181. Cited by: §II-B.
-  (2018) Path aggregation network for instance segmentation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 8759–8768. Cited by: §I, §II-B.
-  (2012) Annotated high-throughput microscopy image sets for validation.. Nat. Methods 9 (7), pp. 637–637. Cited by: §IV-A3.
-  (2000) Observer variation, dysplasia grading, and hpv typing: a review. Pathol. Patterns Rev. 114 (suppl_1), pp. S21–S35. Cited by: §I.
-  (2016) Finely-grained annotated datasets for image-based plant phenotyping. Pattern Recognit. Lett. 81, pp. 80–89. Cited by: §I, §IV-A4.
-  (2018) Segmentation of nuclei in histopathology images by deep regression of the distance map. IEEE Trans. Med. Imaging. Cited by: §I, §II-A, §IV-A1, §IV-A2, §IV-D1, TABLE II.
-  (2017) Automatic differentiation in pytorch. In Proc. Conf. Neural Inf. Process. Syst. Autodiff Workshop, Cited by: §IV-C.
-  (2019) Segmenting and tracking cell instances with cosine embeddings and recurrent hourglass networks. Med. Image Anal. 57, pp. 106–119. Cited by: §I, §IV-D4.
-  (2018) Instance segmentation and tracking with cosine embeddings and recurrent hourglass networks. In Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent, pp. 3–11. Cited by: §I, §II-A, TABLE VII.
-  (2017) Large kernel matters—improve semantic segmentation by global convolutional network. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1743–1751. Cited by: §III-A.
-  (2017) End-to-end instance segmentation with recurrent attention. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6656–6664. Cited by: §II-B, §IV-D4, TABLE VII.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Proc. Conf. Neural Inf. Process. Syst., pp. 91–99. Cited by: §III-A.
-  (2016) Recurrent instance segmentation. In Proc. Eur. Conf. Comput. Vis., pp. 312–329. Cited by: §IV-D4, TABLE VII.
-  (2018) Effective use of synthetic data for urban scene semantic segmentation. In Proc. Eur. Conf. Comput. Vis., pp. 86–103. Cited by: §II-B.
-  Cited by: §IV-D4, TABLE VII.
-  (2014) Annotated image datasets of rosette plants. In Proc. Eur. Conf. Comput. Vis., pp. 6–12. Cited by: §I.
-  (2016) Leaf segmentation in plant phenotyping: a collation study. Mach. Vis. Appl. 27 (4), pp. 585–606. Cited by: §I.
-  (2018) Contour-seed pairs learning-based framework for simultaneously detecting and segmenting various overlapping cells/nuclei in microscopy images. IEEE Trans. Image Process. 27 (12), pp. 5759–5774. Cited by: §I.
-  (2018) Deep leaf segmentation using synthetic data. In Proc. BMVC, pp. 327–339. External Links: Cited by: §IV-D4, TABLE VII.
-  (2018) Group normalization. In Proc. Eur. Conf. Comput. Vis., pp. 3–19. Cited by: §IV-C.
-  (2016) Robust nucleus/cell detection and segmentation in digital pathology and microscopy images: a comprehensive review. IEEE Rev. Biomed. Eng. 9, pp. 234–263. Cited by: §I.
-  (2018) Panoptic segmentation with an end-to-end cell r-cnn for pathology image analysis. In Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent, pp. 237–244. Cited by: §I, §II-B, Fig. 5, Fig. 7, Fig. 8, Fig. 9, §IV-D1, TABLE II, §V.
-  (2018) Nuclei instance segmentation with dual contour-enhanced adversarial network. In Proc. IEEE Int. Symp. Biomed. Imag, pp. 409–412. Cited by: §I, §II-A.