Digital pathology is nowadays playing a crucial role for accurate cellular estimation and prognosis of cancer . Specifically, nuclei instance segmentation which not only captures location and density information but also rich morphology features, such as magnitude and the cytoplasmic ratio, is critical in tumor diagnosis and following treatment procedures . However, automatically segmenting the nuclei at instance-level still remains challenging due to several reasons. First, the vast existence of nuclei occlusions and clusters can easily cause over or under-segmentation, which impedes accurate morphological measurements of nuclei instances. Second, the blurred border and inconsistent staining makes images inevitable to contain indistinguishable instances, and hence introduces subjective annotations and mislabeling, which is challenging to get robust and objective results . Third, the variability in cell appearance, magnitude, and density among diverse cell types and organs requires the method to possess good generalization ability for robust analysis.
, which fail to find reliable threshold in the complex background. While deep learning-based methods are generally more robust and have become the benchmark for medical image segmentation[21, 25, 11]. For example, Chen et al.  proposed a deep contour-aware network (DCAN) for the task of instance segmentation that firstly harnesses the complementary information of contour and instances to separate the attached objects. In order to utilize contour-specific features to assist nuclei prediction, BES-Net  directly concatenates the output contour features with nuclei features in decoders. However, it only learns complementary information in nuclei branch but ignores the potentially reversed benefits from nuclei to contour, which is more essential since contour appearance is more complicated and has larger intra-variance than that of nuclei.
Another challenge is to eliminate the effect from inevitably noisy and subjective annotations. Different training strategies and loss functions have been proposed[9, 24, 6, 19]. A bootstrapped loss 
was proposed to rebalance the loss weight by taking the consistency between the label and reliable output into account. However, when dealing with noise labeling especially the mislabeling nuclei, the network tends to predict probability with a high confidence score, where the negative log-likelihood magnitude is non-trivial and cannot be appropriately adjusted by the consistent term. As we will show later (Sec.2.3), these outliers overwhelm others in loss calculation and dominate the gradient.
To address the issues mentioned above, we have following contributions in this paper. 1). We propose an Information Aggregation Module (IAM) which enables the decoders to collaboratively refine details of nuclei and contour by leveraging the spatial and texture dependencies in bi-directionally feature aggregation. 2). A novel smooth truncated Loss is proposed to modulate the outliers’ perturbation in loss calculation, which endows the network with the ability to robustly segment nuclei instances by focusing on learning informative samples. Moreover, eliminating outliers alleviates the network from overfitting on these noisy samples, eventually enabling the network with better generalization capability. 3). We validate the effectiveness of our proposed Contour-aware Information Aggregation Network (CIA-Net) with the advantages of pyramidal information aggregation and robustness on Multi-Organ Nuclei Segmentation (MoNuSeg) dataset with seven different organs, and achieved the 1st place on 2018 MICCAI Challenge, demonstrating the superior performance of the proposed approach.
Fig. 1 presents overview of the CIA-Net, which is a fully convolutional network (FCN) consisting of one densely connective encoder and two task-specific information aggregated decoders for refinement. To fully leverage the benefit of complementary information from highly correlated tasks, instead of directly concatenating task-specific features, our method conducts a hierarchical refinement procedure by aggregating multi-level task-specific features between decoders.
2.1 Densely Connected Encoder with Pyramidal Feature Extraction
To effectively train the deep FCN, dense connectivity is introduced in encoder . In each Dense Module (DM), let denotes the output of the -th layer, dense connectivity can be described as . It sets up direct connections from any bottleneck layer to all subsequent layers by concatenation, which not only effectively and efficiently reuses features but also benefits gradient back-propagation in the deep network. Transition Module (TM) is added after DM to reduce the spatial resolution and make the features more compact, which contains a
convolution layer and an average pooling layer with a stride of 2. Next, we hierarchically stack four DMs where each followed by a TM except the last one. For each DM, it consists ofbottleneck layers, respectively.
Inspired by feature pyramid network  which takes advantage of multi-scale features for accurate object detection, we propose to make full use of pyramidal features hierarchically by building multi-level lateral connections between encoder and decoders. In this way, localization and texture information from earlier layers can help the low-resolution while strong-semantic features refine the details. The encoder features with of original size are passed through the lateral connections by a convolution to reduce feature map number and merged with the upsampled deeper features in decoders by summation operation, as shown in Fig. 2(a).
2.2 Bi-directional Feature Aggregation for Accurate Segmentation
Given that contour region encases the corresponding nuclei, it is intuitive that nuclei and contour have high spatial and contextual relevance, which is helpful for decoders to localize and focus on learning informative patterns. In other words, the neural response from the specific kernel in nuclei branch can be considered as an extra spatial or contexture cue for localizing contour to refine details and vice versa. In this regard, we proposed Information Aggregation Module (IAM) which aims at utilizing information from highly-correlated sub-tasks to bidirectionally aggregate the task-specific features between two decoders. Fig. 2(b) shows the details of IAM structure, it takes features after lateral connection as inputs, and then selects and aggregates informative features for each sub-task.
To start the iteration, we attach a convolution on the top of the encoder to generate the coarsest feature maps. For each decoder, the feature maps
from a higher level are upsampled by bilinear interpolation to double the resolution and added with high-resolution feature maps from encoder through lateral connections (see Fig.2(a)). After that, the IAM takes the merged maps as inputs and adds a
convolution without nonlinear activation to smoothen and eliminate the grid effects. Then the smooth features are fed into the classifier to predict multi-resolution score maps. Meanwhile, these task-specific features are concatenated along the channel dimension and then passed through two parallel convolution layers to select and integrate the complementary informative featuresfor further details refinement in the next iteration.
Besides, to prevent the network from relying on single level discriminative features, deep supervision mechanism  is introduced at each stage to strengthen learning of multi-level contextual information. This also benefits training of deeper network architectures by shortening the back-propagation path.
2.3 Smooth Truncated Loss for Robust Nuclei Segmentation
The existence of blurred edge and inconsistent staining makes images inevitably contain indistinguishable instances, which leads to subjective annotations such as mislabelled objects and inaccurate boundary. Additionally, to enhance the ability to split attached nuclei, conventional practice is to preprocess the training ground truth by subtracting the dilated contour mask, which is also suboptimal and has the risk of introducing noises. Both factors show that it is unavoidable for pixel-wise nuclei annotations to contain imperfect labels, which is harmful to network training from at least two aspects. Firstly, the inaccurate labeling encountered during training has the tendency to overwhelm other regions in loss calculation and dominate the gradients. This phenomenon is observed from the sorted cumulative distribution function of normalized loss in Fig.3(b) using a converged model. Notice that top samples account for more than value of cross-entropy loss, which prevents network learning from informative samples during gradient back-propagation. Secondly, forcibly learning the subjective labeling would eventually push the network to particularly fit them and tend to overfitting, which is even more pernicious when predicting unseen organ nuclei. To handle the noisy and incomplete labeling,  proposed bootstrapped loss () to rebalance the loss weight by considering the consistency between the label and reliable output. However, as can be seen in Fig. 3(b), when faced with errors with low predicted probability, it cannot easily compensate for the loss with non-trivial magnitude.
To solve this problem, our insight is to reduce outliers’ interference in training by modulating contribution in loss calculation. Under the premise of high credibility of network prediction, the majority of outliers will lie in low predicted probability regions and get large values of error. Inspired by Huber loss  for robust regression, which is quadratic for small values of error and linear for large values to decline the influence of outliers, we propose the prototype of loss function, namely Truncated Loss (), which reduces the contribution of outliers with high confidence prediction. Let denotes the predicted probability of the ground truth, if and otherwise, in which specifics the ground truth label. Formally, the loss is truncated when the corresponding is smaller than a threshold :
The truncated loss only clips outliers with , while preserves loss value for the other. Intuitively, this operation adds a constraint of maximum contribution in loss calculation from each pixel and hence can ease the gradient domination from outliers and benefit of learning the informative samples. However, in Eq. (1) the derivative of at clipping point is undefined. Meanwhile, the perturbation of low prediction will not be reflected in loss calculation if we force the loss value larger than the threshold to a constant, therefore the smoothed version is preferred for optimization. In this regard, we propose Smooth Truncated Loss :
A quadratic function with the same value and derivative as negative log-likelihood at the truncated point is used to modulate the loss weight for outliers. By incorporating constraint for the loss magnitude, it reduces the contribution of outliers, where the smaller , the more considerable modulation. This, in turn, let the network discard the indistinguishable parts and focus on informative and learnable regions. Furthermore, by reducing the influence of the outlier samples that interferences the network training, it encourages the network to predict with higher confidence scores and narrow the uncertain regions, which is helpful for alleviating over-segmentation.
2.4 Overall Loss Function
Based on the proposed Smooth Truncated Loss, we can derive the overall loss function. Note that the contour prediction is much more difficult than that of nuclei due to irregularly curved form. In this case, the primary component of regions with high loss is not by the outliers, but the inlier samples, and hence utilizing truncated loss may confuse the network. Instead, we use Soft Dice Loss to learn the shape similarity:
where denotes the predicted probability of i-th pixel and denotes the corresponding ground truth. In sum, the total loss function for proposed CIA-Net training is:
where the first and second terms calculate error from contour and nuclei prediction respectively, and the third term is the weight decay. and are hyper-parameters to balance three components.
3 Experimental Results
3.1 Dataset and Evaluation Metrics
We validated our proposed method on MoNuSeg dataset of 2018 MICCAI challenge, which contains 30 images (size: ) captured by The Cancer Genomic Atlas (TCGA) from whole slide images (WSIs) . The dataset consists of breast, liver, kidney, prostate, bladder, colon and stomach containing both benign and malignant cases, which is then divided into training set (Train), test set1 from the same organs of training data (Test1) and test set2 from unseen organs (Test2) with 14, 8 and 6 images, respectively. The Train contains 4 organs - breast, kidney, liver and prostate with 4 images from each organ, the Test1 includes 2 images from per organ mentioned in Train, and Test2 contains 2 images from each unseen organ, i.e., bladder, colon and stomach.
We employed Average Jaccard Index (AJI) for comparison, which considers an aggregated intersection cardinality numerator and an aggregated union cardinality denominator for all ground truth and segmented nuclei. Let denotes the set of instance ground truths, denotes the set of segmented objects and denotes the set of segmented objects with none intersection to ground truth. AJI = , where . F1-score ()  is used for nuclei instance detection performance evaluation and we also report it for reference.
3.2 Implementation Details
We implemented our network using Tensorflow (version 1.7.0). The default parameters provided athttps://github.com/pudae/tensorflow-densenet is used in the Densenet backbone. Stain normalization method  was performed before training. Data augmentations including crop, flip, elastic transformation and color jitter were utilized. The outputs of nuclei and contour maps were first subtracted and then the connected components were detected get the final results. The network was trained on one NVIDIA TITAN Xp GPU with a mini-batch size of three. We utilized the pre-trained DenseNet model 
from ImageNet to initialize the encoder. The hyper-parametersand were set as 0.42 and 0.0001 to balance the loss and regularization. AdamW optimizer was used to optimize the whole network and learning rate was initialized as 0.001 and decayed according to cosine annealing and warm restarts strategy .
3.3 Evaluation and Comparison
Effectiveness of contour-aware information aggregation architecture. Firstly, we conduct a series of experiments to compare different informative feature aggregation strategies in decoders: (1) Cell Profiler : a python-based software for computational pathology employing intensity thresholoding method. (2) Fiji : a Java-based software utilizing watershed transform nuclear segmentation method. (3) CNN3 : a 3-class FCN without deep dense connectivity. (4) DCAN : a deep FCN with multi-task learning strategy for objects and contours. (5) PA-Net : a modified path aggregation network by adding path augmentation in two independent decoders to enhance the instance segmentation performance. (6) BES-Net : the original boundary-enhanced segmentation network which concatenated contour features with nuclei features to enhance learning in boundary region. (7) CIA-Net w/o IAM: the proposed network architecture with two independent decoders for nuclei and contour prediction respectively, but without Information Aggregation Module in decoders. (8) Proposed CIA-Net
: Our Contour-aware Information Aggregation Network with Information Aggregation Module between nuclei and contour decoders. Notice that unless specified otherwise, we utilized the same encoder structure with pyramidal feature extraction strategy and loss functions to establish fair comparison.
|(1)||Cell Profiler ||0.1549||0.0809||0.4143||0.3917|
|(7)||CIA-Net w/o IAM||0.6106||0.5817||0.8279||0.8356|
It is observed that all CNN-based approaches achieved much higher results on all evaluation criterions than conventional approaches, highlighting the superiority of deep learning based methods for segmentation related tasks. Moreover, results from (4) to (8) have a striking improvement regarding the evaluation metric of AJI on bothTest1 and Test2 compared with (3), validating the efficacy of dense connectivity structure, which is more powerful to leverage multilevel features and mitigate gradient vanishing in training deep neural network. While methods (4) to (7) achieved comparable performance on the evaluation performance of Test1, the results from BES-Net and proposed CIA-Net w/o IAM outperform others significantly on AJI of Test2, demonstrating the exploitation of high spatial and context relevance between nuclei and contour can generate task-specific features for assisting feature refinement between both tasks. This can help enhance the generalization capability to unseen data. Meanwhile, in comparison with BES-Net and proposed CIA-Net w/o IAM, our proposed network CIA-Net further outperforms these two methods consistently regarding the metric of AJI, achieving overall best performance and boosting results to 0.6306 on Test2 and 0.6129 on Test1. Different from BES-Net which directly concatenates features in contour decoder to nuclei branch, the proposed CIA-Net with IAM bi-directionally aggregating the task-specific features and passing them through parallel convolutions to iteratively aggregate informative features in decoders. Therefore, it is a learnable procedure for network to find favorable features, which mutually benefits two sub-tasks. Compared with the improvement on AJI, the improvement on F1-score is less significant, this is because AJI is a segment-based metric while F1-score is the detection-based metric.
Effectiveness of proposed Smooth Truncated loss. Toward the potential of clinical application, the proposed method should be robust under the numerous circumstances, especially for the diffused-chromatin and attached nuclei in unseen organs, which is evaluated in Test2 set. We compare the results of our proposed CIA-Net with four different functions: (1) : Binary Cross-Entropy loss. (2) : Soft Bootstrapped loss by rebalancing the loss weight. (3) : Proposed Truncated loss without smoothing around truncated point, i.e., Eq. (1). (4) : Proposed Smooth Truncated loss which utilizes quadratic function as soft modulation, i.e., Eq. (2).
As can be seen in Table 4, the improvement of compared to is limited. Compared with first two rows, results from and outperform others on Test2 consisting of unseen organs by a large margin (nearly for and for ) regarding the metric of AJI, and are analogous on Test1. The proposed achieved significant improvements in comparison with on Test2, shows it is less sensitive on and has better generalization capability on different organ images. The proposed Smooth Truncated loss introduces one new hyper-parameter, the truncating parameter , which controls the starting point of down-weighting outliers. When , the loss function degenerates into Binary Cross-entropy . As increases, more examples with lower than are considered as outliers or less informative samples to down-weight in loss calculation. Fig. 4 illustrates the influence of varying . We can see have a striking overall improvement compared with and . More importantly, results from demonstrate less sensitivity for choosing different .
(b)) contain massive blur and noise, which is unfavorable for binarizing instances. Asincreases, the heatmaps turn to be more concrete with less uncertain areas, which is of great significance for instance segmentation to prevent over-segmentation. While setting too large increases the risk of under-segmentation, as can be seen in Fig. 5(f). This is because over suppressing low region also penalties learning from informative inlier samples, especially boundary regions where the is relatively small.
2018 MICCAI MoNuSeg Challenge results. We employed above entire dataset for training and 14 additional images provided by organizer for independent evaluation with ground truth held out111https://monuseg.grand-challenge.org. Top 20 results of 36 teams are shown in Fig. 6. Our submitted entry surpassed all the other methods, highlighting the strength of the proposed CIA-Net and Smooth Truncated loss.
Qualitative analysis. Fig. 7 shows representative samples from Test1 and Test2 with challenging cases such as diffuse-chromatin nuclei and irregular shape. Notice that our proposed CIA-Net (Fig. 7(e)) can generate the segmentation results similar to the annotations of human experts, outperforming others with less over or under-segmentation on the prolific nuclei clusters and attached cases.
Instance-level nuclei segmentation is the pivotal step for cell estimation and further pathological analysis.
In this paper, we propose CIA-Net with the smooth truncated loss to tackle the challenges of prolific nuclei clusters and inevitable labeling noise in pathological images.
Our method inherently can be adapted to a wide range of medical image segmentation tasks to boost the performance such as histology gland segmentation.
Acknowledgments. This work was supported by Hong Kong Innovation and Technology Fund (Project No. ITS/041/16), Guangdong province science and technology plan project (No.2016A020220013).
-  Carpenter, A.E., Jones, T.R., Lamprecht, M.R., Clarke, C., Kang, I.H., Friman, O., Guertin, D.A., Chang, J.H., Lindquist, R.A., Moffat, J., et al.: Cellprofiler: image analysis software for identifying and quantifying cell phenotypes. Genome biology 7(10), R100 (2006)
-  Chen, H., Qi, X., Yu, L., Dou, Q., Qin, J., Heng, P.A.: Dcan: Deep contour-aware networks for object instance segmentation from histology images. Medical image analysis 36, 135–146 (2017)
-  Cheng, J., Rajapakse, J.C., et al.: Segmentation of clustered nuclei with shape markers and marking function. IEEE Trans. Biomed. Eng. 56(3), 741–748 (2009)
-  Dou, Q., Yu, L., Chen, H., Jin, Y., Yang, X., Qin, J., Heng, P.A.: 3d deeply supervised network for automated segmentation of volumetric medical images. Medical image analysis 41, 40–54 (2017)
-  Friedman, J., Hastie, T., Tibshirani, R.: The elements of statistical learning, vol. 1. Springer series in statistics New York (2001)
-  Goldberger, J., Ben-Reuven, E.: Training deep neural-networks using a noise adaptation layer. In: ICLR 2017 (2017)
-  Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR (2017)
-  Irshad, H., Montaser-Kouhsari, L., Waltz, G., Bucur, O., A Nowak, J., Dong, F., Knoblauch, N., Beck, A.: Crowdsourcing image annotation for nucleus detection and segmentation in computational pathology: evaluating experts, automated methods, and the crowd. In: Pac Symp Biocomput. pp. 294–305. World Scientific (2014)
-  Jiang, L., Zhou, Z., Leung, T., Li, L.J., Fei-Fei, L.: Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In: ICML (2018)
-  Jung, C., Kim, C.: Segmenting clustered nuclei using h-minima transform-based marker extraction andcontour parameterization. IEEE Trans. Biomed. Eng. 57(10), 2600–2604 (2010)
-  Kazeminia, S., Baur, C., Kuijper, A., van Ginneken, B., Navab, N., Albarqouni, S., Mukhopadhyay, A.: Gans for medical image analysis. arXiv preprint arXiv:1809.06222 (2018)
-  Kumar, N., Verma, R., Sharma, S., Bhargava, S., Vahadane, A., Sethi, A.: A dataset and a technique for generalized nuclear segmentation for computational pathology. IEEE Trans. Med. Imaging 36(7), 1550–1560 (2017)
-  Lin, T.Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: IEEE CVPR (2017)
-  Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: IEEE CVPR (2018)
-  Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101 (2017)
-  Macenko, M., Niethammer, M., Marron, J.S., Borland, D., Woosley, J.T., Guan, X., Schmitt, C., Thomas, N.E.: A method for normalizing histology slides for quantitative analysis. In: IEEE ISBI (2009)
-  Oda, H., Roth, H.R., Chiba, K., Sokolić, J., Kitasaka, T., Oda, M., Hinoki, A., Uchida, H., Schnabel, J.A., Mori, K.: Besnet: Boundary-enhanced segmentation of cells in histopathological images. In: MICCAI 2018. LNCS, vol. 11071, pp. 228–236 (2018)
-  Pantanowitz, L.: Digital images and the future of digital pathology. J Pathol Inform 1 (2010)
-  Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., Qu, L.: Making deep neural networks robust to label noise: A loss correction approach. In: IEEE CVPR (2017)
-  Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., Rabinovich, A.: Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596 (2014)
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI 2015. LNCS, vol. 9351, pp. 234–241 (2015)
-  Schindelin, J., Arganda-Carreras, I., Frise, E., Kaynig, V., Longair, M., Pietzsch, T., Preibisch, S., Rueden, C., Saalfeld, S., Schmid, B., et al.: Fiji: an open-source platform for biological-image analysis. Nature methods 9(7), 676 (2012)
-  Veta, M., Kornegoor, R., Huisman, A., Verschuur-Maes, A.H., Viergever, M.A., Pluim, J.P., Van Diest, P.J.: Prognostic value of automatically extracted nuclear morphometric features in whole slide images of male breast cancer. Mod. Pathol. 25(12), 1559 (2012)
-  Xue, C., Dou, Q., Shi, X., Chen, H., Heng, P.A.: Robust learning at noisy labeled medical images: Applied to skin lesion classification. In: IEEE ISBI (2019)
-  Yi, X., Walia, E., Babyn, P.: Generative adversarial network in medical imaging: A review. arXiv preprint arXiv:1809.07294 (2018)