One of the main challenges in manual traditional pathology evaluation based on H&E (Haematoxylin and Eosin) stained slides, is the significant time, efforts and skills required for visual assessment of each case. A massive number of samples are being produced on a daily basis, requiring to be examined. Meanwhile, the increasing shortage of subspecialised pathologists is being reported. Fortunately, with the recent advances in digitization techniques for scanning digital whole slide images, a good foundation is laid for developing intelligent computer-aided histopathology assessment systems. Such systems are expected to augment the pathologists’ ability by automating some fundamental, labor-intensive and relatively easy tasks, and allowing the experts to focus on the most challenging parts of the assessment. The analysis on cell shape, size, distribution, and other features is an essential task for both biologists and histopathologists in their visual analysis of histology data. Similarly, the automation of this task plays a critical role for subsequent analysis in computer-aided histopathology image assessment. The localisation and classification of cell types provide important clues in some disease diagnosis. For example, the spatial distribution of the cells can be utilised as unique features for tumour segmentation.
Recently, deep convolutional neural networks achieve impressive performance on object detection and segmentation tasks
, and they open new opportunities for tackling the challenge of automatic nuclei detection and segmentation in histology images. Dozens of successful deep learning based object detection and segmentation methods have been proposed, including two-stage object detection methods like Fast R-CNN and Faster R-CNN. These methods cascade the features from two stages for better results. While one stage methods like ”single shot multibox detector” (SSD) and ”you only look once” (YOLO)  are faster speed since all the procedures are being accomplished in one network while keeping a comparative accuracy against two-stage networks. Of all the segmentation networks, U-net is prevalently used for biomedical image processing due to its concise and efficient structure comparing to other segmentation networks like deep lab . However, these mentioned networks are either too powerful and complex or too simple and ineffective to directly produce decent nuclei detection results. Specially designed networks are required to address the unique nature of the data in nuclei detection and segmentation, like high-density, occlusion and limited range of shapes and sizes.
Comparing to natural objects, the detection and segmentation of nuclei seem to be much easier due to their simple structures and homogeneous properties in representation. However, despite of the fact that the topic of nuclei detection and segmentation has been studied for decades, there is still no publicly available trained models that support universal nuclei detection across H&E slides of different labs and conditions. Before the broadly adoption of deep neural networks, conventional nuclei detection methods often use the statistical or geographical features of images to generate the seeds. In most cases, colour deconvolution is a necessary pre-processing or normalization step for guaranteeing a coherent performance on different datasets.
In the era of deep learning, various networks are proposed to solve the challenge. The work by Xie et al.  proposes a fully convolutional regression network structure with good performance on the overlapping and clumping cells. Another regression network represented in  employs the bounding box for cell (nucleus) detection. To address the common problem of lack of training data, other attempts take semi-supervised or unsupervised approaches to solve this task. Xu et al.
try to extract the features of the nuclei with an unsupervised network, stacked sparse auto-encoder, and then use the extracted features to classify the foreground and background
. Yet, the task of automating nuclei detection and segmentation remain under-addressed, due to various reasons including the lack of training sets, the high visual variance of data from different sources, etc.
Therefore in this paper, we aim to propose a robust model for nuclei detection and segmentation that can produce accurate and coherent results on independent H&E image datasets with varying conditions. This model, referred to as US-Net, benefits from a concise, yet efficient architecture, which consists of a nuclei detection network and a segmentation network. It involves a work flow that dynamically integrates the regression output of nuclei location and the end-to-end output of the semantic segmentation to enhance the performance of both networks.
The main contribution of this research is two-fold: i) a novel and robust deep neural network architecture for instance segmentation of the nuclei in H&E stained histopathology image; and ii) an enhanced focal loss that can help deal with the class imbalance and accelerate the training is designed.
2 US-Net for nuclei detection and segmentation
To tackle the task of precise, instant and generic nuclei detection and segmentation in H&E histology images, a specifically designed network architecture US-Net is proposed in this research. As shown in Fig.1, the structure of US-Net is very compact, composed of segmentation and detection branches, which share the same backbone network. In principle, the propose network takes advantage of the powerful end-to-end semantic segmentation ability of U-net  structure and the excellent object detection and classification performance of SSD , precisely RetinaNet , to achieve instance segmentation results with the help of a post-processing sub-network for refinement.
The overall objective of the network extends the loss for MultiBox objective : given an input image and its corresponding segmentation masks , location information and its class information , the loss can be decided by the following function:
where the parameters and control the relative importance of the loss components.
The term which is defined with the norm, helps to achieve the segmentation results.
where is the segmentation outcome of the segmentation branch and is the ground truth information. is the total number of pixels in the input image. The loss measures the confidence scores of the binary class (nuclei or not) of the detected boxes with an adapted version of Focal loss  :
where denotes the ground truth information for classification and is associated with the output of the segmentation results:
where represents the number of the pixels in ground truth , while denotes the number of pixels that equals to in output of the . The term in Eq. 5 calculates the location regression loss of the multi-boxes given the ground truth information with a Smooth loss as defined in .
For the bounding box location information , is the center of the box while and represent the height and width of the box. With the Soomth loss defined as:
the output of US-Net can only achieve a relatively rough segmentation result. Thus, post-processing steps are necessary to realise an instance level segmentation of the input image. For this purpose, another U-net structure network that consisting of 4 convolutional layers is built for further refining the detected regions.
3 Experiments and Evaluation
The training dataset employed in the experiments come from the Segmentation of Nuclei in Images Contest (SNIC)111“Digitalpathology:Segmentationofnu-cleiinimages.” [Online]. Available:http://miccai.cloudapp.net/competitions/83 and the MICCAI MoNuSeg 222 “Mulit-organ nuclei segmentation challenge.” [Online]. Available: https://monuseg.grand-challenge.org/Home/. There are 32 patches with size 600600 pixels from SNIC and 30 patches with size 10001000 pixels from MoNuSeg. Both of them have instance level annotation. The proposed model works with input patches of size 300 300. Hence, the images from the original dataset are cropped to size 400400 with a fixed step size of 200. After pre-processing, 878 patches are acquired in which 650 patches are used for training and 228 patches for evaluation. For all the nuclei from different organs, they are treated as the same kind of nucleus which means no category information will be attached to each nucleus since the accurate detection and segmentation of the nuclei is the main focus. In addition to the 300300 patches, another dataset with patch size 4848 is needed for the refinement sub-network in the post-processing stage. These patches are cropped from the scaled bounding boxes area and then resized to 4848.
3.2 Implementing Details
In the U-net part, there are six down-sampling layers and six up-sampling layers connected by a bottleneck layer. The block size for all the layers is 4. In the SSD part, the detectable objects’ (nuclei) size is constrained to the range of 20128 by using the feature maps from the last three layers of the base network. The corresponding feature map sizes are . Two anchor box aspect ratios ( and ), and two scales (0.8 and 1.2) are considered in the experiments which make four different anchor boxes for each point in the feature maps. That makes up the 7620 default anchor boxes for an input image.
In the training phase, the parameter is set to 1 while is set to 0.1. Two different optimizers are employed for the two branches. For the SSD branch, Adam optimizer  with a learning rate of 0.001 is employed, while for the U-net branch, the optimizer is SGD with a learning rate 0.0001, momentum 0.9 and weight decay 0.0001.
The evaluation of the results of the proposed networks is divided into two parts, the evaluation of semantic segmentation results and evaluation of detection results. For the segmentation part, pixel accuracy (PA) is calculated regarding the number of accurately predicted pixels out of the total pixels. The evaluation metric for object detection part is the same as the interpolated average precision (AP) used for VOC dataset.
In Fig.2, losses for different branches of the US-Net as well as the AP/PA for evaluation are demonstrated along the training process. From the losses and AP/PA in the curves, it can be observed that the US-Net performs much better than the any of the branch individually. Besides, we can find that by adding the Densenet blocks or Resnet blocks to the networks, the performance is not necessarily imported due to the low complexity of the features in the image. Furthermore, we apply the trained model to the histopathology images from colorectal liver metastasis patients. From the results that demonstrated in Fig 3, we can visually observe with the visual assessment that the trained network has decent robustness and transferability.
In summary, we have proposed a robust network architecture US-Net, for nuclei detection and segmentation. The network incorporates the of the U-net and SSD networks together to realize a concise and powerful instance segmentation network for locating the nuclei in H&E stained images. Comparison with other start-of-art methods demonstrated the efficiency of the proposed network.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional networks for accurate object detection and segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 1, pp. 142–158, 2016.
R. Girshick, “Fast r-cnn,” in
Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 6, pp. 1137–1149, 2017.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on CVPR, 2016, pp. 779–788.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018.
-  W. Xie, J. A. Noble, and A. Zisserman, “Microscopy cell counting and detection with fully convolutional regression networks,” Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, pp. 1–10, 2016.
-  S. U. Akram, J. Kannala, L. Eklund, and J. Heikkilä, “Cell segmentation proposal network for microscopy image analysis,” in International Workshop on Large-Scale Annotation of Biomedical Data and Expert Label Synthesis. Springer, 2016, pp. 21–29.
J. Xu, L. Xiang, Q. Liu, H. Gilmore, J. Wu, J. Tang, and A. Madabhushi, “Stacked sparse autoencoder (ssae) for nuclei detection on breast cancer histopathology images,”IEEE transactions on medical imaging, vol. 35, no. 1, pp. 119–130, 2016.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” IEEE transactions on pattern analysis and machine intelligence, 2018.
D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable object detection
using deep neural networks,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2147–2154.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  M. Everingham, L. Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” Int. J. Comput. Vision, vol. 88, no. 2, pp. 303–338, Jun. 2010.