Accurately counting the number of cells in microscopic images greatly aids medical diagnoses and biological studies . However, manual cell counting, which is slow, expensive, and prone to subjective errors, is not practically feasible for high-throughput processes. An automatic and efficient solution with improved counting accuracy is highly desirable, but automatic cell counting is challenging due to the low image contrast, strong tissue background, and significant inter-cell occlusions in 2D microscopic images [2, 3, 4, 5].
. Generally speaking, density-based methods employ machine-learning tools to learn a density regression model (DRM) that estimates the cell density distribution from the characteristics/features of a given image. The number of cells can be subsequently estimated by integrating the estimated density map. For example, Lempitsky et al. proposed a supervised learning framework to learn a linear DRM and employ it for visual object counting tasks. Differently, Xie et al.  utilized a fully convolutional regression network (FCRN) to learn a DRM for regressing cell spatial densities over the image. The FCRN, as a specific fully convolutional neural network (FCNN), integrates informative feature extraction and powerful function learning to estimate the density map of a given image. It demonstrates promising performance for cell counting tasks, especially for counting overlapped cells.
Generally, the layers in the FCRN are hierarchically constructed, and the output of each layer relies on the outputs of its previous layers. There are two potential shortcomings of this traditional network design. First, the intermediate layers are optimized based on the gradient information back-propagated only from the final layer of the network, not directly from the adjacent layer. Second, this design allows only adjacent layers to be connected, while limiting the integration of multi-scale features (or information) and overall network performance. These two shortcomings might lead to sub-optimized intermediate layers, and can eventually affect the overall cell counting accuracy. Thus, the need to improve the accuracy of automated cell counting methods remains.
Recently, in order to improve the effectiveness of learning intermediate layers in a designed deep neural networks, deeply-supervised learning (or deep supervision) has been proposed and has shown promising performance for addressing various computer vision tasks, such as classification and segmentation [12, 13]. In addition, concatenating CNN frameworks have also attracted great attention. These networks can concatenate multi-scale features by shortcut connections of non-adjacent layers within the network, and so achieve better results than the traditional networks in such computer vision tasks as segmentation  and detection .
Motivated by these works, this study proposes a novel density regression-based automatic cell counting method. A FCNN is used as a primary FCNN (PriCNN) to learn the density regression model (DRM) that performs an end-to-end mapping from a cell image to the corresponding density map. A set of auxiliary CNN (AuxCNNs) are built to assess the features at the intermediate layers in the PriCNN and to directly supervise the training of these layers. In addition, by use of concatenation layers, the multi-scale features from non-adjacent layers are integrated to improve the granularity of the features extracted from the intermediate layers for further supporting final density map estimation. Experimental results, evaluated on a set of immunofluorescent images of human embryonic stem cells (hESC), have demonstrated the superior performance of the proposed deep supervision-based DRM method compared to other state-of-the-art methods.
2.1 Background: Density-Based Automatic Cell Counting
The goal of density regression-based cell counting methods is to learn a density regression function , which can be employed to estimate the density map of a given image [6, 7]. Given an image which includes cells, the density map of can be considered as the superposition of a set of normalized 2D discrete Gaussian kernels that are placed at the centroids of the cells. Therefore, the number of cells can be counted by integrating the density map over the image.
Let represent the cell centroid positions in . Each pixel on the density map can be expressed as:
where , is a normalized 2D Gaussian kernel that satisfies . Here, is the isotropic covariance, is the kernel size, and is a normalization constant.
The density regression-based cell counting process generally includes three steps: (1) map an image to a feature map, (2) estimate a cell density map from the feature map, and (3) integrate the density map for automatic cell counting. In the first step, each pixel in
can be assumed to be associated with a real-valued feature vector. The feature map of can be generated using specific feature extraction methods, such as the dense scale invariant feature transform (SIFT) descriptor , ordinary filter banks , or codebook learning . In the second step, the estimated density of each pixel in can be obtained by applying a pre-trained density regression function on the given :
where is a parameter vector that determines the function . Finally, in the third step, the number of cells in , , can be counted by integrating the estimated densities over the image region:
A key task in density regression-based cell counting methods is learning the function by use of training datasets. The learning of and the related cell counting method proposed in this study are described below.
2.2 The Proposed Automatic Cell Counting Framework
We propose a novel automatic cell counting method that employs deeply-supervised density regression model in this study. The framework, shown in Figure 1, includes two phases: 1) DRM training and 2) cell counting by use of the trained DRM. The network architecture of the proposed DRM is described in Section 2.2.1. The two phases in the framework are described in Sections 2.2.2 and 2.2.3, respectively.
2.2.1 The DRM Network Architecture
The DRM is built as a primary FCNN (PriCNN) with the purpose of estimating the density map of an image , such that:
where is a density regression function, and is a parameter vector that determines .
, each of the first three blocks includes a convolutional (CONV) layer, a ReLU layer, and a max-pooling (Pool) layer; the fourth block in the PriCNN includes a CONV layer and a ReLU layer; each of the fifth to seventh blocks includes a up-sampling (UP) layer, a CONV layer, and a ReLU layer; and the last block includes a chain of a CONV layer and a ReLU layer.
In addition, concatenation layers are employed in the PriCNN to integrate multi-scale features and thus improve the granularity of the features, which assists in the final density map estimation. This design is motivated by a network architecture described by Ronneberger et. al . As shown in Figure 1, the outputs from each of the first three blocks are multi-resolution, low-dimension, and highly-representative feature maps. Three shortcut connections are established to connect the first and seventh blocks, the second and sixth blocks, and the third and fifth blocks, respectively. With these shortcut connections, multi-resolution features can be concatenated between non-adjacent layers. The integration of multi-scale features can further improve the performance of the network compared to the traditional FCRN, which allows only adjacent layers to be connected.
2.2.2 DRM Training Process
1. AuxCNN-supported DRM Training
Training the designed DRM (or PriCNN) with such a hierarchical structure is a challenging task. As described in Section 1, all the layers in the original FCRN  are learned based on the feedback only from the final layer. Therefore, the intermediate layers might be sub-optimized, which can significantly affect the accuracy of the final estimated density map.
Innovatively, three auxiliary FCNNs (AuxCNNs) are employed to provide additional supervision for learning the intermediate layers of the PriCNN. As shown in Figure 2, each AuxCNN contains two CONV-ReLU blocks for estimating a low-resolution density map from the feature map generated at an intermediate layer of the PriCNN. By jointly minimizing the errors between the estimated density maps and the corresponding ground truth density maps at different resolution levels, the optimization of the intermediate layers in the PriCNN can be improved, which eventually improves the overall performance of the PriCNN.
2. Jointly Training the PriCNN and AuxCNN
The parameters of the PriCNN can be learned by jointly training the PriCNN and the AuxCNNs with a set of given training data , where and represent the -th image and its associated ground-truth density map, respectively. The training is completed by minimizing the differences between the estimated density maps at different resolution levels and the ground truth density maps.
As shown in Figure 2, the PriCNN can be denoted mathematically as , where is the parameter vector of and is an input image. All the trainable parameters in the first four blocks, the th, the th, and the last two blocks can be denoted as , , , and , respectively. Therefore, and . Also, the output feature maps of the -th, -th, and -th blocks can be denoted as , , and , respectively. Similarly, the three AuxCNNs can be denoted as , where is the parameter vector of the -th AuxCNN, and is the density map estimated by use of the
-th AuxCNN. Therefore, the cooperative training of the PriCNN and AuxCNNs is performed by jointly minimizing four loss functions, defined below:
where is the ground truth low-resolution density map (GTLR) generated from . is generated from the original ground-truth density map by summing every adjunct in , with and . is the average mean square error (MSE) between the estimated density maps and their ground truths. is the average MSE between the low-resolution density maps estimated by -th AuxCNN and their corresponding GTLR density maps.
To improve the computational efficiency of the optimization of the PriCNN and AuxCNNs, we construct a combined loss function, defined as below:
where is a parameter that controls the relative strength of the supervision under the -th AuxCNN for learning the intermediate layers in the PriCNN. Eqn.(6
) is numerically minimized via stochastic gradient descent (SGD) methods.
2.2.3 Density estimation and cell counting
During the cell counting phase of the framework (Figure 1), the number of cells in a to-be-tested image can be estimated by use of the trained DRM represented by :
where is the estimated density at pixel . In this step, the dimensions of the to-be-tested image can be different because arbitrary input image sizes are allowed by the trained PriCNN.
3 Experimental Results
A set of immunofluorescent images of human embryonic stem cells (hESC) was employed in this study to test the performance of the proposed method. Each image was pixels, and the hESC images were manually annotated by identifying the centroid of each cell within each image. Statistically, the cell number among these images is about .
For each annotated image in the training dataset, the corresponding ground truth density maps were generated by placing a normalized 2D discrete Gaussian kernel with isotropic covariance, , at each annotated cell centroid in the image (details shown in Section 2.2). The values of and were set to pixels and pixels, respectively. A pair consisting of an image and a density map was considered as a training sample for training the PriCNN and AuxCNNs. In this study, 5-fold cross validation was employed to evaluate the cell counting performance.
3.2 Method Implementation
We compared the performance of the proposed method (denoted as PriCNN+AuxCNN) with a state-of-the-art method, FCRN . The FCRN used the same network architecture as the PriCNN, but without concatenation layers. In addition, a PriCNN without AuxCNNs (or a FCRN with concatenation layers) was also compared to illustrate the performance improvement by use of AuxCNNs.
In the PriCNN, the convolution kernel size in the first blocks was set to , while that in the last block was set to . The numbers of kernels in the first to th CONV layers are set to 32, 64, 128, 512, 128, 64, 32, and 1, respectively. The pooling size in each pool layer was set to
, and the Up layers performed bi-linear interpolation.
In the first block of the AuxCNN, the kernel size was set to and the number of kernels was , while the comparable values in the second block were and , respectively. In addition, the ground truth low-resolution density map (GTLR) was generated from the original ground-truth density map by summing local regions with size of , , and , respectively.
All the three methods, PriCNN+AuxCNN, PriCNN-only, and FCRN were trained under the same hyper-parameter configurations, including a learning rate of and a batch size of 100. In addition, all the parameters were orthogonally initialized .
In this study, mean absolute error (MAE) and standard deviation of absolute errors (STD) were employed to evaluate the cell counting performance. MAE measures the mean of the absolute errors (MAE) between the estimated cell counts and their ground truths for all images in the validation set. The STD measures the standard deviation of the absolute errors. Table1 shows that the proposed method yields superior cell counting performance to the other two methods in terms of MAE and STD. In addition, Figure 3 presents the estimated density maps of one hESC image example estimated by the three methods. The numbers of cells counted from density maps are indicated below each density map. From the figure, we can see that the proposed method (PriCNN+AuxCNN) can estimate a density map that is more similar to the ground truth, compared to the other two methods. Also, our estimated cell count is closer to the ground truth.
Convolutional neural networks (CNN) have succeeded in computer vision tasks, including image classification , segmentation [14, 22], and object detection . The success is because that CNNs can integrate informative feature extraction and powerful nonlinear function learning. Furthermore, fully convolutional neural networks (FCNN), such as FCN  and U-Net 
, allow flexible input image sizes, and have been employed to perform an efficient end-to-end mapping from an image (one domain) to a probability map (another domain). Both the PriCNN (the proposed DRM in the study) and FCRN are some specific FCNNs, which explain the descent cell counting accuracy they have achieved in this study.
In this study, only a set of experimental immunofluorescent hESC images was employed for DRM training and validation. However, the generalization of the proposed method should not be limited to only the experimental immunofluorescent hESC images. In future, we will evaluate the proposed method on image sets of other modalities. In addition, other competing general object counting methods will also be compared with our proposed method.
In this study, for the first time, a deeply-supervised density regression framework is proposed for automatic cell counting. The results obtained on experimental hESC images demonstrate the superior cell counting performance of the proposed method, compared with the state of the art.
Acknowledgements.This work was supported in part by award NIH R01EB020604, R01EB023045, R01NS102213, and R21CA223799.
-  Coates, A. S., Winer, E. P., Goldhirsch, A., Gelber, R. D., Gnant, M., Piccart-Gebhart, M., Thürlimann, B., Senn, H.-J., Members, P., André, F., et al., “Tailoring therapies—improving the management of early breast cancer: St gallen international expert consensus on the primary therapy of early breast cancer 2015,” Annals of oncology 26(8), 1533–1546 (2015).
-  Matas, J., Chum, O., Urban, M., and Pajdla, T., “Robust wide-baseline stereo from maximally stable extremal regions,” Image and vision computing 22(10), 761–767 (2004).
-  Barinova, O., Lempitsky, V., and Kholi, P., “On detection of multiple object instances using hough transforms,” IEEE Transactions on Pattern Analysis and Machine Intelligence 34(9), 1773–1784 (2012).
-  Arteta, C., Lempitsky, V., Noble, J. A., and Zisserman, A., “Learning to detect cells using non-overlapping extremal regions,” in [International Conference on Medical Image Computing and Computer-Assisted Intervention ], 348–356, Springer (2012).
-  Xing, F., Su, H., Neltner, J., and Yang, L., “Automatic ki-67 counting using robust cell detection and online dictionary learning,” IEEE Transactions on Biomedical Engineering 61(3), 859–870 (2014).
-  Lempitsky, V. and Zisserman, A., “Learning to count objects in images,” in [Advances in neural information processing systems ], 1324–1332 (2010).
-  Xie, W., Noble, J. A., and Zisserman, A., “Microscopy cell counting and detection with fully convolutional regression networks,” Computer methods in biomechanics and biomedical engineering: Imaging & Visualization 6(3), 283–292 (2018).
-  Arteta, C., Lempitsky, V., Noble, J. A., and Zisserman, A., “Detecting overlapping instances in microscopy images using extremal region trees,” Medical image analysis 27, 3–16 (2016).
-  Cireşan, D. C., Giusti, A., Gambardella, L. M., and Schmidhuber, J., “Mitosis detection in breast cancer histology images with deep neural networks,” in [International Conference on Medical Image Computing and Computer-assisted Intervention ], 411–418, Springer (2013).
-  Liu, F. and Yang, L., “A novel cell detection method using deep convolutional neural network and maximum-weight independent set,” in [Deep Learning and Convolutional Neural Networks for Medical Image Computing ], 63–72, Springer (2017).
-  Lee, C.-Y., Xie, S., Gallagher, P., Zhang, Z., and Tu, Z., “Deeply-supervised nets,” in [Artificial Intelligence and Statistics ], 562–570 (2015).
-  Zeng, G., Yang, X., Li, J., Yu, L., Heng, P.-A., and Zheng, G., “3d u-net with multi-level deep supervision: fully automatic segmentation of proximal femur in 3d mr images,” in [International Workshop on Machine Learning in Medical Imaging ], 274–282, Springer (2017).
-  Dou, Q., Yu, L., Chen, H., Jin, Y., Yang, X., Qin, J., and Heng, P.-A., “3d deeply supervised network for automated segmentation of volumetric medical images,” Medical image analysis 41, 40–54 (2017).
-  Ronneberger, O., Fischer, P., and Brox, T., “U-net: Convolutional networks for biomedical image segmentation,” in [International Conference on Medical image computing and computer-assisted intervention ], 234–241, Springer (2015).
-  Dong, H., Yang, G., Liu, F., Mo, Y., and Guo, Y., “Automatic brain tumor detection and segmentation using u-net based fully convolutional networks,” in [Annual Conference on Medical Image Understanding and Analysis ], 506–517, Springer (2017).
-  Vedaldi, A. and Fulkerson, B., “Vlfeat: An open and portable library of computer vision algorithms,” in [Proceedings of the 18th ACM international conference on Multimedia ], 1469–1472, ACM (2010).
-  Fiaschi, L., Köthe, U., Nair, R., and Hamprecht, F. A., “Learning to count with regression forest and structured labels,” in [Pattern Recognition (ICPR), 2012 21st International Conference on ], 2685–2688, IEEE (2012).
-  Sommer, C., Straehle, C. N., Koethe, U., Hamprecht, F. A., et al., “Ilastik: Interactive learning and segmentation toolkit.,” in [ISBI ], 2(5), 8 (2011).
-  Bottou, L., “Large-scale machine learning with stochastic gradient descent,” in [Proceedings of COMPSTAT’2010 ], 177–186, Springer (2010).
-  Saxe, A. M., McClelland, J. L., and Ganguli, S., “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” arXiv preprint arXiv:1312.6120 (2013).
-  He, K., Zhang, X., Ren, S., and Sun, J., “Deep residual learning for image recognition,” in [Proceedings of the IEEE conference on computer vision and pattern recognition ], 770–778 (2016).
-  He, S., Zheng, J., Maehara, A., Mintz, G., Tang, D., Anastasio, M., and Li, H., “Convolutional neural network based automatic plaque characterization for intracoronary optical coherence tomography images,” in [Medical Imaging 2018: Image Processing ], 10574, 1057432, International Society for Optics and Photonics (2018).
-  Ren, S., He, K., Girshick, R., and Sun, J., “Faster r-cnn: Towards real-time object detection with region proposal networks,” in [Advances in neural information processing systems ], 91–99 (2015).
-  Long, J., Shelhamer, E., and Darrell, T., “Fully convolutional networks for semantic segmentation,” in [Proceedings of the IEEE conference on computer vision and pattern recognition ], 3431–3440 (2015).