1 Introduction
A fundamental problem in neuroscience research is automatic image segmentation of neuronal cells, which is the basis for various quantitative analyses of neuronal structures, such as tracing cell genesis in Danio rerio (zebrafish) brains [3] (e.g., using the EMDbased tracking model [5]). Fully convolutional networks (FCN) [11] have emerged as a powerful deep learning model for image segmentation. In this paper, we aim to study the problem of automatically segmenting neuronal cells in dualcolor confocal microscopy images with deep learning.
In this problem, we face two major challenges, which also arise in other biomedical image segmentation applications. (1) Neuron segmentation is quite complicated, due to vanishing separation among cells in densely packed clusters, very obscure cell boundaries, irregular shape deformation, etc (see Fig. 1). Even to biologists, it is difficult to correctly identify all individual cells visually. Since stateoftheart FCN models may incur considerable errors in this difficult task, it is highly desirable to develop new effective models for it. (2) To train FCNtype models for perpixel prediction, pixellevel supervision is commonly needed, using fully annotated images. However, in our problem, even experienced biologists can hardly determine perpixel ground truth. For pixels near cell boundaries, even approximate ground truth is difficult to acquire. In fact, biologists only perceive instancelevel information, namely, presence or absence of cells. Thus, how to leverage instancelevel annotation to train pixellevel FCN models is important.
In this paper, we propose a new FCNtype segmentation model, called deep Complete Bipartite Networks (CBNet). Its core macroarchitecture is inspired by the structure of complete bipartite graphs. Our proposed CBNet explicitly employs multiscale feature reuse and implicitly embeds deep supervision. Moreover, to overcome the lack of pixellevel annotation, we present a new scheme to train pixellevel deep learning models using approximate instancewise annotation. Our essential idea is to extract reliable and discriminative samples from all pixels, based on instancelevel annotation. We apply our model to segment neuronal cells in dualcolor confocal microscopy images of zebrafish brains. Evaluated using 7 real datasets, our method produces high quality results, both quantitatively and qualitatively. Also, the experiments show that our CBNet can achieve much higher precision/recall than the stateoftheart FCNtype models.
Related Work. In literature, different strategies have been proposed to improve FCNtype segmentation models, most of which share some of the following three characteristics. First, FCN can be embedded into a multipath framework, namely, applying multiple instances of FCNs through multiple paths for different subtasks [4]. An intuitive interpretation of this is to use one FCN for cell boundaries and another FCN for cell interior, and finally fuse the information from such two paths as the cell segmentation results. Second, extra preprocessing and/or postprocessing can be included to boost the performance of FCNs. One may apply classic image processing techniques to the input images and combine the results thus produced together with the input images as the input to FCNs [14]. Also, contextual postprocessing (e.g., fully connected CRF [6] or topology aware loss [2]) can be applied to impose spatial consistency to obtain more plausible segmentation results. Third, FCN, as a backbone network, can be combined with an object detection submodule [1] or be applied in a recurrent fashion [12] to improve instancelevel segmentation accuracy.
In this paper, we focus on developing the CBNet model, bearing in mind that CBNet can be viewed as a backbone network and thus be seamlessly combined with the above mentioned strategies for further improvement of segmentation.
2 Methodology
2.1 CBNet
Fig. 2 shows a schematic overview of CBNet. This model employs a generalized “complete bipartite graph” structure to consolidate feature hierarchies at difference scales. Overall, CBNet works at five different scales (i.e., different resolutions of the feature plane). At scale (), an encoder block is employed to distill contextual information and a decoder block is used to aggregate the abstracted information at this scale, while the bridge block performs abstraction at the highest scale/lowest resolution (i.e., scale 5).
There is one shortcut connection between each encoder and each decoder to implement the complete bipartite structure, which implicitly integrates the benefits from diversified depths, feature reuse, and deep supervision [9]. With the interacting paths between encoder blocks and decoder blocks, the whole network implicitly ensembles a large set of subnetworks of different depths, which significantly improves the representation capacity of the network. In a forward pass, the encoded features at one scale are effectively reused to aid decoding at each scale. In a backward pass, the shortcut connections assist the gradient flow back to each encoder block efficiently, so that the supervision through the prediction block can effectively have deep impact on all encoder blocks.
Core Blocks (Encoders and Decoders). Fig. 3
shows the structures of the encoder blocks and decoder blocks. A key component for feature extraction at a particular scale is the residual module
[8], with two successive “batch normalization BN + ReLU +
convolution” (see Fig. 3(A)). Since we do not pad the convolution output, the input to the first BN is trimmed in both the height and width dimensions before adding to the output of the second convolution. The width of each residual module (i.e., the number of feature maps processed in the module) follows the pyramid design
[15], i.e., width at scale .The encoders consist of a residual module and a “ConvDown” layer for downsampling. Inspired by [16], we use a
convolution with stride 2, instead of pooling, to make the downsampling learnable so as to be scalespecific. The
decoders first fuse the main decoding stream with reused features from the encoders at different scales. The concatenated features include the deconvolution result [11]from a previous decoder (or the bridge block), and 4 sets of resized feature maps, each from the output of a different encoder block with proper rescaling (bilinear interpolation for upsampling and max pooling for downsampling) and/or border cropping . Then, a spatial dropout
[17] (the rate = 0.5), namely randomly selecting a subset of the concatenated feature maps during training, is applied to avoid overfitting to features from specific scales. Before feeding into the residual module, a convolution is applied for dimension casting.Auxiliary Blocks. The transition block is a convolution and ReLU (with zero padding), which can be interpreted as a mapping from the input space (of dimension 2, red/green channels, in our case) to a rich feature space [15] for the model to exercise its representation power. The bridge block, similar to encoders but no downsampling, aims to perform the highest level abstraction and trigger the decoding stream. The prediction block is a
convolution and LogSoftMax, whose output indicates the probability of each pixel belonging to a neuron.
2.2 Leveraging Approximate Instancewise Annotation
In our problem, perpixel ground truth cannot be obtained, even by experienced biologists. Instead, human experts are asked to draw a solid shape within each cell to indicate the cell body approximately. (Note: By “approximate”, we mean that we know neither the exact bounding box nor the exact shape of each instance.) Generally, the annotations are drawn in a conservative manner, namely, leaving uncertain pixels close to cell boundaries as unannotated. But, when it is absolutely sure, the sizes of the solid shapes are drawn as large as possible. In Fig. 4(C), all annotated regions are in white, and the remaining pixels are in black. Directly using this kind of annotation as perpixel ground truth will cause considerably many positive samples (i.e., pixels of cells) being used falsely as negative samples (i.e., background), due to such conservative annotation.
Our main idea of utilizing approximate instancewise annotation for pixellevel supervision is to extract a sufficient number of more reliable and more effective samples from all pixels based on the available annotations. Specifically, (1) we prune the annotated regions to extract reliable ground truth pixels belonging to cells, and (2) we identify a subset of all unannotated pixels that is more likely to be background, especially in the gap areas among touching cells.
Let be an annotated binary image. First, we perform erosion on (with a disk template of radius 1); let be the resulting eroded regions. Second, we perform dilation on (with a disk template of radius 4); let be the result. Third, we compute the outer medial axis of (see Fig. 4(E)), denoted by . Then, for each pixel , we assign its label as: 1 (Cell), if ; 2 (Background), if ; 3 (Fuzzy Boundary), otherwise. The “Fuzzy Boundary” (roughly a ring along the boundary of an annotated region, see Fig. 4(D)), where the pixel labels are the most uncertain, will be ignored during training. A special scenario is that such ring shapes for proximal cells may overlap. So, the outer medial axis of the eroded annotated regions is computed and is retained as the most representative background samples to ensure separation. Note that this scheme may also be applied to other applications by adjusting the parameters (e.g., larger erosion for less conservative annotation).
2.3 Implementation Details
Postprocessing. The output of CBNet can be viewed as a probability map, in which each pixel is given a probability of being in a cell (a value between 0 and 1). We produce the final binary segmentation by thresholding (at 0.75), two successive binary openings (with a disk template of radius 5, and a square template of size 3), and hole filling. We find the CBNet prediction is of high accuracy so that the threshold is not sensitive and simple morphological operations are sufficient to break the potentially tenuous connections among tightly touching cells (not common, less than ). Also, the template sizes of the morphological operations are determined based on our object shapes (i.e., cells), and should not be difficult to adjust for other applications (e.g., a larger template for larger round cells, or a smaller template for star shape cells with tenuous long “arms”).
Data Augmentation. Since we have only 5 images with annotation, we perform intensive random data augmentation to make effective training and reduce overfitting. In each iteration, an image patch is processed by (1) horizontal flip, (2) rotation by a random degree (an integer between 1 and 180), or (3) vertical flip. Each flip is randomly applied with a probability of . Because the random rotation usually involves intensity interpolation, implicitly introducing lighting noise, no color jittering is employed.
Training. Learnable parameters are initialized as in [7] and optimized using Adam scheme [10]
. The key hyperparameters are determined empirically: (1) We use batch size of 1, since large image patch is preferred over large batch size
[13]. (2) We use higher learning rates for a few epochs (1e5 for epochs 150 and 1e6 for epochs 51100), and fix a small learning rate, 1e7, for all the remaining epochs. (3) We use a weighted negative log likelihood criterion (0.25, 0.75, and 0 for the “Cell”, “Background”, and “Fuzzy Boundary” weights, respectively). Thus, the fuzzy boundary is ignored by assigning a zero weight. The background is associated with a higher weight to encourage separation among cells.
3 Experiments
Besides having 5 images for training, we use 7 inhouse datasets for evaluation, each containing 55 dualcolor microscopy images of a zebrafish brain. We use double transgenic fish where GCaMP6s, a green fluorescent protein (GFP) based genetically encoded calcium indicator, and H2bRFP, a histone fused red fluorescent protein (RFP), are driven by the elavl3 promoter. This yields dualcolor images, in which all neurons in the double transgenic fish express green fluorescence in the cytosolic compartment and red fluorescence in the nucleus.
Our method is compared with UNet [13], a stateoftheart FCNtype model, which has achieved lots of successes in various biomedical image segmentation applications. For fair comparison, we use the same training procedure to train UNet as we do for CBNet. The numbers of learnable parameters for CBNet and UNet are 9M and 31M, respectively. Due to the multiscale feature reuse, a smaller width is sufficient for each residual module in CBNet. Consequently, CBNet contains fewer learnable parameters than UNet.
Leaveoneout experiments are conducted to quantitatively assess the performance. The results of running 2000 training epochs are given in Fig. 5(A). One can observe that CBNet can achieve better validation performance than UNet, and overfitting is not a severe issue even using only 5 annotated training images.
Performance on the real datasets was examined in a proofreading manner. This is because pixellevel ground truth is not available in our problem (see Section 1
), and even approximate instancelevel annotation can take two experts over 20 hours in total to manually annotate 5 images for training. Strictly speaking, we presented the segmentation results to experienced biologists in order to (1) confirm true positives, (2) reject false detections, and (3) detect false negatives. Note that falsely merged or falsely separated cells are treated as false detection. If a segmented cell is much smaller (resp., larger) than the actual size, then it is classified as false negative (resp., false detection). Finally, Precision and Recall are calculated. In fact, the proofreading evaluation for our problem is too time consuming to make extensive quantitative ablation evaluation in practice. Also, with a similar amount of effort, we choose to evaluate and compare with the most representative baseline models on many different datasets, instead of comparing with more baseline models on only few datasets. The quantitative testing results are shown in Fig.
5(B), and qualitative results are presented in Fig. 6. It is clear that our CBNet achieves much better results than UNet.We observe that a large portion of errors made by UNet occurs in the following two situations: (1) confusion between noisy areas and cells with relatively weak fluorescent signals (see row 1 in Fig. 6), and (2) confusion between touching cells and large single cells (see rows 2 and 3 in Fig. 6). The higher representative capability of CBNet (due to the complete bipartite graph structure) enables it to extract features more effectively and gain deeper knowledge of the semantic context. Consequently, CBNet can attain more accurate segmentation in the above two difficult situations and achieve significant improvement over UNet.
4 Conclusions
In this paper, we proposed a new FCN model, CBNet, for biomedical image segmentation. The main advantage of CBNet is deep multiscale feature reuse by employing a complete bipartite graph structure. Moreover, we presented a new scheme for training a pixelwise prediction model using only approximate instancewise annotation. Qualitative and quantitative experimental results show that our new method achieves high quality performance in automatic segmentation of neuron cells and outperforms UNet, a stateoftheart FCN model.
References
 [1] A. Arnab and P. Torr. Bottomup instance segmentation using deep higherorder CRFs. arXiv preprint arXiv:1609.02583, 2016.
 [2] A. BenTaieb and G. Hamarneh. Topology aware fully convolutional networks for histology gland segmentation. In MICCAI, pages 460–468, 2016.
 [3] K. L. Cerveny, M. Varga, and S. W. Wilson. Continued growth and circuit building in the anamniote visual system. Developmental Neurobiology, 72(3):328–345, 2012.
 [4] H. Chen, X. Qi, L. Yu, and P.A. Heng. DCAN: Deep contouraware networks for accurate gland segmentation. arXiv preprint arXiv:1604.02677, 2016.
 [5] J. Chen, C. W. Harvey, M. Alber, and D. Z. Chen. A matching model based on earth mover’s distance for tracking Myxococcus xanthus. In MICCAI, pages 113–120, 2014.
 [6] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv preprint arXiv:1412.7062, 2014.

[7]
K. He, X. Zhang, S. Ren, and J. Sun.
Delving deep into rectifiers: Surpassing humanlevel performance on ImageNet classification.
In CVPR, pages 1026–1034, 2015.  [8] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027, 2016.
 [9] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
 [10] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [11] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
 [12] B. RomeraParedes and P. Torr. Recurrent instance segmentation. arXiv preprint arXiv:1511.08250, 2015.
 [13] O. Ronneberger, P. Fischer, and T. Brox. UNet: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241, 2015.

[14]
S. K. Sadanandan, P. Ranefall, and C. Wählby.
Feature augmented deep neural networks for segmentation of cells.
In ECCV, pages 231–243, 2016.  [15] L. N. Smith and N. Topin. Deep convolutional neural network design patterns. arXiv preprint arXiv:1611.00847, 2016.
 [16] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
 [17] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. Efficient object localization using convolutional networks. In CVPR, pages 648–656, 2015.