The relationships between brain cells and microvessels are critical to brain function [andreone2015neuronal, tsai2009correlations] and a frequent target in neurodegenerative research [iadecola2004neurovascular]. However, disease-induced microstructural changes occur across large tissue regions, making comprehensive studies difficult due to limitations in microscope fields of view and imaging speed.
High-throughput microscopy aims to produce three-dimensional images of organ-scale tissue samples ( ) at sub-cellular resolution ( ). Two such techniques, knife-edge scanning microscopy (KESM) [mayerich2008knife] and micro-optical sectioning tomography (MOST) [li2010micro], produce images containing hundreds of millions of cells alongside interconnected microvascular networks. While the resulting images have sub-micrometer spatial resolution, they are densely packed with a variety of microstructures, making manual annotation impractical and confounding automated algorithms. There is therefore a compelling need for tools to automate segmentation on massive images, making high-throughput microscopy practical for quantitative biomedical studies.
In this paper we introduce a framework for automated nuclear and microvascular segmentation in KESM images based on an efficient and accurate deep encoder-decoder architecture (Fig. 1). KESM and MOST images are of particular interest in neurovascular anatomy [mayerich2011fast, xiong2017precise] and have far-reaching applications for quantifying diseased tissue, including cancer grading and treatment efficacy. We propose two contributions critical to cellular and microvascular reconstruction at the macroscopic () scale:
A memory efficient densely-connected convolutional neural network for 3D semantic segmentation. A fully-convolutional encoder-decoder network is modified with dense connections and transition factors to control the number of trainable parameters and mitigate “memory explosion.”
The cellular and vascular structure for a region is automatically segmented, demonstrating reliability, scalability, and accuracy exceeding prior methods.
We first present related work in Section 2. Section 3 describes the proposed network architecture, training regime, and post-processing methods. We then discuss details for leveraged data sets in Section 4 and experimental results are presented in Section 5. Section 6 concludes with a summary. Constraints and future improvements are discussed in Section 7.
2 Related Work
Nuclear segmentation is challenging due to the wide variability in size, morphology, and density. It is often a multi-step process, relying on seed selection such as the Laplacian of Gaussian [kong2013generalized], followed by segmentation steps, including watershed [zhang2010neutrosophic] and graph-cut [al2010improved] algorithms.
Automated seed selection for KESM and MOST data has been proposed using three-dimensional iterative voting [saadatifard2016three, saadatifard2018robust], however the complexity of chromatin structures limits accuracy. Staining heterogeneity and resolution limitations further reduce accuracy, causing over-seeding in poorly stained areas and under-seeding in dense cortical regions.
Vessel segmentation is challenging given the complexity of the embedded capillary network. Prior work uses a combined template-matching and predictor/corrector approach for KESM and MOST data [govyadinov2018robust], however these algorithms are less accurate with nuclear stains [xiong2017precise] (Fig. 5), which have lower contrast than vascular perfusion labels and exhibit additional structural complexity from chromatin features in the nucleus.
Deep neural networks (DNNs) have gained attention by outperforming state-of-the-art algorithms in pattern recognition[schmidhuber2015deep]
, and may be amenable to high-throughput microscopy. These models use a cascade of processing stages to learn abstract representations from high-dimensional data[lecun2015deep]. Convolutional neural networks (CNNs) were initially introduced for image processing [lecun1995convolutional] and demonstrated excellent performance in signal processing [hershey2017cnn, ravindran2019assaying]
, including computer vision in images[krizhevsky2012imagenet] and video [karpathy2014large]. CNNs have been successfully applied to a wide range of tasks including classification [wang2016cnn, berisha2019deep, egmont2002image, shahraki2018graph], object detection [girshick2015fast, foroozandeh2017cyclist], digital staining [lotfollahi2019digital], and semantic segmentation [milletari2016v], and have been adopted for medical applications including x-ray pathology [rajpurkar2017chexnet] and computed tomography [liao2019evaluate, mobiny2017lung].
While powerful, CNNs have fundamental limitations [mobiny2019automated]. First, they require a substantial number of training samples to generalize, making data collection and annotation an expensive and time-consuming barrier. Many modern architectures rely on millions of parameters, leading to exponential increases in computational and memory requirements. These architectures are therefore impractical for training with 3D images. There is therefore a compelling need for new architectures that provide top-tier performance while minimizing trainable parameters.
2.1 CNNs for Semantic Segmentation
Semantic segmentation is a critical component of biomedical image analysis. Adding additional convolutional layers to produce deeper networks has achieved progressively better performance [simonyan2014very]
, however the vanishing gradient problem is a well-known issue for deep architectures[glorot2010understanding, srivastava2015training]. Residual networks were introduced [he2016deep] to mitigate the vanishing gradient problem and reduce convergence time. Alternatively, dense blocks [huang2017densely] alleviate the vanishing gradient problem by using feed-forward connections between convolution layers. Dense connections also reuse features, reducing trainable parameters.
Fully-convolutional encoder-decoder neural networks, which map input data to an identically-sized output [goodfellow2016deep], are a pioneering method for semantic segmentation [ronneberger2015u]
. These networks often using skip connections to copy features from the encoder to decoder path. This architecture has been extended for three dimensional data by adding residual connections at each stage[milletari2016v, cciccek20163d]. The Tiramisu network [jegou2017one] is a fully-convolutional dense network providing state-of-the-art performance for urban scene benchmarks [BrostowFC:PRL2008] by using dense blocks in the encoding path to extract semantic features. However, dense blocks in both encoder and decoder paths result in memory explosion. To address this problem, Tiramisu removes short connections and does not use a fully connected dense block. However, short connections convey useful information and do not add additional parameters [he2016deep]. We propose a fully convolutional auto-encoder using fully connected dense blocks in both the encoder and decoder paths. Two transition factors are used to reduce feature scaling for large three-dimensional images.
2.2 Cellular and Microvascular Segmentation Models
Three-dimensional microvascular modeling is challenging [bullmore2009complex] because most microscopes have a very limited field of view (<). The distribution of cells relative to the microvascular network is key to learning about brain haemodynamics and neurovascular architecture for both diseased and healthy tissue. Quantifying their correlation is therefore crucial to understanding brain function and metabolism [kleinfeld2011guide]. The regional density of cells and their distribution relative to the vascular network is important for understanding the efficiency and spatial localization of neurovascular reactions [tsai2009correlations]
. Recent advances in imaging provide high resolution datasets capturing detailed information about neuronal and non-neuronal cell distributions and microvessels from the whole rodent brain. However, researchers lack a reliable set of algorithms to segment cells and vessels from complex and low-contrast thionine-stained data[tsai2009correlations, wu20143d]. The complexity of brain architecture at sub-micrometer resolution results in acquiring a terabyte-scale volumetric data. Data at this scale is particularly challenging to both manage and process due to the memory limitations and the computational complexity of existing imaging algorithms [akil2011challenges]. To address these issues, a fast and efficient framework is required that precisely maps the whole brain from raw Feulgen-stained three-dimensional samples. Although it is known that cell size and densities vary across cortical and subcortical regions, there are no specific values that accurately quantify these values across the brain [keller2018cell].
Recent work uses expansion microscopy and tissue clearing to image the cellular structure of the mouse brain, including quantitative cell counting across brain regions [murakami2018three]. Although some studies that measure cell densities in different mouse brain regions, there is substantial variation in the reported values. Further, there is more uncertainty about the densities of specific cell types in brain regions [keller2018cell]
. Most of the knowledge of regional cell density is estimated using the stereology method[mouton2013neurostereology], therefore a reliable approach is required to quantify cell densities in the mouse brain.
Recent work on microvascular modeling has focused on light-sheet and multi-photon microscopy [lauwers2008morphometry, blinder2013cortical, haft2019deep]. These methods are able to achieve voxel sizes of – allowing reconstruction of arterioles and venules, as well as the capillary network, but are not able to precisely model the holistic 3D structure of the whole brain due to limited resolution or time constraints. Although capillaries play a vital role in neurovascular function, a comprehensive model of their network properties is unavailable [smith2019brain]. Given the availability of data sets comprising large regions of the brain [mayerich2011fast, xiong2017precise], this limitation is primarily due to the difficulty segmenting and analyzing quantitative DNA-labeled brain images.
KESM images provide sufficient detail to quantify cellular and vascular structure and distributions. The tissue was stained using thionine perfusion [choe2011specimen]
, which acts as a nissl or Feulgen stain primarily labeling nucleic acids. The dye is sufficiently quantitative to differentiate between cell nuclei and chromatin with a dark purple stain. The surrounding neuropil is lightly labeled, while vessels and microvessels contain no tissue and are therefore unstained. The neuropil labeling allows differentiation between microvessels and cell nuclei. However, the low contrast difference between the labels, as well as the complex structure of the cell nucleus, makes cell localization and vascular segmentation with standard techniques challenging. Semantic segmentation forms the basis for explicit neurovascular modeling by classifying pixels into cellular, vascular, and background components.
We develop a deep fully-convolutional network that efficiently combines long skip connections proposed in U-Net [ronneberger2015u] with dense connections (short connections) proposed in the DenseNet CNN family [huang2017densely]. Long skip connections are typically used to pass feature maps from encoder to decoder, recovering spatial information lost during downsampling [cciccek20163d]. Short connections directly connect feature maps from preceding layers within a block, creating shortcuts for uninterrupted gradient flow [huang2017densely]. However, feature map concatenation through dense connections can quickly introduce memory explosion. We therefore introduce a trainable feature compression mechanism to constrain information flow.
We tested both two-dimensional and three-dimensional architectures, with two-dimensional performance tested on standard data sets. Our three-dimensional extension is used to differentiate cellular and microvascular structures from background in KESM images. Finally, the outputs of the CNN (i.e. segmentation masks) undergo post-processing algorithms to create the cellular and vascular model. We show that the cascade of semantic segmentation step and the employed post-processing algorithms provides a significant improvement in model accuracy over existing algorithms used to process KESM images.
3.1 Dense-VNet (DVNet)
The DVNet architecture contains densely-connected encoder and decoder paths joined by a linking unit (Fig. 2). The encoder path decomposes the input image into a feature hierarchy, while the decoder reconstructs spatial context while performing per-pixel classification. The linking unit (LU) provides the primary information path and contains the most compact representation of the input. Cascades of feature encoder (FE) and feature decoder (FD) units run in parallel to form the encoder and decoder paths respectively (see Fig. 2).
Dense Block (DB): is a sequence of multiple bottleneck blocks with direct short connections from any layer to all subsequent layers (referred to as dense connections). Feature maps for layer are computed by:
where is the feature map for layer in the dense block, refers to the concatenation of feature maps of layers 0 to , and calculates the feature map for layer . Dense connection of the feature maps strengthens feature propagation, mitigates the vanishing gradient problem, and applies a regularizing effect to reduce over-fitting [huang2017densely, drozdzal2016importance]. In a dense block, short skip connections pass input features to the output (bold blue lines in Fig. 2).
Bottleneck block (BB): is the main component of a dense block. Each BB is composed of two convolutional layers: a convolution layer with convolution filters followed by a layer with filters, where is a hyper-parameter named as growth rate. The convolution serves as the bottleneck layer which reduces the number of input feature maps, and thus increases the computational efficiency [huang2017densely]2) precedes each convolutional layer.
A DB with layers and a growth rate of calculates feature maps, where is the input depth. Feature maps computed at each encoder level are concatenated to the inputs of the DB in the corresponding level in the decoder path using a long skip connection (the dashed lines in Figure 2). Long skip connections restore the spatial context lost through encoder down sampling [long2015fully, drozdzal2016importance], while short skip connections alleviate vanishing gradients and improve convergence [he2016deep, szegedy2017inception].
Feature Compression: Employing both the short and long skip connections together rapidly increases the number of concatenated feature maps and causes memory explosion. In Tiramisu network [jegou2017one], the short skip connections of the decoder path were removed so as to decrease the number of feature maps and mitigate the explosion. This, however, leads to information loss, and may cause vanishing gradients which eventually degrades the convergence and prediction accuracy performance. In our proposed architecture, DVNet, we introduce compression factors which helps controlling the number of feature maps as explained below.
Transition block: is used along with a dense block to form a feature extraction unit (see Fig. 2) and aims to change the size of the feature maps. Transition down (TD) blocks use two convolution layers to downsample the feature maps by at each level of the encoder path. Conversely, Transition Up (TU) blocks are used at each level of the decoder path to upsample the feature maps by using two transposed-convolution layers [zeiler2010deconvolutional]. To further improve the model compactness, we introduce two transition factors, namely and , to control the number of feature maps in the encoder and decoder paths respectively. If the output of a DB contains feature maps, the following TD (TU) block will generate () output feature maps using a convolution layer, where and represents the floor function.
The encoder path starts with a convolution layer, followed by feature encoding (FE) blocks. Coarse information is extracted at the deepest encoder layers with broader receptive fields, while shallow blocks compute local information. Hierarchical representations are stored in dense block outputs and compressed through average pooling in TD blocks. Note that BN and ReLU operate on inputs for all convolutional layers in the encoder path.
The linking unit that joins encoder and decoder paths is composed of a single dense block and generates deep features with low spatial specificity.
The decoder receives semantic features at each spatial resolution and reconstructs pixel-wise masks for each class with the same spatial context as the input image. The decoder is composed of feature extraction blocks, followed by two convolutional layers. In each TU block, a convolution decreases the feature map depth by factor. Then a transposed convolution layer upsamples the spatial size by 2. Finally, the output of the TU block is concatenated with the output from the DB of the corresponding encoder to form the input of the succeeding DB. At the end of the decoder path, two and convolution layers operate on the feature maps computed by the decoder to generate the output masks for the semantic classes.
The final DVNet consists of 5 feature extraction levels (FEs and FDs) composed of increasing numbers of bottleneck blocks . The linking unit has layers, and the growth rate for the network is 16. Transition down and transition up factors are and . The output mask has the same size as the input with the number of channels equals to the number of predicted classes. The input layer computes a fixed number feature maps (64) while having the same size as input (). Feature map depth are expanded and its dimension are halved when passing each level of the encoder. In the decoder path, its dimension is upsampled to finally generate size masks. The feature map depth is also accumulated but the prevents it from explosion (Table 1).
|layer name||feature depth||feature dimension|
|TD + DB(6 BBs)||160||X/2|
|TD + DB(8 BBs)||208||X/4|
|TD + DB(10 BBs)||264||X/8|
|TD + DB(12 BBs)||324||X/16|
|TD + LU(16 BBs)||418||X/32|
|TU + DB(12 BBs)||641||X/16|
|TU + DB(10 BBs)||616||X/8|
|TU + DB(8 BBs)||520||X/4|
|TU + DB(6 BBs)||412||X/2|
|TU + DB(4 BBs)||315||X|
3.3 Training Procedure
. Combinations of hyperparameters are used to evaluate network performance and memory limitations. A softmax activation function is applied on the output layer to calculate class probabilities for each pixel:
is the logit of the channelof the output at the position of . is the number of classes and is the softmax probability for class
. Its output is used in the backpropagation for computing loss and optimizing weights.
Loss function: pixel-wise cross entropy and dice coefficient are two functions that are used for loss calculation. Pixel-wise cross entropy is the most common cost function for image segmentation:
where and are the ground truth and prediction values for class at position . refers to the number of classes and is the cross-entropy error at that position. The dice coefficient is used in semantic segmentation tasks, and leverages overlap between the prediction and ground truth:
where and represent prediction and ground truth masks, shows the number of elements, and stands for the common elements in the masks.
Evaluation metrics: This network is trained and evaluated using the pixel-wise accuracy and the intersection over union (IoU). IoU is computed for each class and evaluate the class wise performance:
where is the current class and is the set of all voxels. and are voxels from the prediction and ground truth masks that are labeled as class . , and are intersection and union functions. We evaluate DVNet performance using global accuracy and mean IoU.
Optimizer: Trainable parameters are initialized using the Xavier initializer [glorot2010understanding]. Adam optimizer [Kingma2014adam] with the initial learning rate of optimizes weights to minimize loss and improve accuracy. Learning rate is decreased with a decay rate of after each iteration.
3.4 Post Processing
DVNet forms the basis for identifying cellular and microvascular surfaces. An explicit model is required for many practical applications. We adapt the most effective cell localization [saadatifard2018robust] and vascular tracing [govyadinov2018robust] algorithms to segment cell nuclei and identify vascular surfaces and connectivity using the masks generated by the proposed DVNet (Figure 1). There are two available algorithms that work well in cell localization and vessel tracing from the KESM dataset [saadatifard2018robust, govyadinov2018robust]. The performance of these algorithms is further improved by pre-processing this dataset using the proposed CNN network.
Cell localization: iVote3 is a robust cell detection technique for 3D large volume datasets [saadatifard2018robust].
Iterative voting is an embarrassingly parallel computationally intensive method. To improve performance our implementation is accelerated using GPU shared memory and atomic operations. This method is based on the radial symmetry and can detect cells with various sizes. The only input parameter is the maximum radius of the cells in the input dataset. iVote3 computes the gradient of the input, and uses the gradient information to calculate a value for each voxel that represents the probability of being a cell center. Based on the probability mask, the gradient information is refined and a new mask is computed iteratively, until the algorithm converges and outputs a center point for each cell. Figure 3 illustrates an overview of iVote3. This algorithm is fast, requires minimal tuning and shows superior cell detection results on dense, low contrast datasets such as KESM.
Centerline Segmentation The segmentation is handled by a predictor-corrector algorithm that leverages texture sampling to extract centerline, radius and topology information of microvascular network . The algorithm utilizes a set of 2 rectangular templates perpendicular to one another in order to estimate the centerline using multiple steps [govyadinov2018robust]. The process is illustrated in Figure 4. The final output is a connected graph , where is the location of a branch point and is a micro-vessel connecting two branch points.
4.1 Data Description
Knife-Edge Scanning Microscopy (KESM) allows researchers to collect detailed images describing cell structure and vascular/neuronal connectivity across large () volumes [mayerich2008knife]. Optimal stains, such as thionine, provide multiple structural features in a single channel. Thionine staining is common in neuroscience for labeling DNA and ribosomal RNA by binding to acidic proteins and nucleic acids. This label provides contrast for neurons, endothelial cells, and various glial cells. Figure 5 illustrates a cropped section of the thionine-stained mouse cortex imaged using KESM with (××) resolution.
4.2 Data Preparation
Three disjoint KESM volumes are randomly selected for training and evaluation: 20 voxel volumes for training, and 2 sets of 6 volumes for validation and testing. There is no any overlap area between sets. All volumes are annotated manually to segment cellular and vascular structures. volumes of the training set with their related ground truths form a sample of the input batch. Training batches contain two samples due to high memory requirement of DVNet architecture.
Data augmentation: Augmentation is randomly applied to expand training data by randomly cropping and rotating volumes along all three axes.
5 Experimental Results
Both 2D and 3D implementations of DVNet are trained on the KESM dataset. For 2D training, each image and its ground truth is fed to the network as one input sample. Parameters are selected as: , and to build the network. Training loss is computed using dice function, and drop out set to . The trained modle predictions on the test data is evaluated by IoU per class and accuracy that verifies its performance over other 2D networks (Table 2).
Due to the memory limitation, we build 3D DVNet in three sizes with this combinations: (, and ), (, and ), and (, and ). It is trained with the () input size and batch size. Dice and cross entropy functions are examined for computing training loss, and result in the same performance. V-Net [milletari2016v], and 3D implementation of Tiramisu [jegou2017one] are trained on this dataset to optimize their loss and accuracy on the validation set. To compare the performance of DVNet with state of the art, we used trained models to predict cellular and vascular masks from test set, and evaluate those predictions with manually labeled ground truth (Table 2).
Evaluation results confirmed 3D implementation segment KESM with higher accuracy than 2D implementation. Also among the three sizes, biggest structure performs better while having less trainable parameter that other networks (Table 2).
We used the biggest model of trained DVNet to segment cellular and vascular structure of the KESM dataset. A volume of the KESM mouse brain that includes () voxels is divided to () voxels overlapped sections. The trained model predict masks for each subvolumes, and then masks are stitched to model the big region (Figure 6).
The 3D model is used to analyze the cellular and vascular architecture in the big region of the brain. To compute the number of cells and their distribution, ivote3 (3.4) is used to localize cell positions. Using the generated model for cell segmentation significantly improves the performance of ivote3 in compare with using the raw images from the KESM (figure 7-a). To analyze the vascular structure in the brain such as computing a capillary distance to another one, average distance between different vessels with different sizes, vessels locations and sizes are needed. The algorithm explained in 3.4 computes centerlines and radii of vascular network using the vascular model. Figure 7-b indicates the F-measure improvement in comparison with using the raw KESM data. The cell detection and vessel tracking results are compared with a manually labeled ground-truth.
KESM captures detailed information in individual cell level from the brain architecture, that form a volumetric dataset () for a whole () mouse brain. The diversity in size and shape and low contrast images along with the massive dataset are the challenges for segmenting the brain structure from the KESM images. Memory limitation, time performance and lacking of a robust segmentation algorithm are difficulties that are addressed in this paper.
Due to the detail information in the KESM dataset, a deep and densely connected encoder-decoder structure is built to learn semantic features (DVNet). The vanishing gradient problem, which is common in deep networks, is addressed by using the residual function for training network. GPU memory limits the three dimensional models, and it get more confined with dense connections. The memory optimization in DVNet allowed us to train the 3D model. DVNet is trained end-to-end on the KESM dataset to segment cellular/vascular structure of the brain. In compare to other published networks for semantic segmentation, DVNet needs less parameters to be trained and performs better on predicting masks for thionine stained images. Using DVNet to enhance the contrast ofthe thionine stained KESM dataset improves the performance of the cell detection and vessel segmentation algorithms.
7 Discussion and Future Work
DVNet is a deep and densely connected CNN that is computationally intensive and memory limited. The number of trainable parameters and required memory are defined by the number of levels, number of convolutional layer, growth rate, and transition factors. The GPU memory limits the depth of the network to 5 levels in encoder and decoder. It takes to train the 3D network on the KESM dataset and achieved 96% and 93% training and validation accuracy, and 0.12 and 0.19 for training and validation loss.
Hierarchical features extracted at each encoder level and the holistic information computed in the linking unit along with the dense connections in the decoder path of DVNet precisely segment multi-classes from different dataset. It leverages complex spatial features in order to build simplified models of microvascular and cellular structures. DVNet predicts comparable masks on public benchmarks while optimizes memory and the number of trainable parameters.
This work was funded in part by the National Institutes of Health / National Library of Medicine #4 R00 LM011390-02 and the Cancer Prevention and Research Institute of Texas (CPRIT) #RR140013.
DVNet results on open source datasets
We also demonstrate the application of DVNet on open source datasets that have been using in semantic segmentation algorithms. A benchmark of road scenes images (CamVid) have been used to evaluate the performance of 2D implementations. Segmenting the prostate from background of annotated MRI volumes (Promise2012) is a challenge for 3D structures.
CamVid dataset (2D) is a sequence of images from the urban scene videos includes 32 semantic classes [BrostowSFC:ECCV08, BrostowFC:PRL2008]. We used a set of data from CamVid database created by [badrinarayanan2015segnet] consists of 12 semantic classes. This set includes 367 training, 101 validation, and 233 test RGB images at resolution.
2D DVNet is trained on this dataset with input size and batch size. Horizontal flip and random crop are augmentation methods applied on the training set. Weighted Cross-entropy is used as the loss function to compensate different frequencies of pixels from specific classes in the training set. The network is trained in thousands iterations using Adam optimizer. The trained model is used to segment test dataset (Figure 8). Intersection over union (IOU) for each class, and overall accuracy computed from DVNet predictions are used to compare the performance with state of the art (Table 3).
|model||Segnet [badrinarayanan2015segnet]||FCN8 [long2015fully]||Tiramisu [jegou2017one]||DVNet|
Promise2012 dataset (3D) is a challenge on segmenting prostate in MRI volumes collected from different equipment in different hospitals [litjens2014evaluation]. This dataset contains 50 volumes with their manually annotated ground truth used for training and validation sets, and 30 volumes for test set. Since this challenge is still open, labels for the test data are not available. This dataset includes volumes with different sizes and resolutions. of the annotated volumes is denoted to the validation set and rest is used for training set. DVNet is trained with batch size and input size. Data augmentation is applied during the training procedure, and cross entropy is used for loss function. The trained model predicts prostate masks for the validation set in the feedforward manner and with batch size and input size (9). The average accuracy and mean IOU are computed for prediction on the validation dataset as and .