Summary
^{†}^{†} Computer Science and Engineering Department, The Ohio State University, Columbus, OH, USA 43210. Cold Spring Harbor Laboratory, NY, USA 11724. Indian Institute of Technology, Chennai, Tamil Nadu, India 600036.Neuroscientific data analysis has traditionally involved methods for statistical signal and image processing, drawing on linear algebra and stochastic process theory. However, digitized neuroanatomical datasets containing labelled neurons, either individually or in groups labelled by tracer injections, do not fit into this classical framework. The treelike shapes of neurons cannot be adequately described as points in a vector space (e.g. the subtraction of two neuronal shapes is not a meaningful operation). There is therefore a need for new approaches, which has become more urgent given the growth in wholebrain datasets with sparsely labelled neurons or tracer injections.
Methods from computational topology and geometry are naturally suited to the analysis of neuronal shapes. In this paper we introduce methods from Discrete Morse Theory to extract treeskeletons of individual neurons from volumetric brain image data, and to summarize collections of neurons labelled by anterograde tracer injections. Since individual neurons are topologically trees, it is sensible to summarize the collection of neurons using a consensus treeshape. This consensus tree provides a richer information summary than the regional or voxelbased ”connectivity matrix” approach that has previously been used in the literature.
The algorithmic procedure for single neuron skeletonization includes an initial neuritedetection step to extract a density field from the raw volumetric image data, followed by Discrete Morse theoretic skeleton extraction from the density field using the 1(un)stable manifold of the density field. We apply the method to highresolution 3D volumes of sparselylabelled neurons to quantitatively extract the single neuron topology, and achieve better performance than stateoftheart algorithms using nontopological methods (over 10% improvement in precision, and significant speedups at the proofreading stage).
In image volumes of brains with neuronal tracer injections, a summarized consensus tree of the collective neuronal projection patterns is extracted to characterize brainwide neuronal connectivity. The procedure for building the tree skeleton assigns the weight associated with detected tracer projections locally to the nearest point on the skeleton. This ensures that the resulting weighted skeleton also provides a summary of the tracer projection data: integrating the weighted skeleton over a brain region, helps recover the regional ”connectivity matrix”. Thus the consensustree summary of tracer injections incorporates the traditional regional connectivity matrix information, but in addition captures the collective collateral branching patterns of the set of neurons connected to the injection site. We propose that this summary can provide a future means of characterizing tracer injection data, and also provide a bridge to a growing body of singleneuron morphology data.
We find that the DM method is able to trace a tree branch through regions of low intensity that pose challenges to baseline methods. This is a particular strength of the DM approach as it utilizes the global topological structure present in the data, whereas relevant literature methods are inherently spatially local. Additionally the DM approach is theoretically well principled and conceptually clean without multiple adhoc handengineered steps. There is a significant computational overhead to the topological approaches but we are able to mitigate the speed issues using parallelized implementations of the core algorithms.
Background
Topological Data Analysis and Discrete Morse Theory:
Within the last decade or so, topological data analysis (TDA) has grown in influence. Several new TDA methodologies have been proposed for analyzing complex highdimensional datasets. These approaches use a variety of topological concepts to characterize essential structure behind the data and algorithmic implementations have been developed [EH10, Carl09, CM17, LS13, Tie18]. Some of the ideas (in particular persistent homology [Edelsbrunner2002]) have been applied to multiple application domains [Buchet2018, dario, platt, Lamar, LBD17], including neuroscience [li2017, CG19, KDS2018].
The area of TDA that relates to the present work is the persistenceguided Discrete Morse theory based framework for reconstructing hidden graphs from observed data. Discrete Morse theory has been utilized to capture hidden structure in 2D or 3D volumetric data [DRS15, GDN07, RWS11]. Extraction of hidden graphs was formulated in [2011MNRAS], and the framework was simplified and theoretical guarantees provided in [DWW18].
Morse theory exploits the global (instead of local) topology of a scalar field. Thus the hidden graph skeleton of the scalar field can be extracted reliably through regions of weak signals. Classical Morse theory applies to continuous functions on manifolds. Discrete Morse theory [Forman_DM_1998] provides a discretized computational framework suitable for algorithmic implementation on digitized data. Persistent homology is used to separate signal from noise and remove potentially noiserelated structure from the graph. Such a framework has been applied previously to reconstructing hidden road networks from noisy GPS trajectories and satellite images [WWL15, DWW19]. Here we adapt and extend this approach to develop a computational methodology suitable for computational neuroanatomy and to address neuroscientific problems.
In this manuscript we introduce a data analysis framework entitled DMskeleton that uses TDA and the Discrete Morse approach to skeletonize individual neurons as well as groups of neurons labelled by tracer injections. For single neuron reconstruction, we show that DMskeleton significantly outperforms the best available automated approaches[APP2_Hang2013, Hang_GTree2018] (over improvements in F1 scores) and reduces proofreading times. For tracer injection skeletonization, DMskeleton provides a conceptually new route to the analysis of mesoscale projection data, and shows robust performance.
Single Neuron Reconstruction: Many methods have been previously proposed for singleneuron reconstruction from high resolution image stacks [AlKofahi:2002, Bas2011, Basu2014, Boykov2001, Choromanska2012, Chothani2011, activelearning2014, Lee2008, lee2012, Myatt2012, oh2014mesoscale, sironi2015, SCHMITT20041283, sui2014, Srinivasan2007, turetken2013, Turetken2011, VASILKOSKI2009197, Yang2013, ZHANG2007149, Zhao2011, Zhou1999, Zhou2015, Zhou2016, APP2_Hang2013, smarttracing, Quan_NeuroGPSTree_2016, Hang_GTree2018] (see also surveys and books [Acciai2016, DONOHUE201194, neuronsurvey, arenkiel2014neural] for more detailed discussions.) Most of the existing methods sequentially expand a tree from a collection of seed points that are often selected based on density information.
A popular line of sequential tracing algorithms use shortestpath based approaches, such as APP2 [APP2_Hang2013] and SmartTracing[smarttracing]. More recently, methods using a principalcurve to sequentially include more nodes in the reconstruction (e.g., NeuroGPSTree [Quan_NeuroGPSTree_2016] and GTree [Hang_GTree2018]) have shown high performance on single neuron reconstruction in mouse brain datasets. Our proposed DMskeleton approach shows robust performance gains over APP2 and GTree, and shortens proofreading time. Importantly, it also opens up a conceptually new direction in single neuron tracing by using global topological information in a manner that we feel will generalize beyond this specific work.
Tracer Injection Skeletonization: Brainwide tracer injection datasets are another area where we apply the DM skeletonization approach. In tracer injection data sets, thousands of neurons are collectively labelled and do not have the topological simplicity of individual trees. Conventionally, brainwide connectivity information is summarized in the form of regional connectivity matrices [ConnectivityMatrix_chapter3_2016].
Such a representation ignores the fundamental treelike morphology of neurons and contains no information relating to collateral branching patterns. In this manuscript we introduce a new approach to the analysis of tracer injection data, by skeletonizing the tracer injections using a summary tree. To the best of our knowledge, this is a new approach to conceptualizing tracerinjection data, and could provide a biologically bettergrounded approach to the study of mesoscale connectivity mapping using tracer injections. One important advantage of the treeskeletonization approach is that it connects more directly with singleneuron reconstructions.
It is possible to utilize methods such as APP2 or GTree to perform tracerinjection skeletonization (and we use these as baseline comparison techniques). However, these methods do not automatically interpolate through regions of weak label, a problem that is exacerbated for tracerinjection data. In contrast, the DM based approach utilizes global information and is more robust.
The DM pipeline for summarizing multipleneuron tracer injection datasets has multiple stages. First, the raw image stack is preprocessed to detect or highlight neuronal processes [Samik_DM++]. Then a variant of the DiscreteMorse module [WWL15, DWW18]
is employed to produce a graph skeleton containing all potential neuronal trajectories. This graph skeleton may contain false positives, so the next stage of our algorithm performs a persistent homology based simplification step which removes superfluous branches in lowdensity regions and branches that misalign with estimated “flow vectors”. This simplified graph is skeletonized into a minimum spanning tree, which is further simplified using a persistence threshold. During all simplification steps, weights assigned to points on the tree (summarizing the neighboring projection density) is reassigned so as to preserve the total weight. The resulting weighted treesummary provides a new way of characterizing tracer injection data, simultaneously capturing regional connectivity information as well as collateral branching patterns.
Results
Summary method overview
The highlevel workflow of our new Discrete Morsebased (DMSkeleton) pipeline is shown in Fig. 1. The detailed flows for highresolution single neuron data (e.g, fluorescent microoptical sectioning tomograph (fMOST)) and for wholebrain 3D densely labelled tracer injection data (e.g, serial twophoton tomography (STP)) are slightly different, as the types of input and goals are different.
Nevertheless, both workflows have three main components, namely preprocessing, skeletonization, and simplification. (For details, see Method section.)
DMSkeleton takes 3D image stacks as input. In Step 1, the input image is converted to a density field defined on a 3D grid .
For fMOST data, raw image intensity was treated as the density value subjected to Morse skeletonization. For the STP data, a processdetection step [Samik_DM++] was first applied to segment labelled axon fragments in the highresolution 2D images. A 3D volume was created to summarize the density of axon fragments within each voxel at lower resolution. This summary volume was the density field subjected to Morse analysis. Given the starting density field , the goal of the DMskeleton algorithm is to capture centerlines passing through (relatively) high density regions. By viewing this density map as a terrain (see Fig. 2 for a 2D example), this step corresponds to extracting the ’mountain ridges’ of this terrain. This extraction was achieved using 1stable manifolds from discrete Morse theory.
In Step 2, a variant of the persistenceguided discrete Morsebased framework of [WWL15, DWW18] was applied to the 3D density field . We call the output the Morse ’graph skeleton’ . Note that this approach takes into account the global topology of the density field: the 1stable manifold connects through lowdensity regions (e.g., gaps and weak signals around the Yjunction in Fig. 2a) reliably.
Finally, in Step 3, we extracted either a forest (e.g, each tree representing a single neuron in fMOST data) or a tree summary from the Morse graph skeleton . The forest output may have false positives, thus we developed a simplification module to further remove such branches (see Fig. 3; more details in Method).
fMOST sparse labeling dataset.
Three semimanually reconstructed neurons from fMOST dataset were taken as ground truth. A m neighborhood around the soma was taken as the test dataset. More details about the test dataset and the ground truth are provided in Methods.
The Morse graph skeleton obtained from the persistenceguided discrete Morsebased framework (Step 2) was fed to the simplification module (Step 3).
In Step 3, the positions of somata were taken as given and used as roots to extract a forest such that one tree is rooted at each soma.
Note that there exist many automatic methods for reliably detecting soma positions [SomaDetection_Yan_2013, SomaDetection_Cheng_2016]. Automatic methods can be applied to reduce human labor when a region has many somata.
The tree rooted at the soma of the target neuron was then simplified. In this step, each tree node was assigned a score based on density (see Supplementary Methods). Next, the RootGrower strategy was applied (see Method section) to grow a simplified neuron tree.
This strategy (instead of LeafBurner) is more suitable for the fMOST dataset because there is observable autofluorescence scattered in the extracted regions. A bottomup process (LeafBurner) would stall in such regions when the autofluorescence has highintensity values similar to the somata.
We evaluated the performance of the DMSkeleton pipeline on three fMOST singleneuron regions. We compared our output with the stateoftheart approaches GTree [Hang_GTree2018] and APP2 [APP2_Hang2013] methods. For GTree and APP2, we tested different parameters and used the outputs having the fewest gross errors for comparison. We compared the precision, recall and F1score of the outputs of these three methods over groundtruth reconstructions (Fig. 4c).
In all three regions, we note that both our DMSkeleton and the GTree method performed better than the APP2 method. Furthermore, on average, our DMSkeleton method outperformed GTree by around 14% in precision and 3% in recall. This suggests that our DMSkeleton has significantly fewer false positives than the output of GTree.
To further illustrate the potential use of our output, we proofread the outputs from GTree and DMSkeleton using 3D virtual finger tool [Peng_virtual_finger_2014] on fMOST regions. Proofreading was performed on both the GTree outputs and the DMSkeleton outputs by two independent annotators without domainspecific knowledge and with no prior knowledge of the ground truth.
The annotators also constructed the skeleton from scratch on the region of neuron1 using the same proofreading tool with an average time spent of roughly one hour. Supplementary Table 1 shows that the automatic methods reduce the human time of annotation by at least two thirds.
Precision, recall, and F1score were calculated to compare proofread results with the ground truth (Supplementary Table 1). All of these metrics demonstrate that proofread results even by naive annotators were consistent and of highquality. Importantly, the time spent proofreading the DMSkeleton outputs was much shorter than the time spent proofreading GTree outputs. The reason is that DMSkeleton outputs only have minor earlytermination issues, which can be corrected by extending corresponding traces. By contrast, GTree outputs usually have a considerable amount of false positives. They also occasionally miss important mainstream branches, which causes difficulties and prolongs the proofreading process.
To utilize our framework, one labeled dataset was used to tune the parameters (for persistence simplification and tree simplification). These parameters were then applied to other datasets with similar contrast and SNR. In the results reported above, Neuron1 from the fMOST dataset was used to tune the parameters. We chose multiple persistence and tree simplification thresholds. For each such pair of choices, we generated the final tree and computed its F1score compared with groundtruth tree (Supplementary Table 2). We then selected the thresholds that give the best F1score and apply the same thresholds to all other fMOST neurons. From Supplementary Table 2, we note that results were reasonably stable to perturbations of thresholds.
STP tracer injection dataset.
When applying the DMSkeleton pipeline on the tracer injection dataset, in Step 1, the original image data
(see Section Data Collection)
was passed through a hybrid deep CNN with topological priors[Samik_DM++] to detect axon fragments. The output of this automated detection process was then manually proofread and corrected. The proofread images were summarized into lower resolution to construct a manageable 3D volume of the entire brain. The remaining process is similar to the single neuron skeletonization algorithm above. However, there are some notable differences. On one hand, the preprocessed STP data was clean and did not have as much noise as the autofluorescence and bright background in fMOST data, making the LeafBurner simplification strategy preferred over the strategy RootGrower for STP dataset (see Fig. 3, and also Method section). On the other hand, the tracer injection data has more discontinuities and gaps. Therefore, a Gaussian filter (kernel radius 2) was applied to the image stack (after preprocessing) before further processing.
The persistence threshold was adjusted to account for bitdepth differences. The simplification threshold for the tracer data was taken to be the same as for the single neuron data set even though we changed the strategy for simplification. This parameter was robust to input image types because an average score of all tree nodes was used as the reference (Supplementary Methods).
The Morse graph skeleton obtained in Step 2 may have some redundant straight segments connecting signals to the boundary of the domain (Supplementary Fig. 2). This boundary effect is caused by the zerovalue background and degenerated gradient on those pixels in the cleaned STP data. Such redundant segments can be easily removed based on distance to the nearest nonzero voxel.
We can in principle further streamline the output using flow vectors estimated with weighted principal component analysis
[Vector_simp_Delchambre_2014]. Paths not well aligned with estimated flow vectors can be removed as long as doing so does not increase the number of connected components that make up the output. In the results shown below, we did not apply this flow vector simplification, but results with or without this flow vector simplification step are compared in Supplementary Fig. 4. The output with flow vector simplification is visually cleaner and still has good coverage rates.In Step 3, the root position was manually selected within the projection site. Then, a shortest path spanning tree was extracted and simplified. In contrast with the singleneuron skeletonization, the tracer injection skeleton has a weight assigned to each tree node after simplification. The weight of a given tree node represents the total weight of voxels whose nearest neighbor in the tree node set is (see Fig. 3, and also Methods section). This is a weightpreserving process where the weights of the tree nodes of the output preserve the total tracer density of points assigned to that node.
In addition, for visualization purposes we assigned a thickness value for each tree node. The thickness value uses the number of voxels instead of their weights to avoid excessively large tree nodes around the injection site (see Fig. 5a and Methods section). Keeping only top branches (based on length) can provide a more concise summary or skeleton. See Fig. 5a.
Visually, the output by APP2 method is oversimplified and misses many obvious signals (Supplementary Fig. 3).
GTree and DMSkeleton method have more plausible results (Fig. 5b).
In Fig. 5b, we further compared GTree and DMSkeleton outputs in several regions. From the raw data, we can see that these regions, corresponding to motor cortex and striatum in the contralateral hemisphere, contain visible signals. GTree, however, does not have enough branches covering these regions. This may be because GTree is strongly dependent on brightness and contrast, and is not robust enough to handle regions of images with relatively low brightness. This necessitates finetuning of the pixelvalue histogram prior to running GTree. In contrast, the Discrete Morsebased method is able to overcome these issues because only the relative order of pixelvalues matters and effectively there is a builtin histogram localization. Thus the method is automatically adaptive to different types of input images, as well as to different local pixel intensity distributions, with minimal image preprocessing requirements.
We also note that even the top few branches from our DMSkeleton output (the image at the left bottom corner in Fig. 5a) still capture key branches and present a good coverage of different brain regions (including those within orange boxes shown in Fig. 5b, which are missed in the outputs by GTree and APP2 methods).
To verify the biological interpretability of the skeletonization results, we first mapped the 3D volume of the simplified tree to the Allen mouse brain atlas [AllenMouseBrainAtlas] for regional projection strength analysis (Supplementary Methods). The intensity of projected pixels reflected the weights preserved by tree nodes. The projected segments were categorized into the corresponding brain regions based on the atlas map. We applied the same procedure to the proofread volume. A good summarization should cover most of the signals in the input data. Thus, the regions covered by the skeletonbased summary and the conventional connectivity matrix based summary should coincide. Moreover, the distributions of weights over different regions should be similar.
In Fig. 5c, we computed the percentages of (projected) weights within certain regions over the total weights for the proofread data and all three summarizations. Note that the output from GTree is merely a skeletonization without thickness.
We selected the top 5 regions with the highest weights based on the conventional connectivity matrix summary and found the ranks of these regions in the summarization outputs. DMSkeleton output has the identical top 5 regions as those calculated from the proofread data, while GTree or APP2 have difficulties in capturing the signals in SC and APN. The DMSkeleton output captured the projection into the contralateral areas, while
the GTree output did not cover many of these regions. The APP2 output completely missed the contralateral hemisphere. There are other regions where our weighted tree qualitatively matches the conventional summary, while GTree and APP2 have weight percentages close to zero.
The coverages for each hemisphere as well as the entire brain were calculated individually. Our DMSkeleton had the best coverage rates in all cases. In the contralateral hemisphere, it outperformed the GTree method by more than 40%. There is a small discrepancy in the weight distribution over regions between the conventional process density summary and the DMSkeleton based summary. The reason is that DMSkeleton did not capture some poorly labelled regions, and the voxels may be assigned to a tree node in a neighboring region.
In addition, we compared our DMSkeleton results with single neuron data from the Mouselight project [Mouselight]. Eleven neurons which have soma positions in the injection site of the STP dataset were selected and quantitatively summarized. Visually, the DMSkeleton output showed good correspondence with this collection of neurons, which were expected to cover similar regions as the STP volume (Supplementary Fig. 5).
Discussion
In this paper we introduced a new conceptual framework of data analysis utilizing Discrete Morse Theory for extracting underlying tree structures and applied it to two neuroscience problems of current importance: the automatic extraction of single neuron skeletons, and biologically meaningful analysis of mesoscale connectivity mapping data. Apart from being conceptually and mathematically well grounded, this approach significantly outperforms stateoftheart approaches relevant for these problems.
Current approaches to analyzing brainwide tracerinjection data all start from regional connectivity matrices, which lose the biologically important treelike structure of neurons composing those projections. Apart from respecting the collateral branching patterns of the underlying set of neurons and providing a natural bridge to single neuron data, the DMSkeleton method utilizes the global structure of the signal and is robust to noise and missing data, and naturally adaptive to signal intensity variations. We demonstrated the approach using a tracerinjected wholebrain data set. The resulting treeskeleton faithfully captured the regional ”connectivity matrix” information, and significantly outperformed baseline methods. In addition it captured the collateral branching patterns, consistent with corresponding single neuron skeletons with somata in the injection region. We expect that this approach will be useful in the future for mesoscale brain connectivity mapping using tracer injections and form a bridge to the single neuron data.
We also demonstrated significance performance improvements for single neuron reconstruction over existing best in class methods. Automated single neuron skeletonization is a wellstudied problem with a large associated literature and hundreds of algorithms. Our methodology breaks new ground by taking a different conceptual approach that leads to a simple and theoretically transparent algorithmic framework simultaneously with performance improvements. As in the case of tracerinjection skeletonization, the method takes into account the global structure of the data, is robust to noise and missing data, and is dataadaptive as construction of the Morse skeleton depends on the order relations between neighboring pixel intensities rather than their absolute values.
While our method achieves strong performance, it is computationally expensive compared with other methods. The computational bottleneck comes from the persistenceguided discrete Morsebased framework. Computing persistence pairings takes running time in the worst case (although usually significantly faster in practice), where is the number of cells that make up the input cell complex. To address this bottleneck, we utilized DIPHA [DIPHA_Bauer_2014] to compute persistence pairings. This algorithm is distributed and allows for a significant speedup compared to nondistributed persistence algorithms. Further code optimizations over our current implementation are possible and runtime can be further reduced in future work.
Despite the higher computational complexity, the conceptual elegance and theoretical transparency, performance improvements including significant reduction in human proofreading times, and incorporation of prior biological structure are strong arguments in favor of the approaches proposed here.
References
References
Data and code availability
The fMOST and STP data were collected as a part of Brain Initiative Cell Census Network and shared online. The raw images of the STP dataset are available from: ftp://download.brainimagelibrary.org:8811/biccn/huang/connectivity/anterograde/180830_JH_WG_Fezf2LSLflp_CFA_female_processed/. The raw images of the fMOST dataset are available from: ftp://download.brainimagelibrary.org:8811/biccn/zeng/luo/fMOST/732664811/. Single neuron reconstruction ground truth are available from ftp://download.brainimagelibrary.org:8811/biccn/zeng/luo/fMOST/cells/732664811/. Processed projection summary of STP data, subregions of fMOST data, the code and documentation are available on Github at https://github.com/wangdingkang/DiscreteMorse.
Acknowledgements
The authors thank Katherine Matho and Kathleen Kelly for helpful discussions on the test datasets.
This work is in part supported by National Science Foundation under grants CCF1740761, RI1815697, and DMS1547357, National Institute of Health under grant R01EB022899, MH114821 and MH114824. We would also like to thank CrickClay Professorship, Mathers Charitable Foundation and H N Mahabala Chair, IIT Madras for their support.
Author contributions
D. W., L. M., S. W., Y. W. and P. M designed the pipeline. D.W. and L.M. implemented the pipeline. B. H., S. B., J. J., K. R., X. L. and M. L. provided the test datasets. D. W. and L. M. collected results on all test datasets, and together with Y. W. and P. M. analyzed the results on fMOST dataset. D. W., B. H., X. L., Y. W. and P. M. analyzed the results on STP dataset. D. W., L. M., B. H., S. B., J. J., X. L., Y. W. and P. M. wrote the manuscript.
Competing interests
The authors declare no competing interests.
Methods
0.1 Data Collection.
The Serial TwoPhoton (STP) dataset presented in this paper was collected as a part of Brain Initiative Cell Census Network [Matho_BICCN]. Specific Credependent transgenic mouse lines were crossed with IslFlp reporter lines. Flpdependent AAV tracers were utilized to reveal cell typespecific axon connection [Neural_circuits_JoshHuang2013]. Each brain was prepared and imaged using STP tomography [Serial_two_photon_Ragan2012] with inplane resolution, and sectioned coronally every 50 . Two channels of 16bit data were collected, where Channel 1 collected the autofluorescence and Channel 2 collected the fluorescent tracer information. Only Channel 2 data were used in the subsequent analysis. One STP dataset was involved in the development and demonstration of methods in this paper (available from: ftp://download.brainimagelibrary.org:8811/biccn/huang/connectivity/anterograde/180830_JH_WG_Fezf2LSLflp_CFA_female_processed/).
The dualcolor fluorescent microoptical sectioning tomography (fMOST) data presented in this paper , both raw images and single neuron reconstruction data (“ground truth”), were collected as a part of Brain Initiative Cell Census Network and downloaded from Brain Image Library (available from http://www.brainimagelibrary.org/). Specific Credependent transgenic mouse lines with reporter transgenic mouse lines. Sparse labeling of the entire neuron was achieved by injecting small amount of viral tracer. The brain was fixed and embedded in resin before mounted for fMOST imaging at inplane resolution and sectioned serially at . Two channels of data were collected, with 16bit data in each channel [Dual_color_imaging_Gong2016]. Only green channel contained the neuron tracing information and therefore used in the current study. One fMOST dataset was involved in the development and demonstration of methods in this paper (available from: ftp://download.brainimagelibrary.org:8811/biccn/zeng/luo/fMOST/732664811/). Single neuron reconstruction was performed using a combination of automatic tracing and manual annotation [TeraVR_Wang2019]. Information of nodes and edges of individual neuron was stored in separate files in swc format (available from: ftp://download.brainimagelibrary.org:8811/biccn/zeng/luo/fMOST/cells/732664811/).
0.2 Data preprocessing.
For fMOST data, neighborhoods of 3 reconstructed neurons were taken as the test datasets. Each neighborhood volume was saved as one VTK file (simple legacy format). The associated single neuron reconstruction data, saved in SWC format, were taken as ground truth for the subsequent analysis. The VTK and SWC files are available on https://github.com/wangdingkang/DiscreteMorse.
The STP dataset was first processed with a topologically motivated convolutional neural network
[Samik_DM++] for the detection of the tracers. The network, termed as DM++, takes in whole STP sections and divides them into pixel tiles. These tiles are passed through a topological algorithm based on Discrete Morse [DWW17] and a CNN counterpart for determining the topological and neuronal priors, respectively. The topological priors capture the faint connectivity which is used to boost the performance of the CNN in a supervised Siamese setting for the dual priors incorporated into the DM++ framework. The final output likelihood map is converted into a binary mask for the neuronal processes using an optimal empirically determined threshold. This captures most of the processes in the tiles, which are then stitched back together to form a mask for an entire reference section of the brain.The preliminary outputs of process detection were manually verified for the entire brain by an experienced neuroanatomist using Fiji [FIJI_Schneider2012]. Briefly, the preliminary outputs consisting of detected signal were masked with the original brain section image and error corrected using the pixel painting tool. The filled processes were identified as those having a brighter intensity compared to the background.
The proofread brain from the previous step was annotated in the format of binary images. The images were further downsampled to the desired resolution using the sum method, which preserved all the signal information. Volumes of neuronal axons were summarized by counting the number of pixels ( resolution) containing fluorescent signal into in coronal sections while keeping the intersection spacing the same (). The resulting images are 8bit, and the value for each pixel indicates the signal strength for its corresponding voxel. The image stack was saved in VTK simple legacy format and is available on https://github.com/wangdingkang/DiscreteMorse.
0.3 DMSkeleton
This section provides intuition and descriptions for the DMSkeleton pipeline; recall the workflow in Figure 1.
0.4 Step 1: Preprocessing
Each image volume in VTK format was loaded as an image stack, and converted into a density field defined on the cubical complex (3D grid) , where each vertex corresponds to a voxel in the input image and has a density value. If the input is fMOST data, then the density value at each voxel is simply the pixel value in the input raw image stack. For wholebrain tracer injection STP data, a significant portion of the raw images is background (see the example in Figure 1). Hence we first apply the learningbased processdetection module [Samik_DM++] to remove the background and segment the foreground. We further apply a Gaussian filter to smooth the values across the domain.
0.5 Step 2: Skeletonization.
The Discrete Morse graph reconstruction algorithm [WWL15, DWW18] takes a density field of any dimension as input, and outputs a graph skeleton capturing centerlines passing through relatively high density regions. In our case, the input density field is defined at vertices of a cubical complex of the domain. In all subsequent operations, only the 2skeleton of this cubical complex is needed, that is, we assume consists of vertices, edges, and squares. Cubic cells are not needed in the algorithm.
To explain the main idea, consider first the smooth case where we have a smooth function over the domain . Consider the terrain of the density function values plotted over the domain (Fig. 2) where the terrain of a function defined on is given. The underlying graph skeleton of can be captured by the mountain ridges of this terrain (Fig. 2a). Locally, the density along these mountain ridges is higher than the density off of them. These ridges form the socalled 1stable manifold of the function in Morse theory, and are defined by the integral lines “connecting” local maxima to saddle points (Fig. 2b). (An integral line is a curve in the domain where at any point on it, its tangent vector coincides with the gradient of the density field. Integral lines are thus intuitively flow lines, following the steepest descending direction of the density fields.)
Inside the algorithm, roughly speaking, ridges are associated with certain persistence values, as captured by the socalled persistent homology [Edelsbrunner2002], which can be interpreted as an importance score. This makes it easy to filter out ridges of low importance, which are typically associated with noise, from the final output by providing the algorithm with a persistence threshold. An example of simplification for a 1D function is shown in Fig. 2c.
For the input to our algorithm, we have a density field defined at vertices of a cubical complex of the domain (a 3D region). Following [DWW18], discrete Morse theory [Forman_DM_1998] is used to capture the mountain ridges mentioned above, combined with the persistence algorithm to measure importance. See Supplementary Methods for a description of the algorithm. To improve the efficiency of the algorithm, we modified the algorithm of [DWW18] so that it works directly with cubical complexes and also uses DIPHA [DIPHA_Bauer_2014] to compute persistence pairs in a distributed manner.
These mountain ridges cover the neural branches as locally, points along these neural branches tend to have relatively higher density (signal strength) than off the branches. The global nature of the 1stable manifolds makes the output skeleton robust to small gaps in signal, and effective at capturing junctions; see e.g., Fig. 2a, where the global nature of 1manifolds connects through lowdensity region around the Yjunction.
In ideal circumstances, we would find a persistence threshold that would remove all of the noise and only keep the ridges that make up the true neuron tree. However, because of the noisy nature of biological data and also the Discrete Morse graph reconstruction algorithm will not necessarily output a tree, we cannot take the algorithm’s output as a final output. Instead, we first run the algorithm with a low persistence threshold such that we do not remove any ridges that would be part of an ideal output. Then we simplify the Morse graph skeleton in the next step.
0.6 Step 3: Simplification.
The output of the above persistenceguided Morsebased framework is a geometric graph , also referred to as the Morse graph skeleton. Next, is converted into a spanning forest . Prior to this, both boundary and estimated flow vector simplification can be applied to to limit the number of unnecessary branches in the spanning forest. (The estimation of flow vector at each graph node can be found Initial vectorbased Morse graph simplification in Supplementary Methods.) Further simplification strategy to remove false positives and an option to control the level of details presented in the final summarization.
0.7 Extraction of a forest.
The Morse graph skeleton extracted in Step 2 already serves as a good initial skeleton. Each arc in graph is realized by a polygonal path, consisting of edges from the input grid . Then, a forest (a collection of rooted trees) from (See Fig. 3a) is extracted, which is in fact a spanning forest of (that is, contains all vertices from , and edges of are from ). Here we assume we are given the set of tree roots . In our experiments, the positions of roots correspond to soma locations and injection sites for fMOST and tracer injection data, respectively. The injection site and soma positions can be provided by human annotators or automatically detected [SomaDetection_Yan_2013, SomaDetection_Cheng_2016].
With and roots , a weighted shortest path spanning forest algorithm is applied to produce the spanning forest . Intuitively, the weight of an edge depends on the density values of its endpoints, and edges in the regions of higher density values should have larger weights (i.e., smaller distances). The details of the weighted shortest path spanning forest algorithm is introduced in Supplementary Methods. We remark that it may also be reasonable to use the minimum spanning forest of . However, the shortest path spanning forest mimics the natural process of tracers spreading from the injection site, and has a better empirical performance than minimum spanning forest.
0.8 Tree simplification.
Now consider any tree from the initial spanning forest constructed above.
The simplification based on persistence in Step 2 can remove falsepositive branches to some extent, however, false positives with small density surrounded by background can still remain in each tree . We further develop the following simplification strategy to remove these false positives.
First, scores are assigned to each tree node in tree ; the details of calculating node scores are explained in Supplementary Methods. Next, one of the following two strategies is chosen to simplify based on the scores. The first strategy, called LeafBurner, starts from the leaves of and iteratively removes leaves with scores at most to obtain the final simplified tree . An originally internal tree node will become a leave if all of its children are already removed by LeafBurner in previous iterations. The second strategy, called RootGrower, instead expands a simplified tree from the given root and gradually includes tree nodes from with scores at least
. At any moment, only children of those nodes already included in the simplified tree from earlier iterations will be considered. Each strategy has its own merits and disadvantages. For
RootGrower, gaps (weak connections) might cause breaks, while LeafBurner can be potentially stuck at branches in remote noise with high scores and fail to remove them (See Fig. 3d).0.9 Weightpreserving summarization
In this process, our goal is to summarize voxels in the input proofread brain while conserving weight information. The weights of voxels in a certain region should be assigned to a tree node of the DMSkeleton output in the same region. We use a similar idea when assigning scores in the simplification step, but now only use the voxel weights. The summarization weight of a tree node is total weights of voxels whose nearest neighbor in the tree node set is . Additionally, there is an upper bound for the distance allowed. More specifically, . The upper bound is intentionally large to avoid missing weights. For the STP dataset, we used m.
0.10 Additional features.
For visualization purpose, a ”thickness value” can be assigned to each tree node in accordance with the SWC format (Fig. 3e). When calculating the density score for each tree node in the simplified tree (Supplementary Methods), voxel information is leveraged in the surrounding area of that tree node. Here a similar idea is applied, we associate each voxel in the segmented foreground (in the case that the input is not preprocessed in Step 1, such as the fMOST data, then we use a low threshold to remove background) to its nearest neighbor in within distance which is specified later. The thickness value of a node is proportional to the square root of the number of associated voxels. More specifically, , where is a certain constant, is the radius (thickness) assigned to node , and is the number of associated voxels. Any tree node with no voxel associated is set to have the minimum nonzero raidus, i.e., . Empirically, the default value of is . In the experiment on STP dataset, we set m for the best visualization (Fig. 5a).
An option for users to control the desired level of detail is provided. In particular, given the simplified tree rooted at , an “importance” value is assigned to each branch, and the top (a user provided threshold) branches are then greedily selected. Here, a branch is a unique path from the root node to a degree1 leaf node . It turns out that branch length works well as the importance value. An example on STP data is shown in Fig. 5a.
Supplementary Methods
0.11 Implementation in the discrete setting
The following provides a more detailed description of how the persistenceguide Morsebased graph framework is implemented. For even more specifics, please refer to [Forman_DM_1998] for an introduction into Discrete Morse Theory and to [DWW18] for more details on the algorithm we implement.
The Discrete Morse algorithm is traditionally given a triangulation of the domain and a density function given at the vertices of . Instead of a triangulation, we take to be the 2skeleton(vertices, edges, and squares) of the cubical complex of the domain. This does not change any part of the algorithm and reduces computation time. Additionally, the user provides the algorithm with a persistence threshold .
0.12 Step 1
The first step of the algorithm is to compute the persistence pairings P() by the lower star filtration of with respect to . In our implementation, we use DIPHA[DIPHA_Bauer_2014] to compute persistence because it is a distributed persistent homology algorithm that we found minimizes computation time.
0.13 Step 2
The second step of the algorithm is to compute the discrete gradient vector field. As shown in [DWW18], all that is needed is to calculate the spanning forest that is made up of all negative edges (edges that are paired with a vertex in P()) with persistence less than or equal to . Positive edges (edges paired with a square) and edges with persistence greater than are not part of the spanning forest. No explicit discrete gradient vector field needs to be computed nor maintained. This step takes linear time once the persistence pairings are computed in Step 1.
0.14 Step 3
The third step of the algorithm is to compute the 1unstable manifold of each critical edge. As shown in[DWW18], for each edge, the 1unstable manifold is equivalent to the union of the edge with the paths from both vertices to the sink of their corresponding tree in the spanning forest computed in Step 2. The union of all 1unstable manifolds is outputted by the algorithm. Please note that the 1unstable manifold of  is equivalent to the 1stable manifold of .
0.15 Initial vectorbased Morse graph simplification.
Given the Morse skeleton graph computed in Step 2 of the DMSkeleton pipeline, at the beginning of Step 3 of our pipeline, we can perform a flowvector based graph simplification to remove branches misaligned with underlying flow vectors. In particular, we first estimate flowvectors to get a sense of which direction true neuron branches will flow. To do this, we use a weighted principal component analysis [Vector_simp_Delchambre_2014]
. In standard principal component analysis, the principal component represents the direction which explains the most variance of a collection of data points. With weighted principal component analysis, the principal component represents the direction which explains the most variance of the weights of data points. For each vertex in the Morse graph output, we compute this flowvector for a cubic neighborhood, assigning each point in the neighborhood a weight equal to its intensity. We then apply Gaussian diffusion to all computed flowvectors for the sake of smoothing. Note that the flowvector estimation is only carried out for nodes in the Morse skeleton graph.
Next, each path from nondegree 2 to nondegree 2 node in the Morse skeleton graph is computed. Then, for each vertex in each path , we estimate a pathvector at w.r.t. by taking the difference between the two vertices that are 4hops from in the path. We compute the cosine between this pathvector and the estimated flowvector at and consider this to be the vector score at . The closer the value is to 1, the more the flowvectors and pathvectors are aligned, meaning the Morse graph skeleton is more aligned with the estimated flows. On the other hand, the closer the value is to 0, the less the vectors are aligned, indicating the Morse graph skeleton is perhaps significantly deviating from the estimated flows. Note that for each vertex with degree greater than two, receives a score for each path it is a part of.
In addition to these vector scores, a capped intensity value is calculated for each vertex in the Morse skeleton graph. This is simply the minimum of the corresponding voxel value of the vertex and a userprovided value. For our experiments, the value provided is 1.
Finally, a score is computed for each path using the vector scores and capped intensity values. Specifically, for each vertex in each path , let return the capped intensity value of , and let return the vector score of w.r.t. . The score of the path is equal to divided by the length of . is a user provided weight parameter (a value of zero is used in our experiments).
Once scores are computed for each path, the simplification process begins. Only paths below a userprovided threshold are removed. In increasing score order, paths are removed if doing so does not change the number of connected components in the Morse graph skeleton. If a path is removed, it is possible that paths with lower scores that were not removed will now not alter connectivity if removed, and thus are again tested for removal. The simplified Morse skeleton graph is then fed into a weighted shortest spanning forest algorithm.
0.16 Weighted shortest spanning forest algorithm
Given a graph where each node is a point in , let with denote the set of roots for spanning forests. We then compute a weighted shortestspanning forest for as follows: For any edge , its weight is defined as , where is the Euclidean distance between and , and is the density map. Using these weights (as distances), the algorithm computes the shortest path distances between each root in the rootset and nodes in . Then for each root , its shortest path tree is spanned by all nodes whose shortest path distance to is smaller than that to any other node in (ties are broken arbitrarily).
0.17 Score initialization and smoothing
We now describe how to compute a score for all tree nodes in a given tree (from the spanning forest), so as to carry out the tree simplification procedure as described in Methods section.
Consider a specific tree from the weighted shortestpath forest, with root . We first calculate two temporary scores carrying different types of information for each tree node in : one is a density score based on density information, and the other one is a vector score based on directional information. The weighted score of each tree node is the weighted sum of these two temporary scores, i.e., with . Empirically, a large is used because density is the major indicator for false positives. The details for computing and are stated in the following sections. After the weighted score is computed for all tree nodes, we further smooth this score to obtain the final score .
0.18 (1). Calculate density scores for tree/graph nodes.
The density score is calculated based on voxel density. For each voxel, we first find its nearest neighbor in the set of tree nodes and associate it with the nearest neighbor. To avoid associating a voxel to a tree node far away, we set an upper bound , thus a voxel will not be associated with any tree node if its distance to the nearest neighbor exceeds this upper bound. Empirically, on both injection tracer and fMOST singleneuron datasets, the default upper bound is set as the distance between adjacent brain slices (i.e., m for the fMOST dataset and m for the STP dataset). If a voxel has multiple nearest neighbors, we break the tie arbitrarily. The density score of a tree node is simply the sum of density values of all associated voxels, i.e.,
where denotes the set of all voxels, is the density map and is the node set of the simplified tree (Fig. 3b).
0.19 (2). Calculate vector scores for tree nodes.
Vector scores have already been computed as described in Initial vectorbased Morse graph simplification earlier.
0.20 (3). Score smoothing to obtain final score .
Recall that we obtain a combined weighted score from density scores and vector scores. This weighted score will be further smoothed before performing tree simplification to remove local noise. In particular, the score of a tree node will be smoothed by averaging those of its neighbors within hops (i.e., connected via at most tree edges to ). We further restrict only to neighbors that are ancestors or descendants of given the root : intuitively, we wish to consider only neighbors of along the “same” neural branch. An example with is shown in Fig. 3c. The final score of tree node is , where denotes the set of ’s ancestors and descendants (including itself) within hops. In practice, maximum hop is set to for both tracer injection and fMOST datasets.
0.21 Evaluation metrics
0.22 Precision, Recall and F1score for evaluating fMOST results.
For fMOST data, the single neuron reconstruction data were provided as groundtruth. The precision and recall metrics calculated for evaluating fMOST results depend on True Positives (TP), False Positives (FP) and False Negatives (FN). All the outputs were discretized before computing these metrics. We broke any segment with a length greater than 2 pixels so that segment lengths are all roughly 1 pixel. Then, for each node
from the discretized skeletonization, we labelled as either TP or FP. is considered as TP if its nearest neighbor in the human annotation is within . Otherwise, is a FN. Similarly, a node in the annotation is a FN if there is no predicted node within . Precision and recall are then routinely computed as and. The F1score is the harmonic mean of precision and recall, i.e.,
. The parameter was predetermined based on similar reasons as in [Quan_NeuroGPSTree_2016]. The thickness of dendrites near the soma is around and the curve of F1score against this parameter (Supplementary Fig. 1) also supports the choice. The F1score is stabilized when the parameter reaches .0.23 Region analysis
The brainwide interregional connectivity information was obtained through region analysis on the whole brain tracing data. The 3D volume of the data was registered with the Allen Mouse Brain Atlas at 10 resolution [AllenMouseBrainAtlas]. Based on the atlas, the 3D density fields (proofread STP dataset and projected summarizations) were segmented into individual brain regions. Regions were considered ”covered” when the total weights of the voxels in these regions were greater than zero. The covered brain regions were ranked based on the total weights of the voxels inside. The two hemispheres were separated and the above described region analysis was performed on each side in addition to the wholebrain analysis. To quantitatively evaluate performance, we calculate a coverage rate for each method on the two hemispheres, and the entire brain. For a given method, the coverage rate is equal to the sum of the weights for regions in which the method’s output covers divided by the sum of weights for all regions.
Method  Precision  Recall  F1 




Neuron1  GTree  0.846  0.906  0.875    0.500    
GTreea1  0.953  0.940  0.946  23m40s  2839.06  
GTreea2  0.954  0.929  0.941  21m40s  2599.14  
DM  0.900  0.940  0.920      
DMa1  0.957  0.923  0.940  15m43s  1885.37  
DMa2  0.948  0.941  0.944  15m56s  1911.37  
Neuron2  GTree  0.749  0.750  0.750    0.508    
GTreea1  0.818  0.748  0.781  20m00s  2399.20  
GTreea2  0.807  0.756  0.780  21m40s  2599.14  
DM  0.865  0.749  0.803      
DMa1  0.866  0.773  0.817  16m00s  1919.36  
DMa2  0.944  0.771  0.849  16m21s  1961.35  
Neuron3  GTree  0.621  0.729  0.670    0.640    
GTreea1  0.781  0.744  0.762  23m21s  2801.07  
GTreea2  0.726  0.746  0.736  22m45s  2729.09  
DM  0.890  0.789  0.836      
DMa1  0.939  0.771  0.847  18m20s  2199.27  
DMa2  0.925  0.789  0.852  17m46s  2131.29 
Proofreading is done on GTree and DMSkeleton (abbreviated as DM in this table) outputs on all three fMOST neuron regions. We further computed the evaluation metrics on those proofread results with the original ground truth. The suffixes “a1” and “a2” represents the two annotators for proofreading. The F1scores are improved after proofreading. The proofreading time spent on the DMSkeleton method is also significantly shorter.
[width=10em]SimplificationPersistence  128  256  512  768  1024 
0  0.352  0.368  0.468  0.503  0.445 
0.05  0.899  0.911  0.912  0.828  0.639 
0.10  0.909  0.918  0.910  0.825  0.627 
0.15  0.910  0.918  0.907  0.818  0.590 
0.20  0.912  0.920  0.899  0.791  0.584 
0.25  0.902  0.910  0.889  0.788  0.583 
0.30  0.897  0.902  0.872  0.776  0.582 
Comments
There are no comments yet.