I Introduction
A fundamental goal in neuroscience is to understand how information is represented, stored and modified in cortical networks. New experimental methods in neuroscience not only enable chronic, minimally invasive, recordings of large populations of neurons with cellular level resolution, but also allow recordings from identified neuronal subtypes [1]. The ability to acquire complex largescale detailed behavioral and neuronal datasets calls for the development of advanced data analysis tools, as commonly used techniques do not suffice to capture the spatiotemporal network complexity. Such a framework should deal effectively with the challenging characteristics of neuronal and behavioral data, namely connectivity structures between neurons and dynamic patterns at multiple timescales.
Due to natural and physical constraints, the accessible highdimensional data often exhibit geometric structures and lie on a lowdimensional manifold. Manifold learning is a class of data driven methods; these methods aim to find meaningful geometrybased nonlinear representations that parametrize the manifold underlying the data
[2, 3, 4, 5, 6]. Only very recently have we begun to witness seeds of its applicability to real biological data, and, in particular, to neuroscience (e.g., [7, 8]). Yet, most existing manifold learning methods are unable to deal with the complex data sets arising in neuroscience, since they do not account for several fundamental characteristics of the structures and patterns underlying such data. First, current methods are sensitive to noise and interferences. Second, to a large extent, they do not accommodate the dynamical patterns underlying the neuronal activity. Third, manifold learning does not take into account codependencies between neuronal connectivity structures and dynamical patterns.Previous work has addressed analysis of data exhibiting such codependencies. To exemplify the generality of such a problem, consider the Netflix prize [9], where it is desired to provide systematic suggestions and recommendations to viewers. A coorganization enables to both group together viewers based on their similar tastes and, at the same time, group together movies based on their similar ratings across viewers. This clustering of viewers or of movies can be highly dependent on the particular viewer, and on the particular movie; two viewers may be similar under one metric, since they both like similar adventure movies, but at the same time, quite different since they do not like the same comedies. Thus, the suggestion system needs different metrics for recommending different types of movies to different viewers.
Data arising in such settings can be viewed as a 2D matrix, where in the Netflix Prize the first dimension is the viewers (observations) and the second is the movies (variables). The need for matrix coorganization arises when observations are not independent and identically distributed, i.e., correlations exist among both observations and variables of the data matrix. Similar settings also arise in analysis of documents, psychological questionnaires, gene expression data, etc., where there is no particular reason to prefer treating one dimension as independent, while the other is treated as dependent [10, 11, 12, 13]. To address problems of this sort, Gavish and Coifman [14] proposed a methodology for matrix organization relying on the construction of a treebased biorganization of the data.
The analysis of natural data poses an even greater challenge, since such data may also depend on a massive number of marginally relevant variables, including distortions and unrelated measurements, requiring metrics, which are not sensitive to such variability, and which are capable of suppressing noise or irrelevant factors. In particular, it is insufficient to represent neuronal activity recordings, that were acquired in repetitive trials, as a 2D matrix; representing observations (time samples in one dimension), and variables (neurons in the other dimension), does not take into account the multiple scales of the time samples, since the time exhibits both local (within trial) and global (across trials) time scales. Therefore, the data are viewed as a 3D database whose dimensions are the neurons, the local time frames and the global trial indices.
In this paper, to accommodate the threedimensional nature of this data, we extend [14]
to a triplegeometry analysis obtaining a nonparametric model for data tensors. We propose a completely datadriven analysis of a given rank3 tensor that provides a coorganization of the data, i.e., each of the dimensions is organized so that it varies smoothly along the other two dimensions. Specifically, we focus on trialbased neuronal data, however, our approach is general and can be used to analyze other types of 3D datasets.
In addition to the challenge of organizing the data, applying manifold learning methods requires a “good” metric between points, which conveys local similarity, as in the Netflix example. Regular metrics do not perform well due to the high dimensionality and hierarchical structure of the trialbased data, as well as their inability to encompass the 3D nature of the data. For example, using the Euclidean distance or cosine similarity between two sensors, treats the neuronal recordings as a 1D vector, and does not take into account the combined local and global nature of the trialbased experiments. Using more sophisticated metrics such as the Mahalanobis or PCAbased distances proposed in
[15, 16, 17, 18, 19], requires a notion of locality, which is nontrivial in the given application, as the data do not necessarily follow a regular Euclidean 3D grid. Therefore, we also address the problem of defining a new multiscale metric, that incorporates the coupling between the dimensions based on the hierarchical structure of the data in each dimension.Broadly, our focus is on finding a good description of the data; our analysis enables us to build intrinsic datadriven multiscale hierarchical structures. In particular, our analysis builds three types of data structures, conveying a local to global representation, from hierarchical clustering of the data to a multiscale metric to a global embedding. These three structures are constructed in an iterative refinement procedure for each dimension, based on the other two dimensions. Thus, we exploit the coupling between the dimensions.
At the microscale, we learn a multiscale organization of the data, so that each point is organized in a bottomup hierarchical structure, using a partition tree. Thus, each point belongs to a set of nested folders, where each folder defines a coarser and coarser sense of locality/neighborhood.
At the intermediate scale, the hierarchical organization of the data is then used to define a novel 2D multiscale metric between points. This metric enables to organize each dimension based on a coarsetofine decomposition of the other two dimensions. Thus, the metric respects the hierarchy and compares points not only based on the raw measurements, but also based on their values across scales. It is based on a mathematical foundation, stemming from the approximation of the earthmover’s distance (EMD) proposed by Leeb [20]. We show that this metric is equivalent to the distance between points after applying a dataadaptive filterbank. We extend the treebased metric to a bitree multiscale metric and corresponding 2D filter bank.
At the macro scale, the local organization of the data and the multiscale metric enable the calculation of a global manifold representation of the data, via the construction of an affinity kernel and its eigendecomposition [6]. This representation can then be used to provided a single smooth organization of each dimension. The data can also be clustered based on this representation into meaningful groups.
Our trigeometry approach is applied to neuronal recordings and is used for exploratory analysis, interpretability and visualization of the data. This organization is needed to identify latent variables that control the activity in the brain and to develop the automated infrastructure necessary to recover complex structures, with less external information and without expert guidance. Our experimental results on neuronal recordings of headfixed mice demonstrate the capability of isolating and filtering regional activities and relating them to specific stimuli and physical actions, and of automatically extracting pathological dysfunction. Specifically neuronal groupings, temporal activity groupings and experimental condition scenarios are simultaneously extracted from the database, in a datadriven, modelfree and networkoriented manner.
The remainder of the paper is organized as follows. Section II briefly reviews related work regarding stateoftheart methods in neuroscience data analysis. In Section III, we formulate the problem. In Section IV, the proposed methodology for triorganization of trialbased data is presented, detailing the three components of our approach. Section V presents analysis of experimental neuronal data, in a motor foewpaw reach task in mice.
Ii Related Work
Current network analysis approaches in neuroscience can be divided into two main classes [21, 22]. The first class comprises methods, which aim to determine functional connectivity, defined in terms of statistical dependencies between measured elements (e.g., neurons or networks), by constructing direct statistical models from the data (e.g., Granger causality, transfer entropy, point process modeling and graph based methods [23, 24, 25, 22]. The second class of methods is often based on Latent Dynamical Systems (LDS), which accommodates effective connectivity characterizing the causal relations between elements through an underlying hidden dynamical system [21, 26, 27]
. Nonlinear and nonGaussian extensions of the Kalman filter, contemporary sequential Monte Carlo methods and particle filters, have also been introduced in neuroscience
[28, 29].These methods share significant drawbacks, as they are mostly heuristic, providing only an approximation of a largely unknown system, and their quality is often hard to assess
[8]. More importantly, they are all prone to the “curse of dimensionality”. On the one hand, designing a parametric generative model for truly complex highdimensional data, such as neuronal/behavioral recordings, requires considerable flexibility, resulting in a model with a large number of tunable parameters. On the other hand, estimating a large number of parameters, and fitting a predefined class of dynamical models to highdimensional data, is practically infeasible, thereby leading to poor data representations. Our approach is better designed to deal with dynamical systems and aims to alleviate the shortcomings present in current analysis methods. The proposed framework deviates from common recently used in neuroscience as it makes only very general smoothness assumptions, rather than postulating apriori specific structural models. In addition, we show that it takes into consideration the high dimensional spatiotemporal neuronal network structure.
Iii Problem Formulation
In the sequel we denote the three axes of the 3D data with a trialbased experiment in mind. However, our methodology can be applied to general 3D coordinates. Consider data acquired in a set of fixedlength trials, composed of measurements from a large number of sensors (specifically neurons). Mathematically, we have a database , depending on three variables. We collect at each neuron, or identified region of interest (ROI), denoted by , a time series of the neuronal activity (e.g., fluorescence intensity levels in identified ROIs along time). In general trialbased data, this dimension corresponds to the sensors that acquire the data, such as in EEG [30]. On a short time scale, denoted by , these time series can be viewed as a dynamic window profiling the neuron. These profiles vary on a long time scale as well, characterized by repetitive trials and denoted by , and should be organized according to global trends and similarity between trials that are not necessarily consecutive. This database can be separately organized into a triple of geometries involving each variable, , , and . However, the joint organization of all three variables leads to an organization of dynamic neuronal activity regimes, using a global representation via the diffusion maps embedding [6].
Let be a rank3 tensor, where is the number of neurons, is the number of time frames in an individual trial and is the number of trials. A point is indexed by the neuron , the fast (short scale) time index , and the trial (long scale time) index . Note that although both and are coupled as indicating time, there is no assumption on a connection between the two. Let denote the twodimensional matrix (slice) of all measurements for a given neuron throughout all trials. In similar fashion, is the 2D matrix of all measurements of all neurons, for a given time for all trials. Finally, is the 2D matrix of all measurements of all neurons throughout a single trial . A visualization of a 3D dataset and examples of 2D slices in each dimension is presented in Fig. 1.
Considering trialbased data, we assume the withintrial time index is smooth, and all trials are of the same length . It is easy to define neighbors in this dimension, as it is associated with a regular fixedlength grid. We assume the trials follow a repetitive protocol, controlled by the experimenters, yet the trial indices are discrete, not necessarily contiguous and describe a longer span of time, i.e., trials occurring on different dates. Thus, the trial index while being associated with the notion of time which is supposedly smooth, does not imply that two consecutive indices are similar. In the experimental results in Sec. V we show that trials are grouped logically based on behavioral similarity and not based on consecutive experiments. A global trend in the organization of the trials is evident only when introducing a pathological inhibitor, which has a long term effect on the test subject. Finally, we assume that a neuron index may be assigned randomly to the neuron, therefore, it does not impose any smoothness or structure, and two consecutive indices do not imply any similarity between neurons.
Thus, although the trialbased measurements are organized as a 3D database so they are supposedly associated with a regular Euclidean grid, in practice the data suffers from nonuniform sampling, and consecutive indices do not indicate actual proximity as in timeseries data (temporal smoothness) or a 2D image (spatial smoothness). Thus, conventional analysis methods, such as multiscale representations via wavelets, are not straightforward in the given application. In order to define a multiscale analysis of the data, it is necessary to be able to define neighborhoods and a sense of locality between points.
The notations in this paper follow these conventions: matrices and tensors are denoted by bold uppercase, sets are denoted by uppercase calligraphic, and vectors are denoted by uppercase italic.
Iv Trigeometry analysis
Our analysis is based on the assumption that an underlying “good” organization of the data exists, such that under a permutation of the indices in each dimension of the data, the resulting tensor is smooth in the three dimensions. Our aim is to recover this organization of the data, through a local to global processing of the data. We begin with learning the hierarchal structure of the data in each dimension via partition trees, which convey local clustering of the data. We then construct a new multiscale bitree metric for one dimension based on the coupled geometry between the other two dimensions. Finally, the treebased metric enables us to define an affinity between points from which we derive a global representation via manifold learning. Thus, our approach does not treat each dimension separately, but introduces a strong coupling between the dimensions. The threephase organization of each dimension is carried out in an iterative procedure, where each dimension is organized in turn, based on the other two.
An advantage of our approach is that it is based on modular components. We describe three methods fulfilling the motivation of each stage, but these methods can be replaced with others. For example, we propose flexible trees for the partition tree construction, but binary trees can be used instead. We expand on the three components of our approach in detail in the following subsections.
Iva Partition trees and Flexible trees
Following the assumption that under a proper organization the dataset is smooth, we aim to build a finetocoarse set of neighborhoods for each point, by constructing partition trees in each dimension. In the trigeometric organization, the neighborhoods are 3D cubes. Permuting the points in each dimension based on the constructed partition tree will recover the smooth structure respecting the coupling between the neurons and the time dimensions of the data.
Given a set of highdimensional points, we construct a hierarchical partitioning of the points, defined by a tree. In our setting, for each dimension, the set of points are the 2D slices of the data in that dimension. Without loss of generality, we will define the partition trees in this section with respect to partitioning the neurons, but this process is performed in the remaining two dimensions as well.
Let be the set of all 2D neuron slices. We define a finite partition tree on as follows. The partition tree is composed of levels, where a partition of the points is defined for each level . The partition at level consists of mutually disjoint nonempty subsets of indices in , termed folders and denoted by , :
(1) 
Note that we define the folders on the indices of the data points and not on the points themselves.
The partition tree has the following properties:

The finest partition () is composed of singleton folders, termed the “leaves”, where .

The coarsest partition () is composed of a single folder, , termed the “root” of the tree.

The partitions are nested such that if , then for some , i.e., each folder at level is a subset of a folder from level .
The partition tree is the set of all folders at all levels , and the number of all folders in the tree is denoted by .
Given a dataset, there are many methods to construct a hierarchical tree, including deterministic, random, agglomerative and divisive [31, 13, 32]. Partition trees can be constructed in a topdown or bottomup approach. In a topdown approach, the data are divided into few folders, then each of these folders is divided into subfolders, and so on until all folders at the bottom level consist of only one point. In a bottomup approach, we begin with the lowest level of the tree, clustering the points into small folders. Then these folders are merged into larger folders at higher levels of the tree, until all folders are merged at the root of the tree.
A simple approach to bottomup construction is a means based construction. The leaves of the tree are clustered via means into folders. Each folder is then assigned a centroid, and these centroids are then clustered again using means, with smaller . This process is repeated until all points are merged at the root with .
More sophisticated approaches take into account the geometric structure and multiscale nature of the data by incorporating affinity matrices on the data, and manifold embeddings. Gavish et al. [31]
propose constructing a partition tree via bottomup hierarchical clustering, given a symmetric affinity matrix
describing a weighted graph on the dataset. Ankenman [33] proposed “flexible trees”, whose construction requires an affinity matrix on the data, and is based on a lowdimensional diffusion embedding of the data, and not on the highdimensional points. The advantage of this construction, which uses the embedding rather than the highdimensional space is that distances between points in the diffusion space are meaningful and robust to noise, as opposed to highdimensional Euclidean distances. This tree construction enables us to integrate both the multiscale metric and the resulting global embedding. Since our approach is based on an iterative procedure of all three components, the tree construction is refined from iteration to iteration.Another important advantage of flexible trees is that there are relatively few levels and the level at which folders are joined is meaningful across the entire dataset. Thus, the tree structure is logically multiscale and follows the structure of the data. This also reduces the computational complexity of the metric calculation. The construction is controlled by a constant which affects the number of levels in the trees. Higher values of result in “tall” trees, while small values lead to flatter trees.
We briefly describe the flexible trees algorithm, given the set and an affinity matrix on the neurons denoted . For a detailed description see [33].

Input: The set of points , an affinity matrix , and a constant .

Init: Set partition , set .

Given an affinity on the data, we construct a lowdimensional embedding on the data [6].

Calculate the leveldependent pairwise distances in the embedding space.

Set a threshold , where .

For each point which has not yet been added to a folder, find its minimal distance .

If , and form a new folder if also does not belong to a folder. If is already part of a folder , then is added to that folder if . Thus, the threshold on the distance for adding an element to an existing folder is divided by 2 for each added element.

If , remains as a singleton folder.


The partition is set to be all the formed folders.
The trees yield a hierarchical multiscale organization of the data, which then enables us to apply signal processing methods. For example, we can apply nonlocal means to each point based on its neighborhood, to denoise the data, or multiscale analysis via tree based wavelets [34, 31, 35]. However, we aim at a global analysis of the data. To this end, we define a bitree multiscale metric, which compares two points, based on their decomposition via the trees.
IvB Dataadaptive bitree multiscale metric
Applying manifold learning requires an appropriate metric between points. As we cannot associate a sense of locality based on the indexing of the dimensions, we treat the data as vertices in a graph and develop a metric that is based on the multiscale neighborhoods constructed in the partition tree. Given the partition trees in two of the dimensions, our aim is to define a distance between two 2D slices in the remaining dimension. This distance should incorporate the multiscale nature of the data.
Following Leeb [20], we propose a treebased metric in one dimension that incorporates the coupling of the other two dimensions. For a twodimensional matrix, Leeb [20] defines a treebased metric between two points in one dimension based on a partition tree in the other dimension. We will present this metric in our context and then propose a new metric incorporating two partition trees in the case of a 3D dataset.
Consider a single 2D slice of the trial data , and the partition tree on the neurons . A point in the time dimension is a vector of length , consisting of all the neuronal measurements at the time frame during a given trial . The tree metric between two times and within this trial, given the tree is defined as
(2) 
where is a weight function, depending on the folder . The value is the mean of vector in :
(3) 
where denotes the number of points in folder , i.e., its cardinality. The metric encompasses the ability to weight the data based on its multiscale decomposition since each folder is assigned a weight via . The weights can incorporate prior smoothness assumptions on the data, and also enable enhancing either coarse or fine structures in the similarity between points.
We generalize this metric to a distance between 2D matrices, given two partitions trees. We define this distance for the trial dimension, given trees on the time and neuron dimensions, but the same applies in the other dimensions as well. Given a partition tree on the neurons and a partition tree on the time frames, the distance between two trials and is defined as
(4) 
where is a weight function depending on folders and . We term this distance a bitree metric. The value is the mean value of a matrix on the bifolder :
(5) 
i.e., for a given trial , we are averaging the submatrix of the 2D slice defined by the subset of neurons in and the subset of time frames in .
We present a new interpretation of the treebased metrics (2) and (4). These metrics are equivalent to the distance between points, after applying a multiscale transform to the data, where the tree metric (2) corresponds to a 1D transform and the bitree metric (4) corresponds to a 2D transform. For the sake of simplicity we begin with describing the 1D transform in the case of a single 2D slice of the trial data , and then generalize to the 2D transform.
The partition tree can be seen as inducing a multiscale decomposition on the data, via the construction of a dataadaptive filter bank. Define the filter as
(6) 
such that is the indicator function on the neurons belonging to folder . For each filter we calculate the inner product between the filter induced by folder and the measurement vector , yielding a scalar coefficient :
(7) 
The tree defines a multiscale transform by applying filter bank to the measurements vector , resulting in the set of coefficients . The filters of each level of the tree output coefficients, such that . This is demonstrated in Fig. 2. In the middle, a 2D slice is viewed as a 2D matrix and on the left is a partition tree defined on the rows of the matrix. We assume that the rows of the matrix have been permuted so they correspond with leaves of the tree (level 0). In applying the transform , each folder defines an element in the new vector (right), proportional to the average of the entries in the original vector on the support defined by the folder . The new entries in the vector are colored according to the corresponding folders in the tree.
Theorem IV.1
Given a partition tree on the neurons , the tree metric (2) between two times and for a given trial is equivalent to the distance between the multiscale transform defined by the tree and applied to the two vectors:
(8) 
Proof IV.2
(9) 
This result can be generalized to a multiscale 2D transform applied to 2D matrices as in our setting. Define the 2D filter by:
(10) 
where denotes the Kronecker product between the two indicator vectors. Then the elements of the 2D matrix are the coefficients obtained from applying the 2D filter bank defined by the bitree .
Corollary IV.3
The bitree metric (4) between two matrices given a partition tree on the neurons and a partition tree on the time frames is equivalent to the distance between the 2D multiscale transform of the two matrices:
(11) 
This interpretation of the metric as the distance between multiscale transforms has two computational advantages. First, given large datatsets, it is inefficient to calculate full affinity matrices on the points, and instead sparse matrices are used by finding nearest neighbors of each point. Thus, we can apply the multiscale transform to our data, yielding a new feature vector for each point, and then apply approximate nearestneighbor search for the distance to the new vectors [36, 37]. Second, we can relax the norm to other norms such as or . In future work, we intend to establish the properties of this transform and its application to other tasks.
Note that we claimed that regular metrics are inappropriate in processing the data due to its highdimensionality in each dimension of the 3D dataset, i.e., each 2D slice of the data contain a large number of elements. This interpretation of the metric via the transform yields that the proposed metric is equivalent to the distance between vectors/matrices of even higherdimensionality, supposedly contradicting our aim for a good metric. However, due to encompassing weights on the folders, the effective size of the new vectors is smaller than the original dimensionality, as the weights are chosen such that they rapidly decrease to zero based on the folder size.
We note that by using full binary trees in each of the two dimensions, the output of applying the multiscale transform is similar to that of applying the Gaussian pyramid representation, popular in image processing [38], to each 2D matrix , . Instead of applying the Gaussian filter proposed by Burt and Adelson, our transform applies a averaging filter, weighted by , and the resolution at each level will be reduced by 2 as in the Gaussian pyramid.
Relationship to EMD
The Earth mover’s distance (EMD) is a metric used to compare probability distributions or discrete histograms, and is popular in computer vision
[39]. It is fairly insensitive to perturbations since it does not suffer from the fixed binning problems of most distances between histograms. EMD quantifies the difference between the two histograms as the amount of mass one needs to move (flow) between histograms, with respect to a ground distance, so they coincide. In its discrete form, the EMD between two normalized histograms and is defined as the minimal total ground distance “traveled” weighted by the flow:where is the ground distance, and is the flow from bin to bin .
It was shown [40] that a proper choice of the weight makes the tree metric (2) equivalent to EMD, i.e., the ratio of EMD to the treebased metric is always between two constants. The proof follows the KantorovichRubinstein theorem regarding the dual representation of the EMD problem. The weight in (2) is chosen to depend on the tree structure:
(12) 
where weights the folder by its relative size. Positive values of correspond to higher weights on coarser scales of the data, whereas negative values emphasize differences in fine structures in the data. For trees with varyingsized folders, unlike a balanced binary tree, helps to normalize the weights on folders. For , the filter defined in (6) is a uniform averaging filter whose support is determined by . In EMD the histograms are associated with a fixed grid and bins quantizing this grid. In our setting, where the data does not follow a fixed grid, the folders take the place of the bins, and by incorporating their multiscale structure via the weights, they can be seen as bins of varying sizes on the data.
Shirdhonkar and Jacobs [41] proposed a waveletbased metric (wavelet EMD) shown to be equivalent to EMD. The wavelet EMD is calculated as the weighted distance between the wavelet coefficients of the difference between the two histogram. Following [41], Leeb [40] proposed a second metric based on the distance between the coefficients of the difference of distributions expanded in the treebased Haarlike basis [31], which was also shown to be equivalent to EMD. Our interpretation of the metric (2) as an distance between a multiscale filter bank applied to the data, simplifies the calculation even more as it does not require calculating the Haarlike basis defined by the tree, and instead requires only lowpass averaging filters on the support of each folder. This generalizes the wavelet EMD [41], to highdimensional data that is not restricted to a Euclidean grid.
For the bitree metric (4), the weight on a bifolder can be chosen in an equivalent manner to (12) as
(13) 
where weights the bifolder based on the relative size of folder and weights the bifolder based on the relative size of . The values should be set according to the smoothness of the dimension and whether we intend to enhance coarse or fine structures in the data.
IvC Global Embedding
The intrinsic global representation of the data is attained by an integration process of local affinities, often termed “diffusion geometry”. Specifically, the encoding of local variability and structure from different locations (e.g., cortical regions, or trials) is aggregated into a single comprehensive representation through the eigendecomposition of an affinity kernel [6]. This global embedding preserves local structures in the data, thus enabling us to exploit the fine spatiotemporal variations and intertrial variability typical of biological data, in contrast to other methods based on averaging and smoothing the data [42].
Given the bitree multiscale distance between two points (4), we can construct an affinity on the data along each dimension. We choose an exponential function, but other kernels can be considered, dependent on the application. Without loss of generality, we describe the embedding calculation with respect to the dimension of the neurons, but this procedure is applied to the time and trials as well, within our iterative framework. Given the multiscale distance between two neurons and , the affinity is defined as:
(14) 
where is a scale parameter, and depends on the current considered dimension of the 3D data, i.e., each dimension uses a different scale in its affinity. Typically, is chosen to be the mean of distances within the data. The exponential function enhances locality, as points with distance larger than have negligible affinity.
The affinity is used to calculate a lowdimensional embedding of the data, using manifold learning techniques, specifically diffusion maps [6]. Defining an affinity matrix
, we derive a corresponding rowstochastic matrix by normalizing its rows:
(15) 
where is a diagonal matrix whose elements are given by . The eigendecomposition of
yields a a sequence of positive decreasing eigenvalues:
, and right eigenvectors
. Retaining only the first eigenvalues and eigenvectors, the mapping embeds the data set into the Euclidean space :(16) 
Note that for simplicity of notation we omit denoting the eigenvalues and eigenvectors by the relevant dimension , or
. The embedding provides a global lowdimensional representation of the data, which preserves local structures. Euclidean distances in this space are more meaningful than in the original highdimensional space, as they have been shown to be robust to noise. The flexible tree construction is based on the embedding for these reasons. Finally, the embedding integrates the local connections found in the data into a global representation, which enables visualization of the data, reveals overlying temporal trends, organizes the data into meaningful clusters, and identifies outliers and singular points. For more details on diffusion maps, see
[6].IvD Algorithm
Our iterative analysis algorithm composing all three components (tree construction, metric building, embedding) is summarized in Algorithm 1. Note that the order in which the dimensions are processed is arbitrary, and it affect the final results. In addition, since the algorithm is iterative and each component relies on the previous components, an initialization is required. Specifically, calculation of the bitree metric for one dimension requires that partition trees be calculated on the other two dimensions.
One option is to calculate an initial affinity matrix based on a general distance such as the Euclidean distance or cosine similarity. Here, we use the cosine similarity:
(17) 
Note that although the affinity is supposedly between two matrices, effectively it is equivalent to reshaping the matrices as 1D vectors and calculating the affinity using 1D distances. In other words, a general affinity does not take into account the 2D structure of the slices of the 3D data, in contrast to our bitree metric. In addition, these distances are uninformative, as the data are extremely highdimensional. For example, in each dimension of the dataset in the experimental results in Sec. V, the dimension of the measurements is of order .
Given the initial affinity, an embedding and flexible tree are calculated for the neuron dimension (steps 34). This is then repeated for the time dimension (step 5). One second option is to initialize the partition tree for the time dimension to be a binary tree, since the intratrial time is a smooth variable.
Given the trees in two of the dimensions, we can calculate the multiscale metric (4) in the trial dimension (step 7). A corresponding embedding and flexible tree are then calculated (steps 910). We now have a partition tree in each dimension, so we continue in an iterative fashion, going over each dimension and calculating the multiscale metric, diffusion embedding and flexible tree in each iteration, based on the other two dimensions. The resulting output of the algorithm can be used to analyze the data both in terms of its hierarchical structure and through visualization of the embedding. Furthermore, each dimension can be organized by calculating a smooth trajectory in its embedding space. This yields a permutation on the indices of the given dimension. Permuting all three dimensions recovers the smooth structure of the data, respecting the coupling between the neurons and the time dimensions of the data. Python code implementing Algorithm 1 will be released opensource on publication.
V Results
Va Experimental Setup
Our experimental data consists of repeated trials of a complex motor forepaw reach task in awake mice. The animals were trained to reach for a food pellet upon hearing an auditory cue [43]. This complex and versatile task exploits the capability of rodents to use their forepaw very similarly to distal hand movements in primates [43]. The hand reach task is typically learnt by mice over a period of few weeks to become “experts” (success rate of after training over 23 weeks).
Neuronal activity in the motor cortex during task performance was measured using two photon invivo calcium imaging with the recently developed genetically encoded indicators (GECIs) [44]. In addition the network was silenced using DREADDS [45], which was activated using intraperitoneal (IP) injection of the inert agonist (clozapineNoxide (CNO). The analyzed neuronal measurements are of optical calcium fluorescent activity collected from a large population of identified neurons from cortical regions of interest, acquired using two photon microscopy imaging (Fig. 3). In conjunction, highresolution behavioral recordings of the subject are acquired using a camera (400 Hz). This serves to label the time frames and to determine whether the subject performed the task successfully during the trial.
The fluorescent measurements are manually grouped into elliptical regions of interest (ROIs) (Fig. 3), and preprocessing is applied as follows. The spatial average fluorescence of each ROI per time frame in a single trial is
(18) 
where is the florescence image, and are the pixel row and column indices in the image, respectively, and is the area of the th ROI. The baseline florescence for ROI in a single trial is calculated using a subset of time frames corresponding to the florescent averages with the lowest values . Finally, the neuron measurement at each time frame is set using :
(19) 
For simplicity, we refer to the ROIs as neurons in our analysis.
VB Data
We focus on neuronal measurements from the primary motor cortex region (M1), taken from a specific mouse in a single day of experimental training sessions. The data is composed of 59 consecutive trials, where the first 19 trials are considered “control” followed by 40 trials in which the activity of the somatosensory region was silenced by injection of CNO, thus activating DREADDS. Each trial lasts 12 seconds, during which activity in 121 neurons is measured for 119 time frames. Thus, the data can be seen as 3dimensional, measuring a vector of neurons at each time frame within each trial. The data is visualized as 2D slices for several neurons, time frames and trials in Fig. 1. The time frame (1119) within the trial is a local time scale, and the trial index represents a global time scale (159).
Along with neuron measurements, we also have binary data labeling an event for each time and trial (Fig. 4
). The labeling is performed using a modified version of the machine learning based JAABA software, annotating discrete behavioral events
[46]. There are 11 labeled events that provide additional prior information helpful in verifying our analysis. An auditory cue (“tone” event) is activated after 4 seconds (frames 4042) and the food pellet comes to position at 4.4 seconds (frames 4446). The “tone” event is typically followed by either a successful “grab” event and “at mouth” event, which lasts until the end of the trial, or by a several failed “grab” events and then labeled as a “miss” event, i.e., the subject failed to grab the food pellet and bring it to its mouth.The control data consists of 19 trials, 11 of which were successful, i.e., the mouse managed to grab and eat the food pellet. After 19 trials, CNO was injected IP to silence the sensory cortex (S1), which sends feedback information to M1. The next 40 trials, referred to as “silencing trials” included 10 successful trials. During these trials, the behavior of the mouse changes, demonstrated by a decrease in “at mouth” (chewing) events and an increase in “miss” events (in which the mouse does not manage to grab the food). Note that not all silencing trials are “miss” and not all control trials are successful.
VC Trigeometric analysis
The activity of the neurons is such that they are correlated at certain times, but completely unrelated at others, and certain neurons are sensitive to the auditory trigger, whereas others completely disregard it. The goal is to automatically extract coactive communities of neurons, as they relate to the activity of the mouse. We first analyze all 59 trials together, using Algorithm 1. For the weights (13) used in the mutilscale metric (4), we choose . We describe in the following how the analysis is used to derive meaningful results for each dimension.
Figure 5 presents the 3D embedding of the time frames, where each 3D point is colored by the time index (a). The embedding clearly organizes the time frames through the various repetitive experiments into two dominant clusters: “pretone” and “posttone” frames (Fig. 5(b)), where the tone signifies the cue for the animal to begin the hand reach movement. We emphasize that this prior information was not used in the datadriven analysis. The embedding in effect isolates the time where the auditory tone is activated for the subject to reach for food.
Figure 5(c) presents the first eleven nontrivial eigenvectors obtained by the decomposition of the affinity matrix on the time dimension. Some eigenvectors correspond to harmonic functions over the entire interval . However, some are localized either on the pretone region (e.g., ) or on the posttone region (e.g., and ). In addition, each eigenvector captures the time at varying scales. This result demonstrates the power of our analysis; it shows that in a completely datadriven manner, a Fourierlike (harmonic) basis is attained. However, in contrast to the “generic” Fourierbasis, which is fixed, the obtained basis is data adaptive and captures and characterized true hidden phenomena related to external stimuli (the tone) and to different patterns of behavior (before and after the tone).
Thus, the embedding provides a verification of the knowledge we have regarding the time dimension in terms of regions of interest, and enables to pinpoint specific times of interest, essentially capturing the “script” of the trial. We do not present the local decomposition of the time frames via the flexible tree since it is not of interest, as this dimension is smooth, and therefore is just decomposed into local temporal neighborhoods.
We next examine the analysis of the trial dimension, i.e., organization of the global time.
In Fig. 6, we compare the embedding of the trials, obtained from the initial cosine affinity (a), and from the bitree multiscale metric (b). The points are colored by the trial index where blue corresponds to control trials (119), greenorange trials corresponds to the first silencing trials (1944), and red corresponds to last silencing trials (4559). Our trigeometry analysis yields an embedding (Fig. 6(b)) in which the blue and red points, corresponding to the first and last trials, respectively, are grouped together. This clearly indicates the temporal effect of silencing the somatosensory cortex on the activity of motor cortex. This is a promising result since solely from the neuronal activity, the data is selforganized functionally according to the brain activity manipulation we performed without the need to provide this information during the analysis. This result leads us to hypothesize that our silencing manipulation has a lag, and also that it expires over the duration of the experiment. Our analysis recovers hidden biological cues and enables accurate indication of pathological dysfunction driven by neuronal activity evidence.
To highlight the contribution of our approach in the analysis of such data, we compare our embedding to the 3D diffusion maps obtained by the initial cosine affinity (Fig. 6(a)), which does not exhibit any particular organization. Thus, the refinement via iterative application of the algorithm is essential. The multiscale local organization via the trees and coupling of the dimensions via the metric contribute to deriving a meaningful global embedding.
The improved clustering of the trials achieved by the bitree multiscale metric is also apparent when examining the flexible trees obtained from the two embeddings (Fig. 7). The leaves are colored by the trial index as in the embedding. The tree obtained from the new embedding better separates the trials in which the pathological dysfunction caused by the silencing is evident from the normal trials. Remembering that flexible trees are constructed bottomup using the embedding coordinates, this validates the claim that proximity in the embedding space captures the global temporal trend in the data.
To analyze the neurons, we split the data into two parts and analyze each separately, as this enables us to discover both behavioral patterns and pathological dysfunction. First, we examine the 40 trials composing the silencing trials. The neurons were preprocessed by subtracting the mean of each neuron over all trials, and normalizing it by its standard deviation across all trials. This enables us to examine the increase and decrease of activity in the neuron without being sensitive to the intensity of the measurements.
Figure 8 presents the multiscale hierarchical organization of the 2D slices of all the neurons in a flexible tree , obtained after two iterations of our analysis, highlighting several interesting tree folders from level . Neurons composing four folders from this level are presented. The folders are marked in different colors on the tree and the neurons belonging to each folder are grouped together, with a border in corresponding color to the folder. Each neuron has been reorganized as a 2D matrix of size (). The neurons are grouped together according to similar properties and the displayed folders clearly relate to pathological dysfunction. For example, the orange folder consists of neurons that are active only during trials under the effect of somatosensory silencing (horizontal separation). The yellow folder consists of neurons that are active only at or after the tone (vertical separation), and mostly in trials under the effect of the silencing (horizontal separation). In contract, the purple folder consists of neurons, which are active after the tone but during trials without the silencing effect. Finally, the green folder consists of neurons, which were silenced by the manipulation. This leads us to hypothesize, as with the trial analysis, that the effect of the somatosensory silencing has a slight delay, and in addition that it wears off after a certain number of trials, since the experiment was very long. Our analysis groups neurons demonstrating the same activity patterns together in an automatic datadriven manner without manual intervention.
The silencing trials enable us to analyze the neurons in terms of how they are affected by the introduced virus. We now treat the 19 control trials, which allows us to analyze the behavioral aspect of the neurons without external intervention. In Fig. 9, we display the neuron tree, , obtained after one iteration of our analysis, and examine folders for levels . Neurons composing five folders from this level are presented. The folders are marked in different colors on the tree and the neurons belonging to each folder are grouped together, with a border in corresponding color to the folder. Each neuron has been reorganized as a 2D matrix of size (). We use the labeled “at mouth” event and the prior information on the time of the auditory tone to analyze the results. The binary labels indicating “at mouth” activity has also been reordered as a 2D matrix of size (), and is displayed in within the black border.
The results indicate that neurons are grouped by similarity, clearly related to the behavioral data. The upper two folders (red and orange) show increased activity before and during the auditory tone. The next three folders show increased activity after the tone. The yellow folder composes of neurons that are activity during different trials, regardless of “at mouth” activity. The purple folder, on the other hand, contains neurons that are active posttone, almost entirely during which were successful, i.e., the subject managed to eat the food pellet, indicated by continuous “at mouth” labeling till the end of the trial. Finally, the green folder is composed of two neurons with the opposite activity. They are most active posttone during trials in which the subject failed to eat the food pellet. Note that this analysis is datadriven, i.e., no prior information on the event labels is used in grouping the neurons.
In analysis of the neurons, the main contribution is the produced partition tree. The global embedding did not yield meaningful results, and the examination of local folders in the tree was most informative. Note that we are looking at a limited set of “sensors” since the neurons were manually grouped together into ROIs. In future work we intend to analyze a larger group of sensors, by examining all pixels acquired from the 2photon imaging video separately. We know from previous work that increasing the number of sensors is typically beneficial to the iterative analysis. This will remove any introduced biases yielded by the preprocessing and enable to identify spatial structures not limited to ellipses.
Our experimental results demonstrate that our approach identifies for the first time (to the best of our knowledge), solely from observations and in a purely datadriven manner: (i) functional subsets of neurons, (ii) activity patterns associated with particular behaviors, and (iii) the dynamics and variability in these subsets and patterns as they are related to context and to different time scales (from the variability within a trial, to a global trend in trials, induced by the silencing method. In analyzing the intratrial time dimension, we pinpoint the time of the auditory trigger, and separate the time frames into multiscale local regions, before and after the trigger. Finally, in organizing the trials, we are able to both separate the trials to “success” and “failure” cases, and to determine a global trend that relates to an introduced external intervention. Thus, these methods lay a solid foundation for modeling the sensorymotor system by providing sufficiently fine structures and accurate view of the data to test our hypotheses, within an integrated computational theory of sensorymotor perception and action.
We note that conventional manifold learning tools did not yield any intelligent data organization for this case. Thus, organizing the neurons or the time samples separately by a 1D geometry using conventional manifold learning methods is inappropriate for this complex data. The fact, demonstrated here, that the neuronal activity of different types of neurons is correlated only during specific times, and might be random otherwise, verifies the need for coupled organization analysis which simultaneously organizes time, trials and neurons into trigeometries.
Vi Conclusions
In this paper we presented a new datadriven methodology for the analysis of trialbased data, specifically trials of neuronal measurements. Our approach relies on an iterative local to global refinement procedure, which organizes the data in coupled hierarchical structures and yields a global embedding in each dimension. Our analysis enabled extracting hidden biological cues and accurate indication of pathological dysfunction extracted solely from the measurements. We identified neuronal activity patterns and variability in these patterns related to external triggers and behavioral events, at different time scales, from recovering the local “script” of the trial, to a global trend across trials. In this paper we focused on neuronal measurements, but our approach is general and can be applied to other types of trialbased experimental data, and even to general highdimensional datasets such as video, temporal hyperspectral measurements, and more.
In future work, we intend to analyze the neuronal measurements from the twophoton imaging without clustering them into ROIs. This significantly increases the number of “sensors” and should enable to learn complex spatial structures in the cortex. Furthermore, our analysis can be extended to higher dimensions, e.g., incorporating behavioral data as a fourth dimension in the neuronal measurements.
References
 [1] G. M. Shepherd, “Corticostriatal connectivity and its role in disease,” Nature Reviews Neuroscience, vol. 14, no. 4, pp. 278–291, 2013.
 [2] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, Dec. 2000.
 [3] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, pp. 2323–2326, 2000.
 [4] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Computation, vol. 15, no. 6, pp. 1373–1396, 2003.
 [5] D. L. Donoho and C. Grimes, “Hessian eigenmaps: New locally linear embedding techniques for highdimensional data,” PNAS, vol. 100, pp. 5591–5596, 2003.
 [6] R. R. Coifman and S. Lafon, “Diffusion maps,” Appl. Comput. Harmon. Anal., vol. 21, no. 1, pp. 5–30, July 2006.
 [7] J. T. Vogelstein, Y. Park, T. Ohyama, R. A. Kerr, J. W. Truman, C. E. Priebe, and M. Zlatic, “Discovery of brainwide neuralbehavioral maps via multiscale unsupervised structure learning,” Science, vol. 344, no. 6182, pp. 386–392, 2014.
 [8] J. P. Cunningham and B. M. Yu, “Dimensionality reduction for largescale neural recordings,” Nature Neuroscience, 2014.
 [9] J. Bennett and S. Lanning, “The Netflix prize,” in Proceedings of KDD cup and workshop, vol. 2007, 2007, p. 35.
 [10] Y. Cheng and G. M. Church, “Biclustering of expression data,” in ISMB, vol. 8, 2000, pp. 93–103.
 [11] C. Tang, L. Zhang, A. Zhang, and M. Ramanathan, “Interrelated twoway clustering: an unsupervised approach for gene expression data analysis,” in Proc. BIBE 2001, 2001, pp. 41–48.
 [12] S. Busygin, O. Prokopyev, and P. M. Pardalos, “Biclustering in data mining,” Computers & Operations Research, vol. 35, no. 9, pp. 2964 – 2987, 2008.
 [13] E. C. Chi, G. I. Allen, and R. G. Baraniuk, “Convex biclustering,” 2014, http://arxiv.org/abs/1408.0856,arXiv:1408.0856 [stat.ME].
 [14] M. Gavish and R. R. Coifman, “Sampling, denoising and compression of matrices by coherent matrix organization,” Appl. Comput. Harmon. Anal., vol. 33, no. 3, pp. 354 – 369, 2012.

[15]
A. Singer and R. R. Coifman, “Nonlinear independent component analysis with diffusion maps,”
Appl. Comput. Harmon. Anal., vol. 25, no. 2, pp. 226 – 239, 2008.  [16] R. Talmon and R. R. Coifman, “Empirical intrinsic geometry for nonlinear modeling and time series filtering,” PNAS, 2013.
 [17] ——, “Intrinsic modeling of stochastic dynamical systems using empirical geometry,” Appl. Comput. Harmon. Anal., pp. –, 2014.
 [18] A. Haddad, D. Kushnir, and R. R. Coifman, “Texture separation via a reference set,” Appl. Comput. Harmon. Anal., vol. 36, no. 2, pp. 335 – 347, Mar. 2014.
 [19] G. Mishne, R. Talmon, and I. Cohen, “Graphbased supervised automatic target detection,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 5, pp. 2738–2754, May 2015.
 [20] W. E. Leeb, “Topics in metric approximation,” Ph.D. dissertation, Yale University, 2015.
 [21] K. Friston, R. Moran, and A. K. Seth, “Analysing connectivity with Granger causality and dynamic causal modelling,” Current opinion in neurobiology, vol. 23, no. 2, pp. 172–178, 2013.
 [22] O. Sporns, “Contributions and challenges for network models in cognitive neuroscience,” Nature neuroscience, vol. 17, no. 5, pp. 652–660, 2014.
 [23] M. Ding, Y. Chen, and S. Bressler, “Granger causality: Basic theory and application to neuroscience,” in Handbook of time series analysis: Recent Theoretical Developments and Applications. Wiley, 2006, pp. 437–460.
 [24] Schreiber, “Measuring information transfer,” Phys. Rev. Lett., vol. 85, no. 2, pp. 461–464, 2000.
 [25] W. Truccolo, U. Eden, M. Fellows, J. Donoghue, and E. Brown, “A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects,” J Neurophysiol, vol. 93, no. 2, pp. 1074–1089, 2005.
 [26] K. V. Shenoy, M. Sahani, and M. M. Churchland, “Cortical control of arm movements: a dynamical systems perspective,” Annual review of neuroscience, vol. 36, pp. 337–359, 2013.
 [27] E. W. Archer, U. Koster, J. W. Pillow, and J. H. Macke, “Lowdimensional models of neural population activity in sensory cortical circuits,” in Proc. NIPS 2014, 2014, pp. 343–351.
 [28] W. Wu, M. Black, D. Mumford, Y. Gao, E. Bienenstock, and J. Donoghue, “Modeling and decoding motor cortical activity using a switching Kalman filter,” IEEE Trans. Biomed. Eng., vol. 51, no. 6, pp. 933–42, 2004.

[29]
Y. Ahmadian, J. W. Pillow, and L. Paninski, “Efficient Markov chain Monte Carlo methods for decoding neural spike trains,”
Neural Computation, vol. 23, no. 1, pp. 46–96, 2011.  [30] R. Talmon, S. Mallat, H. Zaveri, and R. Coifman, “Manifold learning for latent variable inference in dynamical systems,” IEEE Trans. Signal Process., vol. 63, no. 15, pp. 3843–3856, Aug 2015.

[31]
M. Gavish, B. Nadler, and R. R. Coifman, “Multiscale wavelets on trees, graphs and high dimensional data: Theory and applications to semi supervised learning,” in
Proc. ICML 2010, 2010, pp. 367–374. 
[32]
L. Breiman, “Random forests,”
Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.  [33] J. I. Ankenman, “Geometry and analysis of dual networks on questionnaires,” Ph.D. dissertation, Yale University, 2014.
 [34] A. Buades, B. Coll, and J.M. Morel, “A nonlocal algorithm for image denoising,” in Proc. CVPR 2005, vol. 2, 2005, pp. 60–65.
 [35] I. Ram, M. Elad, and I. Cohen, “Generalized treebased wavelet transform,” IEEE Trans. Signal Process., vol. 59, no. 9, pp. 4199–4209, 2011.
 [36] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu, “An optimal algorithm for approximate nearest neighbor searching fixed dimensions,” J. ACM, vol. 45, no. 6, pp. 891–923, Nov. 1998.
 [37] B.K. Yi and C. Faloutsos, “Fast time sequence indexing for arbitrary norms,” in Proc. VLDB 2000, 2000.
 [38] P. Burt and E. Adelson, “The Laplacian pyramid as a compact image code,” IEEE Trans. Commun., vol. 31, no. 4, pp. 532–540, 1983.
 [39] Y. Rubner, C. Tomasi, and L. J. Guibas, “A metric for distributions with applications to image databases,” in Proc. ICCV 1998, 1998, pp. 59–66.
 [40] R. R. Coifman and W. E. Leeb, “Earth mover’s distance and equivalent metrics for spaces with hierarchical partition trees,” Yale University, Tech. Rep., 2013, technical report YALEU/DCS/TR1482.
 [41] S. Shirdhonkar and D. Jacobs, “Approximate earth mover’s distance in linear time,” in Proc. CVPR 2008, June 2008, pp. 1–8.
 [42] D. Pfau, E. A. Pnevmatikakis, and L. Paninski, “Robust learning of lowdimensional dynamics from large neural ensembles,” in NIPS 2013, 2013, pp. 2391–2399.
 [43] I. Q. Whishaw, S. M. Pellis, and B. P. Gorny, “Skilled reaching in rats and humans: evidence for parallel development or homology,” Behavioural Brain Research, vol. 47, no. 1, pp. 59 – 70, 1992.
 [44] T.W. e. a. Chen, “Ultrasensitive fluorescent proteins for imaging neuronal activity,” Nature, vol. 499, no. 7458, pp. 295–300, 2013.
 [45] S. C. Rogan and B. L. Roth, “Remote control of neuronal signaling,” Pharmacological reviews, vol. 63, no. 2, pp. 291–315, 2011.
 [46] M. Kabra, A. A. Robie, M. RiveraAlba, S. Branson, and K. Branson, “JAABA: interactive machine learning for automatic annotation of animal behavior,” Nature Methods, vol. 10, no. 1, pp. 64–67, 2013.
Comments
There are no comments yet.