 # Statistical Recurrent Models on Manifold valued Data

In a number of disciplines, the data (e.g., graphs, manifolds) to be analyzed are non-Euclidean in nature. Geometric deep learning corresponds to techniques that generalize deep neural network models to such non-Euclidean spaces. Several recent papers have shown how convolutional neural networks (CNNs) can be extended to learn with graph-based data. In this work, we study the setting where the data (or measurements) are ordered, longitudinal or temporal in nature and live on a Riemannian manifold -- this setting is common in a variety of problems in statistical machine learning, vision and medical imaging. We show how statistical recurrent network models can be defined in such spaces. We give an efficient algorithm and conduct a rigorous analysis of its statistical properties. We perform extensive numerical experiments showing competitive performance with state of the art methods but with far fewer parameters. We also show applications to a statistical analysis task in brain imaging, a regime where deep neural network models have only been utilized in limited ways.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In the last decade or so, deep neural network models have been enormously successful in learning complicated patterns from data such as images, videos and speech [40, 38]

– this has led to a number of technological breakthroughs as well as deployments in turnkey applications. One of the popular neural network architectures that has contributed to these advancements is convolutional neural networks (CNNs). In the classical definition of convolution, one often assumes that the data correspond to discrete measurements, acquired at equally spaced intervals (i.e., Euclidean space), of a scalar (or vector) valued function. Clearly, for images, the Euclidean lattice grid assumption makes sense and the use of convolutional architectures is appropriate – as described in

, a number of properties such as stationarity, locality and compositionality follow. While the assumption that the underlying data satisfies the Euclidean structure is explicit or implicit in an overwhelming majority of models, recently there has been a growing interest in applying or extending deep learning models for non-Euclidean data. This line of work is called Geometric deep learning and typically deals with data such as manifolds and graphs . Existing results describe strategies for leveraging the mathematical properties of such geometric or structured data, specifically, lack of (a) global linear structure, (b) global coordinate system, (c) shift invariance/equivariance, by incorporating these ideas explicitly into deep networks used to model them [13, 36, 18, 31, 30, 19].

Separate from the evolving body of work at the interface of convolutional neural networks and structured data, there is a mature literature in statistical machine learning 

and computer vision demonstrating how exploiting the

structure (or geometry) of the data can yield advantages. Structured data abound in various data analysis tasks: directional data in measurements from antennas , time series data (curves) in finance  and health sciences , surface normal vectors on the unit sphere (in vision or graphics) 

, probability density functions (in functional data analysis)

, covariance matrices (for use in conditional independences, image textures) , rigid motions (registration) , shape representations (shape space analysis) 

, tree-based data (parse trees in natural language processing)

, subspaces (videos, segmentation) [64, 23], low-rank matrices [12, 62], and kernel matrices  are common examples. In neuroimaging, an image may have a structured measurement at each voxel to describe water diffusion [7, 63, 41, 32, 4, 15] or local structural change [29, 67, 35]

. And the study of the interface between geometry/structure and analysis methods has given effective practical tools because in order to define loss functions that make sense for the data at hand, one needs to first define a metric which is intrinsic to the structure of the data.

The foregoing discussion, for the most part, covers differential geometry inspired algorithms for non-sequential (or non-temporal) data. The study of analogous schemes for temporal or longitudinal data is less well-developed. But analysis of dynamical scenes and stochastic processes is an important area of machine learning and vision, and it is here that some results have shown the benefits of explicitly using geometric ideas. Some of the examples include the modeling of temporal evolution of features in dynamic scenes in action recognition [2, 9, 60], tractography [14, 49] and so on. There are also proposals describing modeling stochastic linear dynamical system (LDS) [22, 2, 9, 60]. In [2, 3], authors studied the Riemannian geometry of LDS to define distances and first order statistic. Given that the marriage between deep learning and learning on non-Euclidean domains is a fairly recent, the existing body of work primarily deals with attempts to generalize the popular CNN architectures. Few if any results exist that study recurrent models for non-Euclidean structured domains.

The broad success of Recurrent Neural Network (RNN) architectures including Long short term memory (LSTM)



and Gated recurrent unit (GRU)

 in sequential modeling like Natural Language Processing (NLP) has motivated a number of attempts to apply such ideas to model stochastic processes or to characterize dynamical scenes which can be viewed as a sequence of images. Several works have proposed variants of RNN to model dynamical scenes including [56, 21, 45, 53, 65]. In the recent past, developments have been made to reduce the number of parameters in RNN and making RNN faster [37, 65]. In [6, 27]

, authors proposed an efficient way to handle vanishing and exploding gradient problem of RNN using unitary weight matrices. In

, authors proposed a RNN model which combines the remembering ability of unitary RNNs with the ability of gated RNNs to effectively forget redundant/ irrelevant information. Despite these results, we find that no existing model describes a recurrent model for structured (specifically, manifold-valued data). The main contribution of this paper is to describe a recurrent model (and accompanying theoretical analysis) that will fall under the umbrella of “geometric deep learning” — it exploits the geometry of non-Euclidean data but is specifically designed for temporal or ordered measurements.

## 2 Preliminaries: Key Ingredients from Riemannian geometry

In this section, we will first give a brief overview of the Riemannian geometry of symmetric positive definite matrices (henceforth will be denoted by ). Note that our development is not limited to , but choosing a specific manifold simplifies the presentation and the notations significantly. Then, we will present key ingredients needed for our proposed recurrent model.

Differential Geometry of : Let be the set of symmetric positive definite matrices. The group of full rank matrices, denoted by and called the general linear group, acts on via the group action, , where and . One can define a invariant intrinsic metric, on as follows  . Here, Log is the matrix logarithm. This metric is intrinsic but requires a spectral decomposition for computation, a computationally intensive task for large matrices. In , the Jensen-Bregman LogDet (JBLD) divergence was introduced on . As the name suggests, this is not a metric but as proved in , the square root of JBLD is however a metric (called the Stein metric), which is defined as .

Here, we used the notation without any subscript to denote the Stein metric. It is easy to see that the Stein metric is computationally much more efficient than the -invariant natural metric on as no eigen decomposition is required. This will be very useful for training our recurrent model. In the rest of the paper, we will assume the metric on to be the Stein metric. Now, we will describe a few operations on which are needed to define the recurrent model.

“Translation” operation on : Let be the set of all isometries on , i.e., given , , for all , where is the group action as defined earlier. It is clear that forms a group (henceforth, will be denoted by ) and for a given and , , for some is a group action. One can easily see that, endowed with the Stein metric, . In this work, we will choose a subgroup of , i.e., as our choice of , where, is the set of orthogonal matrices and . As the group operation preserves the distance, we call this group operation “translation”, analogous to the case of Euclidean space and is denoted by .

Parametrization of : Let . We will obtain the Cholesky factorization of , where is an invertible lower traingular matrix. This gives a unique parametrization of . Let the parametrization be . With a slight abuse of notation, we will use Chol to denote both decomposition and construction based on the type of the domain of the function, i.e., and . Note that here are diagonal entries of and are positive and can be any real numbers.

Parametrization of : is a Lie group  of orthogonal matrices (of dimension ) with the corresponding Lie algebra, , and consists of the set of skew-symmetric matrices. The Lie algebra is a vector space, so we will use the corresponding element from the Lie algebra to parametrize a point on . Let , we will use the matrix logarithm of to get the parametrization as the skew-symmetric matrix. So, . exp is the matrix exponential operator.

Weighted Fréchet mean (wFM) of matrices on : Given , and with , for all and , the weighted Fréchet mean (wFM)  is:

 M∗=\operatornamewithlimitsargminMN∑i=1wid2(Xi,M) (1)

The existence and uniqueness of the Fréchet mean (FM) is discussed in detail in . For the rest of the paper, we will assume that the samples lie within a geodesic ball of appropriate radius so that FM exists and is unique. We will use to denote the wFM of with weights . With the above tools in hand, now we are ready to formulate the Statistical Recurrent Neural Network on , dubbed as SPD-SRU.

## 3 Statistical Recurrent Network Models in the space of SPD(n) matrices

The main motivation for our work comes from the statistical recurrent unit (SRU) model on Euclidean spaces in . To setup our formulation, we will briefly review the SRU formulation followed by details of our recurrent model for manifold valued measurements.

#### What is the Statistical Recurrent Unit (SRU)?

The authors in  propose an interesting model for sequential (or temporal) data based on an un-gated recurrent unit (called Statistical Recurrent Unit (SRU)). The model maintains the sequential dependency in the input samples through a simple summary statistic — the so-called exponential moving average. Even though the proposal is based on an un-gated architecture, the development and experiments show that the results from SRU are competitive with more complex alternatives like LSTM and GRU. One reason put forth in that work is that using appropriately designed summary statistics, one can essentially emulate complicated gated units and still capture long terms relations (or memory) in sequences. This property is particularly attractive when we study recurrent models for more complicated measurements such as manifolds. Recall that the key challenge in extending statistical machine learning models to manifolds involves re-deriving many of the classical (Euclidean) arithmetic and geometric operations while respecting the geometry of the manifold of interest. The simplicity of un-gated units provides an excellent starting point. Below, we describe the key update equations that define the SRU.

Let be an input sequence on , presented to the model. As in most recurrent models, the training process in SRU proceeds by updating the weights of the model. Let the weight matrix be denoted by (the node is indexed by the superscript). The update rules for SRU are as follows:

(2) (3)
(4) (5)
where is the set of different scales. The SRU formulation is analogous to mean map embedding (MME) but applied to non i.i.d. samples. Since the average of a set of i.i.d. samples will essentially marginalize over time, simple averaging will lose the temporal/sequential information. On the other hand, the SRU computes a moving average over time which captures the average of the data seen so far, i.e., when computing from (as shown in Fig. 1). This is very similar to taking the average of stochastic processes and looking at the “average process”. Further, by looking at averages over different scales, essentially, we can uncover statistics computed over different time scales. This is because is not only a function of but also a function of via . This dependence on the past “tokens” in the sequence is shown in Fig. 1 by a “dashed” line. With this description, we can easily list the key operational components in the update rules in (2)-(5) and then evaluate if such components can be generalized to serve as the building blocks of our proposed model.

Which low-level operations are needed? We can verify that the key ingredients to define the model in SRU are (i) weighted sum; (ii) addition of bias; (iii) moving average and (iv) non-linearity. In principle, if we can generalize each of these operations to the manifold, it will provide us the basic components to define the model. Observe that items (i) and (iii) are essentially a weighted sum if we impose a convexity constraint on the weights. Then, the weighted sum for the Euclidean setting can be generalized using wFM as defined in Section 2 (denoted by FM). Figure 1: Sketch of an SPD-SRU and SRU layer (dashed line represnets dependence on the previous time point).

If we can do so, it will also provide a way to compute moving averages on . Now, the second operation we can identify above is the translation on Euclidean spaces. This can be achieved by the “translation” operation on as defined in Section 2 (denoted by T). Finally, in order to generalize ReLU on , we will use the standard ReLU on the parameter space (this will be the local chart of ) and then map it back on to the manifold. This means that we have generalized each of the key components. With this in hand, we are ready to present our proposed recurrent model on . We first formally describe our SPD-SRU layer and then contrast with the SRU layer, to help see the main differences.

#### Basic components of the SPD-SRU model.

Let, be an input temporal or ordered sequence of points on . The update rules for a layer of SPD-SRU is as follows:

 \hfsetfillcolordarkgreen\hfsetbordercolordarkgreen!20!white\tikzmarkina(4.4,−0.25)(−0.2,0.4)Yt=FM({M(α)t−1},{w(y,α)}) , \hfsetfillcolorlightgreen\hfsetbordercolorlightgreen!20!white\tikzmarkina(2.4,−0.25)(−0.2,0.4)Rt=T(Yt,g(r)) (6) \hfsetfillcolordarkblue\hfsetbordercolordarkblue!20!white\tikzmarkina(3.65,−0.25)(−0.2,0.4)Tt=FM({Rt,Xt},w(t)) , \hfsetfillcolorlightblue\hfsetbordercolorlightblue!20!white\tikzmarkina(2.4,−0.25)(−0.2,0.4)Φt=T(Tt,g(p)) (7) \hfsetfillcolorlightyellow\hfsetbordercolorlightyellow!20!white\tikzmarkina(1.0,−0.25)(−0.2,0.4)∀α∈J , \hfsetfillcolorlightyellow\hfsetbordercolorlightyellow!20!white\tikzmarkina(4.3,−0.25)(−0.2,0.4)M(α)t=FM({M(α)t−1,Φt},α) (8) \hfsetfillcolordarkorange\hfsetbordercolordarkorange!20!white\tikzmarkina(4.4,−0.25)(−0.2,0.4)St=FM({M(α)t},{w(s,α)}) , \hfsetfillcolorlightorange\hfsetbordercolorlightorange!20!white\tikzmarkina(6.0,−0.25)(−0.2,0.4)Ot=Chol(ReLU(Chol(T(St,g(y))))) (9)

where, and is initialized to be a diagonal matrix with small positive values. Similar to before, the set consists of positive real numbers from the unit interval. Now, computing the FM at the different elements of will give a wFM at different “scales”, exactly as desired. Analogous to the SRU, here s are computed by averaging at different scales as shown in Fig. 1. This model leverages the context based on previous data by asking the moving averages, to depend on past data, through (as shown in Fig. 1).

Comparison between the SPD-SRU and the SRU layer: In the SPD-SRU unit above, each update identity is a generalization of an update equation of SRU. In (6), we compute the weighted combination of the previous FMs (computed using different “scales”) with a “translation”, i.e., the input is and the output is . This update equation is analogous to the weighted combination of the past means with bias as given in (2)) where the input is and the output is . This update rule calculates a weighted combination of the past information. In (7), we compute a weighted combination of the previous information, and the current point or token, with a “translation”. The input of this equation is and and the output is . This is analogous to (3), where the input is and and the output is . This update rule combines old and new information. Now, we will update the new information based on the combined information at the current time step, i.e., . This is accomplished in (8). Here, we are computing an FM (average) at different “scales”. Computing averages at different “scales” essentially allows including information from previous data points which have been seen at various time scales. This step is a generalization of (4). In this step, the input is and (with and respectively) and the output is (with ). This step is the combined information gathered at the current time step. Finally, in (9), we used a weighted combination of the current FMs (averages) and outputs . This is the last update rule in SRU, i.e., (5). Observe that we did not use the ReLU operation in each update rule of SPD-SRU, in contrast to SRU. This is because, these update rules are highly nonlinear unlike in the SRU, hence, a ReLU unit at the final output of the layer is sufficient. Also, notice that , hence, we can cascade multiple SPD-SRU layers, in other words in the next layer, the input sequence will be . The update equations track the “averages” (FM) at varying scales. This is the reason we call our framework statistical recurrent network. We will shortly see that our framework can utilize parameters more efficiently and requires very few parameters because of the ability to use the covariance structure.

Important properties of SPD-SRU model: The “translation” operator is analogous to “adding” a bias term in a standard neural network. One reason we call it “translation” is because the action of , preserves the metric. Notice that although in this description, we track the FMs at different scales, one may easily use other statistics, e.g., Fréchet median and mode, etc. The key bottleneck is to efficiently compute the moving statistic (whatever it may be), which will be discussed shortly. Note that the SPD-SRU formulation can be generalized to other manifolds. In fact, it can be easily generalized to Riemannian homogeneous spaces  because of two reasons (a) closed form expressions for Riemannian exponential and inverse exponential maps exist and (b) a group acts transitively on these spaces, hence we can generalize the definition of “translation”. Other manifolds are also possible but the technical details will be different. Now, we will comment on learning the parameters of our proposed model.

Learning the parameters: Notice that using the parametrization of , we will learn the “bias” term on the parametric space, which is a vector space. The weights in the wFM must satisfy the non-negativity constraint. In order to ensure that this property is satisfied, we will learn the square root of the weights which is unconstrained, i.e., the entire real line. We will impose the affine constraint explicitly by normalizing the weights. Hence, all the trainable parameters lie in the Euclidean space and the optimization of these parameters is unconstrained, hence standard techniques are sufficient.

Remarks. It is interesting to observe that the update equations in (6)-(9) involve group operations and wFM computation. But as evident from the (1), the wFM computation requires numerical optimization, which is computationally not efficient. This is a bottleneck. For example, for our proposed model, on a batch size of with matrices with , we need to compute FM times, even for just epochs. Next, we will develop formulations to make this wFM computation faster since it is called hundreds of times in a typical training procedure.

## 4 An efficient way to compute the wFM on SPD(n)

The foregoing discussion describes how the computation of wFM needs an optimization on the SPD

manifold. If this sub-module is slow, the demands of the overall runtime will rule out practical adoption. In contrast, if this sub-module is fast but numerically or statistically unstable, the errors will propagate in unpredictable ways, and can adversely affect the parameter estimation. Thus, we need a scheme that balances performance and efficiency.

Estimation of the FM from samples is a well researched topic. For instance, the authors in [44, 48] used Riemannian gradient descent to compute the FM. But the algorithm has a runtime complexity of , where is the number of samples and is the number of iterations for convergence. This procedure comes with provable consistency guarantees – thus, while it will serve our goals in theory, we find that the runtime for each run makes training incredibly slow. On the other hand, the recursive FM estimator using the Stein metric presented in  is fast and apt for this task if no additional assumptions are made. However, it comes with no theoretical guarantees of consistency.

#### Key Observation.

We found that with a few important changes to the idea described in , one can derive an FM estimator that retains the attractive efficiency behavior and is provably consistent. The key ingredient here involves using a novel isometric mapping from the SPD manifold to the unit Hilbert sphere. Next, we present the main idea followed by the analysis.

#### Proposed Idea.

Let for which we want to compute the FM which will be used in (6)–(9). Authors in  presented a recursive Stein mean estimator given below:

 M1=X1Mk=Mk−1⎡⎣√Tk+(2wk−1)24(I−Tk)2−2wk−12(I−Tk)⎤⎦, (10)

where and is the set of weights. Instead, briefly, our strategy is (i) use an isometric mapping from to the unit Hilbert sphere; (ii) make use of an efficient way to compute the FM on the unit Hilbert sphere; This isometric mapping to the Hilbert sphere then transfers the problem of proving consistency of the estimator from to that on the Hilbert sphere, which is easier to prove as shown below. This then leads to consistency of FM estimator on .

We define the isometric mapping from with a Stein metric to , i.e., the infinite dimensional unit hypersphere. In order to define it, notice that we need to define a metric, on such that, and are isometric. This procedure and the associated consistency analysis is described below (all proofs are in the supplement).

###### Definition 1.

Let . Let be the Gaussian density with mean and covariance matrix . Now, we normalize the density by to map it onto . Let, be that mapping. We define the metric on as .

Here, is the inner product.The following proposition proves the isometry between with the Stein metric and the hypersphere with the new metric. Let, . Then,

###### Proposition 1.

Let and . Then, .

Note that, maps a point on to the positive orthant of , denoted by since the components of any probability vector are non-negative. We should point out that in this metric space, there are no geodesics since it is not

a length space. As a result, we cannot simply use the consistency proof of the stochastic gradient descent based FM estimator presented in

 for any Riemannian manifold and apply it here. Hence, the recursive FM presented next for the identity in (10) with the mapping described above will need a separate consistency analysis.

#### Recursive Fréchet mean algorithm on (H,dS).

Let be the samples on where gives the positive orthant of . Then, the FM of the given samples, denoted by , is defined as . Our recursive algorithm to compute the wFM of is:

 m1=x1mk=argminx(wkd2(xk,x)+(1−wk)d2(mk−1,x)) (11)

where, is the estimate of the FM. At each step of our algorithm, we simply calculate a wFM of two points and we chose the weights to be the Euclidean weights. So, in order to construct a recursive algorithm, we need to have a closed form expression of the wFM, as stated next.

###### Proposition 2.

The minimizer of (11) is given by , where and and .

#### Consistency and Convergence analysis of the estimator.

The following proposition (see supplement for proof) gives us the weak consistency of this estimator and also the convergence rate.

###### Proposition 3.
(a) as . (b) The rate of convergence of the proposed recursive FM estimator is super linear.

Due to proposition 1, we obtain a consistency result for (10) with our mapping. These results suggest that we now have a suitable FM estimator which is consistent and efficient – this can be used as a black-box module in our RNN formulation in (6)-(9).

## 5 Experiments

In this section, we demonstrate the application of SPD-SRU to answer three important questions (1) Using the manifold constraint, what are we saving in terms of number of parameters/ time and is the performance competitive? (2) When data is manifold valued, can we still use our framework with the geometry constraint? (3) In a real application, how much improvements can we get over the baseline?

We perform three sets of experiments to answer these questions namely: (1) classification of moving patterns on Moving MNIST data, (2) classification of actions on UCF11 data and (3) permutation testing to detect group differences between patients with and without Parkinson’s disease. In the following subsections, we discuss about each of these dataset in more detail and present the performance of our SPD-SRU. Our code is available at https://github.com/zhenxingjian/SPD-SRU/tree/master.

### 5.1 Savings in terms of number of parameters/ time and experiments on vision datasets.

In this section, we perform two sets of experiments namely (1) classification of moving patterns on Moving MNIST data, (2) classification of actions on UCF11 data to show the improvement of our proposed framework over the state-of-the-art methods in terms of number of parameters/ time. We compared with LSTM , SRU , TT-GRU and TT-LSTM . In the first two classification applications, we use a convolution block before the recurrent unit for all the competitive methods except for TT-GRU and TT-LSTM. In our SPD-SRU model, before the recurrent layer, we included a covariance block analogous to  after one convolution layer ( includes details of the construction for the covariance block). So, the input of our SPD-SRU layer is a sequence of matrices in , where is the number of channels from the convolution layer.

#### Classification of moving patterns on Moving MNIST data

We used the Moving MNIST data as generated in . For this experiment we did and classes classification experiment. In each class, we generated sequences each of length showing digits moving in a frame. Though within a class, the digits are random, we fixed the moving pattern by fixing the speed and direction of the movement. In this experiment, we kept the speed to be same for all the sequences, but two sequences from two different classes can differ in orientation by at least and by atmost . We experimentally see that, SPD-SRU can achieve very good -fold testing accuracy even when the orientation difference of two classes is . In fact SPD-SRU takes the smallest number of parameters among all methods tested and still offers the best average testing accuracy.

In Table 1

, we report the mean and standard deviation of the

-fold testing accuracy. We like to point out that the training accuracy for all the competitive methods is for all cases. For TT-RNN, we reshaped the input to be and kept the output shape and rank to be and . The number of output units for LSTM is kept as and the number of statistics for SRU is kept as . Note that, we chose different parameters for SRU and LSTM and TT-RNN and the one we reported here are the one for which the number of parameters are smallest for the reported testing accuracy. For the convolution layer, we chose the kernel size to be and the input and output channels to be and respectively, i.e., the dimension of the SPD matrix is for this experiment. As before, the parameters are chosen so the number of parameters are smallest to get the reported testing accuracy.

One can see from the table that, SPD-SRU takes the least number of parameters and can achieve very good classification accuracy even for orientation difference and for three classes. Note that TT-RNN is the closest to SPD-SRU in terms of parameters. For comparisons, we do another experiment where we vary the difference of orientation from to . The testing accuracies are shown in Fig. 2. We can see that only SPD-SRU maintains good -fold testing accuracy for all orientation differences while the performance of TT-RNN (both variants) deteriorates as we decrease the difference between orientations of the two classes (the effect size).

In terms of training time, SPD-SRU takes around seconds per epoch while the fastest method is TT-RNN which takes around seconds. But, in this experiment, SPD-SRU takes epochs to converge to the reported results while TT-RNN takes around epochs. So, although TT-RNN is faster per epoch, the total training time for TT-RNN and SPD-SRU are almost the same. We also like to point out that though the number of trainable parameters are fewer for SPD-SRU than TT-RNN, the time difference is due to constructing the covariance in each epoch which can be optimized via faster implementations.

#### Classification of moving patterns on UCF-11 data

We performed an action classification experiment on UCF11 dataset . It contains in total 1600 video clips belonging to 11 classes that summarize the human action visible in each video clip such as basketball shooting, diving and others. We followed the same processing step as done in . Each frame has resolution . We generate a sequence of RGB frames of size from each clip at fps. The lengths of frame sequences from each video therefore are in the range of - with an average of . For SPD-SRU, we chose two convolution layers with kernel size and number of output channels to be and respectively and then PSRN layers. Hence, the dimension of the covariance matrices are for this experiment. For TT-GRU and TT-LSTM, we used the same configurations of input and output factorization as given in . For SRU and LSTM we used the number of statistics and number of output units to be . For both SRU and LSTM we used convolution layers with kernel size and output channels to be , and respectively to get the reported testing accuracies. All the models achieve training accuracy. We report the testing accuracy with the number of parameters and time per epoch in Table 2. From this experiment, we can see that the number of parameters for SPD-SRU is significantly smaller than the other models without sacrificing the testing accuracy. In terms of training time, SPD-SRU takes approximately times more time than TT-RNN but SPD-SRU (TT-RNN) converges in () epochs. Furthermore, we like to point out that after epochs, SPD-SRU gives testing accuracy. Hence, analogous to the previous experiment, we can conclude that SPD-SRU maintains very good classification accuracy while keeping the number of trainable parameters very small. Furthermore, this experiment indicates that SPD-SRU can achieve competitive performance on real data with small number of training parameters in comparable time.

### 5.2 Application on manifold valued data

From the previous two experiments, we can conclude that SPD-SRU requires a smaller number of parameters. In this subsection, we focus our attention to a neuroimaging application where data is manifold valued. Because the number of parameters are small, we can do statistical testing on brain connectivity at the fiber bundle level. We seek to find group differences between subjects with and without Parkinson’s disease (denoted by ‘PD’ and ‘CON’) based on the M1 fiber tracts on both hemispheres of the brain.

#### Permutation testing to detect group differences

The data pool consists of dMRI (human) brain scans acquired from ‘PD’ patients and ‘CON’. All images were collected using a 3.0 T MR scanner (Philips Achieva) and 32-channel quadrature volume head coil. The parameters of the diffusion imaging acquisition sequence were as follows: gradient directions = 64, b-values = 0/1000 s/mm2, repetition time =7748 ms, echo time = 86 ms, flip angle = , field of view = mm, matrix size = , number of contiguous axial slices = 60 and SENSE factor P = 2. We used FSL  software to extract M1 fiber tracts (denoted by ‘LM1’ and ‘RM1’) , which consists of and points respectively (please see Fig. 3

for M1-SMATT fiber tract template). We fit a diffusion tensor and extract

SPD matrices. Now, for each of these two classes, we use layers of SPD-SRU to learn the tracts pattern to get two models for ‘PD’ and ‘CON’ (denoted by ‘mPD’ and ‘mCON’).

Now, we use a permutation testing based on a “distance” between ‘mPD’ and ‘mCON’. We will define the distance between two network models as proposed in  (let it be denoted by ). Here, we assume each subject is independent hence use of permutation testing is sensible. Then we perform permutation testing for each tract as follows (i) randomly permute the class labels of the subjects and learn ‘mPD’ and ‘mCON’ models for each of the new group. (ii) compute (iii) repeat step (ii) 10,000 times and report the -value as the fraction of times

. So, we ask if we can reject the null hypothesis that

there is no significant difference between the tracts models learned from the two different classes.

As a baseline, we use the following scheme: (i) for each tract of each subject, compute the FM of the matrices on the tract. (ii) use Cramer’s test based on this Stein distance. (iii) do the permutation testing based on the Cramer’s test.

We found that using our SPD-SRU model with layers, the -value for ‘LM1’ and ‘RM1’ are and respectively, while the baseline method gives a p-value of and respectively. Hence, we conclude that, unlike the baseline method, using SPD-SRU we can reject the null hypothesis with confidence. To the best of our knowledge, this is the first result that demonstrates a RNN based statistical significance test applied on tract based group testing in neuroimaging.

## 6 Conclusions

Non-Euclidean or manifold valued data are ubiquitous in science and engineering. In this work, we study the setting where the data (or measurements) are ordered, longitudinal or temporal in nature and live on a Riemannian manifold. This setting is common in a variety of problems in statistical machine learning, vision and medical imaging. We presented a generalization of the RNN to such non-Euclidean spaces and analyze its theoretical properties. Our proposed framework is fast and needs far fewer parameters than the state-of-the-art. Extensive experiments show competitive performance on benchmark computer vision data in comparable time. We also apply our framework to perform statistical analysis in brain connectivity and demonstrate the applicability to manifold valued data.

## References

•  Bijan Afsari. Riemannian lp center of mass: existence, uniqueness, and convexity. Proceedings of the American Mathematical Society, 139(2):655–673, 2011.
•  Bijan Afsari, Rizwan Chaudhry, Avinash Ravichandran, and René Vidal. Group action induced distances for averaging and clustering linear dynamical systems with applications to the analysis of dynamic scenes. In

Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on

, pages 2208–2215. IEEE, 2012.
•  Bijan Afsari and René Vidal. The alignment distance on spaces of linear dynamical systems. In Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on, pages 1162–1167. IEEE, 2013.
•  Iman Aganj, Christophe Lenglet, and Guillermo Sapiro. ODF reconstruction in q-ball imaging with solid angle consideration. In Biomedical Imaging: From Nano to Macro, 2009. ISBI’09. IEEE International Symposium on, pages 1398–1401. IEEE, 2009.
•  Derek B Archer, David E Vaillancourt, and Stephen A Coombes. A template and probabilistic atlas of the human sensorimotor tracts using diffusion mri. Cerebral Cortex, pages 1–15, 2017.
•  Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In International Conference on Machine Learning, pages 1120–1128, 2016.
•  Peter J Basser, James Mattiello, and Denis LeBihan. MR diffusion tensor spectroscopy and imaging. Biophysical journal, 66(1):259–267, 1994.
•  Timothy EJ Behrens, H Johansen Berg, Saad Jbabdi, Matthew FS Rushworth, and Mark W Woolrich. Probabilistic diffusion tractography with multiple fibre orientations: What can we gain? Neuroimage, 34(1):144–155, 2007.
•  Alessandro Bissacco, Alessandro Chiuso, Yi Ma, and Stefano Soatto. Recognition of human gaits. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 2, pages II–II. IEEE, 2001.
•  Silvere Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217–2229, 2013.
•  Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
•  Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717, 2009.
•  Rudrasis Chakraborty, Monami Banerjee, and Baba C Vemuri. H-CNNs: Convolutional neural networks for riemannian homogeneous spaces. arXiv preprint arXiv:1805.05487, 2018.
•  Guang Cheng, Hesamoddin Salehian, John R Forder, and Baba C Vemuri.

Tractography from HARDI using an intrinsic unscented kalman filter.

IEEE Transactions on Medical Imaging, 34(1):298–305, 2015.
•  Guang Cheng, Hesamoddin Salehian, and Baba C Vemuri. Efficient recursive algorithms for computing the mean diffusion tensor and applications to DTI segmentation. In European Conference on Computer Vision, pages 390–401. Springer, 2012.
•  Anoop Cherian, Suvrit Sra, Arindam Banerjee, and Nikolaos Papanikolopoulos. Efficient similarity search for covariance matrices via the jensen-bregman logdet divergence. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2399–2406. IEEE, 2011.
•  Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
•  Taco S Cohen, Mario Geiger, Jonas Koehler, and Max Welling. Spherical cnns. arXiv preprint arXiv:1801.10130, 2018.
•  Taco S Cohen and Max Welling. Steerable cnns. arXiv preprint arXiv:1612.08498, 2016.
•  Francesca Dominici, Aidan McDermott, Scott L Zeger, and Jonathan M Samet. On the use of generalized additive models in time-series studies of air pollution and health. American journal of epidemiology, 156(3):193–203, 2002.
•  Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625–2634, 2015.
•  Gianfranco Doretto, Alessandro Chiuso, Ying Nian Wu, and Stefano Soatto. Dynamic textures. International Journal of Computer Vision, 51(2):91–109, 2003.
•  Ehsan Elhamifar and René Vidal. Sparse subspace clustering. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 2790–2797. IEEE, 2009.
•  Maurice Fréchet. Les éléments aléatoires de nature quelconque dans un espace distancié. Ann. Inst. H. Poincaré, 10(3):215–310, 1948.
•  Brian Hall. Lie groups, Lie algebras, and representations: an elementary introduction, volume 222. Springer, 2015.
•  Sigurdur Helgason. Differential geometry and symmetric spaces, volume 12. Academic press, 1962.
•  Mikael Henaff, Arthur Szlam, and Yann LeCun. Recurrent orthogonal networks and long-memory tasks. arXiv preprint arXiv:1602.06662, 2016.
•  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
•  Xue Hua, Alex D Leow, Neelroop Parikshak, Suh Lee, Ming-Chang Chiang, Arthur W Toga, Clifford R Jack Jr, Michael W Weiner, Paul M Thompson, Alzheimer’s Disease Neuroimaging Initiative, et al. Tensor-based morphometry as a neuroimaging biomarker for alzheimer’s disease: an MRI study of 676 AD, MCI, and normal subjects. NeuroImage, 43(3):458–469, 2008.
•  Zhiwu Huang and Luc J Van Gool. A riemannian network for spd matrix learning. In AAAI, volume 2, page 6, 2017.
•  Zhiwu Huang, Jiqing Wu, and Luc Van Gool. Building deep networks on grassmann manifolds. arXiv preprint arXiv:1611.05742, 2016.
•  Bing Jian, Baba C Vemuri, Evren Özarslan, Paul R Carney, and Thomas H Mareci. A novel tensor distribution model for the diffusion-weighted MR signal. NeuroImage, 37(1):164–176, 2007.
•  Li Jing, Caglar Gulcehre, John Peurifoy, Yichen Shen, Max Tegmark, Marin Soljačić, and Yoshua Bengio. Gated orthogonal recurrent units: On learning to forget. arXiv preprint arXiv:1706.02761, 2017.
•  David G Kendall. Shape manifolds, procrustean metrics, and complex projective spaces. Bulletin of the London Mathematical Society, 16(2):81–121, 1984.
•  Hyunwoo J Kim, Nagesh Adluru, Heemanshu Suri, Baba C Vemuri, Sterling C Johnson, and Vikas Singh. Riemannian nonlinear mixed effects models: Analyzing longitudinal deformations in neuroimaging. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
•  Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. arXiv preprint arXiv:1802.03690, 2018.
•  Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. A clockwork rnn. arXiv preprint arXiv:1402.3511, 2014.
•  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
•  Guy Lebanon et al. Riemannian geometry and statistical machine learning. LAP LAMBERT Academic Publishing, 2015.
•  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
•  C. Lenglet, M. Rousson, and R. Deriche. DTI segmentation by statistical surface evolution. IEEE Trans. on Medical Imaging, 25(6):685–700, 2006.
•  Jingen Liu, Jiebo Luo, and Mubarak Shah. Recognizing realistic actions from videos “in the wild”. In Computer vision and pattern recognition, 2009. CVPR 2009. IEEE conference on, pages 1996–2003. IEEE, 2009.
•  Konstantinos Mammasis and RobertW Stewart. Spherical statistics and spatial correlation for multielement antenna systems. EURASIP Journal on Wireless Communications and Networking, 2010(1):307265, 2010.
•  Maher Moakher and Philipp G Batchelor. Symmetric positive-definite matrices: From geometry to applications and visualization. In Visualization and Processing of Tensor Fields, pages 285–298. Springer, 2006.
•  Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 4694–4702. IEEE, 2015.
•  Junier B Oliva, Barnabás Póczos, and Jeff Schneider. The statistical recurrent unit. arXiv preprint arXiv:1703.00381, 2017.
•  FC Park and Bahram Ravani. Bezier curves on riemannian manifolds and lie groups with kinematics applications. Journal of Mechanical Design, 117(1):36–40, 1995.
•  Xavier Pennec, Pierre Fillard, and Nicholas Ayache. A riemannian framework for tensor computing. International Journal of computer vision, 66(1):41–66, 2006.
•  Sonia Pujol, William Wells, Carlo Pierpaoli, Caroline Brun, James Gee, Guang Cheng, Baba Vemuri, Olivier Commowick, Sylvain Prima, Aymeric Stamm, et al. The DTI challenge: toward standardized evaluation of diffusion tensor imaging tractography for neurosurgery. Journal of Neuroimaging, 25(6):875–882, 2015.
•  Chris Quirk, Arul Menezes, and Colin Cherry. Dependency treelet translation: Syntactically informed phrasal smt. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 271–279. Association for Computational Linguistics, 2005.
•  Hesamoddin Salehian, Guang Cheng, Baba C Vemuri, and Jeffrey Ho. Recursive estimation of the stein center of spd matrices and its applications. In Proceedings of the IEEE International Conference on Computer Vision, pages 1793–1800, 2013.
•  Bernhard Schölkopf and Alexander J Smola.

Learning with kernels: support vector machines, regularization, optimization, and beyond

.
MIT press, 2002.
•  Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. Action recognition using visual attention. arXiv preprint arXiv:1511.04119, 2015.
•  Suvrit Sra. Positive definite matrices and the symmetric stein divergence. Technical report, 2011.
•  Anuj Srivastava, Ian Jermyn, and Shantanu Joshi. Riemannian analysis of probability density functions with applications in vision. In Computer Vision and Pattern Recognition, 2007. CVPR. IEEE Conference on, pages 1–8. IEEE, 2007.
•  Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852, 2015.
•  Julian Straub, Jason Chang, Oren Freifeld, and John Fisher III. A dirichlet process mixture model for spherical data. In Artificial Intelligence and Statistics, pages 930–938, 2015.
•  Umberto Triacca. Measuring the distance between sets of arma models. Econometrics, 4(3):32, 2016.
•  Ruey S Tsay. Analysis of financial time series, volume 543. John Wiley & Sons, 2005.
•  Pavan Turaga, Ashok Veeraraghavan, and Rama Chellappa. Statistical analysis on stiefel and grassmann manifolds with applications in computer vision. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
•  Oncel Tuzel, Fatih Porikli, and Peter Meer. Region covariance: A fast descriptor for detection and classification. Computer Vision–ECCV 2006, pages 589–600, 2006.
•  Bart Vandereycken. Low-rank matrix completion by riemannian optimization. SIAM Journal on Optimization, 23(2):1214–1236, 2013.
•  Zhizhou Wang and Baba C Vemuri. Dti segmentation using an information theoretic tensor dissimilarity measure. IEEE Transactions on Medical Imaging, 24(10):1267–1277, 2005.
•  Jia Xu, Vamsi K Ithapu, Lopamudra Mukherjee, James M Rehg, and Vikas Singh. Gosus: Grassmannian online subspace updates with structured-sparsity. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 3376–3383. IEEE, 2013.
•  Yinchong Yang, Denis Krompass, and Volker Tresp. Tensor-train recurrent neural networks for video classification. arXiv preprint arXiv:1707.01786, 2017.
•  Kaicheng Yu and Mathieu Salzmann. Second-order convolutional neural networks. arXiv preprint arXiv:1703.06817, 2017.
•  Ernesto Zacur, Matias Bossa, and Salvador Olmos. Multivariate tensor-based morphometry with a right-invariant riemannian distance on GL+ (n). Journal of mathematical imaging and vision, 50(1-2):18–31, 2014.