## 1 Introduction

In computer vision, convolutional neural networks (CNN) and its variants are ubiquitous and serve as omnipotent tools for various tasks, e.g. image classification and segmentation. However, the traditional CNNs are restricted to data residing in vector spaces while data residing in smooth non-Euclidean spaces, e.g. Riemannian manifolds, arise naturally in many problem domains. Although Riemannian manifolds lack the vector space structure, the associated Riemannian metric induces the notions of distance and angle (between intersecting curves on the manifold) intrinsic to the manifold. Commonly encountered examples of Riemannian manifolds in computer vision are the manifold of

symmetric positive-definite (SPD) matrices, , the special orthogonal group , the Grassmann manifold, and the -sphere, . Recently, there has been a growing interest in generalizing the well-known CNN and its variants to cope with these types of data while respecting the underlying geometry.In the past few years, there has been a surge in research to develop deep neural networks that deal with data residing on the aforementioned Riemannian manifolds. At the outset, it will be useful to categorize two types of problems concerning data in non-Euclidean spaces. These two types are: (i) data that are samples of functions defined on smooth manifolds and (ii) data that are samples of manifold-valued functions whose domain is Euclidean or data that are simply sample points on manifolds. In this paper we will address the problem of developing deep neural networks for the data defined in (ii).

In the context of data defined in (i) above, in the recent past, there has been a flurry of research activity in developing the analogs of CNNs. For example, masci2015geodesic presented the geodesic convolutional neural network (GCNN) for which they defined the geodesic convolution as standard convolution on the local geodesic charts. poulenard2018multi presented convolution for directional functions which reduces to the usual convolution when the underlying manifold is . In both masci2015geodesic and poulenard2018multi, convolutions are performed in local geodesic polar charts constructed on the manifold. Moving on, samples of functions defined on a sphere are encountered in numerous applications of computer vision and to this end, there is the spherical CNN work reported in esteves2018learning, kondor2018clebsch, and cohen2018spherical. In this problem, group equivariant convolutions were used to replace the standard convolutions in CNNs. Note that the group action on the sphere corresponds to rotations in 3D which are members of the group . Recently, the equivariance of convolutions to more general classes of group actions suited for other Riemannian homogeneous spaces has been reported in kondor2018generalization, banerjee2019dmr, and cohen2018spherical. We will not discuss methods suited for this type of data any further in this paper but refer the reader to bronstein2017geometric who present a good survey of state-of-the-art in geometric deep learning.

In the context of data described in (ii) above, huang2017riemannian proposed a network architecture that consisted of layers which explicitly utilize the structure of SPD matrices. huang2018building presented a deep network for classification of hand-crafted features residing on a Grassmann manifold. However, the above architectures do not resemble the classic convolutional layer in the traditional CNN which is viewed as one of the key component to the success of CNNs. Furthermore, the operations used in the above network are not valid for general Riemannian manifolds. For example, in huang2017riemannian

, applying ReLU and logarithms on the eigenvalues is not valid for Grassmann manifolds. Besides convolutional layers, batch normalization is also a useful trick in CNN to avoid over-fitting and

brooks2019riemannian proposed a batch normalization technique for manifold-valued networks. In this paper, we focus our attention to data represented on a grid where each of the grid points is associated with a value on a known manifold, e.g. . However, all the aforementioned works are targeted for specific manifolds, e.g. the Grassmann or the SPD manifolds. The lack of a consistent framework for designing deep network architectures for data residing on a general Riemannian manifold is partly due to the fact that there is no natural analog of convolution operation for manifold-valued data. This justifies the need to generalize the convolution operation for data in Riemannian manifolds in order to develop a consistent framework for deep learning to tackle such data. Recently, chakraborty2019deep proposed to use weighted Fréchet mean (wFM) Frechet1948a as an analog to the classical (Euclidean space) convolution operation for data points residing in Riemannian manifolds. Note that although their definition of wFM as an analogous operation is valid for any Riemannian manifold, the convexity constraints in the definition used for wFM puts certain restrictions on the range of values that the wFM can take on and this can limit the modeling capacity of the network as we will see later.In order to generalize the (discrete) convolution operation in Euclidean spaces – which is simply a linear combination of weights and image function values inside a certain window – to Riemannian manifolds, we have to define what is a meaningful “equivalent” of the aforementioned linear combination operation in the Riemannian manifold setting. In this paper, we propose to make use of the idea that it is possible to map the manifold-valued data points within a convolution window defined over the manifold-valued image to the tangent space anchored at the FM of these points using the Riemannian Log map. Then, perform the linear combination operation in the tangent space (which is isomorphic to the Euclidean space) and map it back to the manifold using the Riemannian Exp

map. We provide the details of this operation called the manifold-valued convolution (MVC) in the next section. Further, we prove that the proposed MVC is equivariant to the isometry group actions admitted by the manifold. Armed with MVC, we then describe how to build a MVC-Net for manifold-valued data by defining the corresponding activation functions and fully-connected (FC) layers for the manifold-valued data.

Thus, the main contributions of our work in this paper are the following. (i) We define the MVC operation for general Riemannian manifolds and a prove that MVC is equivariant to isometry group actions admitted by the manifold. (ii) We present a deep neural network architecture based on MVC, called MVC-Net, for any Riemannian manifold. (iii) Further, we present experiments demonstrating performance of the MVC-Net on classification problems encountered in medical image analysis and computer vision along with comparisons to the state-of-the-art.The rest of this paper is organized as follows. In section 2, we review some essential background in Riemannian geometry. In section 3, we propose a novel generalization, the MVC, of the convolution operation for Riemannian manifold-valued images and show that MVC is equivariant to isometry group actions admitted by the manifold. Then, we propose a deep neural network architecture based on MVC, called the MVC-Net. In section 4, we present the experimental results and finally draw conclusions in section 5.

## 2 Review of Riemannian Geometry

In this section, we review some basic material from Riemannian geometry that is necessary in our work.

Let be a -dimensional Riemannian manifold. For , the *tangent space* of at is denoted , which is a -dimensional vector space. Equipped with the Levi-Civita connection, the geodesic starting at is denoted with where is some interval containing , and is the initial tangent vector, i.e. . Sometimes a geodesic is specified by the two endpoints and in this case we denote the geodesic by such that and . The *Exponential map* is defined by where . The exponential map is a diffeomorphism from to its range, and its inverse is denoted . These two maps will be of fundamental importance for our proposed layer which is discussed later in this section.

In general, there is no global coordinate system on a Riemannian manifold. Therefore, a local coordinate system is important for doing computations on Riemannian manifolds. The most common one is called the *normal coordinate* which is based on the Riemannian exponential map and the log map respectively. The normal coordinates are constructed as follows. For , there exist a neighborhood of and a neighborhood such that is a diffeomorphism between and (Lemma 5.10 in lee2006riemannian). The neighborhood is called the *normal neighborhood*. The *normal coordinate* of with respect to the normal neighborhood is given by . This concept is important as we will use it in the definition of manifold-valued convolution in section 3.

The Riemannian metric , induces a distance given by,

Let . The Fréchet mean (FM) of is

This is a generalization of mean of points in a vector space. The existence and uniqueness of the FM is discussed in Afsari2011. To be precise, the FM is unique if lie in a open ball of radius , where is the *convexity radius* of Groisser2004. In practice, it is always assumed to be this case.

With this intrinsic distance metric, the Riemannian manifold is a metric space and a natural transformation under consideration would be the *isometry*. For a Riemannian manifold, a transformation is called an isometry if it is a diffeomorphism and where is pullback operation of . In this work, we consider the isometry from to . It is known that the collection of isometries forms a group under composition, denoted . For a smooth map , a desired property would be the *isometry equivariant*, i.e. . Another similar concept is the *isometry invariance*, i.e. .

###### Remark.

Note that with a slight abuse of notation, for a metric space , we denote the set of all isometry transformations of by as well.

## 3 MVC-Net Theory and Architecture

In this section, we present the MVC and show that it is equivariant under isometry group actions admitted by the manifold. Then we present the architecture of MVC-Net by introducing the basic constituent layers of the MVC-Net.

### 3.1 Manifold-valued convolution (MVC)

Recall that in a CNN the convolution operation involves a linear combination of the data in the window, i.e. . Due to the lack of vector space structure on Riemannian manifolds, we can not perform this usual convolution on manifold-valued images directly. In this work, we propose a generalization of the above described standard convolution to manifold-valed images, called the manifold-valued convolution (MVC) defined as follows.

###### Definition 1.

Let be a Riemannian manifold and and be two functions defined on where is the set of all integers. The convolution, is defined by

(1) |

for where .

An illustration of the MVC operation can be seen in Figure 1 An important property of the convolution operation in Euclidean spaces is that it is equivariant to translation which is the natural isometry group action for Euclidean spaces. Thus, MVC, as a generalization of the convolution to Riemannian manifold-valued images, is expected to possess such property, i.e. equivariant to isometry group actions admitted by the manifold. The following lemma is useful for proving this result.

###### Lemma 1.

Let be an isometry. Then for

where is the differential of at . Therefore when the inverse of exists,

The proof of this lemma can be found in most of the introductory textbooks in Riemannian geometry, e.g. proposition 5.9 in lee2006riemannian.

###### Theorem 1.

The MVC is equivariant to isometry group actions both in the domains and the ranges of and i.e. for and ,

(2) | ||||

(3) | ||||

(4) | ||||

(5) |

###### Proof.

We show only (2) here since the other three equalities follow from similar derivation. First, note that the FM of is where . This is a consequence of the invariance of the intrinsic distance metric. Then for

This concludes the proof. ∎

Note that the equivariance is preserved even if the FM is replaced by any other points as long as the choice of the point is also equivariant, e.g. replace by , for some . This avoids the computation of the FM and hence is computationally more efficient. In practice, the analytic forms of and are unknown and only and , are observed for some fixed . Thus from now on, we consider and instead of and . For this situation, the MVC can be simplified as

where . For applications in computer vision and medical imaging, the domain is usually or .

### 3.2 Activation Functions for MVC-Net

In classical neural networks, the activation functions, e.g. ReLU, sigmoid, tanh, etc., play an important role as they make the resulting network non-linear and thus we are able to build a deep neural network by stacking layers of different sizes along with the activation functions. The choice of activation functions has been studied extensively and there are a few guidelines for choosing one. First, the activation function must be a contraction map Mallat2016. The precise definition of a contraction map will be given later. Second, the activation function should prevent multiple stacked layers of the network from collapsing to a single layer, which allows us to build a deep network. In this section, we will analyze the MVC-net in the context of the above guidelines. We first show that the MVC layer is not a contraction and under some conditions cascaded MVC layers will collapse into one. Then we give a possible choice of an activation function for use in the MVC-net.

#### Contraction Property

The following definition of contraction is from Mallat2016.

###### Definition 2.

Let where and are metric spaces with distance metrics and , respectively. The mapping is called a *contraction* if for , there exists such that . If for all , , then is called a *non-expansion*.

Since the range of MVC is a normal neighborhood of the anchor point, it can be easily shown that the MVC layer is *not* a contraction by considering large ’s.

#### Collapsibility Property

In classical neural networks, one reason for adding non-linear activation functions between layers, e.g. sigmoid, ReLU, tanh, is that without these, the multi-layer network collapses into a single-layer network. We want to know if a similar behavior is exhibited by the MVC-net. For example, consider a network with two MVC layers (without non-linear activation in between). For the sake of simplicity, suppose that there are only two MVC “filters” in the first MVC layer and one MVC “filter” in the second MVC layer, i.e. the first MVC layer takes as input with weight and the second MVC layer takes as input with weights where and . Is this two-layer MVC-net equivalent to a one-layer MVC-net i.e., does there exist such that ? We answer this question in the affirmative under some conditions as stated in the following theorem.

###### Theorem 2.

Let . If and belong to the same normal coordinate chart then, two cascaded MVC layers will collapse to a single layer.

###### Proof.

As mentioned earlier, the anchor point of map (1) can be any point in the normal coordinate chart. Let be such a point. Consider the weights for . Apply the map (1) first to and separately and obtain and . Then apply the map (1) to and again and obtain . We will show that there exists such that . Observe that

Hence, for and for and the two layers collapse into a single layer. ∎

If we consider different normal charts for and , i.e. for and for , then the cascaded two layer structure will not collapse. However, to avoid any possibility of a collapse, e.g. in the case that , we recommend the inclusion of a non-linear activation function between the layers. The choice of activation functions for manifold-valued input are however limited. As the most widely used activation function in CNN is ReLU, we propose to use the tangent ReLU (tReLU) chakraborty2019deep as the activation function for the MVC-Net.

### 3.3 Manifold-valued Fully-connected (MVFC) Layers for MVC-Net

The outputs of the last MVC layer/tReLU layer would be a set of points on the manifold

. Therefore the desired FC layer should take points on the manifold as inputs and output labels (hard assignment) or probability vectors (soft assignment). In this work, we adopt the FC layer used in

chakraborty2019deep, i.e. for , first transform to and then apply the usual (Euclidean) FC layers as in CNN.### 3.4 Architecture of MVC-Net

For classification problems, the architecture we use in this work is parallel to CNN, i.e.

The number and the size of the layers will be presented in section 4 as it depends on the experiment settings. Besides the classical CNN, different deep network architectures for data in Euclidean space have been proposed to solve specific application problems and the convolutional layer serves as the basic component in most of them. In a similar manner, for manifold-valued data, based on the application problem, we envision an appropriate architecture with MVC layers as the building blocks.

## 4 Experiments

In this section we present several experiments demonstrating the performance of the MVC-net. The experiments involve the use of data from medical imaging as well as computer vision domains. In all the experiments, we present comparisons to the state-of-the-art.

### 4.1 Parkinson’s Disease Classification

time (s) | Accuracy | ||||
---|---|---|---|---|---|

Model | Non-linearity | # params. | / sample | Training Accuracy | Test Accuracy |

MVC-net | TReLU | ||||

DTI-ManifoldNet | None | ||||

ResNet-34 | ReLU | ||||

CapsuleNet | ReLU |

In this section, we apply the MVC-Net to a classification problem in the field of movement disorders, specifically, using diffusion magnetic resonance images (dMRIs) to classify Parkinson’s disease (PD) patients from controls.

#### Diffusion MRI Data Acquisition and Pre-processing

The dataset we use in this work consists of dMRIs acquired from 355 Parkinson’s disease (PD) patients and 356 control (healthy) subjects. This data was acquired from a combination of three sources namely, (i) The University of Florida (UFL), (ii) The Parkinson’s Progression Markers Initiative (PPMI) database (www.ppmi-info.org/data) and (iii) The University of Michigan. The data acquired at UFL is publicly available for research use by request via the National Institute of Neurological Disorders (NINDS) Parkinson’s Disease Biomarker Program (PDBP). This PDBP data contained images that were collected using a 3.0 T MR scanner and 32-channel quadrature volume head coil. The scanning parameters of the dMRIs acquisition sequence were as follows: gradient directions = , b-values = , resolution = mm uniform voxel size. The data from University of Michigan was obtained using a 3T Phillips MR scanner and the parameters were, gradient directions= , b-values = , resolution = mm uniform voxel size. Eddy current correction was applied to each data set by using standard motion correction techniques.

From each of these dMRIs, 12 regions of interest (ROIs) – six on each hemisphere of the brain – in the sensorimotor tract are segmented by registering to the sensorimotor area tract template (SMATT) Archer2017SMATTS. These tracts are known to be affected by PD. Figure 3

depicts the M1, dorsal premotor cortex (PMd) and the supplementary motor area (SMA) tracts . In our experiments, we adopt the most widely used representation of dMRI in the clinic namely, diffusion tensor images. Diffusion tensors (DTs) are symmetric positive-definite matrices

basser1994mr.#### Diffusion Tensor Representation

The DTI representation of diffusion weighted images assumes a local Gaussian distribution of water diffusion within each voxel

basser1994mr. The covariance matrix of each local Gaussian represents the diffusion tensor, which is a symmetric positive definite (SPD) matrix. Thus we have a field . We can equip the space with the -invariant metric to make it a Riemannian homogeneous manifold.We estimate the diffusion tensor images from the segmented dMRIs of the sensorimotor tracts using the DiPy software

DiPyPackage. This data is fed directly into an MVC-net with five layers. The output from the last of these layers forms the input to an MVFC layer which maps this input into . Next, two standard fully connected layers are applied to this -valued input followed by a softmax function to output class probabilities. This architecture was found to give the best performance among similar architectures.#### Classification Results

We compared the performance of MVC-Net with several deep net architectures including the ManifoldNet chakraborty2019deep, the ResNet-34 CNN architecture and a CapsuleNet architecture with dynamic routing. To perform the comparison, we applied each of the aforementioned deep net architectures to the above described diffusion tensor image data sets.

We train our MVC-net architecture for 200 epochs using cross-entropy loss and an Adam optimizer with learning rate set to

. We obtain a 10-fold cross validation accuracy of . For the ManifoldNet, we achieved a 10-fold cross validation accuracy of . The ResNet-34 and CapsuleNet architectures are trained directly on the diffusion weighted images (without any diffusion tensor fitting to the dMRI data in the ROIs since they can not cope with symmetric positive definite matrix-valued images). With the ResNet-34 architecture we observe significant overfitting late in training and we utilize an early stoppage approach to report the best -fold cross validation result, which still significantly under-performs the MVC-net and ManifoldNet (the only two approaches that respect the underlying geometry of ) respectively. Comprehensive results are reported in Table 2.As evident from the Table 2, MVC-net outperforms all other methods on both training and test accuracy while simultaneously keeping the lowest parameter count. The inference speed under-performs ResNet-34 and CapsuleNet, but these architectures utilize operations that have been optimized heavily for inference speed for years. Further, in terms of the possible application domain of automated Parkinson’s diagnosis, the sub-second (less than a second) inference speeds we have achieved are more than sufficient in practice.

### 4.2 Anatomical Structure to Function Regression

In this experiment we consider the problem of learning a function from a structural image of the human brain to a functional physiological measurement. Specifically, we consider the problem of mapping from Cauchy Deformation Tensor (CDT) images estimated relative to an atlas of diffusion MRI scans of the Substantia Nigra Banerjee2016, a neuro-anatomical region known to be affected by movement disorders to MDS-UPDRS scores. The MDS-sponsored Revision of the Unified Parkinson’s Disease Rating Scale (MDS-UPDRS) is a quantitative measure of PD severity assigned by a physician that combines various physical and psychological biomarkers associated with PD such as sleep quality, depression, and motor skills. The CDT of a diffusion MRI scan captures the deviation of a particular subject from a reference atlas (i.e. an ”average” brain over the population), thus the CDT captures structural information about a particular brain.

#### Data Acquisition and Pre-processing

The data here consists of high angular resolution diffusion MRI (HARDI) TuchHARDI2002 images of 25 controls, 15 essential tremor (ET) patients and 26 PD patients acquired using the same parameters as the PDBP data in the previous experiment. For each patient we have corresponding MDS-UPDRS scores. We segment the Substantia Nigra (40 voxels large) from each of these images. Each image is pre-processed to estimate an Ensemble Average Propagator (EAP) at each voxel leading to an EAP field representation of the dMRI data. The EAP

is a probability distribution that describes the likelihood of water diffusing along a vector

johansen2013diffusion. To compute the CDT we follow a standard procedure which we outline here. First, we non-rigidly register cheng2009non each of the EAP-field images to the Montreal Neurological Institute (MNI) reference atlas fonov2011unbiased. Let be the Jacobian of the non-rigid registration, then the CDT at each voxel is given by . This gives a SPD matrix at each voxel, hence for each sample we have a dimensional tensor. Thus, to summarize, the independent variables are sized CDT fields describing structural properties of a particular human brain, and the dependent variables are the vector of MDS-UPDRS scores, quantifying functional severity of movement disorders.We compare an MVC-net architecture operating on the space where the CDT descriptors live. The architecture for this problem consists of layers followed by a MVFC

and two Euclidean fully connected layers plus a softmax layer. We compare the performance of the MVC-net to state-of-the-art methods for this task in

chakraborty2019deep and Banerjee2016. The performance is quantified in terms of the statistic. Results are summarized in Table 4.Model | |
---|---|

MVC-net | |

DTI-ManifoldNet chakraborty2019deep | 0.930 |

NL Manifold Regression Banerjee2016 | 0.925 |

As is evident from Table 4, MVC-net outperforms the competing methods on this particular task, although all methods perform well. Beyond this, MVC-net again achieves significant parameter efficiency, with parameters for this architecture. Future work will focus on evaluating MVC-net on this task for much larger datasets.

### 4.3 Video Classification

We now outline an architecture for using MVC-net together with covariance blocks Yu2017SecondOrderCNN to perform video classification. We present results of applying this MVC-net architecture to the Moving MNIST dataset, which is generated using the algorithm in Srivastava2015LSTM-Vid. Each video consists of two MNIST digits moving across the frame. The velocity of both digits is fixed across all videos in a class, but the digits themselves vary (in the range ). Different classes have different angles of motion, and the goal is to classify them based on this angle.

#### Architecture for Video Classification

time (s) | orientation () | ||||
---|---|---|---|---|---|

Mode | # params. | / epoch | - | - | -- |

MVC-Net | |||||

Manifold DCNN | |||||

SPD-TCN | |||||

SPD-SRU | |||||

TT-GRU | |||||

TT-LSTM | |||||

SRU | |||||

LSTM |

We now present an MVC-net architecture for video classification. Given an input video of dimensions , a covariance block Yu2017SecondOrderCNN is applied in parallel to each frame to yield an tensor. An illustration of the architecture is shown in Figure 5. We will now describe the components of this architecture.

For completeness, we will summarize the covariance block design from Yu2017SecondOrderCNN below. The input to the covariance block is an image of size . We first apply a regular CNN without fully connected layers at the end to get a sized output. Now we interpret each channel as a feature vector and compute a covariance matrix of the channel activations. Finally, to incorporate the first order statistics, we append the mean channel activation to both the last row and column of the covariance matrix to get a shaped output.

As mentioned before, applying a covariance block at each frame of a video in parallel yields a shape tensor, where at each frame we have a covariance matrix, which is an element in the space . We now use a one-dimensional temporal MVC-net architecture to map the per-frame covariance descriptors to class outputs. This is no different than traditional temporal CNNs, i.e. at each layer, a moving window slides over the frames and computes a weighted combination. For our architecture, we use the manifold-valued convolution defined earlier on this sequence of frames each represented by a covariance matrix descriptor . Figure 5 depicts a schematic of the MVC-net tailored for the video classification problem.

Experimental Results: For this experiment we use five layers, followed by an MVFC layer and two Euclidean fully connected layers and a softmax. We use an Adam optimizer with learning rate set to and train for epochs using the cross-entropy loss. 10-fold cross validation results are summarized in Table 6. As evident, the MVC-net either outperforms or is competitive with all competing methods in terms of test accuracy.

## 5 Conclusion

In this paper, we presented a generalization of CNNs to manifold-valued images i.e., images whose value sets lie in Riemannian manifolds. Such data are commonly encountered in many applications including but not limited to medical imaging and computer vision. We defined the the analog of the traditional convolution operation to manifold-valued images and proved that it is equivariant to the isometry group actions admitted by the manifold. Equivariance is a fundamental design principle in traditional CNNs that affords weight sharing in the deep neural networks. Further, we also proved that a multi-layer MVC-Net requires the use of nonlinear activation functions and proposed a tangent-ReLU (tReLU) to this end. The final layer of the MVC-net is the manifold-valued fully connected layer whose construction is adopted from chakraborty2019deep. Finally, we presented several experiments demonstrating the performance of the MVC-Net on classification problems drawn from medical imaging and computer vision. Comparisons to state-of-the art was presented demonstrating comparable to superior performance of the MVC-Net in terms of classification accuracy, parameter and time/epoch efficiency of the MVC-Net.

Comments

There are no comments yet.