Riemannian Multi-Manifold Modeling

10/01/2014
by   Xu Wang, et al.
University of Minnesota
0

This paper advocates a novel framework for segmenting a dataset in a Riemannian manifold M into clusters lying around low-dimensional submanifolds of M. Important examples of M, for which the proposed clustering algorithm is computationally efficient, are the sphere, the set of positive definite matrices, and the Grassmannian. The clustering problem with these examples of M is already useful for numerous application domains such as action identification in video sequences, dynamic texture clustering, brain fiber segmentation in medical imaging, and clustering of deformed images. The proposed clustering algorithm constructs a data-affinity matrix by thoroughly exploiting the intrinsic geometry and then applies spectral clustering. The intrinsic local geometry is encoded by local sparse coding and more importantly by directional information of local tangent spaces and geodesics. Theoretical guarantees are established for a simplified variant of the algorithm even when the clusters intersect. To avoid complication, these guarantees assume that the underlying submanifolds are geodesic. Extensive validation on synthetic and real data demonstrates the resiliency of the proposed method against deviations from the theoretical model as well as its superior performance over state-of-the-art techniques.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 15

page 16

page 17

page 19

01/04/2016

Kernel Sparse Subspace Clustering on Symmetric Positive Definite Manifolds

Sparse subspace clustering (SSC), as one of the most successful subspace...
01/08/2010

Spectral clustering based on local linear approximations

In the context of clustering, we assume a generative model where each cl...
10/16/2019

A Notion of Harmonic Clustering in Simplicial Complexes

We outline a novel clustering scheme for simplicial complexes that produ...
01/26/2017

Riemannian-geometry-based modeling and clustering of network-wide non-stationary time series: The brain-network case

This paper advocates Riemannian multi-manifold modeling in the context o...
04/07/2015

Heterogeneous Tensor Decomposition for Clustering via Manifold Optimization

Tensors or multiarray data are generalizations of matrices. Tensor clust...
06/10/2015

Fast Online Clustering with Randomized Skeleton Sets

We present a new fast online clustering algorithm that reliably recovers...
04/18/2019

SPONGE: A generalized eigenproblem for clustering signed networks

We introduce a principled and theoretically sound spectral method for k-...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many modern data sets are of moderate or high dimension, but manifest intrinsically low-dimensional structures. A natural quantitative framework for studying such common data sets is multi-manifold modeling (MMM) or its special case of hybrid-linear modeling (HLM). In this MMM framework a given dataset is modeled as a union of submanifolds (whereas HLM considers union of subspaces). When proposing a valid algorithm for MMM, one assumes an underlying dataset that can be modeled as mixture of submanifolds and tries to prove under some conditions that the proposed algorithm can cluster the dataset according to the submanifolds. This framework has been extensively studied and applied for datasets embedded in the Euclidean space or the sphere (Arias-Castro et al., 2011, 2013; Cetingul and Vidal, 2009; Elhamifar and Vidal, 2011; Kushnir et al., 2006; Ho et al., 2013a; Lui, 2012; Wang et al., 2011).

Nevertheless, there is an overwhelming number of application domains, where information is extracted from datasets that lie on Riemannian manifolds, such as the Grassmannian, the sphere, the orthogonal group, or the manifold of symmetric positive (semi)definite [P(S)D] matrices. For example, auto-regressive moving average (ARMA) models are utilized to extract low-rank linear subspaces (points on the Grassmannian) for identifying spatio-temporal dynamics in video sequences (Turaga et al., 2011). Similarly, convolving patches of images by Gabor filters yields covariance matrices (points on the PD manifold) that can capture effectively texture patterns in images (Tou et al., 2009). Nevertheless, current MMM strategies are not sufficiently accurate for handling data in more general Riemannian spaces.

The purpose of this paper is to develop theory and algorithms for the MMM problem in more general Riemannian spaces that are relevant to important applications.

Related Work.

Recent advances in parsimonious data representations and their important implications in dimensionality reduction techniques have effected the development of non-standard spectral-clustering schemes that result in state-of-the-art results in modern applications (Arias-Castro et al., 2013; Chen and Lerman, 2009a; Elhamifar and Vidal, 2009; Goh and Vidal, 2008; Goldberg et al., 2009; Harandi et al., 2013; Liu et al., 2013; Zhang et al., 2012). Such schemes rely on the assumption that data exhibit low-dimensional structures, such as unions of low-dimensional linear subspaces or submanifolds embedded in Euclidean spaces.

Several algorithms for clustering on manifolds are generalizations of well-known schemes developed originally for Euclidean spaces. For example, Gruber and Theis (2006) extended the classical -means algorithm from Euclidean spaces to Grassmannians, and illustrated an application to nonnegative matrix factorization. Tuzel et al. (2005) capitalized on the Riemannian distance of

to design an efficient mean-shift (MS) algorithm for multiple 3D rigid motion estimation.

Subbarao and Meer (2006), as well as Cetingul and Vidal (2009), extended further the MS algorithm to general analytic manifolds including Grassmannians, Stiefel manifolds, and matrix Lie groups. O’Hara et al. (2011) showed promising results by using the geodesic distance of product manifolds in clustering of human expressions, gestures, and actions in videos. Rathi et al. (2007)

solved the image segmentation problem, after recasting it as a matrix clustering problem, via probability distributions on symmetric PD matrices.

Goh and Vidal (2008) extended spectral clustering and nonlinear dimensionality reduction techniques to Riemannian manifolds. These previous works are quite successful when the convex hulls of individual clusters are well-separated, but they often fail when clusters intersect or are closely located.

HLM and MMM accommodate low-dimensional data structures by unions of subspaces or submanifolds, respectively, but are restricted to manifolds embedded in either a Euclidean space or the sphere. Many strategies have been suggested for solving the HLM problem, known also as subspace clustering. These strategies include methods inspired by energy minimization (Bradley and Mangasarian, 2000; Ho et al., 2003; Ma et al., 2007; Poling and Lerman, 2014; Tseng, 2000; Zhang et al., 2009, 2010), algebraic methods (Boult and Brown, 1991; Costeira and Kanade, 1998; Kanatani, 2001, 2002; Ma et al., 2008; Ozay et al., 2010; Vidal et al., 2005), statistical methods (Tipping and Bishop, 1999; Yang et al., 2006), and spectral-type methods with various types of affinities representing subspace-related information (Chen and Lerman, 2009a; Elhamifar and Vidal, 2013; Liu et al., 2013; Yan and Pollefeys, 2006; Zhang et al., 2012). Recent tutorial papers on HLM are Vidal (2011) and Aldroubi (2013). Some theoretical guarantees for particular HLM algorithms appear in (Chen and Lerman, 2009b; Lerman and Zhang, 2011; Soltanolkotabi and Candès, 2012; Soltanolkotabi et al., 2014). There are fewer strategies for the MMM problem, which is also known as manifold clustering. They include higher-order spectral clustering (Arias-Castro et al., 2011), spectral methods based on local PCA (Arias-Castro et al., 2013; Goldberg et al., 2009; Gong et al., 2012; Kushnir et al., 2006; Wang et al., 2011), sparse-coding-based spectral clustering in a Euclidean space (Elhamifar and Vidal, 2011) and its modification to the sphere by Cetingul et al. (2014) (the sparse coding encodes local subspace approximation), energy minimization strategies (Guo et al., 2007), methods based on manifold learning algorithms (Polito and Perona, 2001; Souvenir and Pless, 2005), and methods based on clustering dimension or local density (Barbará and Chen, 2000; Gionis et al., 2005; Haro et al., 2006). Notwithstanding, only higher-order spectral clustering and spectral local PCA are theoretically guaranteed (Arias-Castro et al., 2011, 2013).

In a different context, Rahman et al. (2005) suggested multiscale strategies for signals taking values in Riemannian manifolds, in particular, the sphere, the orthogonal group, the Grassmannian, and the PD manifold. Even though Rahman et al. (2005) addresses a completely different problem, its basic principle is similar in spirit to ours and can be described as follows. Local analysis is performed in the tangent spaces, where the exponential and logarithm maps are used to transform data between local manifold neighborhoods and local tangent space neighborhoods. Information from all local neighborhoods is then integrated to infer global properties.

Contributions.

Despite the popularity of manifold learning, the associated literature lacks generic schemes for clustering low-dimensional data embedded in non-Euclidean spaces. Furthermore, even in the Euclidean setting only few algorithms for MMM or HLM are theoretically guaranteed. To this end, this paper aims at filling this gap and provides an MMM approach in non-Euclidean setting with some theoretical guarantees even when the clusters intersect. In order to avoid nontrivial theoretical obstacles, the theory assumes that the underlying submanifolds are geodesic and refer to it as multi-geodesic modeling (MGM). Clearly, this modeling paradigm is a direct generalization of HLM from Euclidean spaces to Riemannian manifolds. A more practical and robust variant of the theoretical algorithm is also developed, and its superior performance over state-of-the-art clustering techniques is exhibited by extensive validation on synthetic and real datasets. We remark that in practice we require that the logarithm map of can be computed efficiently and we show that this assumption does not restrict the wide applicability of this work.

We believe that it is possible to extend the theoretical foundations of this work to deal with general submanifolds by using local geodesic submanifolds (in analogy to Arias-Castro et al. (2013)). However, this will significantly increase the complexity of our proof, which is already not simple. Nevertheless, the proposed method directly applies to the more general setting (without theoretical guarantees) since geodesics are only used in local neighborhoods and not globally. Furthermore, our numerical experiments show that the proposed method works well in real practical scenarios that deviate from the theoretical model.

On a more technical level, the paper is distinguished from previous works in multi-manifold modeling in its careful incorporation of “directional information,” e.g., local tangent spaces and geodesics. This is done for two purposes: (i) To distinguish submanifolds at intersections; (ii) to filter out neighboring points that belong to clusters different than the cluster of the query point. In such a way, the proposed algorithm allows for neighborhoods to include points from different clusters, while previous multi-manifold algorithms (e.g., Elhamifar and Vidal (2011)) need careful choice of neighborhood radii to avoid points belonging to other clusters.

2 Theoretical Preliminaries

We formulate the theoretical problem of MGM and review preliminary background of Riemannian geometry, which is necessary to follow this work.

2.1 Multi-Geodesic Modeling (MGM)

MGM assumes that each point in a given dataset lies in the tubular neighborhood of some unknown geodesic submanifold , , of a Riemannian manifold, .111The tubular neighborhood with radius of in

(with metric tensor

and induced distance ) is . The goal is to cluster the dataset into groups such that points in are associated with the submanifold . Note that if is a Euclidean space, geodesic submanifolds are subspaces and MGM boils down to HLM, or equivalently, subspace clustering (Elhamifar and Vidal, 2009; Vidal, 2011; Zhang et al., 2012).

For theoretical purposes, we assume the following data model, which we refer to as uniform MGM: The data points are i.i.d. sampled w.r.t. the uniform distribution on a fixed tubular neighborhood of

. We denote the radius of the tubular neighborhood by and refer to it as the noise level. Figure 1 illustrates data generated from uniform MGM with two underlying submanifolds (.

Figure 1: Illustration of data generated from a uniform MGM when .

The MGM problem only serves our theoretical justification. The numerical experiments show that the proposed algorithm works well under a more general MMM setting. Such a setting may include more general submanifolds (not necessarily geodesic), non-uniform sampling and different kinds and levels of noise.

2.2 Basics of Riemannian Geometry

This section reviews basic concepts from Riemannian geometry; for extended and accessible review of the topic we recommend the textbook by do Carmo (1992). Let be a -dimensional Riemannian manifold with a metric tensor . A geodesic between is a curve in whose length is locally minimized among all curves connecting and . Let be the Riemannian distance between and on . If denotes the tangent space of at , then stands for the tangent subspace of a -dimensional geodesic submanifold at . As shown in Figure (a)a, is a linear subspace of . The exponential map

maps a tangent vector

to a point , which provides local coordinates around . By definition, the geodesic submanifold is the image of under (cf., Definition 2). The functional inverse of is the logarithm map from to , which maps to the origin of . Let denote the image of a data point in by the logarithm map at ; that is, .

(a) Tangent space
(b) Logarithm map
Figure 2: Demonstration of the exponential and logarithm maps as well as the tangent and estimated subspaces. (a) The tangent space and the exponential map of a manifold at a point . Note that the tangent subspace is a pre-image of under the exponential map. (b) The logarithm map w.r.t.  and the images by of data points in a local neighborhood of , in particular, , the image of . Note the difference between , which is the image of under , and the subspace estimated by the images of the data points in the local neighborhood.

3 Solutions for the MGM (or MMM) Problem in

We suggest solutions for the MMM problem in with theoretical guarantees supporting one of these solutions when restricting the problem to MGM. Section 3.1 defines two key quantities for quantifying directional information: Estimated local tangent subspaces and geodesic angles. Section 3.2 presents the two solutions and discusses their properties.

3.1 Directional Information

The Estimated Local Tangent Subspace .

Figure (b)b demonstrates the main quantity defined here () as well as related concepts and definitions. It assumes a dataset generated by uniform MGM with a single geodesic submanifold . The dataset is thus contained in a tubular neighborhood of a -dimensional geodesic submanifold . Since is geodesic, for any the set of images by the logarithm map is contained in a tubular neighborhood of the -dimensional subspace (possibly with a different radius than ).

Since the true tangent subspace is unknown, an estimation of it, , is needed. Let be the neighborhood of with a fixed radius . Let

(1)

Moreover, let denote the local sample covariance matrix of the dataset on , and the spectral norm of

, i.e., its maximum eigenvalue. Since

is in a tubular neighborhood of a -dimensional subspace, estimates of the intrinsic dimension of the local tangent subspace, which is also the dimension of , can be formed by bottom eigenvalues of (cf., Arias-Castro et al. (2013)). We adopt this strategy of dimension estimation and define the estimated local tangent subspace, , as the span in

of the top eigenvectors of

. In theory, the number of top eigenvectors is the number of eigenvalues of that exceed for some fixed (see Theorem 1 and its proof for the choice of ). In practice, the number of top eigenvectors is the number of top eigenvalues until the largest gap occurs.

Empirical Geodesic Angles.

Let be the shortest geodesic (global length minimizer) connecting and in . Let be the tangent vector of at . In other words, shows the direction at of the shortest path from to . Given a dataset , the empirical geodesic angle is the elevation angle (cf., (9) of Lerman and Whitehouse (2009)) between the vector and the subspace in the Euclidean space .

3.2 Proposed Solutions

In Section 3.2.1, we propose a theoretical solution for data sampled according to uniform MGM. We start with its basic motivation, then describe the proposed algorithm and at last formulate its theoretical guarantees. In Section 3.2.2, we propose a practical algorithm. At last, Section 3.2.3 discusses the numerical complexity of both algorithms.

3.2.1 Algorithm 1: Theoretical Geodesic Clustering with Tangent information (TGCT)

The proposed solution for the MGM-clustering task applies spectral clustering with carefully chosen weights. Specifically, a similarity graph is constructed whose vertices are data points and whose edges represent the similarity between data points. The challenge is to construct a graph such that two points are locally connected only when they come from the same cluster. This way spectral clustering will recover exactly the underlying clusters.

For the sake of illustration, let us assume only two underlying geodesic submanifolds and . We also assume that the data was sampled from according to uniform MGM. Given a point one wishes to connect to it the points from the same submanifold within a local neighborhood for some . Clearly, it is not realistic to assume that all points in are from the same submanifold of (due to nearness and intersection of clusters as demonstrated in Figures (a)a and (b)b).

We first assume no intersection at as demonstrated in Figure (a)a. In order to be able to identify the points in from the same submanifold of , we use local tangent information at . If belongs to , then the geodesic has a large angle with the tangent space at . On the other hand, if such belongs to , then the geodesic has an angle close to zero. Therefore, thresholding the empirical geodesic angles may become beneficial for eliminating neighboring points belonging to a different submanifold (cf., Figure (a)a).

(a) Angle filtering
(b) Intersection
Figure 3: In Figures (a)a and (b)b, blue points lie on the submanifold and red points lie on the submanifold . In Figure (a)a, a local neighborhood, which is a disk of radius , around the blue point is observed and the goal is to exclude the red points in . This can be done by thresholding the angles between geodesics and the tangent subspace . Indeed, the angles w.r.t. blue points are close to zero and the angles w.r.t. red points are sufficiently large. In Figure (b)b, a point is in and an arbitrary point sufficiently far from it. The goal is to assure that is not connected to . This can be done by comparing local estimated dimensions. The estimated dimension in is , while the estimated dimension in is . Due to the dimension difference, the intersection is disconnected from the two submanifolds.

If is at or near the intersection, it is hard to estimate correctly the tangent spaces of each submanifold and the geodesic angles may not be reliable. Instead, one may compare the dimensions of estimated local tangent subspaces. The estimated dimensions of local neighborhoods of data points, which are close to intersections, are larger than the estimated dimensions of local neighborhoods of data points further away from intersections (cf., Figure (b)b). The algorithm thus connects to other neighboring points only when their “local dimensions” (linear-algebraic dimension of the estimated local tangent) are the same. In this way, the intersection will not be connected with the other clusters.

The dimension difference criterion, together with the angle filtering procedure, guarantee that there is no false connection between different clusters (the rigorous argument is established in the proof of Theorem 1). We use these two simple ideas and the common spectral-clustering procedure to form the Theoretical Geodesic Clustering with Tangent information (TGCT) in Algorithm 1.

0:  Number of clusters: , a dataset of points, a neighborhood radius , a projection threshold for estimating tangent subspaces, a distance threshold and an angle threshold .
0:  Index set such that is the cluster label assigned to Steps: Compute the following geometric quantities around each point:
  for  do
      For (c.f., (1)), compute Compute the sample covariance matrix of Compute the eigenvectors of whose eigenvalues exceed (their span is ) For all , compute the empirical geodesic angles (see Section 3.1)
  end for Form the following affinity matrix :
Apply spectral clustering to the affinity matrix to determine the output
Algorithm 1 Theoretical Geodesic Clustering with Tangent information (TGCT)

The following theorem asserts that TGCT achieves correct clustering with high probability. Its proof is in Section 5. Its statement relies on the constants and , which are clarified in the proof and depend only on the underlying geometry of the generative model. For simplicity, the theorem assumes that there are only two geodesic submanifolds and that they are of the same dimension. However, it can be extended to geodesic submanifolds of different dimensions.

Theorem 1

Consider two smooth compact -dimensional geodesic submanifolds, and , of a Riemannian manifold and let be a dataset generated according to uniform MGM w.r.t.  with noise level . If the positive parameters of the TGCT algorithm, , , and , satisfy the inequalities

(2)

then with probability at least , the TGCT algorithm can cluster correctly a sufficiently large subset of , whose relative fraction (over ) has expectation at least .

3.2.2 Algorithm 2: Geodesic Clustering with Tangent information (GCT)

A practical version of the TGCT algorithm, which we refer to as Geodesic Clustering with Tangent information (GCT), is described in Algorithm 2. This is the algorithm implemented for the experiments in Section 4 and its choice of parameters is clarified in Section A.2. GCT differs from TGCT in three different ways. First, hard thresholds in TGCT are replaced by soft ones, which are more flexible. Second, the dimension indicator function is dropped from the affinity matrix . Indeed, numerical experiments indicate that the algorithm works properly without the dimension indicator function, whenever there is only a small portion of points near the intersection. This numerical observation makes sense since the dimension indicator is only used in theory to avoid connecting intersection points to points not in intersection. At last, pairwise distances are replaced by weights resulting from sparsity-cognizant optimization tasks. Sparse coding takes advantage of the low-dimensional structure of submanifolds and produces larger weights for points coming from the same submanifold (Elhamifar and Vidal, 2011).

0:  Number of clusters: , a dataset of points, a neighborhood radius , a distance threshold (default: ) and an angle threshold (default )
0:  Index set such that is the cluster label assigned to Steps:
  for  do
      For , compute Compute the weights that minimize
(3)
among all such that and Complete these weights as follows: for Compute the sample covariance matrix of Find the largest gap between eigenvalues and of and compute the top eigenvectors of (their span is ) For all , compute the empirical geodesic angles (see Section 3.1)
  end for Form the following affinity matrix :
(4)
Apply spectral clustering to the affinity matrix to determine the output
Algorithm 2 Geodesic Clustering with Tangent information (GCT)

Algorithm 2 solves a sparse coding task in (3). The penalty used is non-standard since the codes are multiplied by (where in Cetingul et al. (2014), these latter terms are all 1). These weights were chosen to increase the effect of nearby points (in addition to their sparsity). In particular, it avoids sparse representations via far-away points that are unrelated to the local manifold structure (see further explanation in Figure 4). Similarly to Cetingul et al. (2014), the clustering weights in (4) exponentiate the sparse-coding weights.

Figure 4: Illustration of the need for weighted sparse optimization in (3). The non-weighted sparse optimization may fail to detect the local structure at in the manifold setting. The term is used to avoid assigning large weights to the far-away blue points.

3.2.3 Computational Complexity of GCT and TGCT

We briefly discuss the computational complexity of GCT and TGCT, while leaving many technical details to Appendices B and C. The computational complexity of GCT is

where bounds the number of nearest neighbors in a neighborhood (typically by the choice of parameters), CR is the cost of computing the Riemannian distances between any two points and CL is the cost of computing the logarithm map of a given point w.r.t. another point. Furthermore, once was computed, CR. The complexity of CL depends on the Riemannian manifold . If , then CL. If is the space of symmetric PD matrices and , then CL=. If is the Grassmannian, and is chosen to be of the same order as the dimension of the subspaces in , then CL=. In all applications of Riemannian multi-manifold modeling we are aware of is known and it is one of these examples. For more general or unknown , estimation of the logarithm map is discussed in Mémoli and Sapiro (2005) (this estimation is rather slow).

It is possible to reduce the total computational cost under some assumptions. In particular, in theory, it is possible to implement TGCT (or more precisely an approximate variant of it) for the sphere or the Grassmannian with computational complexity of order

where is near zero.

4 Numerical Experiments

To assess performance on both synthetic and real datasets, the GCT algorithm is compared with the following algorithms: Sparse manifold clustering (SMC) (Elhamifar and Vidal, 2011; Cetingul et al., 2014), which is adapted here for clustering within a Riemannian manifold and still referred to as SMC, spectral clustering with Riemannian metric (SCR) of Goh and Vidal (2008), and embedded -means (EKM). The three methods and choices of parameters for all four methods are reviewed in Appendix A.2.

The ground truth labeling is given in each experiment. To measure the accuracy of each method, the assigned labels are first permuted to have the maximal match with the ground truth labels. The clustering rate is computed for that permuted labels as follows:

4.1 Experiments with Synthetic Datasets

Six datasets were generated. Dataset I and II are from the Grassmannian , datasets III and IV are from symmetric positive-definite (PD) matrices, and datasets V and VI are from the sphere . Each dataset contains points generated from two “parallel” or intersecting submanifolds (130 points on each) and cropped by white Gaussian noise. The exact constructions are described below.

Datasets I and II

The first two datasets are on the Grassmannian . In dataset I, 130 pairs of subspaces are drawn from the following non-intersecting submanifolds:

where are equidistantly drawn from and the noise vector

comprises i.i.d. normal random variables

.

In dataset II, 130 pairs of subspaces lie around two intersecting submanifolds as follows:

where are equidistantly drawn from and the noise vector comprises, again, i.i.d. normal random variables .

Datasets III and IV

The next two datasets are contained in the manifold of symmetric PD matrices.

In dataset III, 130 pairs of matrices of two intersecting groups are generated from the model

(5)

where is equidistantly drawn from and is a symmetric matrix whose entries are i.i.d. normal random variables with distribution .

In dataset IV, 130 pairs of matrices of two non-intersecting groups are generated from the model

where are equidistantly drawn from respectively and is a symmetric matrix whose entries are i.i.d. normal random variables with distribution .

Datasets V and VI

Two datasets are constructed on the unit sphere of the 3-dimensional Euclidean space. Dataset V comprises of vectors lying around the following two parallel arcs:

where are equidistantly drawn from . To ensure membership in , vectors generated by and are normalized to unit length. On the other hand, dataset VI considers the following two intersecting arcs:

4.1.1 Numerical Results

Each one of the six datasets is generated according to the postulated models above, and the experiment is repeated times. Table 1 shows the average clustering rate for each method. GCT, SMC and SCR are all based on the spectral clustering scheme. However, when a dataset has low-dimensional structures, GCT’s unique procedure of filtering neighboring points ensures that it yields superior performance over the other methods. This is because both SMC and SCR are sensitive to the local scale , and require each neighborhood not to contain points from different groups. This becomes clear by the results on datasets I, IV, and V of non-intersecting submanifolds. SMC only works well in dataset I, where most of the neighborhoods contain only points from the same cluster, while neighborhoods in datasets IV and V often contain points from different ones. Embedded -means generally requires that the intrinsic means of different clusters are located far from each other. Its performance is not as good as GCT when different groups have low-dimensional structures.

Methods Set I Set II Set III Set IV Set V Set VI
GCT 1.00 0.00 0.98 0.01 0.98 0.00 0.95 0.01 0.98 0.01 0.96 0.01
SMC 0.97 0.04 0.66 0.08 0.88 0.03 0.80 0.02 0.55 0.06 0.69 0.05
SCR 0.51 0.00 0.66 0.07 0.84 0.00 0.80 0.00 0.50 0.00 0.53 0.07
EKM 0.50 0.00 0.50 0.00 0.67 0.00 0.50 0.00 0.50 0.00 0.67 0.06
Table 1: Average clustering rates on the six synthetic datasets of Section 4.1.

4.2 Robustness to Noise and Running Time

Section 4.1 illustrated GCT’s superior performance over the competing SMC, SCR, and EKM on a variety of manifolds. This section further investigates GCT’s robustness to noise and computational cost pertaining to running time. In summary, GCT is shown to be far more robust than SMC in the presence of noise, at the price of a small increase of running time.

4.2.1 Robustness to Noise

The proposed tangent filtering scheme enables GCT to successfully eliminate neighboring points that originate from different groups. As such, it exhibits robustness in the presence of noise and/or whenever different groups are close or even intersecting. On the other hand, SMC appears to be sensitive to noise due to its sole dependence on sparse weights. Figures 5 and 6

demonstrate the performance of GCT, SMC, SCR, and EKM on the Grassmannian and the sphere for various noise levels (standard deviations of Gaussian noise).

Figure 5: Performance of clustering methods on the Grassmannian for various noise levels. Datasets are generated according to the model of dataset II, but with an increasing standard deviation of the noise.

The datasets in Figure 5 are generated on the Grassmannian according to the model of dataset II in Section 4.1 but with different noise levels (in Section 4.1 the noise level was ). Both SMC and SCR appear to be volatile over different datasets, with their best performance never exceeding clustering rate. It is worth noticing that EKM shows poor clustering accuracy. On the contrary, GCT exhibits remarkable robustness to noise, achieving clustering rates above even when the standard deviation of the noise approaches .

Figure 6: Performance of clustering methods on the sphere for various noise levels. Datasets are generated according to the model of dataset VI, but with an increasing standard deviation of the noise.

GCT’s robustness to noise is also demonstrated in Figure 6, where datasets are generated on the unit sphere according to the model of the dataset VI, but with different noise levels. SMC appears to be volatile also in this setting; it collapses when the standard deviation of noise exceeds , since its affinity matrix precludes spectral clustering from identifying eigenvalues with sufficient accuracy (see further explanation on the collapse of SMC at the end of Section A.2).

4.2.2 Running time

This section demonstrates that GCT outperforms SMC at the price of a small increase in computational complexity. Similarly to any other manifold clustering algorithm, computations have to be performed per local neighborhood, where local linear structures are leveraged to increase clustering accuracy. The overall complexity scales quadratically w.r.t. the number of data-points due to the last step of Algorithm 2, which amounts to spectral clustering of the affinity matrix . Both the optimization task of (3) and the computation of a few principal eigenvectors of the covariance matrix in Algorithm 2 do not contribute much to the complexity since operations are performed on a small number of points in the neighborhood . The computational complexity of GCT is detailed in Appendix C. It is also noteworthy that GCT can be fully parallelized since computations per neighborhood are independent. Nevertheless, such a route is not followed in this section.

Compared with SMC, GCT has one additional component: identifying tangent spaces through local covariance matrices—a task that entails local calculation of a few principal eigenvectors. Nevertheless, it is shown in Appendix C.1 that for neighbors it can be calculated with operations.

The ratios of running times between GCT and SMC for all three types of manifolds are illustrated in Table 2. It can be readily verified that the extra step of identifying tangent spaces in GCT increases running time by less than of the one for SMC.

Running-time ratio
GCT/SMC 1.06 1.05 1.11
Table 2: Ratio of running times of GCT and SMC for instances of the synthetic datasets I, IV and VI

Ratios of running times were also investigated for increasing ambient dimensions of the sphere. More precisely, dataset VI of Section 4.1, which lies in , was embedded via a random orthonormal matrix into the unit sphere , where ranged from to . Figure 7 shows the ratios of the running time of GCT over that of SMC as a function of . We observe that the extra cost of computing the eigendecomposition in GCT is mostly less than of SMC, and never exceeds , even when the ambient dimension is as large as .

Figure 7: Relative running times of GCT w.r.t. SMC as the ambient dimension increases. With dimensions ranging from to , dataset VI of Section 4.1 was embedded via a random orthonormal matrix into the unit sphere . Dataset VI of Section 4.1 matrix with two random

4.3 Synthetic Brain Fibers Segmentation

Cetingul et al. (2014) cast the problem of segmenting diffusion magnetic resonance imaging (DMRI) data of different fiber tracts as a clustering problem on

. The crux of the methodology lies on the transformation of diffusion images, associated with different views of the same object, into orientation distribution functions (ODFs), which are nothing but probability density functions on

. The discretized ODF (dODF) is a probability mass function (pmf) , with , that describes the water diffusion pattern at a corresponding location of the object’s image according to the viewing directions . Given and a fixed location, the square-root (SR)dODF is the vector , which lies on the sphere since f is a pmf. In this way, pixels of diffusion images of the same object at a given location are mapped into an element of . Cetingul et al. (2014) assume that each fiber tract is mapped into a submanifold of and thus try to identify different fiber tracts by multi-manifold modeling on .

As suggested in Cetingul et al. (2014), to differentiate pixels with similar diffusion patterns but located far from each other in an image, one has to incorporate pixel spatial information in the segmentation algorithm. Therefore, for GCT, SMC and SCR, the similarity entry of two pixels is modified as

where is the similarity matrix before modification (e.g., for GCT, it is described in Algorithm 2), and is the Euclidean distance between two pixels. For EKM, where no spectral clustering is employed, the dODF is simply augmented with the spatial coordinates of and .

(a) Randomly sampled 6 points from the colored regions of the domain.
(b) A configuration of two intersecting fibers generated according to points in Figure (a)a
Figure 8: Demonstration of fiber generation. Two fibers are generated in Figure (b)b by fitting two cubic splines to and in Figure (a)a, respectively.

Following Cetingul et al. (2014), we consider here the problem of segmenting or clustering two 2D synthetic fiber tracts in the domain. To generate the fibers, six points are randomly chosen in the colored region of Figure (a)a. Two cubic splines passing through and , respectively, are set to be the center of the fibers (cf., red curves in Figure (b)b). Fibers are defined as the curved bands around the splines with bandwidth (cf., blue region in Figure (b)b).

Methods SNR=40 SNR=30 SNR=20 SNR=10
GCT 0.80 0.12 0.82 0.12 0.78 0.14 0.80 0.13
SMC 0.73 0.14 0.73 0.13 0.70 0.13 0.67 0.13
SCR 0.66 0.11 0.66 0.11 0.68 0.11 0.66 0.11
EKM 0.59 0.08 0.58 0.08 0.61 0.08 0.59 0.08
Table 3: Mean standard deviation of accuracy rates for experiments on clustering synthetic brain fibers.
Figure 9: Histogram of clustering rates for the noise level over a total number of experiments. Each bar shows the number of experiments whose rates fall within one of the ten intervals of length 0.05 in the partition. For example, since the tallest bar within the range is the blue one, GCT is the most likely method to achieve almost accurate clustering. On the contrary, the brown bar is the tallest one between the range of and , meaning that a clustering rate within is the most likely one to be achieved for EKM over experiments.

Given a pair of such fibers, the next step is to map each pixel (e.g., both red and blue ones in Figure (b)b) to a point (SRdODF) in . To this end, the software code provided by Canales-Rodriguez et al. (2013) is used to generate SRdODFs on , where diffusion images at gradient directions, with baseline image and , are considered. The dimensionality of the generated SRdODFs corresponds to directions. Moreover, Gaussian noise was added in the ODF-generation mechanism, resulting in a signal-to-noise ratio (more details on the construction can be found in Cetingul et al. (2014)). Typical noise levels for real-data brain images are considered: , and (i.e., ).

Once SRdODFs are formed, clustering is carried out on the Riemannian manifold . This in turn provides a segmentation of pixels according to different fiber tracts. A total number of pairs of synthetic brain fibers are randomly generated, and clustering is performed for each pair. Table 3 reports the mean standard deviation of the clustering accuracy rates. Results clearly suggest that GCT outperforms the other three clustering methods. For the case of , Figure 9 plots sample distributions of accuracy rates and shows that GCT demonstrates the highest probability of achieving almost accurate clustering among competing schemes.

4.4 Experiments with Real Data

In this section, GCT performance is assessed on real datasets. Scenarios where data within each cluster have submanifold structures are demonstrated.

4.4.1 Stylized Application: Texture Clustering

We cluster local covariance matrices obtained from various transformations of images of the Brodatz database (Randen, 2014) where the goal is to be able to distinguish between the different images independently of the transformation.

The Brodatz database contains images of pixels with different textures (e.g., brick wall, beach sand, grass) captured under uniform lighting and in frontview position. We apply three simple deformations to these images, which mimic real settings: different lighting conditions, stretching (obtained by shearing) and different viewpoints (obtained by affine transformation). Figure 10 shows sample images in the Brodatz database and their deformations.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
(x)
Figure 10: Sample images in the Brodatz database and their deformations. The first row shows the 6 original images; in the second row, each image contains a unique texture but different regions of it have different lighting; the third row shows the horizontal-shifted (distorted) images of an image; the fourth row shows affine-transformed (change of viewpoints) images of an image.

Tou et al. (2009) show that region covariances generated by Gabor filters effectively represent texture patterns in a region (patch). Given a patch of size , a Gabor filter of size 1111 with 8 parameters is used to extract 2,500 feature vectors of length 8. This set of feature vectors is then used to compute an 88 covariance matrix for the specific patch.

Three clustering tests, one for each type of deformation, are carried out. In each test, 300 transformed patches are generated equally from 3 different textures and the region covariance is computed for each patch. Then clustering algorithms are applied on the dataset of 300 region covariances belonging to 3 texture patterns. The way to generate transformed patches is described below.

I. Lighting transformation:

A single lighting transformation (demonstrated in Figure 10) is applied to three randomly drawn images from the Brodatz database and 100 patches of size 6060 are randomly picked from each of the 3 transformed images.

II. Horizontal shearing:

Three randomly drawn images are horizontally sheared by 100 different angles to get 3 sequences of 100 shifted images. From each shifted image, a patch of size 6060 is randomly picked.

III. Affine transformation:

Three randomly drawn images are affine transformed to create 3 sequences of 100 affine-transformed images. From each transformed image, a patch of size 6060 is randomly picked.

Figure 11 plots the projection of the embedded datasets generated by the above procedure onto their top three principal components (the embedding to Euclidean spaces is done by direct vectorization of the covariance matrices). The submanifold structure in each cluster can be easily observed.

(a) Lighting transformation
(b) Horizontal shearing
(c) Affine transformation
Figure 11: Projection of the covariance matrices of local patches of the transformed 3 images onto their top principal directions. For sample images, a dataset of covariance matrices is computed for each transformation type. The covariance matrices are identified as vectors in . The figure demonstrates the underlying structure of 3 manifolds for the data generated with each kind of transformation.

The procedure of generating the data is repeated times for each type of transformation. GCT as well as the other three clustering methods are applied to these datasets, and the average clustering rates are reported in Table 4. GCT exhibits the best performance for all datasets and for all types of transforms.

Methods GCT SMC SCR EKM
Lighting transformation 0.73 0.53 0.68 0.67
Horizontal shifting 0.95 0.61 0.85 0.76
Affine tranformation 0.83 0.53 0.82 0.76
Table 4: Average clustering rates for each method over 30 datasets.

4.4.2 Clustering Dynamic Patterns.

Spatio-temporal data such as dynamic textures and videos of human actions can often be approximated by linear dynamical models (Doretto et al., 2003; Turaga et al., 2011). In particular, by leveraging the auto-regressive and moving average (ARMA) model, we experiment here with two spatio-temporal databases: Dyntex++ and Ballet. Following Turaga et al. (2011), we employ the ARMA model to associate local spatio-temporal patches with linear subspaces of the same dimension. We then apply manifold clustering on the Grassmannian in order to distinguish between different textures and actions in the Dyntex++ and Ballet database respectively.

ARMA Model.

The premise of ARMA modeling is based on the assumption that the spatio-temporal dataset under study is governed by a small number of latent variables whose temporal variations obey a linear rule. More specifically, if is the observation vector at time (in our case, it is the vectorized image frame of a video sequence), then

(6)

where , , is the vector of latent variables, is the observation matrix, is the transition matrix, and and

are i.i.d. sampled vector-values r.vs. obeying the Gaussian distributions

and , respectively.

We next explain the idea of Turaga et al. (2011) to associate subspaces with spatio-temporal data. Given data , the ARMA parameters and can be estimated according to the procedure in Turaga et al. (2011). Moreover, by arbitrarily choosing , it can be verified that for any ,

We then set , which is known as the th order observability matrix. If the observability matrix is of full column rank, which was the case in all of the conducted experiments, the column space of is a -dimensional linear subspace of . In other words, the ARMA model estimated from data , , gives rise to a point on the Grassmannian . For a fixed dataset , different choices of , s.t. , and several local regions within the image give rise to different estimates of and and thus to different points in .

Dynamic textures.

The Dyntex++ database (Ghanem and Ahuja, 2010) contains dynamic textures videos of size , which are divided into categories. It is a hard-to-cluster database due to its low resolution. Three videos were randomly chosen, each one from a distinct category from the available ones.

Per video sequence, patches of size are randomly chosen. Each frame of the patch is vectorized resulting into patches of size . To reduce the size to , a (Gaussian) random (linear) projection operator is applied to each patch. As a result, each patch is reduced to the set . We fix and and use each such set to estimate the underlying ARMA model. Consequently, points on are generated, per video category.

We expect that points in of the same cluster lie near a submanifold of . This is due to the repeated pattern of textures in space and time (they often look like a shifted version of each other in space and time). To visualize the submanifold structure, we isometrically embedded into a Euclidean space (Basri et al., 2011), so that subspaces are mapped to Euclidean points. We then projected the latter points on their top 3 principal components. Figure (a)a demonstrates this projection as well as the submanifold structure within each cluster.

Ballet database.

The Ballet database (Wang and Mori, 2009) contains videos of actions from a ballet instruction DVD. The frames of all videos are of size and their lengths vary and are larger than 100. Different performers have different attire and speed. Three videos, each one associated with a different action, were randomly chosen.

Figure 12: Two samples of Ballet video sequences: the first and second rows are from videos demonstrating actions of hopping and leg-swinging, respectively.

Spatio-temporal patches are generated by selecting consecutive frames of size from each one of the following overlapping time intervals: , , , …, . In this way, for each of the three videos, 31 spatio-temporal patches of size are generated. As in the case of the Dyntex++ database, video patches are vectorized and downsized to spatio-temporal patches of size . Following the previous ARMA modeling approach, we set and and associate each such patch with a subspace in . Consequently, 93 subspaces (31 per cluster) in the Grassmannian are generated. Figure (b)b visualizes the 3D representation of the subspaces created from three random videos. Their intersection represents still motion.

The procedure described above (for generating data by randomly choosing 3 videos from the Dyntex++ and Ballet databases and applying clustering methods on ) is repeated times. The average clustering accuracy rates are reported in Table 5. GCT achieves the highest rates on both datasets.

(a) Dyntex++
(b) Ballet
Figure 13: Projection onto top 3 principal components of the two embedded datasets (the embedding into Euclidean spaces is according to Basri et al. (2011)). A submanifold structure for each cluster is clearly depicted.
Methods GCT SMC SCR EKM
Dyntex++ 0.85 0.69 0.77 0.42
Ballet 0.81 0.76 0.68 0.47
Table 5: Average clustering accuracy rates for the Dyntex++ and Ballet datasets.

5 Proof of Theorem 1

The idea of the proof is as follows. After excluding points sampled near the possibly nonempty intersection of submanifolds, we form a graph whose vertices are the points of the remaining set and whose edges are determined by . The proof then establishes that the resulting graph has two connected components, which correspond to the two different submanifolds and . Spectral clustering can exactly cluster such a graph with appropriate choice of its tuning parameter , which can be specified by self-tuning mechanism (Zelnik-Manor and Perona, 2004). This claim follows from Ng et al. (2001) and its unpublished supplemental material.

The basic strategy of the proof and its organization are described as follows. Section 5.1 presents additional notation used in the proof. Section 5.2 reminds the reader the underlying model of the proof (with some additional details). Section 5.3 eliminates undesirable events of negligible probability (it clarifies the term in the statement of the theorem).

The rest of the proof (described in Sections 5.4-5.8) is briefly sketched as follows. For simplicity, we first assume no noise, i.e., . We define a “sufficiently large” set (and its subsets and ) by the following formula (which uses the notation and ):

(7)

In the first part of the proof (see Section 5.4), we show that the graphs of and (with weights ) are respectively connected. If we can show that the graphs of and are disconnected from each other, then the proof can be concluded. To this end, the subsequent auxiliary sets and will be instrumental in the proof. We fix a constant (to be specified later in (34)), which depends on , and the angles of intersection of and , and define

(8)

We will verify that and . In fact, it will be a consequence of the second part of the proof. This part shows that the graph of is disconnected from the graph of as well as graph of . Therefore, and cannot be connected via points in . At last, we show that they also cannot be connected within . That is, we show in the third part of the proof (Section 5.6) that the graphs of and are disconnected from each other. These three parts imply that the graphs of and form two connected components within . By definition, and are identified with and respectively. To conclude the proof (for the noiseless case), we estimate the measure of the set , which was excluded. More precisely, we consider the measure of the set , which we define as follows

(9)

This measure estimate and the conclusion of the proof (to the noiseless case) are established in Section 5.7. Section 5.8 discusses the generalization of the proof to the noisy case.

Various ideas of the proof follow Arias-Castro et al. (2013), which considered multi-manifold modeling in Euclidean spaces. Some of the arguments in the proof of Arias-Castro et al. (2013) even apply to general metric spaces, in particular, to Riemannian manifolds. We thus tried to maintain the notation of Arias-Castro et al. (2013).

However, the algorithm construction and the main theoretical analysis of Arias-Castro et al. (2013) are valid only when the dataset lies in a Euclidean space and it is nontrivial to extend them to a Riemannian manifold. Indeed, the basic idea of Arias-Castro et al. (2013) is to compare local covariance matrices and use this comparison to infer the relation between the corresponding data points, over which those matrices were generated. However, comparing local covariance matrices in the case where the ambient space is a Riemannian manifold is not straightforward as in Euclidean spaces. This is due to the fact that local covariance matrices are computed at different tangent spaces with different coordinate systems. Instead we show that it is sufficient to compare the “local directional information” (i.e., empirical geodesic angles) and “local dimension”. Both of these quantities are derived from the local covariance matrices. However, due to the nonlinear mapping to the tangent spaces, which distorts the uniform assumption within the ambient space, care must be taken in using the inverse nonlinear map, i.e., the logarithm map.

5.1 Notation

We provide additional notation to the one in Section 2.2. Readers are referred to do Carmo (1992) for a complete introduction to Riemannian geometry.

Let and denote the -neighborhoods of and in and respectively. They are related by the exponential map, , as follows: . We refer to the coordinates obtained in the tangent space by the exponential map as normal coordinates. Using normal coordinates, is endowed with the Riemannian metric and measure . On the other hand, the tangent space can also be identified with by choosing an orthonormal basis. This provides Euclidean metric and measure on , in particular, on . There is a simple relation between and  (do Carmo, 1992):

(10)

Figure 14 highlights the difference between and . It shows the tangent space of the north pole, , of and the straight blue line connecting and in ; it is the shortest path w.r.t. . On the other hand, the shortest path w.r.t.  is clearly the equator (the geodesic connecting and ), which is the black arc on ; it is different than the blue line. In fact, only lines in connecting the origin and other points on correspond to geodesics on for a general metric. As a consequence, the measures and