 # A regression approach for explaining manifold embedding coordinates

Manifold embedding algorithms map high dimensional data, down to coordinates in a much lower dimensional space. One of the aims of the dimension reduction is to find the intrinsic coordinates that describe the data manifold. However, the coordinates returned by the embedding algorithm are abstract coordinates. Finding their physical, domain related meaning is not formalized and left to the domain experts. This paper studies the problem of recovering the domain-specific meaning of the new low dimensional representation in a semi-automatic, principled fashion. We propose a method to explain embedding coordinates on a manifold as non-linear compositions of functions from a user-defined dictionary. We show that this problem can be set up as a sparse linear Group Lasso recovery problem, find sufficient recovery conditions, and demonstrate its effectiveness on data.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Problem formulation, assumptions and challenges

We make the standard assumption that the observed data are sampled i.i.d. from a smooth manifold 111The reader is referred to [Lee03] for the definitions of the differential geometric terms used in this paper. of intrinsic dimension smoothly embedded in by the inclusion map. In this paper, we will call smooth any function or manifold of class at least . We assume that the intrinsic dimension of is known; for example, by having been estimated previously by one method in [KvL15]. The manifold is a Riemannian manifold with Riemannian metric inherited from the ambient space . Furthermore, we assume the existence of a smooth embedding map , where typically . We call the coordinates in this dimensional ambient space the embedding coordinates; let . In practice, the mapping of the data onto represents the output of an embedding algorithm, and we only have access to and via and its image .

In addition, we are given a dictionary of user-defined and domain-related smooth functions . Our goal is to express the embedding coordinate functions in terms of functions in .

More precisely, we assume that with a smooth function of variables, defined on a open subset of containing the range of . Let , and . The problem is to discover the set such that . We call the functional support of , or the explanation for the manifold in terms of . For instance, in the toluene example, the functions in are all the possible torsions in the molecule, , and is the explanation for the 1-dimensional manifold traced by the configurations.

##### Indeterminacies

In differential geometric terms, the explanation is strongly related to finding a parametrization of . Hence, . Since the function given by the embedding algorithm is not unique, the function cannot be uniquely determined. Moreover, whenever where is a smooth monotonic function, the support may not be unique either.

In Section 7 we give sufficient and necessary conditions under which can be recovered uniquely; intuitively, they consist of functional independencies between the functions in . For instance, it is sufficient to assume that that the dictionary is a functionally independent set, i.e. there is no that can be obtained as a smooth function of other functions in .

Hence, in this paper we resolve the indeterminacies w.r.t. the support , and we limit our scope to recovering . We leave to future work the problem of recovering information on how the functions combine to explain .

## 2 Background on manifold estimation: neighborhood graph, Laplacian and estimating tangent subspaces

##### Manifold learning and intrinsic geometry

Suppose we observe data points that are sampled from a smooth -dimensional submanifold . The task of manifold learning is to provide a diffeomorphism where . The Whitney Embedding Theorem [Lee03] guarantees the existence of a map satisfying this property with . Hence, a good manifold learner will identify a smooth map with .

##### The neighborhood graph and kernel matrix

The neighborhood graph is a data structure that associates to each data point its set of neighbors , where is a neighborhood radius parameter. The neighborhood relation is symmetric, and determines an undirected graph with nodes represented by the data points .

Closely related to the neighborhood graph are the local position matrices , local embedding coordinate matrices , and the kernel matrix whose elements are

 Kii′=⎧⎪⎨⎪⎩exp(−||ξi−ξi′||ϵ2N) ifi′∈Ni0otherwise. (1)

Typically, the radius and the bandwidth parameter are related by with a small constant greater than 1. This ensures that is close to its limit when while remaining sparse, with sparsity structure induced by the neighborhood graph. Rows of this matrix will be denoted to emphasize that when a particular row is passed to an algorithm, only values need to be passed. The neighborhood graph, local position matrices, and kernel matrix play crucial roles in our manifold estimation tasks.

##### Estimating tangent spaces in the ambient space RD

The tangent subspace at point in the data can be estimated by Weighted Local Principal Component Analysis, as described in the LocalPCA

algorithm. The output of this algorithm is an orthogonal matrix

, representing a basis for . For this algorithm and others we define the SVD algorithm of a symmetrical matrix as outputting , where and are the largest eigenvalues and their eigenvectors, respectively, and denote a column vector of ones of length by . We denote by .

##### The renormalized graph Laplacian

The renormalized graph Laplacian, also known as the sample Laplacian, or Diffusion Maps Laplacian , constructed by Laplacian , converges to the manifold Laplace operator ; [CL06] shows that this estimator is unbiased w.r.t. the sampling density on (see also [HAvL05, HAvL07, THJ10]). The Laplacian is a sparse matrix; row of contains non-zeros only for . Thus, as for , rows of this matrix will be denoted . Again, sparsity pattern is given by the neighborhood graph; construction of the neighborhood graph is therefore the computationally expensive component of this algorithm.

The principal eigenvectors of the Laplacian (or alternatively of the matrix of Algorithm Laplacian), corresponding to its smallest eigenvalues, are sometimes used as embedding coordinates of the data; the embedding obtained is known as the Diffusion Map [CL06] or the Laplacian Eigenmap [BN02] of . We use this embedding approach for convenience, but in general, any algorithm which asymptotically generates a smooth embedding is acceptable.

##### The pushforward Riemannian metric

Geometric quantities such as angles and lengths of vectors in the tangent bundle , distances along curves in , etc., are captured by Riemannian geometry. We assume that is a Riemannian manifold, with the metric induced from . Furthermore, we associate with a Riemannian metric which preserves the geometry of . This metric is called the pushforward Riemannian metric and is defined by

 ⟨u,v⟩g=⟨Dϕ−1(ξ)u,Dϕ−1(ξ)v⟩ for all u,v∈Tϕ(ξ)ϕ(M). (2)

In the above, maps vectors from to , and is the Euclidean scalar product.

For each , the associated push-forward Riemannian metric expressed in the coordinates of , is a symmetric, semi-positive definite matrix of rank . The scalar product takes the form .

The matrices can be estimated by the algorithm RMetric of [PM13]. The algorithm uses only local information, and thus can be run efficiently. For notational simplicity, we refer to the embedding coordinates of the points in a neighborhood of a point as .

## 3 Flasso: sparse non-linear functional support recovery

Our approach for explaining manifold coordinates is predicated on the following general method for decomposing a real-valued function over a given dictionary . The main idea is that when , their differentials at any point are in the linear relationship . This is expressed more precisely as follows.

###### Proposition 1 (Leibniz Rule)

Let with . All maps are assumed to be smooth. Then at every point,

 ∇f=∂h∂g1∇g1+∂h∂g2∇g2+…∂h∂gs∇gs, (3)

or, in matrix notation

 Df =DhDgS.

Given samples , dictionary , and function , we would like to discover that is a function of without knowing . The usefulness of this decomposition is that even if is a non-linear function of , its gradient is a linear combination of the gradients at every point. That is, proposition 1 transforms the functional, non-linear sparse recovery problem into a set of linear sparse recovery problems, one for each data point.

##### Formulating the problem as Group Lasso

Let and , for . These vectors in could be estimated, or available analytically. Then, by (3) we construct the following linear model

 yi=p∑j=1βijxij+ϵi=Xiβi:+ϵi,for i=1:n. (4)

In the above, is the vector . If there exists some such that holds, then non-zero s are estimations of for and zeros indicate . Hence, in , only elements are non-zero. The term is added to account for noise or model misspecification.

The key characteristic of the functional support that we leverage is that the same set of elements will be non-zero for all ’s. Denote by the vector . Since finding the set for which , given and is underdetermined when for example , we use Group Lasso [YL06] regularization in order to zero out as many vectors as possible. In particular, each group has size . Thus, the support recovery can be reformulated as the globally convex optimization problem

 ({\sc FLasso})minβJλ(β)=12n∑i=1||yi−Xiβi||22+λ/√n∑j||βj||, (5)

with a regularization parameter. This framework underscores why we normalize gradients by ; else we would favor dictionary functions that are scaled by large constants, since their coefficients would be smaller. Note that normalization of is not necessary since it rescales all estimated partials the same. We call problem (5) Functional Lasso (FLasso). Experiments on synthetic data in Section 8 illustrate the effectiveness of this algorithm.

##### Multidimensional FLasso

The FLasso extends easily to vector valued functions . To promote finding a common support for all embedding coordinates, groups will overlap the embedding coordinates. Let

 yik=∇fk(ξi)βijk=∂hk∂gj(ξi) (6)

and

 βj=vec(βijk,i=1:n,k=1:m)∈Rmn,βik=vec(βijk,j=1:p)∈Rp. (7)

Then the FLasso objective function becomes

 (8)

In the above, is the vector of regression coefficients corresponding to the effect of function , which form group in the multivariate group Lasso problem. The size of each group is .

## 4 Explaining manifold coordinates with FLasso on the tangent bundle

To explain the coordinates with the FLasso method of Section 3, we need (i) to extend FLasso to functions defined on a manifold , and (ii) to reliably estimate their differentials in an appropriate coordinate system.

These tasks require bringing various “gradients” values into the same coordinate system. For function defined on a neighborhood of in , we denote the usual gradient . Similarly for a function defined on a neighborhood of in , the gradient is denoted . To define the gradient of a function on a manifold, we recall that the differential of a scalar function on at point is a linear functional . The gradient of at , denoted is defined by for any . To distinguish between gradients in the tangent bundles of , respectively , we denote the former by and the latter by .

In this section, we will be concerned with the coordinate functions seen as functions on , and with the gradients and of the dictionary functions. It is easy to see that for a choice of basis in ,

### 4.1 FLasso for vector-valued functions on M

We start from the FLasso objective function for -dimensional vector valued functions (8). Since differentials of functions on a manifold operate on the tangent bundle , the definitions of and from (FLasso) are replaced with the values of the respective gradients in

. For the moment we assume that these are given. In Section

4.2 we will concern ourselves with estimating the parameters of from the data, while is given in (12) below.

For each , we fix an orthogonal basis in  , and denote the orthogonal matrix representing the respective basis vectors by . Hence, any vector in can be expressed as by its coordinates in the chosen basis, or as . For any vector in , represents the coordinates of its projection onto ; in particular, if , . These elementary identities are reflected in (9) above. Let

 yik=∇Tϕk(ξi)Yi=[yik]mk=1∈Rd×mβijk=∂hk∂gj(ξi) (10)

and

 βj=vec(βijk,i=1:n,k=1:m)∈Rmn,βik=vec(βijk,j=1:p)∈Rp. (11)

In the above, the columns of are the coordinates of in the chosen basis of ; is the vector of regression coefficients corresponding to the effect of function , which form group in the multivariate group Lasso problem. The size of each group is . By comparing (6) and (7) with (10) and (11) we see, first, the change in the expression of due to projection. We also note that in the definition of the normalization is omitted. For the present, we will assume that the dictionary functions have been appropriately normalized; how to do this is deferred to Section 5.

To obtain the matrices , we project the gradients of the dictionary functions onto the tangent bundle, i.e.

 Xi=TTi∇ξgj(ξi)]j=1:p∈Rd×p. (12)

The FLasso objective function is

 (13)

an expression identical to (8), but in which denote the quantities in (10) and (12). To note that is invariant to the change of basis . Inded, let be a different basis, with

a unitary matrix. Then,

, , and . The next section is concerned with estimating the gradients from the data.

### 4.2 Estimating the coordinate gradients by pull-back

Since is implicitly determined by a manifold embedding algorithm, the gradients of are in general not directly available, and is known only through its values at the data points. We could estimate these gradients naively from differences between neighboring points, but we choose a differential geometric method inspired by [LSW09] and [PM13], that pulls back the differentials of from to , thus enabling them to be compared in the same coordinate system with the derivatives .

If the range of were , then the gradient of a coordinate function would be trivially equal to the -th basis vector, and . If , by projecting the rows of onto the tangent subspace , we get . We then pull back the resulting vectors to  to get , as defined in (10).

##### Pulling back gradϕϕ=I into TξiM

If the embedding induced by were an isometry, the estimation of could be performed by LocalPCA, and the pull-back following [PM13]. Here we do not assume that is isometric. The method we introduce exploits the fact that, even when is not an isometry, the distortion induced by can be estimated and corrected [PM13]. More precisely, we make use of the RMetric algorithm described in Section |refsec:background to estimate for each , the associated push-forward Riemannian metric, expressed in the coordinates of by , a symmetric, semi-positive definite matrix of rank . Additionally, since the theoretical rank of equals , the principal eigenvectors of represent (an estimate of) an orthonormal basis of .

We now apply (2) to the set of vectors for . Let

 Ai=[ProjTξiM(ξi′−ξi)]i′∈Ni∈Rd×kiBi=[ϕ(ξi′)−ϕ(ξi)]i′∈Ni∈Rm×ki (14)

and as defined in the previous section. Then, (2) implies

 ATiYi≈BTiGiI. (15)

The equality is approximate to first order because the operator above is a first order approximation to the logarithmic map ; the error tends to 0 with the neighborhood radius [dC92]. If is full rank , for , we can obtain by least-squares as

 Yi=(AiATi)−1AiBTiGi. (16)

This equation recovers in the local basis of .

If we desired to obtain in the global coordinates, it suffices to express the columns of as vectors in

(or equivalently to apply the appropriate linear transformation to

).

Algorithm PullBackDPhi summarizes the above steps. For the estimation of , the Laplacian must be available. Note also that in obtaining (15) explicit projection on is not necessary, because any component orthogonal to this subspace is in the null space of .

## 5 The full ManifoldLasso algorithm

We are now ready to combine the algorithms of Section 2 with the results of Sections 3 and 4 into the main algorithm of this paper. Algorithm ManifoldLasso takes as input data sampled near an unknown manifold , a dictionary of functions defined on (or alternatively on an open subset of the ambient space that contains ) and an embedding in . The output of ManifoldLasso is a set of indices in , representing the functions that explain .

##### Computation

The first two steps of ManifoldLasso are construction of the neighborhood graph, and estimation of the Laplacian . As shown in Section 2, is a sparse matrix, hence RMetric can be run efficiently by only passing values corresponding to one neighborhood at a time. Note that in our examples and experiments, Diffusion Maps is our chosen embedding algorithm, so the neighborhoods and Laplacian are already available, though in general this is not the case.

The second part of the algorithm estimates the gradients and constructs matrices . The gradient estimation runtime is where is the number of edges in the neighborhood graph, using Cholesky decomposition-based solvers. Finally, the last step is a call to the GroupLasso, which estimates the support of . The computation time of each iteration in GroupLasso is . For large data sets, one can perform the “for” loop over a subset of the original data while retaining the geometric information from the full data set. This replaces the in the computation time with the smaller factor .

##### Normalization

In Group Lasso, the columns of the matrix corresponding to each group are rescaled to have unit Euclidean norm. This ensures that the FLasso algorithm will be invariant to rescaling of dictionary functions by a constant. Since any multiplication of by a non-zero constant, simultaneously with dividing its corresponding by the same constant leaves the reconstruction error of all ’s invariant, but affects the norm , rescaling will favor the dictionary functions that are scaled by large constants. Therefore, the relative scaling of the dictionary functions can influence the support recovered.

When the dictionary functions are defined on , but not outside . In this case we follow the standard Group Lasso recipe. We calculate the normalizing constant

The above is the finite sample version of , integrated w.r.t. the data density on . Then we set which has the same effect as the normalization for FLasso in Section 3.

In the case when the dictionary functions are defined on a neighborhood around in , we compute the normalizing constant with respect to , that is

 γ2j=1nn∑i=1∥∇ξgj(ξi)∥2. (18)

Then, once again, we set . Doing this favors the dictionary functions whose gradients are parallel to the manifold , and penalizes the ’s which have large gradient components perpendicular to .

##### Tuning

As the tuning parameter is increased, the cardinality of the support decreases. Tuning parameters are often selected by cross-validation in Lasso-type problems, but this is not possible in our unsupervised setting. We base our choice of on matching the cardinality of the support to . The recovery results proved in Section 7.2 state that for a single chart, functionally independent dictionary functions suffice.

In manifolds which cannot be represented by a single chart, the situation is more complicated, and it depends on the topology of the co-domains of the functions . For example, in the toluene data presented in Section 8, the manifold is a circle. This cannot be covered by a single chart, but it can be parametrized by a single function in the dictionary, the angle of the bond torsion in Figure 5. Another case when is the case when no single dictionary function can parametrize the manifold. When is not a functionally independent set, it is possible that the parametrization of by is not unique. Hence, in general, the support size may be equal to or larger.

### 5.1 Variants and extensions

The approach utilized here can be extended in several interesting ways. First, our current approach explains the embedding coordinates produced by a particular embedding algorithm. However, the same approach can be used to directly explain the tangent subspace of , independently of any embedding.

Second, one could set up FLasso problems that explain a single coordinate function. In general, manifold coordinates do not have individual meaning, so it will not be always possible to find a good explanation for a single . However, Figure 6 shows that for the ethanol molecule, whose manifold is a torus, there exists a canonical system of coordinates in which each coordinate is explained by one torsion.

Third, manifold codomains of can be used, eliminating certain issues of continuity and chart-counting.

Finally, when the gradients of the dictionary functions are not analytically available, they can also be estimated from data.

## 6 Related work

To our knowledge, ours is the first solution to estimating a function as a non-linear sparse combination of functions in a dictionary. Below we cite some of the closest related work.

Group Lasso has been widely used in sparse regression and variable selection. In [OPV14] the Functional Sparse Shrinkage and (FuSSO) is introduced to solve similar problem in Euclidean space setting. FuSSO uses Group Lasso to recover as a sparse linear combination of functions from an infinite dictionary given implictly by the decomposition over the Fourier basis. More importantly, this method imposes an additive model between the response and the functional covariates, which is not true in our method.

Gradient Learning [YX12]is also a similar work trying to recover non-zero partial derivatives via Group Lasso type regression as methods of variable selection and dimension reduction. Their work does not have functional covariate like us as input. Moreover, the goal of [YX12] is not explaining the coordinates but instead predicting a response given the covariates.

The Active Subspace method of [CDW14] uses the information in the gradient to discover a subspace of maximum variation of . This subspace is given by the principal subspace of the matrix , where is a weighting function averaging over a finite or infinite set of points in the domain of . While this method uses the gradient information, it can only find a global subspace, which would not be adequate for function composition, or for functions defined on non-linear manifolds.

The work of [BPK16] is similar to ours in that it uses a dictionary. The goal is to identify the functional equations of non-linear dynamical systems by regressing the time derivatives of the state variables on a subset of functions in the dictionary, with a sparsity inducing penalty. The recovered functional equation is linear in the dictionary functions, hence any non-linearity in the state variables must be explicitly included in the dictionar. On the other hand, when the functional equation can be expressed as a sum of dictionary functions, then the system is completely identified.

With respect to parametrizing manifolds, the early work of [SR03, TR02] (and references therein) proposes parametrizing the manifold by finite mixtures of local linear models, aligned in a way that provides global coordinates, in a way reminiscent of LTSA[ZZ04].

A point of view different from ours views a set of eigenvectors of the Laplace-Beltrami operator as a parametrization of . Hence, the Diffusion Maps coordinates could be considered such a parametrization [CL06, CLL05, Gea12]. With [MN17]

it was shown that principal curves and surfaces can provide an approximate manifold parametrization. Our work differs from these in two ways (1) first, obviously, the explanations we obtain are endowed with the physical meaning of the domain specific dictionaries, (2) less obviously, descriptors like principal curves or Laplacian eigenfunctions are generally still non-parametric (i.e exist in infinite dimensional function spaces), while the parametrizations by dictionaries we obtain (e.g the torsions) are in finite dimensional spaces.

[DTCK18] tackles a related problem: choosing among the infinitely many Laplacian eigenfunctions which provide a -dimensional parametrization of the manifold; the approach is to solve a set of Local Linear Embedding [RS00] problems, each aiming to represent an eigenfunction as a combination of the preceding ones.

## 7 Uniqueness and recovery results

### 7.1 Functional dependency and uniqueness of representation

Here we study the conditions under which is uniquely represented over a dictionary that contains . Not surprisingly, we will show that these are functional (in)dependency conditions on the dictionary.

We first study when a group of functions on an open can be represented with a subset of functionally independent functions. The following lemma implies that if a group of non-full-rank smooth functions has a constant rank in a neighborhood, then it can be showed that locally we can choose a subset of these functions such that the others can be smoothly represented by them. This is a direct result from the constant rank theorem.

###### Lemma 2 (Remark 2 after Zorich[Zor04] Theorem 2 in Section 8.6.2)

Let be a mapping defined in an neighborhood of a point . Suppose and the rank of the mapping is at every point of a neighborhood and . Moreover assume that the principal minor of order of the matrix is not zero. Then in some neighborhood of there exist functions such that

 fi(x1,x2,⋯,xd)=gi(f1(x1,x2,⋯,xd),f2(x1,x2,⋯,xd),⋯,fk(x1,x2,⋯,xd)) (19)

Applying this lemma we can construct a local representation of a subset in . The following classical result in differential geometry lets us expand the above lemma beyond local to global.

We start with a definition. A smooth partition of unity subordiante to is an indexed family of smooth functions with the following properties:

1. for all and all ;

2. supp for each ;

3. Every has a neighborhood that intersects supp for only finitely many values of ;

4. for all .

###### Lemma 3 (John 2013[Lee03] Theorem 2.23)

Suppose is a smooth manifold, and is any indexed open cover of . Then there exists a smooth partition of unity subordinate to

Now we state our main results.

###### Theorem 4

Assume and are defined as in Section 3, with are functions in open set and let be another subset of functions. Then there exists a mapping on such that iff

 rank(DgSDgS′)=rankDgS′ on U (20)

Proof If is not full rank on , we replace with a subset of so that is full rank globally on . The existence of such subsets are guaranteed by step-wise eliminating functions in

, which will result in a zero matrix in r.h.s if such a subset does not exist and thus leads to a contradiction. Since

, . Let

 gS′∪S(ξ)=(gS′(ξ)gS(ξ)) (21)

and denote the l.h.s. matrix in (20). When the rank of equals the rank of , according to Lemma 2, there exists some neighborhood of and functions

 giS′∪S(ξ)=τix(g1(ξ),⋯,gs′(ξ)), for i=s′+1,s′+2,⋯,s′+s, ξ∈Ux (22)

Here we should notice that is defined only on a neighborhood of . We denote such a neighborhood by and then since this holds for every , therefore we can find an open cover of the original open set . Since each open set in is a manifold, the classical result of partition of unity in Lemma 3 holds, that admits a smooth partition of unity subordinate to the cover . We denote this partition of unity by .

Hence we can define

 τi(ξ)=∑xϕx(ξ)τix(gj1(ξ),⋯,gj|S∪S′|(ξ)), (23)

where the functions in are taken without repetition. For each , the product is on the neighborhood . According to the properties of partition of unity, this is a locally finite sum, which means that we do not have to deal with the problem of convergence. Also this will be a function.

Therefore, globally in we have

 giS∪S′(ξ)=τi(g1(ξ),⋯,gs′(ξ)), for i=s′+1,s′+2,⋯,s′+s, ξ∈U (24)

Now we prove the converse implication. If , then there is , so that . Pick such that ; such an must exist because otherwise it will be in . By the theorem’s assumption, . This implies that is in for any . But this is impossible at .

A direct corollary of this theorem is that in a single chart scenario, we can use exactly functionally independent functions in the dictionary to give the explanation.

###### Corollary 5

Let defined as before. is a smooth manifold with dimension embedded in . Suppose that is also an embedding of and has a decomposition for each . If there is a diffeomorphism , then there is a subset of functions with such that for some function we can find on each

Proof Since is a diffeomorphism, we can write uniquely for each . And also . Then the original decomposition of on each is actually

 ϕ∘φ(η)=h∘gS∘φ(η) (25)

and is a mapping . Now we choose such that is full rank and . From

 rank(D˜gSD˜gS′)=rankD˜gS′ (26)

and the previous theorem, we know that there exists a function such that on each . Finally let then

 ϕ(ξ)=ϕ∘φ∘φ−1(ξ)=h∘˜gS∘φ−1(ξ)=h∘τ∘˜gS′∘φ−1(ξ)=˜h∘gS′(ξ) (27)

holds for each . Now we determine the number of functions in . On one hand we can select so that