 # Manifold learning with bi-stochastic kernels

In this paper we answer the following question: what is the infinitesimal generator of the diffusion process defined by a kernel that is normalized such that it is bi-stochastic with respect to a specified measure? More precisely, under the assumption that data is sampled from a Riemannian manifold we determine how the resulting infinitesimal generator depends on the potentially nonuniform distribution of the sample points, and the specified measure for the bi-stochastic normalization. In a special case, we demonstrate a connection to the heat kernel. We consider both the case where only a single data set is given, and the case where a data set and a reference set are given. The spectral theory of the constructed operators is studied, and Nyström extension formulas for the gradients of the eigenfunctions are computed. Applications to discrete point sets and manifold learning are discussed.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

### 1.1. Introduction

In the fields of machine learning and data mining, kernel based methods related to diffusion processes have proven to be effective tools for data analysis, e.g.

[3, 5, 10, 14, 16]. Such methods are often studied as manifold learning problems where the given data is assumed to have been sampled from an underlying Riemannian manifold, see [1, 6, 15, 17]. In this paper, we adopt this manifold perspective to study bi-stochastic kernel constructions. Let be a measure space; we say that a positive kernel is bi-stochastic with respect to if

 ∫Xb(x,z)dμ(z)=∫Xb(z,y)dμ(z)=1for allx,y∈X.

Bi-stochastic kernels can be constructed by appropriately normalizing positive kernels. We are specifically interested in symmetric constructions. Suppose that is a compact topological space, is a positive Borel measure, and is a continuous positive symmetric kernel. Then it follows from a result of Knopp and Sinkhorn  that there exists a positive continuous function such that

 k(x,y)d(x)d(y)is bi-stochastic with respect to dμ.

Moreover, the bi-stochastic normalization of can be determined iteratively by alternating between normalizing the kernel by a function of such that the integral over is equal to , and vice versa. This iteration was first studied in the matrix setting by Sinkhorn . In this paper we consider bi-stochastic kernel constructions in two settings. First, in Section 2, we consider the continuous setting where is a compact smooth Riemannian manifold. Here, we prove our main results, Theorems 1 and 2, which describe a connection between bi-stochastic kernels and diffusion operators. Second, in Section 3, we consider the discrete setting where is a finite set of points. Here, we translate the results of Theorems 1 and 2 into the discrete setting, and present numerical examples. Finally, in Sections 4 and 5 we prove Theorems 1 and 2, respectively.

## 2. Main Results

### 2.1. Single measure

Suppose that is a compact smooth Riemannian manifold endowed with a measure , where is a positive density function with respect to the volume measure on . Assume that is isometrically embedded in and define by

 kε(x,y):=h(|x−y|2ε),

where is a smooth function of exponential decay, and where denotes the Euclidean norm on . Given a positive weight function , we define , and assert that there exists a function such that the kernel defined by

 bε(x,y)=kε(x,y)d(x)d(y)is bi-% stochastic with respect tod^μ,

i.e., . The existence of such a function follows from  and from the smoothness of as described in Section 4.2. Define the operator

 Bε:L2(X,d^μ)→L2(X,d^μ)byBεf(x)=∫Xbε(x,y)f(y)d^μ(y).

Our first result characterizes the infinitesimal generator associated with , and requires a class of smooth test functions. As in , we define to be the span of the first Neumann eigenfunctions of the Laplace-Beltrami operator on , with the convention (by the possibility of multiplying by ) that is positive semi-definite. Let

 m0=∫Rdh(|t|2)dtandm2=∫Rdt21h(|t|2)dt,

where denotes the Lebesgue measure on .

###### Theorem 1.

Suppose is fixed. Then on

 I−Bεεf→m22m0⎛⎜ ⎜ ⎜ ⎜⎝Δ(f(qw)1/2)(qw)1/2−Δ(qw)1/2(qw)1/2f⎞⎟ ⎟ ⎟ ⎟⎠in L2 norm asε→0,

where denotes the identity operator.

For example, suppose that is the Gaussian kernel which satisfies

 m0=∫Rde−|t|2dt=(2π)d/2,andm2=∫Rdt21e−|t|2dt=2(2π)d/2,

and suppose that , where

 qε(x):=∫Xkε(x,y)dμ(y).

Then we have the following Corollary.

###### Corollary 1.

Let be fixed. Suppose and . Then on

 I−Bεεf→Δfin L2% norm asε→0,

where denotes the positive semi-definite Laplace-Beltrami operator on .

The proof of this Corollary is included after the proof of Theorem 1; essentially, the proof amounts to checking that approximates with sufficiently small error such that can be replaced by in the result of Theorem 1. Corollary 1 suggests that the operator can be viewed as an approximation to the heat kernel; more precisely, we have the following result.

###### Corollary 2.

Suppose that and . Then on

 Bt/εεf→e−tΔfin L2% norm asε→0,

where denotes the Neumann heat kernel on .

The proof of Corollary 2 is included following the proof of Theorem 1.

### 2.2. Connection to diffusion maps

The reader might notice the similarity between the bi-stochastic operator and the averaging operator of Lafon  which also approximates the Laplace-Beltrami operator. Specifically, the operator is defined by

 Aεf(x)=1v(x)2∫Xkε(x,y)qε(x)qε(y)f(y)dμ(y),wherev(x)2=∫Xkε(x,y)qε(x)qε(y)dμ(y).

Unlike the operator , the operator is only row stochastic, not bi-stochastic. Thus, we argue that provides a more natural analog to the heat kernel than the averaging operator . Of course, when these constructions are performed on data, numerically these operators are very similar (they have the same limiting behavior) so in the case of a single data set, the difference between and is mostly a matter of perspective. However, this bi-stochastic perspective is valuable in that it translates to more general settings as we demonstrate in the following section, where a reference measure is additionally given. We remark that bi-stochastic kernels have been previously considered in the context of reference graphs by Coifman and Hirn ; however, in this paper a specialized construction is presented which avoids the use of Sinkhorn iteration, and infinitesimal generators are not investigated.

### 2.3. Reference measure

We are motivated by the reference set embedding of Haddad, Kushnir, and Coifman [8, 12] . Suppose that in addition to the measure space , a reference measure space

 (X,dν)is given such thatdν(r)=p(r)dr,

where is a positive density function with respect to the volume measure on . Define where is a given positive weight function. We assert that there exists a positive function such that the kernel defined by

 cε(x,y)=∫Xkε(x,r)kε(y,r)d^ν(r)d(x)d(y)is bi-stochastic with respect tod^μ,

i.e., . As before, the existence of such a function follows from  and the smoothness of as discussed in Section 4.2. Define the operator

 Cε:L2(X,d^μ)→L2(X,d^μ)% byCεf(x)=∫Xcε(x,y)f(y)d^μ(y).

The following result concerns the infinitesimal generator associated with . We consider the same class of smooth test function as before; that is, is the span of the first Neumann eigenfunctions of the Laplace-Beltrami operator on .

###### Theorem 2.

Suppose is fixed. Then on

 I−Cεεf→m22m0⎛⎜ ⎜ ⎜ ⎜⎝Δ(f⋅(q⋅vw⋅p)1/2)(q⋅vw⋅p)1/2−Δ(q⋅vw⋅p)1/2(q⋅vw⋅p)1/2⋅f+Δ(f⋅(q⋅pw⋅v)1/2)(q⋅pw⋅v)1/2−Δ(q⋅pw⋅v)1/2(q⋅pw⋅v)1/2⋅f⎞⎟ ⎟ ⎟ ⎟⎠

in norm as .

For example, suppose that

is the Gaussian kernel, and define two density estimates

 qε(x)=∫Xkε(x,y)q(y)dyandpε(r)=∫Xkε(r,s)p(s)ds.

Our first Corollary concerns the case where the measure is of interest, while the measure is not. For example, in the discrete setting discussed in Section 3 the density may be artificial, while the density may describe clusters in the data.

###### Corollary 3.

Suppose is fixed. If , and , then on

 I−Cεεf→Δ(fq)q−Δqqfin L2 norm as ε→0.

The proof of this Corollary is immediate from Theorem 2 and the proof of Corollary 1. If neither density function is of interest, then their effects can be removed from the associated infinitesimal generator as in Corollary 1.

###### Corollary 4.

Let be fixed. If , , and , then on

 I−Cεεf→Δfin L2 norm as% ε→0,

where denotes the positive semi-definite Laplace-Beltrami operator on .

The proof of this Corollary is immediate from Theorem 2 and the error analysis performed in the proof of Corollary 1.

###### Corollary 5.

Suppose , , and . Then on

 Ct/εεf→e−tΔfin L2 norm as% ε→0,

where is the Neumann heat kernel on .

The proof of this Corollary is immediate from Corollary 4 and the error analysis in the proof of Corollary 2.

### 2.4. Eigenfunctions and their gradients

The idea of using eigenfunctions of the heat kernel as coordinates for Riemannian manifolds originates with Berard, Besson, and Gallot , and originates in the context of data analysis with Coifman and Lafon . Subsequently, Jones, Maggioni, and Schul  proved that the eigenfunctions of the heat kernel provide locally bi-Lipschitz coordinates, and moreover, that these eigenfunctions and their gradients satisfy estimates from above and below. Informally speaking, the gradient of an eigenfunction describes how the given eigenfunction views a neighborhood of its domain; in fact, in  the magnitude of the gradient of eigenfunctions is used as a selection criteria to build local coordinates. Moreover, the gradient of eigenfunctions is generally a useful tool for data analysis; for example, Wolf, Averbuch, and Neittaanmäki , and Wolf  used the gradients of eigenfunctions for characterization and detection of anomalies. In the following, we justify the existence of the spectral decomposition of and . Furthermore, under the assumption that both operators are based on the Gaussian kernel we compute Nyström extension type formulas for the gradients of the eigenfunctions of and . Recall, that the bi-stochastic operator is defined

 (Bεf)(x)=∫Xbε(x,y)f(y)d^μ(y).

Since the kernel is bounded, and is compact, it follows that the operator is a Hilbert-Schmidt operator, and hence a compact operator. Moreover, since is also symmetric it follows that the operator is self-adjoint and we may apply the Spectral Theorem to conclude that

 (Bεf)(x)=∞∑k=0λk⟨f,φk⟩L2(X,d^μ)φk(x),

where

are the eigenvalues of

, and are the corresponding orthonormal eigenfunctions. In the following, we show that the gradients of the eigenfunctions of have a convenient formulation when the construction is based on the Gaussian kernel.

###### Proposition 1.

Suppose . Then

 ∇φk(x)=1λk∫Xy−¯xεb(x,y)φk(y)d^μ(y)where¯x:=∫Xyb(x,y)d^μ(y).

In the absence of a manifold assumption the formula for in Proposition 1 can be taken as a definition; for example, in Section 3, we discuss how Proposition 1 can be used to define the gradient of an eigenfunction in a setting where is a discrete set without any given differential structure.

### 2.5. Reference measure eigenfunctions and their gradients

Similar results hold for the operator defined by

 Cεf(x)=∫Xcε(x,y)f(y)d^μ(y).

Since the kernel is bounded and is compact, clearly is a Hilbert-Schmidt operator, which implies that is a compact operator. Moreover, the symmetry of implies that is self-adjoint and thus by the Spectral Theorem

 (Cεf)(x)=∞∑j=0λj⟨f,φj⟩L2(X,d^μ)φj(x),

where are the eigenvalues, and are the corresponding orthonormal eigenfunctions on . In this case, the Nyström extension type formula for the gradient of an eigenfunction is slightly more involved, but similar in spirit to the case of a single measure.

###### Proposition 2.

Suppose . Then

 ∇xφj(x)=1λk∫Xr−¯rxε(Fεφj)(x,r)d^ν(r).

where

 (Fεf)(x,r)=∫Xkε(x,r)kε(y,r)d(x)d(y)f(y)d^μ(y)and¯rx:=∫Xr(Fε1)(x,r)d^ν(r).

While this formula for the gradient of an eigenfunction is less elegant than the result from Proposition 1, it nevertheless provides a method of defining in an arbitrary setting.

## 3. The discrete setting

### 3.1. Single measure

Suppose that a data set of points is given. The data set induces a data measure

 dμ(x)=1nn∑i=1δxi(x),

where is a Dirac distribution centered at . While the data measure space does not satisfy the assumptions specified in Section 1, the construction of the operator can be preformed without issue; indeed, the construction only depends on the ability to compute Euclidean distances between points and the ability to integrate. Moreover, if

consists of samples from a probability distribution from a Riemannian manifold, then integrating with respect to the data measure

can be viewed as Monte Carlo integration, which approximates integrals on the manifold with error on the order of . Alternatively, the data measure space can be viewed as an intrinsic abstract measure space and be considered independent from any manifold assumption. In either case, the results regarding infinitesimal generators in Section 1 can guide our choice of normalization. In the following we reiterate the construction of using matrix notation and describe computational considerations which arise in this setting. For notation, we use bold capital letters, e.g., , to denote matrices, lower case bold letters, e.g.,

, to denote column vectors, and

to denote the diagonal matrix with the vector along the diagonal.

### Step 1: Defining the kernel and weight function

Let denote the matrix with entries

 Kε(i,j)=h(|xi−xj|2ε)fori,j=1,…,n,

where is a smooth function of exponential decay, e.g., . Let denote a weight function and define . For example, if an explicit is not given, then for a fixed parameter we could define

 w(i)=(n∑k=1h(|xi−xk|ε))β,fori=1,…,n.

### Step 2: Sinkhorn iteration

Given the weight matrix , we seek a diagonal matrix such that

 D−1KεD−1W−11=1.

The diagonal matrix will be determined via Sinkhorn iteration ; that is to say, by alternating between normalizing the rows and columns of the given matrix. In the case of a symmetric matrix, Sinkhorn iteration can be phrased as the following iterative procedure: first, we set , where denotes the identity matrix, and then define inductively by

We claim that if we consider the limit

 D:=limk→∞Dk+11/2D1/2k,

then the matrix will be bi-stochastic with respect to

as desired. Indeed, the reason for considering the limit of the geometric mean

is that  only guarantees that

 D2k→αDandD2k+1→1αD

for some constant .

### Step 2.5: Heuristic tricks for Sinkhorn iteration

We note that the following trick seems to drastically improve the rate of convergence for some Gaussian kernel matrices. Let , , and define

 Dk+1=diag(KεD−1k(D′k)−1/2W−11)andD′k+1=diag(D−1kKεD−1kW−11).

This iteration corresponds to first “dividing” by the row sums on both sides, and then “dividing” by the square root of the row sums of the resulting matrix on both sides of the resulting matrix; this iteration is motivated by a desire to combat the imbalance between left and right iterates.

### Step 3: Diagonalization

In the discrete setting, the integral operator can be expressed

 Bεf=D−1KεD−1W−1f,

where is a column vector. Suppose that we compute the symmetric eigendecomposition

where

is an orthogonal matrix of eigenvectors, and

is a diagonal matrix of eigenvalues. Define

 Φ=W1/2U.

Suppose that denotes the -th column of the matrix , and denotes the -th diagonal element of the diagonal matrix . Then

 Bφφk=λkφk.

That is to say, is an eigenvector of of eigenvalue .

Suppose . Let denote the data matrix corresponding to points in . Motivated by Proposition 1 we define

 ∇φk:=D−1KεD−1W−1diag(φk)X−λkdiag(φk)¯Xελk,

where

 ¯X:=D−1KεD−1W−1X.

We note that the entire described construction can be completely implemented with matrix operations and one call to a symmetric eigendecomposition routine.

### 3.2. Numerical example: single measure eigenfunction gradient

We illustrate Proposition 1 with a numerical example. We sample points from a rectangle with each coordinate of each point sampled uniformly and independently from their respective ranges. We construct the bi-stochastic kernel using the discrete construction described in Section 3 with the Gaussian kernel and

(the kernel density estimate). We compute the gradients

, , , and using Proposition 1 (Step 4 of the discrete construction), and we plot the resulting vector fields in Figure 1. If , then the first four (nontrivial) Neumann Laplacian eigenfunctions on the rectangle are , , , and ; hence, the vector fields of the gradients of these functions plotted in Figure 1 appear as expected up to errors resulting from the discrete approximation. Figure 1. The discrete vector fields ∇φ1 (top left), ∇φ2 (top right), ∇φ3 (bottom left), and ∇φ4 (bottom right) for our numerical example.

### 3.3. Reference Measure

Next, we describe the discrete formulation of the operator . Suppose that finite sets and are given. These sets induce measures

 dμ(x)=1nn∑i=1δxi(x)anddν(r)=1mm∑j=1δrj(r).

In the following, we reiterate the construction of the operator in matrix notation. Moreover, we describe computational considerations under the assumption that the reference set has significantly less elements than the data set .

### Step 1: Defining the kernel and weight functions

We begin by defining the kernel matrix

 Kε(i,j)=h(|xi−rj|ε),fori=1,…,nandj=1,…,m,

where is a smooth function of exponential decay, and where denotes the Euclidean norm in . Suppose that and are weight vectors and set and . For example, if , and are not explicitly given, then for and we could define

 v(i)=(n∑k=1h(|xi−xk|ε))βandv(j)=(m∑k=1h(|rj−rk|ε))γ,

for and , respectively.

### Step 2: Sinkhorn Iteration

Next, Sinkhorn iteration is used to compute a diagonal matrix such that

 D−1KεV−1K⊤εD−1W−11=1

where denotes a vector of ones. The same Sinkhorn Iteration described in the previous section can be used. We remark that in this case it may be more efficient to applying each matrix to the vector successively in the product rather than computing the matrix product first.

### Step 3: Eigendecomposition

We want to compute the eigenvectors of the operator

 Cεf=D−1KεV−1K⊤εD−1W−1f.

Suppose that we compute the singular value decomposition

 W−1/2D−1KεV−1/2=USV⊤,

where is an orthogonal matrix of left singular vectors, is a diagonal matrix of singular values, and is a orthogonal matrix of right singular vectors. Define

 Φ:=W1/2U.

By the definition of the singular value decomposition, it immediately follows that if is the -th column of , and is the -th diagonal element of , then

 Cεφk=λkφk,

that is, is an eigenvector of of eigenvalue .

Let denote the matrix encoding the reference set . Motivated by Proposition 2 we define

 ∇φj:=(Fεφj)R−λjdiag(φj)¯RXλjε,

where

Here denotes an arbitrary vector, while denotes either an or an vector of ones depending on the context.

### 3.4. Numerical example: reference measure eigenfunction gradient

First, we generate a dataset of points sampled uniformly at random from the unit disc. Second, we select a reference set of points using a pivoted Gram-Schmidt procedure on the kernel matrix

 K(i,j)=e−|xi−xj|2/ε. Figure 2. We plot the data set {xi}ni=1 (left) and the reference set {rj}mj=1 (right).

Specifically, we choose the indices of the reference points from the data set inductively as follows. Let denote the columns of and define where denote the Euclidean norm in . Given we choose by

 ij+1=argmaxi|k(j+1)i|wherek(j+1)i=k(j)i−k(j)i⋅kij|k(j)ij|2k(j)ijfori=1,…,n,

where denotes the Euclidean inner product in . That is to say, at each step we choose the column of with the largest norm, and then orthogonalize all of the columns to the selected column and proceed iteratively. Given the data set and the reference set we construct the operator and its eigenfunctions as described in Section 3. Using Proposition 2 (whose discrete formulation is described in Step 4 in Section 3.3) we compute an approximation of the gradient of the first two (nontrivial) eigenfunctions , and which are plotted in Figure 3

. The first (nontrivial) Neumann Laplacian eigenvalue on the disc is of multiplicity two, and the associated eigenspace is spanned by the functions

 J1(λ1r)cos(θ)andJ1(λ1r)sin(θ)whereλ1≈1.84

is the smallest positive root of the equation ; therefore, the discrete approximations of the gradients and in Figure 3 appear as expected. Figure 3. We plot the discrete vector fields ∇φ1 (left), and ∇φ2 (right)

## 4. Proof of Theorem 1

### 4.1. Notation

Recall that is a compact smooth Riemannian manifold with a measure where is a positive density function with respect to the volume measure on . Furthermore, recall that is isometrically embedded in , and that the kernel is defined

 kε(x,y)=h(|x−y|2ε),

where is a smooth function of exponential decay, and where denotes the Euclidean norm on . Finally, recall that is a positive weight function which defines the measure .

### 4.2. Existence of bi-stochastic normalization

First, we argue why there exists a normalization function such that the kernel defined by

 bε(x,y)=kε(x,y)d(x)d(y)is bi-% stochastic with respect tod^μ.

By a result of Knopp and Sinkhorn  a positive normalization function exists (also see Theorem 5.1 in ). However, if such a function exists, then it satisfies the equation

 d(x)=∫Xkε(x,y)d(y)d^μ(y).

Since the kernel is smooth, we conclude that is also smooth.

### 4.3. Useful Lemmas

Our proof strategy is motivated by Lafon , and we adopt similar notation. Let

 m0=∫Rdh(|t|2)dt,andm2=∫Rdt21h(|t|2)dt,

where denotes Lebesgue measure on . Consider the integral operator by

 (Gεf)(x)=1εd/2∫Xkε(x,y)f(y)dy.

The following Lemma is due to , also see .

###### Lemma 1.

Suppose and . If is at least distance from , then

 (Gεf)(x)=m0f(x)+εm22(E(x)f(x)−Δf(x))+O(ε2),

where is a scalar function of the curvature at .

Additionally, we need an expansion for near the boundary. The following result is due to .

###### Lemma 2.

Suppose and . Then uniformly over within distance of

 Gεf(x)=m0,ε(x)f(x0)+√εm1,ε(x)∂f∂n(x0)+O(ε),

where is the closest point in to (with respect to Euclidean distance), and and are bounded functions of and .

### 4.4. Intermediate results

Before proving Theorem 1 we establish two results: Proposition 3 and Proposition 4, which provide local expansions for the bi-stochastic operator away from the boundary, and near the boundary, respectively.

###### Proposition 3.

Suppose and . If is at least distance from , then

 Bεf(x)=f(x)−εm22m0⎛⎜ ⎜ ⎜ ⎜⎝Δ(f(x)(q(x)w(x))1/2)(q(x)w(x))1/2−Δ(q(x)w(x))1/2(q(x)w(x))1/2f(x)⎞⎟ ⎟ ⎟ ⎟⎠+O(ε2).
###### Proof.

Since is bi-stochastic with respect to , we have the following integral equation

 1=∫Xbε(x,y)d^μ(y)=∫Xkε(x,y)d(x)d(y)q(y)w(y)dy.

Expanding this integral equation using Lemma 1 gives

 1=εd/21d(m01d⋅qw+εm22(E⋅1d⋅qw−Δ(1d⋅qw))+O(ε2)),

where the argument of the functions , , , and have been suppressed for notational brevity. By rearranging terms we have

 d2=εd/2m0qw(1+εm22m0(E−Δ(q/(d⋅w))q/(d⋅w))+O(ε2)).

Next, we compute an expansion for applied to an arbitrary function

 Bεf(x)=∫Xkε(x,y)d(x)d(y)f(y)q(y)w(y)dy.

Expanding the right hand side using Lemma 1 gives

 Bεf=εd/21d(m01d⋅f⋅qw+εm22(E⋅1d⋅f⋅qw−Δ(1d⋅f⋅qw))+O(ε2)).

Rearranging terms gives

 Bεf=m0εd/2f⋅1d2⋅qw(1+εm22m0(E+Δ(f⋅q/(d⋅w))f⋅q/(d⋅w))+O(ε2)).

Substituting in the expansion for into this equation yields

 Bεf=f(1+εm22m0(E−Δ(f⋅q/(d⋅w))f⋅q/(d⋅w)))(1+εm22m0(E−Δ(q/(d⋅w))q/(d⋅w)))−1+O(ε2).

Expanding the inverse term in a Taylor expansion gives

 Bεf=f−εm22m0(Δ(f⋅q/(d⋅w))q/(d⋅w)−