In the fields of machine learning and data mining, kernel based methods related to diffusion processes have proven to be effective tools for data analysis, e.g.[3, 5, 10, 14, 16]. Such methods are often studied as manifold learning problems where the given data is assumed to have been sampled from an underlying Riemannian manifold, see [1, 6, 15, 17]. In this paper, we adopt this manifold perspective to study bi-stochastic kernel constructions. Let be a measure space; we say that a positive kernel is bi-stochastic with respect to if
Bi-stochastic kernels can be constructed by appropriately normalizing positive kernels. We are specifically interested in symmetric constructions. Suppose that is a compact topological space, is a positive Borel measure, and is a continuous positive symmetric kernel. Then it follows from a result of Knopp and Sinkhorn  that there exists a positive continuous function such that
Moreover, the bi-stochastic normalization of can be determined iteratively by alternating between normalizing the kernel by a function of such that the integral over is equal to , and vice versa. This iteration was first studied in the matrix setting by Sinkhorn . In this paper we consider bi-stochastic kernel constructions in two settings. First, in Section 2, we consider the continuous setting where is a compact smooth Riemannian manifold. Here, we prove our main results, Theorems 1 and 2, which describe a connection between bi-stochastic kernels and diffusion operators. Second, in Section 3, we consider the discrete setting where is a finite set of points. Here, we translate the results of Theorems 1 and 2 into the discrete setting, and present numerical examples. Finally, in Sections 4 and 5 we prove Theorems 1 and 2, respectively.
2. Main Results
2.1. Single measure
Suppose that is a compact smooth Riemannian manifold endowed with a measure , where is a positive density function with respect to the volume measure on . Assume that is isometrically embedded in and define by
where is a smooth function of exponential decay, and where denotes the Euclidean norm on . Given a positive weight function , we define , and assert that there exists a function such that the kernel defined by
Our first result characterizes the infinitesimal generator associated with , and requires a class of smooth test functions. As in , we define to be the span of the first Neumann eigenfunctions of the Laplace-Beltrami operator on , with the convention (by the possibility of multiplying by ) that is positive semi-definite. Let
where denotes the Lebesgue measure on .
Suppose is fixed. Then on
where denotes the identity operator.
For example, suppose that is the Gaussian kernel which satisfies
and suppose that , where
Then we have the following Corollary.
Let be fixed. Suppose and . Then on
where denotes the positive semi-definite Laplace-Beltrami operator on .
The proof of this Corollary is included after the proof of Theorem 1; essentially, the proof amounts to checking that approximates with sufficiently small error such that can be replaced by in the result of Theorem 1. Corollary 1 suggests that the operator can be viewed as an approximation to the heat kernel; more precisely, we have the following result.
Suppose that and . Then on
where denotes the Neumann heat kernel on .
2.2. Connection to diffusion maps
The reader might notice the similarity between the bi-stochastic operator and the averaging operator of Lafon  which also approximates the Laplace-Beltrami operator. Specifically, the operator is defined by
Unlike the operator , the operator is only row stochastic, not bi-stochastic. Thus, we argue that provides a more natural analog to the heat kernel than the averaging operator . Of course, when these constructions are performed on data, numerically these operators are very similar (they have the same limiting behavior) so in the case of a single data set, the difference between and is mostly a matter of perspective. However, this bi-stochastic perspective is valuable in that it translates to more general settings as we demonstrate in the following section, where a reference measure is additionally given. We remark that bi-stochastic kernels have been previously considered in the context of reference graphs by Coifman and Hirn ; however, in this paper a specialized construction is presented which avoids the use of Sinkhorn iteration, and infinitesimal generators are not investigated.
2.3. Reference measure
where is a positive density function with respect to the volume measure on . Define where is a given positive weight function. We assert that there exists a positive function such that the kernel defined by
The following result concerns the infinitesimal generator associated with . We consider the same class of smooth test function as before; that is, is the span of the first Neumann eigenfunctions of the Laplace-Beltrami operator on .
Suppose is fixed. Then on
in norm as .
For example, suppose that
is the Gaussian kernel, and define two density estimates
Our first Corollary concerns the case where the measure is of interest, while the measure is not. For example, in the discrete setting discussed in Section 3 the density may be artificial, while the density may describe clusters in the data.
Suppose is fixed. If , and , then on
The proof of this Corollary is immediate from Theorem 2 and the proof of Corollary 1. If neither density function is of interest, then their effects can be removed from the associated infinitesimal generator as in Corollary 1.
Let be fixed. If , , and , then on
where denotes the positive semi-definite Laplace-Beltrami operator on .
Suppose , , and . Then on
where is the Neumann heat kernel on .
2.4. Eigenfunctions and their gradients
The idea of using eigenfunctions of the heat kernel as coordinates for Riemannian manifolds originates with Berard, Besson, and Gallot , and originates in the context of data analysis with Coifman and Lafon . Subsequently, Jones, Maggioni, and Schul  proved that the eigenfunctions of the heat kernel provide locally bi-Lipschitz coordinates, and moreover, that these eigenfunctions and their gradients satisfy estimates from above and below. Informally speaking, the gradient of an eigenfunction describes how the given eigenfunction views a neighborhood of its domain; in fact, in  the magnitude of the gradient of eigenfunctions is used as a selection criteria to build local coordinates. Moreover, the gradient of eigenfunctions is generally a useful tool for data analysis; for example, Wolf, Averbuch, and Neittaanmäki , and Wolf  used the gradients of eigenfunctions for characterization and detection of anomalies. In the following, we justify the existence of the spectral decomposition of and . Furthermore, under the assumption that both operators are based on the Gaussian kernel we compute Nyström extension type formulas for the gradients of the eigenfunctions of and . Recall, that the bi-stochastic operator is defined
Since the kernel is bounded, and is compact, it follows that the operator is a Hilbert-Schmidt operator, and hence a compact operator. Moreover, since is also symmetric it follows that the operator is self-adjoint and we may apply the Spectral Theorem to conclude that
are the eigenvalues of, and are the corresponding orthonormal eigenfunctions. In the following, we show that the gradients of the eigenfunctions of have a convenient formulation when the construction is based on the Gaussian kernel.
Suppose . Then
2.5. Reference measure eigenfunctions and their gradients
Similar results hold for the operator defined by
Since the kernel is bounded and is compact, clearly is a Hilbert-Schmidt operator, which implies that is a compact operator. Moreover, the symmetry of implies that is self-adjoint and thus by the Spectral Theorem
where are the eigenvalues, and are the corresponding orthonormal eigenfunctions on . In this case, the Nyström extension type formula for the gradient of an eigenfunction is slightly more involved, but similar in spirit to the case of a single measure.
Suppose . Then
While this formula for the gradient of an eigenfunction is less elegant than the result from Proposition 1, it nevertheless provides a method of defining in an arbitrary setting.
3. The discrete setting
3.1. Single measure
Suppose that a data set of points is given. The data set induces a data measure
where is a Dirac distribution centered at . While the data measure space does not satisfy the assumptions specified in Section 1, the construction of the operator can be preformed without issue; indeed, the construction only depends on the ability to compute Euclidean distances between points and the ability to integrate. Moreover, if
consists of samples from a probability distribution from a Riemannian manifold, then integrating with respect to the data measurecan be viewed as Monte Carlo integration, which approximates integrals on the manifold with error on the order of . Alternatively, the data measure space can be viewed as an intrinsic abstract measure space and be considered independent from any manifold assumption. In either case, the results regarding infinitesimal generators in Section 1 can guide our choice of normalization. In the following we reiterate the construction of using matrix notation and describe computational considerations which arise in this setting. For notation, we use bold capital letters, e.g., , to denote matrices, lower case bold letters, e.g.,
, to denote column vectors, andto denote the diagonal matrix with the vector along the diagonal.
Step 1: Defining the kernel and weight function
Let denote the matrix with entries
where is a smooth function of exponential decay, e.g., . Let denote a weight function and define . For example, if an explicit is not given, then for a fixed parameter we could define
Step 2: Sinkhorn iteration
Given the weight matrix , we seek a diagonal matrix such that
The diagonal matrix will be determined via Sinkhorn iteration ; that is to say, by alternating between normalizing the rows and columns of the given matrix. In the case of a symmetric matrix, Sinkhorn iteration can be phrased as the following iterative procedure: first, we set , where denotes the identity matrix, and then define inductively by
We claim that if we consider the limit
then the matrix will be bi-stochastic with respect to
as desired. Indeed, the reason for considering the limit of the geometric meanis that  only guarantees that
for some constant .
Step 2.5: Heuristic tricks for Sinkhorn iteration
We note that the following trick seems to drastically improve the rate of convergence for some Gaussian kernel matrices. Let , , and define
This iteration corresponds to first “dividing” by the row sums on both sides, and then “dividing” by the square root of the row sums of the resulting matrix on both sides of the resulting matrix; this iteration is motivated by a desire to combat the imbalance between left and right iterates.
Step 3: Diagonalization
In the discrete setting, the integral operator can be expressed
where is a column vector. Suppose that we compute the symmetric eigendecomposition
whereis a diagonal matrix of eigenvalues. Define
Suppose that denotes the -th column of the matrix , and denotes the -th diagonal element of the diagonal matrix . Then
That is to say, is an eigenvector of of eigenvalue .
Step 4: Gradient estimation
Suppose . Let denote the data matrix corresponding to points in . Motivated by Proposition 1 we define
We note that the entire described construction can be completely implemented with matrix operations and one call to a symmetric eigendecomposition routine.
3.2. Numerical example: single measure eigenfunction gradient
We illustrate Proposition 1 with a numerical example. We sample points from a rectangle with each coordinate of each point sampled uniformly and independently from their respective ranges. We construct the bi-stochastic kernel using the discrete construction described in Section 3 with the Gaussian kernel and
(the kernel density estimate). We compute the gradients, , , and using Proposition 1 (Step 4 of the discrete construction), and we plot the resulting vector fields in Figure 1. If , then the first four (nontrivial) Neumann Laplacian eigenfunctions on the rectangle are , , , and ; hence, the vector fields of the gradients of these functions plotted in Figure 1 appear as expected up to errors resulting from the discrete approximation.
3.3. Reference Measure
Next, we describe the discrete formulation of the operator . Suppose that finite sets and are given. These sets induce measures
In the following, we reiterate the construction of the operator in matrix notation. Moreover, we describe computational considerations under the assumption that the reference set has significantly less elements than the data set .
Step 1: Defining the kernel and weight functions
We begin by defining the kernel matrix
where is a smooth function of exponential decay, and where denotes the Euclidean norm in . Suppose that and are weight vectors and set and . For example, if , and are not explicitly given, then for and we could define
for and , respectively.
Step 2: Sinkhorn Iteration
Next, Sinkhorn iteration is used to compute a diagonal matrix such that
where denotes a vector of ones. The same Sinkhorn Iteration described in the previous section can be used. We remark that in this case it may be more efficient to applying each matrix to the vector successively in the product rather than computing the matrix product first.
Step 3: Eigendecomposition
We want to compute the eigenvectors of the operator
Suppose that we compute the singular value decomposition
where is an orthogonal matrix of left singular vectors, is a diagonal matrix of singular values, and is a orthogonal matrix of right singular vectors. Define
By the definition of the singular value decomposition, it immediately follows that if is the -th column of , and is the -th diagonal element of , then
that is, is an eigenvector of of eigenvalue .
Step 4: Eigenfunction Gradients
Let denote the matrix encoding the reference set . Motivated by Proposition 2 we define
Here denotes an arbitrary vector, while denotes either an or an vector of ones depending on the context.
3.4. Numerical example: reference measure eigenfunction gradient
First, we generate a dataset of points sampled uniformly at random from the unit disc. Second, we select a reference set of points using a pivoted Gram-Schmidt procedure on the kernel matrix
Specifically, we choose the indices of the reference points from the data set inductively as follows. Let denote the columns of and define where denote the Euclidean norm in . Given we choose by
where denotes the Euclidean inner product in . That is to say, at each step we choose the column of with the largest norm, and then orthogonalize all of the columns to the selected column and proceed iteratively. Given the data set and the reference set we construct the operator and its eigenfunctions as described in Section 3. Using Proposition 2 (whose discrete formulation is described in Step 4 in Section 3.3) we compute an approximation of the gradient of the first two (nontrivial) eigenfunctions , and which are plotted in Figure 3
. The first (nontrivial) Neumann Laplacian eigenvalue on the disc is of multiplicity two, and the associated eigenspace is spanned by the functions
is the smallest positive root of the equation ; therefore, the discrete approximations of the gradients and in Figure 3 appear as expected.
4. Proof of Theorem 1
Recall that is a compact smooth Riemannian manifold with a measure where is a positive density function with respect to the volume measure on . Furthermore, recall that is isometrically embedded in , and that the kernel is defined
where is a smooth function of exponential decay, and where denotes the Euclidean norm on . Finally, recall that is a positive weight function which defines the measure .
4.2. Existence of bi-stochastic normalization
First, we argue why there exists a normalization function such that the kernel defined by
Since the kernel is smooth, we conclude that is also smooth.
4.3. Useful Lemmas
Our proof strategy is motivated by Lafon , and we adopt similar notation. Let
where denotes Lebesgue measure on . Consider the integral operator by
Suppose and . If is at least distance from , then
where is a scalar function of the curvature at .
Additionally, we need an expansion for near the boundary. The following result is due to .
Suppose and . Then uniformly over within distance of
where is the closest point in to (with respect to Euclidean distance), and and are bounded functions of and .
4.4. Intermediate results
Before proving Theorem 1 we establish two results: Proposition 3 and Proposition 4, which provide local expansions for the bi-stochastic operator away from the boundary, and near the boundary, respectively.
Suppose and . If is at least distance from , then
Since is bi-stochastic with respect to , we have the following integral equation
Expanding this integral equation using Lemma 1 gives
where the argument of the functions , , , and have been suppressed for notational brevity. By rearranging terms we have
Next, we compute an expansion for applied to an arbitrary function
Expanding the right hand side using Lemma 1 gives
Rearranging terms gives
Substituting in the expansion for into this equation yields
Expanding the inverse term in a Taylor expansion gives