# Gaussian Process Landmarking on Manifolds

As a means of improving analysis of biological shapes, we propose a greedy algorithm for sampling a Riemannian manifold based on the uncertainty of a Gaussian process. This is known to produce a near optimal experimental design with the manifold as the domain, and appears to outperform the use of user-placed landmarks in representing geometry of biological objects. We provide an asymptotic analysis for the decay of the maximum conditional variance, which is frequently employed as a greedy criterion for similar variance- or uncertainty-based sequential experimental design strategies, to our knowledge this is the first result of this type for experimental design. The key observation is to link the greedy algorithm with reduced basis methods in the context of model reduction for partial differential equations. We apply the proposed landmarking algorithm to geometric morphometrics, a branch of evolutionary biology focusing on the analysis and comparisons of anatomical shapes, and compare the automatically sampled landmarks with the "ground truth" landmarks manually placed by evolutionary anthropologists, the results suggest that Gaussian process landmarks perform equally well or better, in terms of both spatial coverage and downstream statistical analysis. We expect this approach will find additional applications in other fields of research.

## Authors

• 12 publications
• 5 publications
• 2 publications
• 21 publications
• ### Gaussian Process Landmarking for Three-Dimensional Geometric Morphometrics

We demonstrate applications of the Gaussian process-based landmarking al...
07/31/2018 ∙ by Tingran Gao, et al. ∙ 0

• ### A numerical approach to Kolmogorov equation in high dimension based on Gaussian analysis

For Kolmogorov equations associated to finite dimensional stochastic dif...
07/07/2019 ∙ by Franco Flandoli, et al. ∙ 0

• ### Automatic Detection and Uncertainty Quantification of Landmarks on Elastic Curves

A population quantity of interest in statistical shape analysis is the l...
10/13/2017 ∙ by Justin Strait, et al. ∙ 0

• ### On the Inference of Applying Gaussian Process Modeling to a Deterministic Function

The Gaussian process modeling is a standard tool for building emulators ...
02/04/2020 ∙ by Wenjia Wang, et al. ∙ 0

• ### Optimal Design of Experiments on Riemannian Manifolds

Traditional optimal design of experiment theory is developed on Euclidea...
11/06/2019 ∙ by Hang Li, et al. ∙ 0

• ### Gaussian Process Manifold Interpolation for Probabilistic Atrial Activation Maps and Uncertain Conduction Velocity

In patients with atrial fibrillation, local activation time (LAT) maps a...
04/22/2020 ∙ by Sam Coveney, et al. ∙ 4

• ### Algorithms to automatically quantify the geometric similarity of anatomical surfaces

We describe new approaches for distances between pairs of 2-dimensional ...
10/17/2011 ∙ by D. Boyer, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper grew out of an attempt to apply principles of the statistics field of optimal experimental design to geometric morphometrics

, a subfield of evolutionary biology that focuses on quantifying the (dis-)similarities between pairs of two-dimensional anatomical surfaces based on their spatial configurations. In contrast to methods for statistical estimation and inference, which typically focus on studying the error made by estimators with respect to a deterministically generated or randomly drawn (but fixed once given) collection of sample observations, and constructing estimators to minimize this error, the paradigm of optimal experimental design is to minimize the empirical risk by an “optimal” choice of sample locations, while the estimator itself and the number of samples are kept fixed

[62, 6]. Finding an optimal design amounts to choosing sample points that are most informative for a class of estimators so as to reduce the number of observations; this is most desirable when even one observation is expensive to acquire (e.g. in spatial analysis (geostatistics) [79, 27] and in computationally demanding computer experiments [72]), but similar ideas have long been exploited in the probabilistic analysis of some classical numerical analysis problems (see e.g. [78, 92, 65]).

In this paper, we adopt the methodology of optimal experimental design for discretely sampling Riemannian manifolds, and propose a greedy algorithm that sequentially selects design points based on the uncertainty modeled by a Gaussian process. On anatomical surfaces of interest to geometric morphometrical applications, these design points play the role of anatomical landmarks, or just landmarks, which are geometrically or semantically meaningful feature points crafted by evolutionary biologists for quantitatively comparing large collections of biological specimens in the framework of Procrustes analysis [39, 33, 40]. The effectiveness of our approach on anatomical surfaces, along with more background information on geometric morphometrics and Procrustes analysis, is demonstrated in a companion paper [37]; though the prototypical application scenario in this paper and [37] is geometric morphometrics, we expect the approach proposed here to be more generally applicable to other scientific domains where compact or sparse data representation is demanded. In contexts different from evolutionary biology, closely related (continuous or discretized) manifold sampling problems are addressed in [5, 53, 42], where smooth manifolds are discretized by optimizing the locations of (a fixed number of) points so as to minimize a Riesz functional, or by [61, 66], studying surface simplification via spectral subsampling or geometric relevance. These approaches, when applied to two-dimensional surfaces, tend to distribute points either empirically with respect to fine geometric details preserved in the discretized point clouds or uniformly over the underlying geometric object, whereas triangular meshes encountered in geometric morphometrics often lack fine geometric details but still demand non-uniform, sparse geometric features that are semantically/biologically meaningful; moreover, it is often not clear whether the desired anatomical landmarks are naturally associated with an energy potential. In contrast, our work is inspired by recent research on active learning with Gaussian processes [25, 63, 45] as well as related applications in finding landmarks along a manifold [49]

. Different from many graph-Laplacian-based manifold landmarking algorithms in semi-supervised learning (e.g.

[93, 90]), our approach considers a Gaussian process on the manifold whose covariance structure is specified by the heat kernel, with a greedy landmarking strategy that aims to produce a set of geometrically significant samples with adequate coverage for biological traits. Furthermore, in stark contrast with [30, 51] where landmarks are utilized primarily for improving computational efficiency, the landmarks produced by our algorithm explicitly and directly minimize the mean squared prediction error

(MSPE) and thus bear rich information for machine learning and data mining tasks. The optimality of the proposed greedy procedure is also established (see

Section 4); this is apparently much less straightforward for non-deterministic, sampling-based manifold landmarking algorithms such as [32, 21, 83].

The rest of this paper is organized as follows. The remainder of this introduction section motivates our main algorithm and discusses other related work. Section 2 sets notations and provides background materials for Gaussian processes and the construction of heat kernels on Riemannian manifolds (and discretizations thereof), as well as the “reweighted kernel” constructed from these discretized heat kernels; Section 3 presents an unsupervised landmarking algorithm for anatomical surfaces inspired by recent work on uncertainty sampling in Gaussian process active learning [49]; Section 4 provides the convergence rate analysis and establishes the MSPE optimality; Section 5 summarizes the current paper with a brief sketch of potential future directions. We defer implementation details of the proposed algorithm for applications in geometric morphometrics to the companion paper [37].

### 1.1 Motivation

To see the link between landmark identification and active learning with uncertainty sampling [48, 74], let us consider the regression problem of estimating a function defined over a point cloud . Rather than construct the estimator from random sample observations, we adopt the point of view of active learning, in which one is allowed to sequentially query the values of at user-picked vertices . In order to minimize the empirical risk of an estimator within a given number of iterations, the simplest and most commonly used strategy is to first evaluate (under reasonable probabilistic model assumptions) the informativeness of the vertices on the mesh that have not been queried, and then greedily choose to inquire the value of at the vertex at which the response value —inferred from all previous queries—is most “uncertain” in the sense of attaining highest predictive error (though other uncertainty measures such as the Shannon entropy could be used as well); these sequentially obtained highest-uncertainty points will be treated as morphometrical landmarks in our proposed algorithm.

This straightforward application of an active learning strategy summarized above relies upon selecting a regression function of rich biological information. In the absence of a natural candidate regression function , we seek to reduce in every iteration the maximum “average uncertainty” of a class of regression functions, e.g., specified by a Gaussian process prior [63]. Throughout this paper we will denote for the Gaussian process on a smooth, compact Riemannian manifold with mean function and covariance function . If we interpret choosing a single most “biologically meaningful” function as a manual “feature handcrafting” step, the specification of a Gaussian process prior can be viewed as a less restrictive and more stable “ensemble” version; the geometric information can be conveniently encoded into the prior by specifying an appropriate covariance function . We construct such a covariance function in Section 2.2 by reweighting the heat kernel of the Riemannian manifold , adopting (but in the meanwhile also appending further geometric information to) the methodology of Gaussian process optimal experimental design [70, 72, 35] and sensitivity analysis [71, 60] from the statistics literature.

### 1.2 Our Contribution and Other Related Work

The main theoretical contribution of this paper is a convergence rate analysis for the greedy algorithm of uncertainty-based sequential experimental design, which amounts to estimating the uniform rate of decay for the prediction error of a Gaussian process as the number of greedily picked design points increases; on a -manifold we deduce that the convergence is faster than any inverse polynomial rate, which is also the optimal rate any greedy or non-greedy landmarking algorithm can attain on a generic smooth manifold. This analysis makes use of recent results in the analysis of reduced based methods, by converting the Gaussian process experimental design into a basis selection problem in a reproducing kernel Hilbert space associated with the Gaussian process. To our knowledge, there does not exist in the literature any earlier analysis of this type for greedy algorithms in optimal experimental design; the convergence results obtained from this analysis can also be used to bound the number of iterations in Gaussian process active learning [25, 45, 49] and maximum entropy design [72, 47, 59]. From a numerical linear algebra perspective, though the rank- update procedure detailed in Remark 2 coincides with the well-known algorithm of pivoted Cholesky decomposition for symmetric positive definite matrices (c.f. Section 3.2), we are not aware either of similar results in that context for the performance of pivoting. We thus expect our theoretical contribution to shed light upon gaining deeper understandings of other manifold landmarking algorithms in active and semi-supervised learning [93, 90, 21, 83, 30, 51]. We discuss implementation details of our algorithm for applications in geometric morphomerics in a companion paper [37].

We point out that, though experimental design is a classical problem in the statistical literature [34, 20, 62], it is only very recently that interests in computationally efficient experimental design algorithms began to arise in the computer science community [15, 7, 58, 86, 1, 2]. Most experimental design algorithms, based on various types of optimality criteria including but not limited to A(verage)-, D(eterminant)-, E(igen)-, V(ariance)-, G-optimality and Bayesian alphabetical optimality, are NP-hard computational in their exact form [23, 19], with the only exception of T(race)-optimality which is trivial to solve. For computer scientists, the interesting problem is to find polynomial algorithms that efficiently find -approximations of the optimal solution to the exact problem, where is expected to be as small as possible but depends on the size of the problem and the pre-fixed budget for the number of design points; oftentimes these approximation results also require certain constraints on the relative sizes of the dimension of the ambient space, the number of design points, and the total number of candidate points. Different from those approaches, our theoretical contribution assumes no relations between these quantities, and the convergence rate is with respect to the increasing number of landmark points (as opposed to a pre-fixed budget); nevertheless, similar to results obtained in [15, 7, 58, 86, 1, 2], our proposed algorithm has polynomial complexity and is thus computationally tractable; see Section 3.2 for more details. We refer interested readers to [62] for more exhaustive discussions of the optimality criteria used in experimental design.

## 2 Background

### 2.1 Heat Kernels and Gaussian Processes on Riemannian Manifolds: A Spectral Embedding Perspective

Let be an orientable compact Riemannian manifold of dimension with finite volume, where is the Riemannian metric on . Denote for the canonical volume form with coordinate representation

 dvolM(x)=√|g(x)|dx1∧⋯∧dxd.

The finite volume will be denoted as

 Vol(M)=∫MdvolM(x)=∫M√|g(x)|dx1∧⋯∧dxd<∞,

and we will fix the canonical normalized volume form as reference. Throughout this paper, all distributions on are absolutely continuous with respect to .

The single-output regression problem on the Riemannian manifold will be described as follows. Given independent and identically distributed observations

of a random variable

on the product probability space

, the goal of the regression problem is to estimate the conditional expectation

 f(x):=E(Y∣X=x) (1)

which is often referred to as a regression function of on [81]

. The joint distribution of

and will always be assumed absolutely continuous with respect to the product measure on for simplicity. A Gaussian process (or Gaussian random field) on with mean function and covariance function is defined as the stochastic process of which any finite marginal distribution on fixed points

is a multivariate Gaussian distribution with mean vector

 mn:=(m(x1),⋯,m(xn))∈Rn

and covariance matrix

 Kn:=⎛⎜ ⎜⎝K(x1,x1)⋯K(x1,xn)⋮⋮K(xn,x1)⋯K(xn,xn)⎞⎟ ⎟⎠∈Rn×n.

A Gaussian process with mean function and covariance function will be denoted as . Under model , given observed values at locations , the best linear predictor (BLP) [79, 72] for the random field at a new point is given by the conditional expectation

 Y∗(x):=E[Y(x)∣Y(x1)=y1,⋯,Y(xn)=yn]=m(x)+kn(x)⊤K−1n(Yn−mn) (2)

where , ; at any , the expected squared error, or mean squared prediction error (MSPE), is defined as

 MSPE(x;x1,⋯,xn): =E[(Y(x)−Y∗(x))2] (3) =E[(Y(x)−E[Y(x)∣Y(x1)=y1,⋯,Y(xn)=yn])2]

which is a function over . Here the expectation is with respect to all realizations . Squared integral () or sup (

) norms of the pointwise MSPE are often used as a criterion for evaluating the prediction performance over the experimental domain. In geospatial statistics, interpolation with (

2) and (3) is known as kriging.

Our analysis in this paper concerns the sup norm of the prediction error with design points picked using a greedy algorithm, i.e. the quantity

 σn:=supx∈MMSPE(x;x1,⋯,xn)

where are chosen according to Algorithm 1. This quantity is compared with the “oracle” prediction error attainable by any sequential or non-sequential experimental design with points, i.e.

 dn:=infx1,⋯,xn∈Msupx∈MMSPE(x;x1,⋯,xn)

As will be shown in (29) in Section 4, can be interpreted as the Kolmogorov width of approximating a Reproducing Kernel Hilbert Space (RKHS) with a reduced basis. The RKHS we consider is a natural one associated with a Gaussian process; see e.g. [28, 55] for general introductions on RKHS and [82] for RKHS associated with Gaussian processes. We include a brief sketch of the RKHS theory needed for understanding Section 4 in Appendix A.

On Riemannian manifolds, there is a natural choice for the kernel function: the heat kernel, i.e. the kernel of the Laplace-Beltrami operator. Denote for the Laplace-Beltrami operator on with respect to the metric , i.e.

 Δf=1√|g|∂i(√|g|gij∂jf),∀f∈C∞(M)

where the sign convention is such that is positive semidefinite. If the manifold has no boundary, the spectrum of

is well-known to be real, non-negative, discrete, with eigenvalues satisfying

, with the only accumulation point of the spectrum; when has non-empty boundary we assume Dirichlet boundary condition, so the same inequalities hold for the eigenvalues. If we denote

for the eigenfunction of

corresponding to the eigenvalue , then the set constitutes an orthonormal basis for under the standard inner product

 ⟨f1,f2⟩M:=∫Mf1(x)f2(x)dvolM(x).

The heat kernel is the fundamental solution of the heat equation on :

 ∂tu(x,t)=−Δu(x,t),x∈M,t∈(0,∞).

That is, if the initial data is specified as

 u(x,t=0)=v(x)

then

 u(x,t)=∫Mkt(x,y)v(y)dvolM(y).

In terms of the spectral data of (see e.g. [68, 12]), the heat kernel can be written as

 kt(x,y)=∞∑i=0e−λitϕi(x)ϕi(y),∀t≥0,x,y∈M. (4)

For any fixed , the heat kernel defines a Mercer kernel on by

 (x,y)↦kt(x,y)∀(x,y)∈M×M

and the feature mapping (see (42)) takes the form

 M∋x⟼Φt(x):=(e−λ0t/2ϕ0(x),e−λ1t/2ϕ1(x),⋯,e−λit/2ϕi(x),⋯)∈ℓ2 (5)

where is the infinite sequence space equipped with a standard inner product; see e.g. [64, §II.1 Example 3]. Note in particular that

 (6)

In fact, up to a multiplicative constant , the feature mapping has long been studied in spectral geometry [11] and is known to be an embedding of into ; furthermore, with the multiplicative correction by , the pullback of the canonical metric on is asymptotically equal to the Riemannian metric on .

In this paper we focus on Gaussian processes on Riemannian manifolds with heat kernels (or “reweighted” counterparts thereof; see Section 2.2) as covariance functions. There are at least two reasons for heat kernels to be considered as natural candidates for covariance functions of Gaussian processes on manifolds. First, as argued in [18, §2.5], the abundant geometric information encoded in the Laplace-Beltrami operator makes the heat kernel a canonical choice for Gaussian processes; Gaussian processes defined this way impose natural geometric priors based on randomly rescaled solutions of the heat equation. Second, by (6), a Gaussian process on with heat kernel is equivalent to a Gaussian process on the embedded image of into under the feature mapping (5) with a dot product kernel; this is reminiscent of the methodology of extrinsic Gaussian process regression (eGPR) [50] on manifolds — in order to perform Gaussian process regression on a nonlinear manifold, eGPR first embeds the manifold into a Euclidean space using an arbitrary embedding, then performs Gaussian process regression on the embedded image following standard procedures for Gaussian process regression. This spectral embedding interpretation also underlies recent work constructing Gaussian priors, by means of the graph Laplacian, for uncertainty quantification of graph semi-supervised learning [13].

### 2.2 Discretized and Reweighted Heat Kernels

When the Riemannian manifold is a submanifold embedded in an ambient Euclidean space () and sampled only at finitely many points , we know from the literature of Laplacian eigenmaps [9, 10] and diffusion maps [26, 76, 77] that the extrinsic squared exponential kernel matrix

 K=(Kij)1≤i,j≤n=⎛⎝exp⎛⎝−∥∥xi−xj∥∥2t⎞⎠⎞⎠1≤i,j≤n (7)

is a consistent estimator (up to a multiplicative constant) of the heat kernel of the manifold if are sampled uniformly and i.i.d. on with appropriately adjusted bandwidth parameter as

; similar results holds when the squared exponential kernel is replaced with any anisotropic kernel, and additional renormalization techniques can be used to adjust the kernel if the samples are i.i.d. but not uniformly distributed on

, see e.g. [26] for more details. These theoretical results in manifold learning justify using extrinsic kernel functions in a Gaussian process regression framework when the manifold is an embedded submanifold of an ambient Euclidean space; the kernel (7) is also used in [91] for Gaussian process regression on manifolds in a Bayesian setting. Nevertheless, one may well use other discrete approximations of the heat kernel in place of (7) without affecting our theoretical results in Section 4, as long as the kernel matrix is positive (semi-)definite and defines a valid Gaussian process for our landmarking purposes.

The heat kernel of the Riemannian manifold defines covariance functions for a family of Gaussian processes on , but this type of covariance functions depends only on the spectral properties of

, whereas in practice we would often like to incorporate prior information addressing relative high/low confidence of the selected landmarks. For example, the response variables might be measured with higher accuracy (or equivalently the influence of random observation noise is damped) where the predictor falls on a region on the manifold

with lower curvature. We encode this type of prior information regarding the relative importance of different locations on the domain manifold in a smooth positive weight function defined on the entire manifold, whereby the higher values of

indicate a relatively higher importance if a predictor variable is sampled near

. Since we assume is closed, is bounded below away from zero. To “knit” the weight function into the heat kernel, notice that by the reproducing property we have

 kt(x,y)=∫Mkt/2(x,z)kt/2(z,y)dvolM(z) (8)

and we can naturally apply the weight function to deform the volume form, i.e. define

 kwt(x,y)=∫Mkt/2(x,z)kt/2(z,y)w(z)dvolM(z). (9)

Obviously, on if we pick on , using the expression (4) for heat kernel and the orthonormality of the eigenfunctions . Intuitively, (9) reweighs the mutual interaction between different regions on such that the portions with high weights have a more significant influence on the covariance structure of the Gaussian process on . Results established for can often be directly adapted for .

In practice, when the manifold is sampled only at finitely many i.i.d. points on , the reweighted kernel can be calculated from the discrete extrinsic kernel matrix (7) with replaced with :

 Kw=(Kwij)1≤i,j≤n=⎛⎜⎝n∑k=1e−∥∥xi−xk∥∥2t/2⋅w(xk)⋅e−∥∥xk−xj∥∥2t/2⎞⎟⎠1≤i,j≤n=K⊤WK (10)

where is a diagonal matrix of size with at its -th diagonal entry, for all , and is the discrete squared exponential kernel matrix (7). It is worth pointing out that the reweighted kernel no longer equals the kernel in (7) even when we set at this discrete level. Similar kernels to (9) have also appeared in [22] as the symmetrization of an asymmetric anisotropic kernel.

Though the reweighting step appears to be a straightforward implementation trick, it turns out to be crucial in the application of automated geometric morphometrics: the landmarking algorithm that will be presented in Section 3 produces biologically much more representative features on anatomical surfaces when the reweighted kernel is adopted. We demonstrate this in greater detail in [37].

## 3 Gaussian Process Landmarking

We present in this section an algorithm motivated by [49] that automatically places “landmarks” on a compact Riemannian manifold using a Gaussian process active learning strategy. Let us begin with an arbitrary nonparametric regression model in the form of Eq. 1. Unlike in standard supervised learning in which a finite number of sample-label pairs are provided, an active learning algorithm can iteratively decide, based on memory of all previously inquired sample-label pairs, which sample to inquire for label in the next step. In other words, given sample-label pairs observed up to the -th step, an active learning algorithm can decide which sample to query for the label information of the regression function to be estimated; typically, the algorithm assumes full knowledge of the sample domain, has access to the regression function as a black box, and strives to optimize its query strategy so as to estimate in as few steps as possible. With a Gaussian process prior on the regression function class, the joint distribution of a finite collection of response values is assumed to follow a multivariate Gaussian distribution where

 m(X1,⋯,Xn+1) =(m(X1),⋯,m(Xn+1))∈Rn, (11) K(X1,⋯,Xn+1) =⎛⎜ ⎜⎝K(X1,X1)⋯K(X1,Xn+1)⋮⋮K(Xn+1,X1)⋯K(Xn+1,Xn+1)⎞⎟ ⎟⎠∈R(n+1)×(n+1).

For simplicity of statement, the rest of this paper will use short-hand notations

 X1n=(X1,⋯,Xn)∈Mn,Y1n=(Y1,⋯,Yn)∈Rn, (12)

and

 Kn,n =K(X1,⋯,Xn)∈Rn×n, (13) K(X,X1n)

Given observed samples , at any , the conditional probability of the response value

follows a normal distribution

 N(ξn(X),Σn(X))

where

 ξn(X) =K(X,X1n)⊤K−1nY1n, (14) Σn(X) =K(X,X)−K(X,X1n)⊤K−1n,nK(X,X1n).

In our landmarking algorithm, we simply choose to be the location on the manifold with the largest variance, i.e.

 Xn+1:=argmaxX∈MΣn(X). (15)

Notice that this successive procedure of “landmarking” on is independent of the specific choice of regression function in since we only need the covariance function .

### 3.1 Algorithm

The main algorithm of this paper, an unsupervised landmarking procedure for anatomical surfaces, will use a discretized, reweighted kernel constructed from triangular meshes that digitize anatomical surfaces. We now describe this algorithm in full detail. Let be a -dimensional compact surface isometrically embedded in , and denote , for the Gaussian curvature and (scalar) mean curvature of . Define a family of weight function parametrized by and as

 wλ,ρ(x)=λ|κ(x)|ρ∫M|κ(ξ)|ρdvolM(ξ)+(1−λ)|η(x)|ρ∫M|η(ξ)|ρdvolM(ξ),∀x∈M. (16)

This weight function seeks to emphasize the influence of high curvature locations on the surface on the covariance structure of the Gaussian process prior , where is the reweighted heat kernel defined in Eq. 9. We stick in this paper with simple kriging [setting in ], and use in our implementation default values and (but one may wish to alter these values to fine-tune the landscape of the weight function for a specific application).

For all practical purposes, we only concern ourselves with being a piecewise linear surface, represented as a discrete triangular mesh with vertex set and edge set . We calculate the mean and Gaussian curvature functions on the triangular mesh using standard algorithms in computational geometry [24, 3]. The weight function can then be calculated at each vertex by

 (17)

where is the area of the Voronoi cell of the triangular mesh centered at . The reweighted heat kernel is then defined on as

 kwλ,ρt(xi,xj)=|V|∑k=1kt/2(xi,xk)kt/2(xk,xj)wλ,ρ(xk)ν(xk) (18)

where the (unweighted) heat kernel is calculated as in Eq. 7. Until a fixed total number of landmarks are collected, at step the algorithm computes the uncertainty score on from the existing landmarks by

 Σ(n+1)(xi)=kwλ,ρt(xi,xi)−kwλ,ρt(xi,ξ1n)⊤kwλ,ρt(ξ1n,ξ1n)−1kwλ,ρt(xi,ξ1n)∀xi∈V (19)

where

 kwλ,ρt(xi,ξ1n) :=⎛⎜ ⎜ ⎜⎝kwλ,ρt(xi,ξ1)⋮kwλ,ρt(xi,ξn)⎞⎟ ⎟ ⎟⎠, kwλ,ρt(ξ1n,ξ1n) :=⎛⎜ ⎜ ⎜⎝kwλ,ρt(ξ1,ξ1)⋯kwλ,ρt(ξ1,ξn)⋮⋮kwλ,ρt(ξn,ξ1)⋯kwλ,ρt(ξn,ξn)⎞⎟ ⎟ ⎟⎠,

and pick the -th landmark according to the rule

 ξn+1=argmaxxi∈VΣ(n+1)(xi).

If there are more than one maximizer of , we just randomly pick one; at step the algorithm simply picks the vertex maximizing on . See Algorithm 1 for a comprehensive description.

###### Remark 1

Algorithm 1 can be easily adapted to work with point clouds (where connectivity information is not present) and in higher dimensional spaces, which makes it applicable to a much wider range of input data in geometric morphometrics as well as other applications; see e.g. [37]. For instance, it suffices to replace Step 4 of Algorithm 1 with a different discrete curvature (or another type of “importance score”) calculation procedure on point clouds (see e.g. [69, 29]), and replace Step 5 with a nearest-neighbor weighted graph adjacency matrix construction. We require in this paper the inputs to be triangular meshes with edge connectivity only for the ease of statement, as computation of discrete curvatures on triangular meshes is much more straightforward.

###### Remark 2

Note that, according to Eq. 19, each step adds only one new row and one new column to the inverse covariance matrix, which enables us to perform rank- updates to the covariance matrix according to the block matrix inversion formula (see e.g. [63, §A.3])

 K−1n=(Kn−1PP⊤K(Xn,Xn))−1=(K−1n−1(In−1+μPP⊤K−1n−1)−μK−1n−1P−μP⊤K−1n−1μ)

where

 P =(K(X1,Xn),⋯,K(Xn−1,Xn))∈Rn−1, μ =(K(Xn,Xn)−P⊤K−1n−1P)−1∈R.

This simple trick significantly improves the computational efficiency as it avoids directly inverting the covariance matrix when the number of landmarks becomes large as the iteration progresses.

Before we delve into the theoretical aspects of Algorithm 1, let us present a few typical instances of this algorithm in practical use. A more comprehensive evaluation of the applicability of Algorithm 1 to geometric morphometrics is deferred to [37]. In a nutshell, the Gaussian process landmarking algorithm picks the landmarks on the triangular mesh successively, according to the uncertainty score function at the beginning of each step; at the end of each step the uncertainty score function gets updated, with the information of the newly picked landmark incorporated into the inverse convariance matrix defined as in Eq. 14. Fig. 1 illustrates the first few successive steps on a triangular mesh discretization of a fossil molar of primate Plesiadapoidea. Empirically, we observed that the updates on the uncertainty score function are mostly local, i.e. no abrupt changes of the uncertainty score are observed away from a small geodesic neighborhood centered at each new landmark. Guided by uncertainty and curvature-reweighted covariance function, the Gaussian process landmarking often identifies landmarks of abundant biological information—for instance, the first Gaussian process landmarks are often highly biologically informative, and demonstrate comparable level of coverage with observer landmarks manually picked by human experts. See Fig. 2 for a visual comparison between the automatically generated landmarks with the observer landmarks manually placed by evolutionary anthropologists on a different digitized fossil molar.

### 3.2 Numerical Linear Algebra Perspective

Algorithm 1 can be divided into phases: Line 1 to 10 focus on constructing the kernel matrix from the geometry of the triangular mesh ; from Line 11 onward, only numerical linear algebraic manipulations are involved. In fact, the numerical linear algebra part of Algorithm 1 is identical to Gaussian elimination (or LU decomposition) with a very particular “diagonal pivoting” strategy, which is different from standard full or partial pivoting in Gaussian elimination. To see this, first note that the variance in Eq. 14 is just the diagonal of the Schur complement of the submatrix of corresponding to the previously chosen landmarks , and recall from [80, Ex. 20.3] that this Schur complement arises as the bottom-right block after the -th elimination step. The greedy criterion Eq. 15 then amounts to selecting the largest diagonal entry in this Schur complement as the next pivot. Therefore, the second phase of Algorithm 1 can be consolidated into the form of a “diagonal-pivoted” LU decomposition, i.e. , in which the first columns of the permutation matrix reveals the location of the chosen landmarks. In fact, since the kernel matrix we choose is symmetric and positive semidefinite, the rank- updates in Remark 2 most closely resembles the pivoted Cholesky decomposition (see e.g. [43, §10.3] or [41]). Identical with these classical pivoting-based matrix decomposition algorithms, the time and space computational complexity of the main algorithm in Algorithm 1 are thus and , respectively, where is the total number of candidate points and is the desired number of parameters. Note that these complexities are both polynomial and comparable with those in the computer science literature [15, 58, 86, 1, 2]. This numerical linear algebraic perspective motivates us to investigate variants of Algorithm 1 based on other numerical linear algebraic algorithms with pivoting in future work.

## 4 Rate of Convergence: Reduced Basis Methods in Reproducing Kernel Hilbert Spaces

In this subsection we analyze the rate of convergence of our main Gaussian process landmarking algorithm in Section 3. While the notion of “convergence rate” in the context of Gaussian process regression (i.e. kriging [56, 79]) or scattered data approximation (see e.g. [87] and the references therein) refers to how fast the interpolant approaches the true function, our focus in this paper is the rate of convergence of Algorithm 1 itself, i.e. the number of steps the algorithm takes before it terminates. In practice, unless a maximum number of landmarks is predetermined, a natural criterion for terminating the algorithm is to specify a threshold for the sup-norm of the prediction error (3) [i.e. the variance (15)] over the manifold. We emphasize again that, although this greedy approach is motivated by the probabilistic model of Gaussian processes, the MSPE is completely determined once the kernel function and the design points are specified, and so is the greedy algorithmic procedure. Our analysis is centered around bounding the uniform rate at which the pointwise MSPE function (3) decays with respect to number landmarks greedily selected.

To this end, we observe the connection between Algorithm 1 and a greedy algorithm studied thoroughly for reduced basis methods in [14, 31] in the context of model reduction. While the analyses in [14, 31] assume general Hilbert and Banach spaces, we apply their result to a reproducing kernel Hilbert space (RKHS), denoted as , naturally associated with a Gaussian process ; as will be demonstrated below, the MSPE with respect to selected landmarks can be interpreted as a distance function between elements of to an -dimensional subspace of determined by the selected landmarks. We emphasize that, though the connection between Gaussian process and RKHS is well known (see e.g. [82] and the references therein), we are not aware of existing literature addressing the resemblance between the two classes of greedy algorithms widely used in Gaussian process experimental design and reduced basis methods.

We begin with a brief summary of the greedy algorithm in reduced basis methods for a general Banach space . The algorithm strives to approximate all elements of using a properly constructed linear subspace spanned by (as few as possible) selected elements from a compact subset ; thus the name “reduced” basis. A popular greedy algorithm for this purpose generates successive approximation spaces by choosing the first basis according to

 f1:=argmaxf∈F∥f∥ (20)

and, successively, when are picked already, choose

 fn+1:=argmaxf∈Fdist(f,Vn) (21)

where

 Vn=span{f1,f2,⋯,fn}

and

 dist(f,Vn):=infg∈Vn∥f−g∥.

In words, at each step we greedily pick the function that is “farthest away” from the set of already chosen basis elements. Intuitively, this is analogous to the farthest point sampling (FPS) algorithm [38, 54] in Banach spaces, with a key difference in the choice of the distance between a point and a set of selected points : in FPS such a distance is defined as the maximum over all distances , whereas in reduced basis methods the distance is between and the linear subspace spanned by .

Gaussian process landmarking algorithm fits naturally into the framework of reduced basis methods, as follows. Let us first specialize this construction to the case when is the reproducing kernel Hilbert space , where is a compact Riemannian manifold and is the reproducing kernel. A natural choice for is the heat kernel with a fixed as in Section 2.1, but for a submanifold isometrically embedded into an ambient Euclidean space it is common as well to choose the kernel to be the restriction to of a positive (semi-)definite kernel in the ambient Euclidean space such as Eq. 7 or Eq. 10, for which Sobolev-type error estimates are known in the literature of scattered data approximation [57, 36]. It follows from standard RKHS theory (see e.g. (43)) that

 HK=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯span{∑i∈IaiK(⋅,xi)∣ai∈R,xi∈M,card(I)<∞} (22)

and, by the compactness of and the regularity of the kernel function, we have for any

 ⟨K(⋅,x),K(⋅,x)⟩HK=K(x,x)≤∥K∥∞,M×M<∞

which justifies the compactness of

 F:=span{K(⋅,x)∣x∈M} (23)

as a subset of , since embeds into compactly [4, 67]. In fact, since we only used the compactness of and the boundedness of on , the argument above for the compactness of can be extended to any Gaussian process defined on a compact metric space with a bounded kernel. The initialization step Eq. 20 now amounts to selecting from that maximizes

which is identical to Eq. 15 when (or equivalently, Line 14 in Algorithm 1); furthermore, given previously selected basis functions , the -th basis function will be chosen according to Eq. 21, i.e. maximizes the infimum

 infg∈span{K(⋅,x1),⋯,K(⋅,xn)} ∥K(⋅,x)−g∥2HK=infa1,⋯,an∈R∥∥ ∥∥K(⋅,x)−n∑i=1aiK(⋅,xi)∥∥ ∥∥2HK = infa1,⋯,an∈RK(x,x)−2n∑i=1aiK(x,xi)+n∑i=1n∑j=1aiajK(xi,xj) (∗)= K(x,x)−K(x,x1n)⊤K−1n,nK(x,x1n) (24)

where the notation are as in Eq. 12 and Eq. 13, i.e.

 K(x,x1n):=⎛⎜ ⎜⎝K(x,x1)⋮K(x,xn)⎞⎟ ⎟⎠,Kn,n:=⎛⎜ ⎜⎝K(x1,x1)⋯K(x1,xn)⋮⋮K(xn,x1)⋯K(xn,xn)⎞⎟ ⎟⎠.

The equality follows from the observation that, for any fixed , the minimizing vector satisfies

 K(x,x1n) =Kn,na⇔a=K−1n,nK(x,x1n).

It is clear at this point that maximizing the rightmost quantity in Eq. 24 is equivalent to following the greedy landmark selection criterion Eq. 15 at the -th step. We thus conclude that Algorithm 1 is equivalent to the greedy algorithm for reduced basis method in , a reproducing kernel Hilbert space modeled on the compact manifold . The following lemma summarizes this observation for future reference.

###### Lemma 1

Let be a compact Riemannian manifold, and let be a positive semidefinite kernel function. Consider the reproducing kernel Hilbert space as defined in Eq. 22. For any and a collection of points , the orthogonal projection from to is

 Pn(K(⋅,x))=n∑i=1a∗i(x)K(⋅,xi)

where is the inner product of vector with the -th row of

 ⎛⎜ ⎜⎝K(x1,x1)⋯K(x1,xn)⋮⋮K(xn,x1)⋯K(xn,xn)⎞⎟ ⎟⎠−1.

In particular, has the same regularity as the kernel , for all . Moreover, the squared distance between and the linear subspace has the closed-form expression

 PK,Xn(x): =∥K(⋅,x)−Pn(K(⋅,x))∥2HK (25) =mina1,⋯,an∈R∥∥ ∥∥K(⋅,x)−n∑i=1aiK(⋅,xi)∥∥ ∥∥2HK

where

 K(x,x1n):=(K(x,x1),⋯,K(x,xn))⊤∈Rn.

Consequently, for any Gaussian process defined on with covariance structure given by the kernel function , the MSPE of the Gaussian process conditioned on the observations at equals to the distance between and the subspace spanned by .

The function defined in Eq. 25 is in fact the squared power function in the literature of scattered data approximation; see e.g. [87, Definition 11.2].

The convergence rate of greedy algorithms for reduced basis methods has been investigated in a series of works [17, 14, 31]. The general paradigm is to compare the maximum approximation error incurring after the -th greedy step, denoted as

 σn:=dist(fn+1,Vn)=maxf∈Fdist(f,Vn), (26)

with the Kolmogorov width (c.f. [52]), a quantity characterizing the theoretical optimal error of approximation using any -dimensional linear subspace generated from any greedy or non-greedy algorithms, defined as

 dn:=infYsupf∈Fdist(f,Y) (27)

where the first infimum is taken over all -dimensional subspaces of . When , both and reduce to the -bound of the kernel function on , i.e.