    # Topologically penalized regression on manifolds

We study a regression problem on a compact manifold M. In order to take advantage of the underlying geometry and topology of the data, the regression task is performed on the basis of the first several eigenfunctions of the Laplace-Beltrami operator of the manifold, that are regularized with topological penalties. The proposed penalties are based on the topology of the sub-level sets of either the eigenfunctions or the estimated function. The overall approach is shown to yield promising and competitive performance on various applications to both synthetic and real data sets. We also provide theoretical guarantees on the regression function estimates, on both its prediction error and its smoothness (in a topological sense). Taken together, these results support the relevance of our approach in the case where the targeted function is "topologically smooth".

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Problems of regression on manifolds are of growing importance in statistical learning. Given a manifold , the specific goal is to retrieve a true regression function from data (for that lie on the manifold and noisy real-valued responses of the form where are the additive noise. Such problems arise in many applications where the data samples , although represented by very high dimensional spaces like sets of images and 3D volumes, often have an underlying low-dimensional structure and lie on a manifold. This is in particular the case in medical applications. For instance, Guerrero et al. (2014) and Jie et al. (2015) study regression problems on a set of images of brains. While the set of all images is of very large dimension (the number of pixels), the set of brain images turns out to have a comparatively very small intrinsic dimension. Although there are ways to recover the metric of the underlying unknown manifold (Beg et al., 2005), in this article we adopt an extrinsic approach.

A standard approach for estimating is based on expanding the function in a suitable basis to take advantage of the underlying manifold structure. To this end, we consider the Laplace-Beltrami operator (Rosenberg, 1997), which has been broadly studied, both for its theoretical properties (Zelditch, 2008, 2017; Shi and Xu, 2010) and its great power of applicability in data analysis (Hendriks, 1990; Levy, 2006; Coifman and Lafon, 2006; Saito, 2008; Mahadevan and Maggioni, 2007; Göbel et al., 2018)

. In our context, we chose the basis to be the set of eigenfunctions of the Laplace-Beltrami operator. Since it is impossible to have access to a closed form expression for the Laplace-Beltrami eigenfunctions for various manifolds in full generality, we replace them by the eigenvectors of the Laplacian matrix of a graph built on the data; see

Mohar (1991) for a complete treatment. Using the eigenvectors of the graph Laplacian for diverse learning tasks is an idea that has its roots in the works of Belkin and Niyogi (2003) and has become extremely popular since. There is a plethora of literature on this topic, and we refer to Wang et al. (2015) for a theoretical treatment, and Belkin et al. (2006) and Chun et al. (2016) for two out of many applications. The use of the graph Laplacian spectrum is backed-up by solid guarantees regarding its convergence towards the spectrum of the Laplace-Beltrami operator. We refer to von Luxburg et al. (2008)

for general results adopting the point of view of spectral clustering,

Burago et al. (2015) for a more recent treatment, and Trillos et al. (2020) for the recent generalization of the latter to random data.

In order to efficiently estimate , we will use a penalization procedure; see Giraud (2014) or Massart (2007) for a complete treatment of these methods. Specifically, we will present two types of penalties that both leverage topological information. These penalties are based on persistent homology, a field that has its origins in algebraic topology and Morse theory (Milnor, 1963). The use of persistent homology has become increasingly popular over the past decade, popularized, among others, by the books Edelsbrunner and Harer (2010) and Boissonnat et al. (2018)

. It offers a new approach to data representations. Penalties based on persistence follow a heuristic similar to the one based on total variation (see, for instance,

Rudin et al., 1992; Hütter and Rigollet, 2016) which works by reducing the oscillations of the estimated function in order to reconstruct a smooth function. While the heuristics for total variation penalties and persistence based penalties are similar, they still work quite differently, as discussed below.

Penalizing the persistence has been used recently in Chen et al. (2019) for classification applications, and in Brüel-Gabrielsson et al. (2020)

in the context of Generative Adversarial Networks. Furthermore,

Carriere et al. (2021) has examined optimization with such penalizations in the context of various applications. The novelty of the present work resides in the use of such models in the framework of a regression over a manifold and its joint utilization with a Laplace eigenbasis, enabling a deeper understanding of its topology. It is also the opportunity to study higher dimensional examples where the behavior of topological persistence is fundamentally different from the one of total variation. Indeed, we will see that topological persistence is a very convenient way to prevent the estimated functions from oscillating too much in a stronger way than more standard approaches.

The rest of the paper is organized as follows: in an intuitive fashion, Section 2 presents the motivation behind the introduction of a topological penalty for a Laplace eigenbasis regression and how it can overcome the limitations of total variation denoising. Section 4 discusses two types of topological penalties: one is equivalent to solving a Lasso problem with weights and therefore has a simple theoretical analysis and even has a closed-form solution, while the other one is non-convex. Despite the lack of guarantees in non-convex optimization, we will present an oracle result for the estimated parameter. We also present a result on controlling the topology, or on topological sparsity, of one of our approaches. In Section 5, we will present the results of experiments conducted on both synthetic and real data, in order to highlight the strengths and weaknesses of such an approach as opposed to standard regression methods. We have made the code used in several examples available here.

## 2 Motivation

### 2.1 Laplace eigenbasis regression

We study a regression problem on a compact submanifold of dimension of without boundary. Throughout this paper, we assume the data points are sampled uniformly and independently over . Furthermore, for , the responses are generated based on the model

 Yi=f⋆(Xi)+εi, (1)

where are i.i.d. zero-mean sub-Gaussian noise variables independent of all the ’s. Our goal is to retrieve the function , also referred to as the regression function, from the given observations .

A natural choice of basis to perform a regression and exploit the manifold-structure assumption is the Laplace-Beltrami eigenbasis. Analogously to the Euclidean case, the Laplace-Beltrami operator is (the negative of) the divergence of the gradient: . If we denote by

the metric tensor and by

the components of its inverse, we have the following expression in local coordinates (with Einstein summation convention):

 Δf=−1√det(g)∂i(√det(g)gij∂jf).

We remark here that due to our uniform sampling assumption on the , if suffices to consider the standard Laplace-Beltrami operator as above. The methodology and theory we develop will immediately extend to non-uniform sampling schemes based on weighted Laplace-Beltrami operators (Rosenberg, 1997; Grigoryan, 2009) as long as the sampling distribution is sufficiently light-tailed (say, it satisfies Poincaré inequality). In order for our exposition to convey our main contribution on topological penalization, we stick to the uniform sampling assumption in the rest of this paper.

Notice that in the Euclidean case, where

is the identity matrix, we retrieve the usual well-known formula for the Laplacian (up to a sign convention). The operator

is a self-adjoint operator with compact inverse which implies that its set of eigenvalues is discrete and that they all are non-negative

(Rosenberg, 1997). We can then sort the eigenvalues in nondecreasing order and approximate as a linear combination of the corresponding normalized eigenfunctions . Besides being an orthonormal basis of with many smoothness properties, it is a known fact that the functions are related to the topology of the manifold (Zelditch, 2008). In addition, the Laplace-Beltrami eigenbasis can be seen as an extension of the Fourier basis to general manifolds. Indeed, consider the two dimensional torus parametrized by the mapping :

 ⎧⎪⎨⎪⎩x1=(2+cos(θ))cos(ϕ),x2=(2+cos(θ))sin(ϕ),x3=sin(θ).

On this manifold, the eigenvalues of the Laplace-Beltrami operator are and the corresponding eigenfunctions are up to a normalization constant. By analogy with the approximation of a function by its truncated Fourier series in classical analysis, it is natural to choose the eigenfunctions corresponding to the smallest eigenvalues as a suitable expansion basis for the signal.

Once the number of features is chosen, the problem boils down to use the observed data for finding such that is a good approximation to . To this end, we introduce the design matrix where and we let be a minimizer of

 L(θ)=∥Y−Xθ∥22+μΩ(θ), (2)

where

is the response vector and

is a penalty term also depending on the Laplace-Beltrami eigenfunctions. Our choices for will be discussed below. The scalar is a calibration factor aiming at reducing overfitting. In case we do not know the eigenfunctions , we will use eigenvectors of a graph Laplacian as sample approximation (see below for details).

Examples of classical penalties include -regularization, also called Lasso (see Bühlmann and van de Geer, 2011 for an exhaustive treatment), and total variation penalty. Although the latter provides good theoretical guarantees (see, for example Hütter and Rigollet, 2016 for oracle results and Dutta and Nguyen, 2018 for a metric entropy based approach), it fails to capture some aspects of the geometry of the data. Indeed, consider the square (or equivalently the 2D torus), discretize it as small squares of size (we can assume to be equal to for some integer to avoid boundary issues) and consider a pyramidal function on each square with value 0 at the boundary of the square, and a maximum of attained in the middle of the square (see Figure 1). The total variation of the so obtained function is equal to . Since , it yields that . In particular, it does not depend on , which means that total variation is blind to very small perturbations of the function and is therefore not suited to deal with such a type of noise. We are now going to see in the following subsection a type of penalty that can capture such small oscillations.

### 2.2 Total Persistence

In this section, we present the most basic concepts of topological data analysis as introduced in the reference textbooks by Edelsbrunner and Harer (2010) and Boissonnat et al. (2018). We will try to keep the notions as intuitive as possible and do not lay out the technical details of homology theory. Consider the sub-level sets of a given function . As traverses from to , the topology of the sub-level sets changes and we keep track of these changes in the so-called persistence diagram. More precisely, suppose we are interested in -dimensional topological features present in a sub-level set (or in a topological space), namely a connected component for , a cycle for , a void for , and so on. For simplicity, assume that is a Morse function (in particular, its critical points are non-degenerate), such that the topology of the sub-level sets of only changes at levels corresponding to extremal points (see Theorem 2 below). Then, as the level increases, such -dimensional features might start to exist at a certain level and they might disappear by merging with another component at a different level , where might be equal to if it never disappears. Then we place a point in the plane with coordinates . The set of all such points (each corresponding to a different -dimensional feature) along with the diagonal (accounting for the fact that in general features might appear and disappear at the same level or time) forms the th persistence diagram of . It is a multi-set of as different features might appear and disappear at the same levels. The example of a persistence diagram for a one-dimensional function shown in Figure 2 is taken from Boissonnat et al. (2018).

For the persistence diagram of a function to be well-defined, we need the function to satisfy a tameness assumption (Edelsbrunner and Harer, 2010). Sufficient for this is to assume to be a Morse function. The following result makes precise the above mentioned fact that for Morse functions the topology of the sub-level sets of can be simply described in terms of its critical points, defined by . Recall that critical points of Morse functions are non-degenerate (non-singular Hessian), and that the index of a critical point is the number of negative eigenvalues of the Hessian.

[Edelsbrunner and Harer (2010)] Let f be a Morse function on a smooth manifold and denote by the sub-level set .

• Suppose that there is no critical value between . Then and are diffeomorphic and deformation retracts onto .

• Suppose is a non-degenerate critical point of with index and that . Then for small enough, is homotopy equivalent to with a -handle attached.

As a consequence, for Morse functions all the coordinates of the points in the persistence diagrams of every dimension are critical values of . Furthermore, in the persistence diagram of a feature dimension , the birth times are critical values of index and the death times are critical values of index . We define the persistence of a feature to be its lifetime, namely its death time minus its birth time , and define the persistence of a function, denoted by , as the sum of all individual persistences in dimension . In the literature this sometimes is also called the -total persistence. When we talk about the persistence of a function , it is understood to be the sum of all persistences over all dimensions. The count of all the -dimensional features in a topological space (sub-level set) is called the -th Betti number of this space. The total sum of all the the -th Betti number is called the total Betti number of this space.

Note that the existence of features with infinite persistence would make equal to . To avoid this degeneracy, the quantity is modified (see Polterovich et al. (2020)) by replacing the infinite persistence of a feature born at by The number of topological features with infinite persistences equals the total Betti number of the manifold . We also state a useful result from Chapter 6 of Polterovich et al. (2020) in Lemma 2 below. It can be seen as a corollary of the famous stability inequality in topological data analysis from Cohen-Steiner et al. (2007). The result essentially states that two functions close in uniform norm necessarily have close persistence.

[Polterovich et al., 2020] Let and be two Morse functions on a manifold with total Betti number . Denote by the total number of points (with finite persistence) in the persistence diagram of . Then

 χ(f)−χ(h)≤(2ν(f)+ζ)∥f−h∥∞.

This result remains true when is Morse and is only continuous. Under those circumstances, can be defined by the (possibly infinite) limit of the total persistence of a sequence of Morse functions that uniformly converges towards , as done in Polterovich et al. (2019).

## 3 Methodology

In applications we construct persistence diagrams from random data — think of a random function, such as an estimated regression function, or a function with noise added; see below. The current paradigm in topological data analysis is that in such random persistence diagrams the features with a high persistence are true features, whereas the features with a low persistence that lie near the diagonal are noisy perturbation of the topology. We can see an example of this in Figure 3 where we have computed the value of a function at points uniformly sampled in the square and where we have added Gaussian noise to each entry with three different levels

), and then plotted an interpolation. The function considered here is the sum of four Gaussian functions on a square. This function has a single topological feature of dimension 0 (a connected component is born at level 0 and never dies) and four topological components of dimension 1 (that die at the height of the local modes of each Gaussian). When adding noise to this function, the resulting persistence diagram has many points and the noisy function has a very large persistence. The higher the noise level, the further the noisy features are from the diagonal, until it is hard to distinguish the four true topological features from the noisy features. We can see this observation reflected in the plots of the function itself. This motivates to consider methods that sparsify the persistence diagram in order to denoise the input.

### 3.1 Two types of penalties

We will see two different ways of penalizing the persistence. The first one aims at reducing the dimension of the problem by selecting a ‘small’ number of eigenfunctions, while the second one is more focused on denoising and providing a smoother output.

We first consider the penalty

 Ω1(θ)=p∑i=1|θi|χ(Φi),

where the are eigenfunctions of the Laplace-Beltrami operator. Depending on the problem at hand, the persistence of only one or a few chosen homological dimensions can be used. The idea behind the penalty is that the more a function oscillates, the more likely it is to overfit the data. The penalty can be understood as a weighted Lasso penalty, with weights being the persistences of each eigenfunction. The weighted Lasso is a broadly studied model (see Bühlmann and van de Geer, 2011 for an exhaustive reference). It induces sparsity in the representation of the function, and in our context it aims at introducing an inductive bias towards discarding eigenfunctions with a large persistence.

The second persistence based penalty considered here is

 Ω2=χ(p∑i=1θiΦi).

While does not induce sparsity over the parameter , it aims at inducing a certain kind of ‘topological sparsity’. Indeed, the goal of this penalty is for the reconstructed function to have a much smaller number of points in the persistence diagram than the noisy function. Intuitively, this is achieved by optimizing so as to cancel out oscillations of the individual eigenfunctions when summing them up. Section 5 presents some illustrations of this, especially Figure 8. A main computational issue with the penalty is its non-convexity. Indeed, if we consider the two functions and , they both have a persistence of order , yet the sum of the two has a persistence of order . However, a result in Carriere et al. (2021)

shows the convergence of a stochastic gradient descent algorithm towards a critical point of the loss function for the penalty

.

Finally, we remark that one can potentially consider variants of penalty , for instance, a weighted Ridge penalty of the form . Preliminary experiments have shown that the weighted Lasso performs better than the weighted Ridge. In addition, as we will see in Section 5, in applications a combination of the two penalties described above is quite efficient, and we actually benefit of the selection properties of the weighted Lasso.

### 3.2 Contrasting persistence and total variation

While persistence at a first glance might appear to be a similar measure of the regularity of a function as total variation, this only is true for a function in one dimension, where the persistence is half the total variation (see Polterovich et al., 2020). In higher dimensions these two penalties are no longer equivalent as discussed in the following.

A first indication of the differences between total variation and persistence is given in Figure 4, showing Laplace-Beltrami eigenvalues on along with persistences and total variations of their eigenfunctions

. Within an eigenspace, the eigenvalues are sorted in lexicographical order on

. We remark here that the -axis (corresponding to the index of eigenfunctions) in the sub-figures are all aligned.

Figure 4 shows that the eigenvalues of increase ‘smoothly’ and using them as weights for a Lasso-type penalty is a way to regularize the oscillatory behavior of eigenfunctions of large index. However, it can be seen in panel (b) in Figure 4 that while the persistences of the eigenfunctions show an increasing trend with increasing eigenvalues, the persistences also show an overlaid periodic behavior. A similar behavior can be seen for the total variation, but with a much smaller periodic effects. The significant periodic behavior of the persistences means that eigenfunctions can have similar persistences, even if their eigenvalues are quite different. Vice versa, eigenfunctions with similar (or equal) eigenvalues can have quite different persistences. Indeed, within an eigenspace with eigenvalue , a penalty on the persistence is proportional to , therefore eigenfunctions with or small are more likely to be kept in the model. This effect is much less pronounced for the total variation penalty.

While the above already shows some differences between the two types of measures of regularity of functions, the following observation is perhaps even more relevant for our purposes. To this end, let us reconsider the example of the 2-dimensional pyramidal function shown in Fig. 1. We already observed above that the TV-penalty does not depend on the choice of To understand the behavior of the persistences of these functions, observe that the sub-level sets of the function are empty for levels , and then for , the sub-level sets have homology components of dimension 1, that all merge at . Therefore, the 1-persistence diagram of has points, all born at level 0 and dying at time . Thus, has a total 1-persistence of , which thus increase to infinity as . This is in stark contrast to the behavior of the total variation. In a similar fashion, the sub-level sets of the function have connected components from to that all merge at 0. Therefore, while it also has a total variation that does not depend on .

An example of a function where both and are of importance, consider the same discretization of the space as above, where this time we alternate between a pyramid of height and a reversed pyramid of height (see Figure 5). Similarly to the two previous cases, the persistence diagram of this function has points of coordinate (, 0) (for the 0-homology) and points of coordinate (0, ) (for the 1-homology). Therefore, its 0-persistence is equal to its 1-persistence, both equal to while the total variation here again does not vary with .

It is straightforward to build similar examples in higher dimensions and the effect will be even more striking: For instance, discretize the dimensional hypercube with cubes of size and consider a function increasing linearly towards the center of each cube until it reaches a maximum of . Such a function still has a total variation of order 1 no matter the value of , however, its -dimensional persistence will be of order .

The takeaway of the above discussion is as follows: Consider Table 1 which provides a summary of various measures of regularity of the pyramidal function . We see that if we penalize the supremum norm of this function, the penalty will have no effect as . If we try to penalize its Lipschitz constant or its total variation, the penalty will be the same, no matter the scaling and it will therefore have a very limited effect. In contrast to that, when penalizing the persistence, the effect of the penalty will become quite important as becomes small. Such a function is assimilated to noise as and we want to penalize it as much as possible, which is something that can be achieved by persistence but not by total variation.

### 3.3 Complexity of functions with bounded persistence: A negative result

When penalizing persistence, a natural question that immediately arises is to measure the complexity of the set of bounded persistence functions. Indeed, if this set has a large complexity (which is made precise below), seeking for a candidate function can be very challenging and no good upper-bound on the loss between and its estimation can be derived (see Mohri et al., 2018, Sections 3 and 11 for a more detailed exposition). For regression problems, a standard measure of the size of a set of functions is the so-called fat-shattering dimension introduced in (Kearns and Schapire, 1994). Let . A set of points is said to be -shattered if there exists thresholds such that for any subset , there exists a function such that if and if for all . The fat-shattering dimension of the class is then equal to the cardinality of the maximal -shattered set .

Note that the fat-shattering dimension of the class depends on the parameter . A class has infinite fat-shattering dimension if there are -shattered sets of arbitrarily large size. It is well-known that bounds on the fat-shattering dimension lead to bounds on the covering number and hence the metric entropy and Rademacher complexity of the function class. Furthermore, Bartlett et al. (1996) showed that a function class is learnable (in the sense of Bartlett et al., 1996, Definition 2) if and only if it has finite fat-shattering dimension. Unfortunately, in the case of bounded persistence functions, we have the following result:

Let

• If

• If , then if and otherwise.

First recall that if , persistence is equal to half the total variation and it is a known fact that the fat shattering dimension of the set of functions with total variation smaller than is equal to .

If , assume that (the other case is obvious). Consider a set of points in that form a regular gon. We denote by the points that have label (the others are ). We consider the function such that

and increases smoothly on . This function has a persistence smaller than and the set therefore -shatters this set of points no matter its labeling. Similar examples can be built for .

This observation highlights a challenge to overcome when constructing penalties involving topological persistence. Recall that for total variation penalized function classes, meaningful bounds on the metric entropy are well-known (Dutta and Nguyen, 2018). Hence, efficient learning over such function classes is possible. In contrast, efficient learning directly with the set of functions with bounded persistence seems to be not possible due to Theorem 3.3. This serves as our main motivation for our proposed penalties, in order to restrict the size of the set of candidate functions based on eigenbasis expansions.

### 3.4 Empirical eigenfunctions

For simple manifolds (a flat open space, a torus or a sphere for instance), computing the spectrum of the Laplace-Beltrami operator is analytically tractable. However, for general manifolds this is not possible. Moreover, in practical problems, the manifold itself maybe unknown. To deal with this, we take the standard empirical approach and build an undirected graph on the vertex set , with weights between the vertex and the vertex that are computed according to the ambient metric. In the following experimental study, we consider nearest neighbor and Gaussian-similarity based graphs. We denote by the unnormalized graph Laplacian matrix with degree matrix and weight matrix . The degree matrix is a diagonal matrix simply defined as

 Dii=n∑j=1Wij and Dij=0 if i≠j.

The normalized Laplacian matrix is then given by . Both the matrix and are symmetric positive semi-definite and therefore admit a basis of orthogonal eigenvectors. We will only focus on the normalized Laplacian since it provides slightly better convergence guarantees (von Luxburg et al., 2008). The use of these eigenvectors is justified by the fact that they converge to the true eigenfunctions of the Laplace-Beltrami operator in various metrics as the number of points tends to infinity and the scaling parameter of the graph tends to  (Koltchinskii, 1998; Trillos et al., 2020). A few estimated eigenfunctions based on nearest-neighbor graph Laplacian are plotted in Figure 6, for a point cloud lying on a torus.

We also remark that while we previously defined the persistence for true eigenfunctions, we can simply extend the definition for estimated eigenfunctions by considering where is the Voronoi cell centered on . We also use the notation . Finally, we also mention that using the spectrum of the Laplacian of a graph is a broadly developed idea to perform various statistical learning tasks such as regression or clustering; see, for example, Chun et al. (2016), Irion and Saito (2014), Ng et al. (2002) and von Luxburg et al. (2008).

## 4 Theoretical guarantees

We now provide theoretical guarantees for the proposed persistence regularization methodology. We remark here that the works of Brüel-Gabrielsson et al. (2020) or Chen et al. (2019) also propose related topological penalties but they do not provide theoretical performance guarantees.

###### Assumption 1

We assume the true regression satisfies one of the two assumptions below.

There exists with such that , where are the eigenfunctions of the Laplace-Beltrami operator with corresponding eigenvalue . In this case, we say that has a sparsity index of over the basis .

There exists such that is a Morse function.

A remark is in order regarding the sparsity assumption on in Assumption 1-A2. As discussed in Section 3.2, the relationship between the topological regularity of the eigenfunctions and their ordering is not known to be monotone in general, and it only exhibits a periodic trend. Hence, to maintain generality, we assume is a sparse linear combination of the eigenfunctions.

### 4.1 Theoretical guarantees for the Ω1 penalization

We are first interested in the properties of the penalty . This approach can simply be understood as a weighted Lasso with a random design. Since the Laplacian eigenfunctions form a basis of , the compatibility condition (van de Geer and Bühlmann, 2009) is verified and we have a fast rate of convergence.

Assume that satisfies Assumption 1-A1. Let be the minimizer of with penalty . Then there exists a constant that depends only on the manifold such that for

 μ≥2σ√pC(M)(ln(p)+x)n,

with probability larger than

 1−2e−x−e−0.15nC(M)p+ln(p),

we have

 1n∥X(^θ−θ⋆)∥22+μp∑i=1χ(Φi)|^θi−θ⋆i|≤μ2s2.

The proof of the above theorem is provided in Section 6. For the result in Theorem 4.1 to hold with large probability, the number of samples should be of order at least . Under those circumstances, if the trade-off parameter is chosen of order as this theorem suggests, it can be shown that the overall prediction error is of order , up to multiplicative constants (van de Geer and Bühlmann, 2009). According to Lemma 6.1 in Section 6, the use of a Laplace-Beltrami eigenbasis here translates into an additional multiplicative factor of as opposed to a standard Lasso with a design matrix satisfying the RIP conditions (Bühlmann and van de Geer, 2011).

In the case where we study an approximation of the Laplace-Beltrami eigenbasis by using the estimated eigenfunctions of the graph Laplacian, the design matrix can be chosen orthonormal. Under those circumstances, the Lasso has an explicit solution, which we illustrate next. Let a design matrix be built on the estimated eigenfunctions or the graph Laplacian eigenvectors. Then, the minimizer of the functional

 L′(θ)=∥Y−^Xθ∥22+μp∑i=1|θi|χ(^Φi), (3)

satisfies for all :

 ^θj=χ(^Φj)^ΦTjY(1−μχ(^Φj)2|^ΦTjY|)+. (4)

To see this, consider a new design matrix such that . Minimizing the functional in (3) is then equivalent to minimizing the functional

 L′(θ)=∥Y−^Xθ∥2+μ∥θ∥1,

by solving a standard Lasso problem. This function is convex in and therefore admits a global minimum (not necessarily unique) that we will denote by . Although is not differentiable because of the penalty, we can still write the optimality conditions in terms of its sub-differential. Specifically, we have the sub-differential to be

 ∂L′(θ)={−2^XT(Y−^Xθ)+μz:z∈∂∥θ∥1},

from which the optimality condition could be obtained as

 ^XT^X^θ=^XTY−μ2^z,

where is such that whenever and whenever Due to the orthogonal design, the term simplifies and a straightforward analysis of the possible cases given the sign or the nullity of , for all leads to the solution of given in (4) for all . We therefore have an explicit condition on the eigenbasis selection process, that is if and only if . Stated otherwise, an eigenvector with a high persistence has to explain the data significantly well to be kept in the model.

### 4.2 Theoretical guarantees for the Ω2 penalization

The penalty being non-convex, a thorough theoretical study seems more complicated. Nonetheless guarantees on the prediction error and on the persistence of the reconstruction can be derived, as described below.

Let satisfy Assumption 1-A2. Assume we observe where

are zero-mean sub-Gaussian random variables with parameter

. We further assume that the ’s are uniformly sampled on and that the ’s and the ’s are all independent. Then the estimated parameter verifies with probability larger than

 1−2e−x−exp(−0.1nC(M)p+ln(2p)),

that

 ∥θ⋆−^θ∥2≤16pσ2n[1+C(M)√2xn](1+√x)2+4C(M)p(2ν(f⋆)+ζ)2μ2.

Here a constant that depends only on the manifold and is the number of points in the persistence diagram of . In addition, we also have under the same hypotheses with :

The proof of the above theorem is provided in Section 6. This result holds with large probability if and only if the number of samples is at least of order . Choosing ensures that the trade-off term is of the same order as the main term, and we obtain rate of convergence of order for , which is what we can expect from such a model without any sparsity assumption. The second part of this theorem ensures the topological consistency of the reconstructed function , namely that it has a persistence that remains close to the persistence of the regression function and is therefore topologically smooth to some extent. Again choosing (as suggested above to keep a classical convergence rate for the parameter) leads to a consistency of the persistence of towards that of at a rate . We therefore need a larger sample size (of order at least ) to obtain consistency of the total persistence.

Note that according to Equation 6.14 from Polterovich et al. (2020), the number of features with persistence larger than some given value can be upper bounded by where is a constant that depends only on the metric of the manifold. This shows a connection between the topological smoothness developed in this paper and a more usual notion of smoothness.

### 4.3 Theoretical prospects

A first possible extension of the theoretical results presented in this section can be to generalize Theorem 4.2 to mis-specified models, that is, cases where the target function no longer belongs to but can be any function. To this end, a lower bound on the bias incurred with such a model for a function with fixed total persistence may be found in Proposition 2.1.1 from Polterovich et al. (2019) in the case of surfaces.

Let be a compact orientable Riemannian surface without boundary. Denote by the set of smooth functions over , such that for every , and . Then there exists a constant that only depends on and its metric such that for every Morse function ,

 inf{∥f−h∥∞|h∈Fλ}≥12(ν(f)+1)(χ(f)−κ(λ+1)).

This means that for a fixed , if the persistence of the target function is too large, it will be impossible to approximate it with eigenfunctions of corresponding eigenvalue smaller than , and in order to have a chance of approximating it, we will have to allow for more oscillating functions by letting increase. Balancing the estimation and approximation errors, based on the above result might lead to a data-driven choice for selecting .

The other possibility of extension would be to establish consistency results for the case of estimated eigenfunctions, based on the graph Laplacian approach. For example, in order to derive an oracle inequality similar to that of Theorem 4.2, we would need to establish the convergence of the persistence of the eigenvectors towards the persistence of the eigenfunctions. A potential approach is to leverage the stability results for persistence (for example, Lemma 2), and combine them with error rates for eigenfunction estimation (for example, recent results by Dunson et al. (2021) and Calder et al. (2021) for the empirical uniform norm). However, there are several technical challenges to overcome, in order to implement the above proof strategy. For example, it is not clear if the estimated eigenfunctions satisfy the regularity conditions required, for example, by stability results like Lemma 2. Furthermore, the results from Dunson et al. (2021) and Calder et al. (2021) are existence results, and do not resolve the inherent identifiability issue arising in eigenfunction estimation. In addition, the following two cases are to be distinguished:

• Semi-supervised setting: This corresponds to the case when there is an additional set of unlabeled observations uniformly and independently sampled over . In this case, the eigenfunctions could first be estimated using the above unlabeled observations and the regression coefficients could be subsequently estimated using the estimated eigenfunctions.

• Supervised setting: In this case, the same set of observations is used to estimate the eigenfunctions and the regression coefficients. Extra difficulties arise in this case due to the dependency in estimating the eigenfunctions and the regression coefficients using the same set of observations. We remark that this is the approach we take in the experiments as we discuss in Section 5.

## 5 Experimental results

### 5.1 Experimental design

We have applied the following experimental routine in order to estimate the function , given a set of points in that are assumed to live on a manifold of dimension and a vector of real responses

• Build a proximity graph on the data points . Many options are possible but we have chosen a nearest neighbor graph with .

• Compute the normalized Laplacian matrix of the graph.

• For a fixed , compute the first eigenvectors of the Laplacian matrix which yields a new design matrix in where each column is an eigenvector.

• For the penalty , compute the persistence of each eigenvector, divide each column of the design matrix by its persistence and solve a Lasso with cross-validation.

• For the penalty , we start from a random vector and perform a stochastic gradient descent of , where we compute the persistence of at each iteration, similarly to what is done in Carriere et al. (2021).

The method that has been the most efficient is to take for to perform a variable selection, using Lasso sparsity properties. We then perform a gradient descent of the loss function with penalty on the subset of eigenvectors previously selected. In the tables below, the results for the penalty are always understood to have been obtained this way.

The routine previously described works for the denoising problem where we have a label for each data point. If we are interested in prediction problems, where the set of covariates is split in two: one part with labels and one part without, we build the graph on all the data points since all the points are assumed to live on but we train the model only on the points for which we have a label at our disposal.

Numerically, to compute the persistence of a function for which we know the values at points , we build the Delaunay triangulation (Edelsbrunner and Harer, 2010) on the vertex set where the filtration value of a -dimensional simplex is equal to . Here, we have used the GUDHI library (Maria et al., 2014) to compute the persistence given this filtration.

In what follows, we will present the results of several experiments conducted both on synthetic and real data to investigate the relevance of our approach in practice. We compare our method to standard regression methods on manifolds: Kernel Ridge Regression (KRR), Nearest-neighbour regression (k-NN), Total variation penalty (TV) as well as graph Laplacian eigenmaps with a

penalty (Lasso) and a weighted Lasso where the weights are the total variation of each eigenvector, computed on the graph (Lasso-TV). The performance is measured in terms of root mean squared error between the estimated and the true functions at the data points. Its expression is given by

 RMSE(^f)= ⎷n∑i=1(f⋆(xi)−^f(xi))2n.

All hyperparameters have been tuned by cross validation or grid-search. The code used for data on a Swiss roll and the spinning cat in 2D is available

here.

### 5.2 Simulated data

#### 5.2.1 Illustrative example

In order to illustrate the behavior of the -penalization model, we first look at a synthetic setting where the function we try to reconstruct is a sum of 4 Gaussians, to which we add a large noise (Figure 3).

The persistence diagram of the true function to be estimated has four points in its 1-homology corresponding to the 4 cycles in each Gaussian, all being born at a neighboring saddle and dying at the corresponding local maximum, and one point in its 0 homology dying at infinity corresponding to the only connected component. The 0-homology points on the diagonal correspond to sampling noise. When a noise is added, the visual chaos of the observation is supported by its persistence diagram: it has a lot of features and the statistical noise added to the measurement here is converted into topological noise.

In Figure 7 we compare the results of a stochastic gradient descent penalizing against a Lasso estimation. The reconstruction is visually better and is also better in terms of MSE. Indeed, when penalizing the persistence, we have managed to keep the four most persistent one-dimensional features in the persistent diagram (although their persistence has diminished) and we still observe four peaks, while most of the noise has been removed. Although the Lasso enables some denoising, it has only been able to reconstruct 2 to 3 peaks and does not offer the same denoising performance. Note that it is also possible to consider a topological penalty where we penalize the persistence of the functions except the 4 most persistent points in 1-homology and the most persistent point in 0-homology like in Brüel-Gabrielsson et al. (2020). In this case, we can be more coarse in the choice of the trade-off since the penalty must then be set to 0. This shows that if one is given an a priori on the topology of the function to reconstruct, it can be included in the model very easily.

#### 5.2.2 Torus data

We simulate data on a torus which we recall is homeomorphic to , parametrized by the embedding :

 ⎧⎪⎨⎪⎩x1=(2+cos(θ))cos(ϕ),x2=(2+cos(θ))sin(ϕ),x3=sin(θ).

A function on the torus is identified to a function of . To set up the regression problem, we define the target response where

is the sigmoid function. Note that

is radial symmetric, depending only on the distance between and . The comparison of the RMSE for all methods can be found Table 2.

The persistence diagrams for a typical realization of each penalty can be found in Figure 8. The original function we try to estimate has only one component of dimension one corresponding to its peak in . The addition of a Gaussian noise on each entry adds a lot of points in the persistence diagram, mostly of dimension 1. In this case, it is hard to distinguish the true feature from the noise. Note that a standard Lasso performed on the graph Laplacian eigenbasis reduces the number of points in the persistence diagram and therefore the total persistence. However, there is still a significant number of points and the separation between the true feature and the noise cannot be performed visually. Although the Kernel ridge regression with an optimal bandwidth does not reduce the topological noise, its persistence diagram has one point significantly more persistent than the others, which suggests a faithful reconstruction from a topological point of view. The topological reconstruction has managed to both reduce the topological noise while keeping a single one-dimensional feature more persistent than the rest. Note that this feature is less persistent than the true feature which would suggest that this method induces a shrinkage of the function to be estimated and therefore of the persistence of its critical points.

#### 5.2.3 Data on a Swiss roll

We now consider data on a Swiss roll which is a two dimensional manifold parametrized by the mapping . We have set as target function

 f∗(x,y)=4exp(−((y−7)2/20+(x−6)2/5))+2cos2(x)sin2(y),

for . The function is plotted Figure 9.

Here, we observe that the penalty performs the best and benefits from the selection properties of the regularization with penalty . Data on a Swiss roll shows the limitations of Kernel methods when the number of points is low, as the geodesic metric can be very different from the ambient metric. The use of a NN graph is a solution to bypass this problem since at a small scale the two metrics are close. We see in Table 3 that a topology-based penalty is particularly efficient when the noise level becomes important. Although Kernel Ridge regression and Total Variation penalties provide a very good reconstruction and should be preferred when the noise is low, they become quite unstable as the noise level increases, whereas all Laplacian eigenmaps based methods as well as a k-NN regression are somehow robust to noise. Note that among all possible penalties on Laplacian eigenmaps based models, ran on the eigenfunctions selected by always yield the best results.

### 5.3 Real data

#### 5.3.1 Spinning cat in 1D

This model has also been tried on real data. We consider a data set of images of the same object rotated in space with increments of . The data lie on a one-dimensional submanifold of , the images having size pixels. We can see some of the images of the dataset Figure 10. For each image, we want to retrieve the angle of rotation in radians of the object. The source vector we want to estimate is therefore .

We have built a Gaussian weighted graph on the data points, using the ambient metric between images for the weights. Using a geodesic distance between images would probably yield better results, but the use of the ambient metric is enough to have convergence of the graph Laplacian eigenvectors towards the eigenfunctions of the Laplace-Beltrami operator on the manifold according to Trillos et al. (2020).

We have tried different values of the scaling parameter in the Gaussian weights

We depict in Figure 11 the aspect of some eigenfunctions as a function of the rotation degree of the baseline image for and . Here, we can see that the value is too small: indeed, is supposed to tend to 0 so that the eigenvectors converge to the true eigenfunction, but this limit is supposed to be at a certain speed with respect to according to Trillos et al. (2020). Here, we only have a small number of data at hand (), and therefore, the value visually seems to be working better. For a larger index, the eigenfunctions start to be localized around an image of the manifold, reminding a wavelet-type basis.

Unlike the synthetic experiments, where we have only focused on denoising, here we are interested in the prediction properties of the method. To this extent, the data are randomly split between a training set and a validation set. The graph is then built on the whole dataset (since the data points are all assumed to lie on the same manifold). The optimization procedure is then only performed on the labels from the training set, and we measure the mean square error between the prediction on the new points and the true values from the validation set.

We see here that although an approach based on the penalization of the topological persistence yields results a lot better than a or total variation penalty, it is still outperformed by a kernel ridge regression. This might be due to the small number of data available on which to build the graph or the fact that the topology of the manifold as well as the topology of the source function are very simple and do not benefit fully from the method presented in this paper. We will see in the next subsection a set-up that shows the appeal of such a penalty.

#### 5.3.2 Spinning cat in 2D

We consider the data set from the previous subsection where in addition each image is rotated in another direction by increments of . We thus have at hand a two-dimensional manifold with the homotopy type of a torus, where the two parameters are the degree of rotation of the object within the image and the degree of rotation of the image itself . A few images of the dataset are plotted in Figure 12. We consider the same target response as in the Subsection 5.2.2:

 f∗(θ,ϕ)=ξ[−17(√(θ−π)2+(ϕ−π)2−0.6π)].

We add an i.i.d. Gaussian noise with standard deviation

to the input. We select a random subset of 1000 images over the dataset and randomly split it into a training set and a testing set. The results can be found Table 5.

As opposed to a KRR regression, topology based penalties ( and ) seem to offer a good reconstruction when the number of training data becomes smaller, which shows the benefits of our method for data in a large dimensional space with a small intrinsic dimension.

#### 5.3.3 Electrical consumption dataset

We have tried our method on a dataset available here. The covariates are curves of temperature in Spain, averaged over a week. There is a measurement for each hour of the day, thus each curve can be seen as a point of . However, the possible profiles of temperature are very limited (namely, each curve is increasing towards a maximum in the afternoon and is then decreasing), it is therefore believed that the data lie on a manifold of smaller dimension. For each week, we try to predict the electrical consumption in Spain in GW. Like in the previous experiments, the data are randomly split between a training and a testing set, and the graph Laplacian is built on all the available data. The results can be found Table 6.

On this dataset, we notice once again that a small training set is not really damaging towards topological methods as opposed to more standard methods, illustrating once again the generalization properties of the penalties and . Here, all methods act relatively similarly as they all predict well the electrical consumption over a week when it takes values around its mean (see Figure 13). However, sometimes the electrical consumption over a week reaches a peak and all methods fail to capture the extreme values of the source function. It makes perfect sense here that the knowledge of the temperature alone fails to explain perfectly the global electrical consumption over an entire country, without taking into account any other covariates.

### 5.4 Conclusion of the experiments

The methods developed in this paper couple the use of a graph Laplacian eigenbasis and a topological penalty. The first aspect enables to deal with data living on manifolds with an extrinsic approach: nothing needs to be known on the manifold and its metric, the graph Laplacian being computed on the ambient metric. Laplace eigenmaps methods in general have proven to be useful when the manifold structure is quite strong, illustrated here with the experiment on a Swiss roll. The use of a topological penalty has multiple advantages: it acts as a generalization of total variation penalties in higher dimensions and seems to be a more natural way to regularize functions, as observed in Section 2. Numerically, topological methods almost always provide better results than usual penalties such as TV or penalty. Here, we have developed two types of topological penalties: aims at performing a selection process of the regression basis functions, by discarding the ones that oscillate too much in order to allow a good generalization to new data. On the other hand directly acts on the topology of the source function and performs a strong denoising. Note that the statistical noise on the data translates into a topological noise on the persistence diagram of the corresponding function which is the one the regularization acts onto, by providing a powerful simplification of the persistence diagram of the reconstructed function. When the underlying geometric structure is complex or the noise is important, regularization on the eigenfunctions selected by provides a better reconstruction in terms of RMSE than standard methods, including KRR which appears to be the most competitive one.

We have essentially penalized the total persistence of functions, but it would be possible to numerically penalize any smooth function on the points of the persistence diagram. In particular, establishing a penalty that would erase all low-dimensional persistence features while keeping all the high-dimensional features could be of practical interest, likewise to the smoothly clipped absolute deviation (SCAD) penalty developed by Fan and Li (2001).

One major drawback of topological methods is their computational cost as opposed to a simple KRR regression, especially when we need to compute topological persistence of high dimensions.

## 6 Proofs for Section 4

### 6.1 Proof of Theorem 4.2

We start by the proof of Theorem 4.2 since it introduces a number of lemmas and results that will also be useful to prove the other theorems. We start with a technical lemma based on the celebrated Weyl’s Law (Chavel, 1984; Ivrii, 2016), reformulated for statistical regression purposes in Barron et al. (1999). It provides a control of the sup-norm of the sum of the squares of the first Laplacian eigenfunctions.

Let be a compact Riemannian manifold of dimension and its Laplace-Beltrami operator. Let be the eigenvalues of , and for an eigenvalue , we denote by a normalized eigenfunction. Then there exists constants and depending only on the geometry of (with also potentially depending on the dimension) such that for all ,

 ∥∥ ∥∥p∑i=1Φ2i∥∥ ∥∥∞≤C′(M)λd/2p≤C(M)p,

where the last inequality follows by Weyl’s law. We will also need the following concentration result, itself using Lemma 6.1. Recall that a random variable is called sub-Gaussian with parameter if for all . Denote for every , the tuple . Let be i.i.d sub-Gaussian random variables with parameter , independent of , and let . Then, with probability larger than ,

 ∥∥ ∥∥1nn∑i=1εiΦ(Xi)∥∥ ∥∥≤2σ√pn(1+2√x)√1+C(M) √2xn.

(of Lemma 6.1). Our proof is based on a non-standard result by Kontorovich (2014), on concentration of Lipschitz functions of not necessarily bounded random vectors, which we briefly introduce next.

For , let be random objects living in a measurable space with metric , endowed with measure . Define to be living on the product probability space with product measure , and metric . To each , we also associate symmetrized random objects , where are independent, and are Rademacher random variables (i.e. taking values with probability 1/2) independent of . We also define as the sub-Gaussian diameter of , given by the smallest value of for which we have for all . Then using (Kontorovich, 2014, Proof of Theorem 1), we have for the random object as above, and a -Lipschitz function , that

 P(φ(Z)−Eφ(Z)>t)≤exp(−t22n∑i=1Δ2SG(Zi)). (5)

In our context, for all , we set , and . For , we define the function as

 φ:(v1,…,vn)↦∥∥ ∥∥n∑i=1vi∥∥ ∥∥.

Note that, for two finite sets of vectors and , we have

 ∣∣∥∥∑vi∥∥−∥∥∑wi∥∥∣∣≤∑∥vi−wi∥.

Hence, the function is Lipschitz with respect to the mixed norm of , meaning that we will be able to apply Kontorovich’s result.

Now, considering we proceed by conditioning on , that is, we consider randomness only with respect to , which are independent of the ’s; will denote the corresponding conditional expectation. First, note that we have the following bound on the (conditional) expectation:

 E</