Persistent homology is a central tool in Topological Data Analysis (TDA) that allows to efficiently infer relevant topological features of complex data in a descriptor called persistence diagram. It found many applications in Machine Learning (ML) where it initially played the role of a feature engineering tool through the direct use of persistence diagrams or through dedicated ML architectures to handle them—see, e.g. [hofer2017deep, umeda2017time, carriere2020perslay, dindin2020topological, kim2020efficient]
. For the last few years, a growing number of works have been using persistence theory in different ways in order to, for instance, better understand, design and improve neural network architectures—see, e.g.[rieck2018neural, moor2019topological, carlsson2020topological, gabrielsson2019exposition]
—or design regularization and loss functions incorporating topological terms and penalties for various ML tasks—see, e.g.[chen2019topological, hofer2019connectivity, clough2020topological]. These new use cases of persistence generally involve minimizing functions that depend on persistence diagrams. Such functions are in general highly non-convex and not differentiable, and thus their theoretical and practical minimization can be difficult. In some specific cases, persistence-based functions can be designed to be differentiable and/or some effort have to be made to compute their gradient, so that standard gradient descent techniques can be used to minimize them—see e.g. [wangtopogan, poulenard2018topological, bruel2020topology]. In the general case, recent attempts have been made to better understand their differential structures [leygonie2019framework]
, and, building on the powerful tools provided by software libraries such as PyTorch or TensorFlow, practical methods that allow to encode and optimize a large family of persistence-based functions have been proposed and experimented[gabrielsson2020topology, solomon2020fast]. However, in all these cases, the algorithms used to minimize these functions do not come with theoretical guarantees of convergence to a global or local minimum.
The aim of this note is to show that the algorithm to compute persistence diagrams has a simple combinatorial structure that allows us to easily implement and use stochastic subgradient descent in a general framework. This gives convergence guarantees for a wide class of functions that includes almost all persistence-based functions in the literature. More precisely,
we show that the map converting a filtration over a given simplicial complex into a persistence diagram is semi-algebraic with very easy to compute (sub-)differential;
building on the recent work [subgradient], we show that under mild assumptions, stochastic subgradient descent algorithms applied to functions of persistence converge almost surely to a critical point of the function;
we provide a simple corresponding Python implementation for minimizing functions of persistence that is publicly available at https://github.com/MathieuCarriere/difftda (and that is going to be part of the Gudhi library [maria2014gudhi] soon), and we illustrate it with a few examples from the literature.
The article is organized as follows. In section 2, we introduce notations and recall basic facts about filtrations and the persistence computation algorithm. In section 3, we study the differentiability of persistence-based maps defined in o-minimal structures, and we also provide a few examples showing that almost all persistence-based maps encountered in practice belong to such structures. In section 4, we consider the minimization problem of a persistence-based functional. In section 5, we provide simple implementation for persistence-based function minimization and illustrate it on a few examples. Finally, we conclude in section 6.
2 Filtrations and persistence diagrams
In this section, we show that, given a finite simplicial complex , the map that associates to a filtration of
its persistence diagram can be seen as a simple permutation of the coordinates of a vector containing the filtration values.
2.1 Simplicial complexes and filtrations
Recall that given a set , a simplicial complex is a collection of finite subsets of that satisfies
for any ,
if and then .
An element with is called a -simplex.
Given a simplicial complex and a subset of , a filtration of is an increasing sequence of subcomplexes of with respect to the inclusion—i.e., for any —such that .
To each simplex , one can associate its filtering index . Thus, when is finite, a filtration of can be conveniently represented as a filtering function . Equivalently, it can be represented as a -dimensional vector in whose coordinates are the indices of the simplices of and that satisfies the following condition: if and , then . As a consequence, if the vectorized filtration depends on a parameter, the corresponding family of filtrations can be represented as a map from the space of parameters to in the following way.
Let be a finite simplicial complex and a set. A map is said to be a parametrized family of filtrations if for any and with , one has .
2.2 Computation of persistence diagrams from filtrations
In this section, we briefly recall how the computation of the persistence diagram of a filtered simplicial complex decomposes into: (1) a purely combinatorial part only relying on the order on the simplices induced by the filtration, and (2) a part relying on the filtration values. For a complete introduction to persistent homology and its computation, we refer the reader to standard articles and textbooks, such as [zomorodian2005computing, edelsbrunner2010computational, boissonnat2018geometric].
Let be a finite simplicial complex endowed with a filtration and corresponding filtering function . We regard as a vector in , where is the number of non-empty simplices of . The computation of the persistence diagram can be split into two steps.
First part: combinatorial part (persistence pairs).
The filtering function induces a preorder on the elements of as follows: if . This preorder can be refined into a total order in some arbitrary way (e.g., by indexing the simplices, and using the lexicographic order to break ties). Note that the choice of this refinement does not change the output of the persistence computation, as long as it is deterministic and consistent. The basic algorithm to compute persistence then iterates over the ordered set of simplices in the following way (see [boissonnat2018geometric, Section 11.5.2] for a complete description of the algorithm).
Note that for each dimension , some -dimensional simplices may remain unpaired at the end of the algorithm; their number is equal to the -dimensional Betti number of .
Second part: associate filtration values.
The persistence diagram of the filter function is now obtained by associating to each persistent pair the point . Moreover, to each unpaired simplex is associated the point .
If is the number of persistence pairs and is the number of unpaired simplices, then and the persistence diagram of the filtration of is made of points in (counted with multiplicity) and points (also counted with multiplicity) with infinite second coordinate. Choosing an order (e.g., the lexicographical order) on , the persistence diagram can be represented as a vector in and the output of the persistence algorithm can be simply seen as a permutation of the coordinates of the input vector . Moreover, this permutation only depends on the order on the simplices of induced by .
The subset of points of a persistence diagram with finite coordinates is called the regular part of and is denoted by . The subset of points of with infinite second coordinate is called the essential part of and is denoted by .
With the notations defined above, and can be represented as vectors in and , respectively.
Note that the above construction is usually done dimension by dimension, leading to a single persistence diagram for each dimension in homology. Indeed, in order to obtain a persistence diagram in dimension , it suffices to restrict to the subset of simplices of dimension and . Without loss of generality, and to avoid unnecessary heavy notations, in the remaining of this article, we consider the whole persistence diagram, made of the union of the persistence diagrams in all dimensions .
3 Differentiability of persistence maps defined in o-minimal structures
In most cases encountered in practice, the parametrized families of filtrations that are used are definable in o-minimal structures. Moreover, given a finite simplicial complex , the persistence map that associates its persistence diagram to each filtration on is semi-algebraic. As a consequence, its composition with a parametrized family of filtrations is definable and benefits from interesting differentiability properties.
3.1 Background on o-minimal geometry
In this section, we briefly recall some elements of o-minimal geometry, which are needed in the next sections of this article. For a more detailed introduction, we refer to standard textbooks, such as [coste2000introduction].
[o-minimal structure] An o-minimal structure on the field of real numbers is a collection , where each is a set of subsets of such that:
is exactly the collection of the finite unions of points and intervals;
all algebraic subsets of are in ;
is a Boolean subalgebra of for any ;
if and , then ;
if is the linear projection onto the first coordinates and , then .
An element for some is called a definable set in the o-minimal structure. For a definable set , a map is said to be definable if its graph is a definable set in .
Definable sets are stable under various geometric operations. The complement, closure and interior of a definable set are definable sets. The finite unions and intersections of definable sets are definable. The image of a definable set by a definable map is itself definable. Sums and products of definable functions as well as compositions of definable functions are definable—see [coste2000introduction, Section 1.3]. In particular, the and of finite sets of real-valued definable functions are also definable. An important property of definable sets and definable maps is that they admit a finite Whitney stratification. This implies that (1) any definable set can be decomposed into a finite disjoint union of smooth submanifolds of and (2) for any definable map , can also be decomposed into a finite union of smooth manifolds such that the restriction of on each of these manifolds is a smooth function.
The simplest example of o-minimal structures is given by the family of semi-algebraic subsets of the spaces . Although most of the parametrized families of filtrations encountered in practice are semi-algebraic, the o-minimal framework actually allows one to consider families that mix exponential functions with semi-algebraic functions [wilkie1996model].
3.2 Persistence diagrams of definable parametrized families of filtrations
Let be a finite simplicial complex and be a definable (in a given o-minimal structure) parametrized family of filtrations. If for any , the preorders induced by and on the simplices of are the same, i.e., for any , if and only if , then the pairs of simplices , and the unpaired simplices that are computed by the persistence algorithm (see algorithm 1) are independent of . As a consequence, the persistence diagram of is
Given a total order on , the points of any finite multi-set with points in and points in can be ordered in non-decreasing order, and can be represented as a vector in . In the following, we assume that such an order is given and fixed. As a consequence, denoting by the set of vectors in defining a filtration on , the persistence map that assigns to each filtration of its persistence diagram consists in a permutation of the coordinates of . This permutation is constant on the set of filtrations that define the same preorder on the simplices of . This leads to the following statement.
Given a finite simplicial complex , the map is semi-algebraic, and thus definable in any o-minimal structure. Moreover, there exists a semi-algebraic partition of such that the restriction of to each element of this partition is a Lipschitz map.
As is finite, the number of preorders on the simplices of is finite. Let be a preorder on simplices of , induced by some equalities and inequalities between the coordinates of . Then, the set of filtrations such that gives rise to a preorder equal to is a semi-algebraic set. It follows that is a semi-algebraic set that admit a semi-algebraic partition such that the restriction of the persistence map to each component is a constant permutation.
Now, from the stability theorem for persistence [chazal2012structure], the persistence modules induced by two filtrations are -interleaved where is the sup norm of the vector . As a consequence, the restriction of to each component of the above mentioned semi-algebraic partition of is clearly -Lipschitz with respect to the sup norm in . ∎
Since there exists a finite semi-algebraic partition of on which is a locally constant permutation, the subdifferential of is well-defined and obvious to compute: each coordinate in the output (i.e., the persistence diagram) is a copy of a coordinate in the input (i.e., the filtration values of the simplices). This implies that every partial derivative is either 1 or 0. The output can be seen as a reindexing of the input, and this is indeed how we implement it in our code, so that automatic differentiation frameworks (PyTorch, TensorFlow, etc.) can process the function directly and do not need explicit gradient formulas—see section 5.
Let be a finite simplicial complex and be a definable (in a given o-minimal structure) parametrized family of filtrations. The map is definable.
It also follows from standard finiteness and stratifiability properties of definable sets and maps that is differentiable almost everywhere. More precisely, we have the following result.
Let be a finite simplicial complex and a definable parametrized family of filtrations. Then there exists a finite definable partition of , such that
for any , is a definable manifold of dimension and is differentiable.
3.3 Examples of definable families of filtrations
Below we provide a few examples of common families of filtrations that turn out to be semi-algebraic or definable in a more general o-minimal structure.
[Vietoris-Rips filtrations] The family of Vietoris-Rips filtrations built on top of sets of points is the semi-algebraic parametrized family of filtrations
where is the simplicial complex made of all the faces of the -dimensional simplex, defined, for any and any simplex spanned by a subset , by
One can easily check that the permutation induced by is constant on the connected components of the complement of the union of the subspaces over all the -tuples such that at least of the indices are distinct.
This example naturally extends to the case of Vietoris-Rips-like filtrations in the following way. Let be the set of symmetric matrices with non-negative entries and on the diagonal111Note that this set contains the set of matrices of pairwise distances between sets of points in metric spaces.. This is a semi-algebraic subset of the space of -by- matrices , of dimension . The map defined by
for any and any simplex spanned by a subset , is a semi-algebraic family of filtrations. Note that the set of section 3.2 can be chosen to be the -dimensional semi-algebraic set of matrices with at least entries that are equal.
Weighted Rips filtrations are a generalization of Vietoris-Rips filtrations where weights are assigned to the vertices of the simplicial complex. Given a set of points and a function , the family of weighted Rips filtrations associated to is defined for any simplex spanned by by
, if , ;
As Euclidean distances between points are semi-algebraic, as well as the max function, it is clear from this definition that a family of weighted Rips filtrations is definable as soon as the weight function is definable.
This example easily extends to the case where the weight function depends on the set of points : the weight at vertex is defined by with . Again, the resulting family of filtrations is definable as soon as is definable. A particular example of such a family is given by the so-called DTM filtration [anai2020dtm], where is the average distance from to its -nearest neighbors in . In this case, is semi-algebraic, and the family of DTM filtrations is semi-algebraic.
The o-minimal framework also allows to consider weight functions involving exponential functions, such as, for instance, kernel-based density estimates with Gaussian kernel. This is a consequence of the fundamental result of[wilkie1996model] in o-minimal geometry.
[Sublevel sets filtrations] Let be a finite simplicial complex with vertices . Any real-valued function defined on the vertices of can be represented as a vector . The family of sublevel sets filtrations of functions on the vertices of is defined by
for any and any simplex spanned by a subset . This filtration is also known as the lower-star filtration of . The function is obviously semi-algebraic, and for section 3.2 it is sufficient to choose .
4 Minimization of functions of persistence
Using the same notation as in the previous section, recall that the space of persistence diagrams associated to a filtration of is identified with , where each point in the copies of is a point with finite coordinates in the persistence diagram and each coordinate in is the -coordinate of a point with infinite persistence.
A function is called a function of persistence if it is invariant to permutations of the points of the persistence diagram: for any and any permutations of the sets and , respectively, one has
It follows from the persistence stability theorem [chazal2012structure] that if a function of persistence is locally Lipschitz, then the composition is also locally Lipschitz.
If is a definable function of persistence, then for any definable parametrized family of filtrations , the composition is also definable. As a consequence, it has a well-defined subgradient. Building on a recent result from [subgradient], the aim of this section is to show that stochastic subgradient descent applied to converges almost surely to a critical point under mild conditions on . We show that these conditions are satisfied for a wide variety of examples commonly encountered in Topological Data Analysis.
4.1 Stochastic gradient descent
To minimize , we consider the differential inclusion
whose solutions are the trajectories of the gradient of . They can be approximated by the standard stochastic subgradient algorithm given by the iterations of
where denotes the subgradient of , the sequence is the learning rate, and
is a sequence of random variables. In[subgradient], the authors prove that under mild technical conditions on these two sequences, the stochastic subgradient algorithm converges almost surely to a critical point of as soon as is locally Lipschitz.
More precisely, consider the following assumptions (which correspond to Assumption C in [subgradient]):
for any , , and ;
, almost surely;
denoting by the increasing sequence of -algebras , there exists a function which is bounded on bounded sets such that almost surely, for any ,
These assumptions are standard and not very restrictive. Assumption (1) depends on the choice of the learning rate by the user and is easily satisfied, e.g., taking . Assumption (2) is usually easy to check for most of the functions encountered in practice (see below). Assumption (3) is a standard condition, which states that, conditioned upon the past, the variables’s and ’s.
Under these assumptions (1), (2), and (3), the following result is an immediate consequence of Corollary 5.9 in [subgradient].
Let be a finite simplicial complex, , and a parametrized family of filtrations of that is definable in an o-minimal structure. Let be a definable real-valued function defined on the space of persistence diagrams such that is locally Lipschitz. Then, under the above assumptions (1), (2), and (3), almost surely the limit points of the sequence obtained from the iterations of eq. 8 are critical points of and the sequence converges.
The above theorem provides explicit conditions ensuring the convergence of stochastic gradient descent for functions of persistence. The main criterion to be checked is the local Lipschitz condition for. From section 4, it is enough to check that and are Lipschitz. Regarding , while it is obvious for Rips, Čech, DTM, etc. filtrations on finite point clouds, it is not as naturally satisfied by the alpha-complex filtration, where 3 points with diameter 1 can create an edge and a triangle with an arbitrarily large filtration value. However, such large filtration values only ever participate in persistence intervals of length 0 (points on the diagonal of the diagram): indeed, the alpha-complex filtration and the Čech filtration have the same persistence diagram, if we ignore the diagonal. Filtering out 0-length intervals from the alpha-complex thus gives a locally Lipschitz function and can be seen as a technical detail of computing the Čech filtration.
4.2 Examples of locally Lipschitz functions of persistence
Now, regarding the function , many functions of persistence considered in the literature are Lipschitz or locally Lipschitz, as illustrated by the following, non-exhaustive, list of examples.
[Total persistence] The total persistence function is the sum of the distances of each point of a persistence diagram with finite coordinates to the diagonal: given a persistence diagram represented as a vector in , ,
Thus, is obviously semi-algebraic, and thus definable in any o-minimal structure. It is also Lipschitz.
[Wasserstein and bottleneck distance] Given a persistence diagram , the bottleneck distance between the regular part of a diagram and the regular part of (see section 2.2) is given by
where, denoting the diagonal in , a matching is a subset of such that every point of and , appears exactly once in . One can easily check that the map is semi-algebraic, and thus definable in any o-minimal structure. It is also Lipschitz. This property also extends to the case where the bottleneck distance is replaced by the so-called Wasserstein distance with [cohen2010lipschitz]. Optimization of these functionals and other functionals of bottleneck and Wasserstein distances have been used, for example, in shape matching in [poulenard2018topological], see also the example on 3D shape in section 5.
[Persistence landscapes [bubenik2015statistical]] To any given a point with and , associate the function defined by
Given a persistence diagram , the persistence landscape of is a summary of the arrangement of the graphs of the functions , :
where is the th largest value in the set, or when the set contains less than points. An interesting property of persistence landscapes is that the map from the space of persistence diagrams endowed with the bottleneck distance to the vector space of real-valued functions on endowed with the sup norm is -Lipschitz—see [bubenik2015statistical] or [chazal2015subsampling, Lemma 1].
Given a positive integer , a finite set , and a finite simplicial complex , the map that associates the vector to each persistence diagram of a filtration of is clearly semi-algebraic. As a consequence, section 4.1 applies to any locally Lipschitz definable function composed with the map . Functions of landscapes have been used, for example, to design topological layers in neural networks [kim2020efficient].
[Persistence surfaces [adams2017persistence]] Given a weight function and a real number , the persistence surface associated to a persistence diagram is the function defined by
The definition of only involves algebraic operations, the weight function , and exponential functions. Thus, it follows from [wilkie1996model] that if is a semi-algebraic function, is a finite simplicial complex, and is a finite set of points in , then the map that associates the vector to each persistence diagram of a filtration of is definable in some o-minimal structure. In [adams2017persistence], the authors provide explicit conditions under which the map from persistence diagrams to persistence surfaces is Lipschitz.
Persistence surfaces belong to the set of linear representations of persistence diagrams [chazal2018density], which includes many other ways to vectorize persistence diagrams. In [divol2019understanding], the authors give explicit conditions for such representations to be locally Lipschitz.
5 Implementation and numerical illustrations
We showed in sections 4 and 3 that the usual stochastic gradient descent procedure of eq. 8 enjoys some convergence properties for persistence diagrams and persistence-based functions. This means in particular that the algorithms available in standard libraries such as TensorFlow, which implement stochastic gradient descent among other optimization methods, can be leveraged and used as is for differentiating persistence diagrams, while still ensuring convergence. The purpose of this section is to demonstrate that our code, based on Gudhi and TensorFlow, can be readily used for several different persistence optimization tasks that were presented in the literature. Our code is publicly available online222https://github.com/MathieuCarriere/difftda, as well as some other code for generating the images presented below. In the following, we describe a few optimization tasks that were studied in Topological Data Analysis.
Point cloud optimization.
A toy example in topological optimization is to modify the positions of the points in a point cloud so that its homology is maximized [gabrielsson2020topology, gameiro2016continuation]. In this paragraph, we present two numerical experiments in this vein.
In our first experiment, we start with a point cloud sampled uniformly from the unit square , and optimize the point coordinates, so that the loss is minimized, where is a topological penalty, is the 1-dimensional persistence diagram associated to the Vietoris-Rips filtration of , and stands for the projection onto the diagonal , and where is a penalty term ensuring that the point coordinates stay in the unit square. Using this loss ensures that holes are created within the point cloud so that the cardinality of the persistence diagram is as large as possible, while constraining the points to stay in a fixed region of the Euclidean plane. Another effect of the penalty is to flatten the boundary of the created holes along the boundary of . See the first row of fig. 1. It is important to note that without the penalty , convergence is very difficult to reach since inflating the point cloud with dilations can make the topological penalty arbitrarily small.
In our second experiment, we start with a noisy sample
of the circle with three outliers and use the loss, where stands for the Wasserstein distance (see section 4.2), is the 0-dimensional persistence diagram associated to the Vietoris-Rips filtration of , and is the 0-dimensional persistence diagram associated to the Vietoris-Rips filtration of a clean (i.e., with no noise nor outliers) sample of the circle. See the second row of fig. 1. It is interesting to note that when one does not use extra penalties, optimizing only topological penalties can lead to funny effects: as one can see on the lower middle of fig. 1, the circle got disconnected, and one of the outliers created a small path out of the circle during optimization.
Another task presented in [gabrielsson2020topology] is related to image processing. In this experiment, we optimize the 0-dimensional homology associated to the pixel values of a digit binary image with noise (see fig. 2, upper left). Since noise can be detected as unwanted small connected components, we use the loss , where is the total persistence penalty, is the finite 0-dimensional persistence diagram of the cubical complex associated to , and is a penalty forcing the pixel values to be either or . As can be seen from fig. 2, using the two penalties is essential: if only is used, the noise is amplified (lower left), and if only is used, the noise does not quite disappear, and paths are created between the noise and the central digit to ensure the corresponding connected components are merged right after they appear (lower middle). Note that this funny behavior which appears when penalizing topology alone is similar to what was observed in experiments where persistence was used to simplify functions [attali2009persistence]. Using both penalties completely remove the noise (lower right) in two steps: firstly topology is optimized, and then the paths created by optimizing are removed by optimizing . This two step process can also be observed on the loss function (upper middle) and the sequence of optimized persistence diagrams of the image (upper right).
3D shape processing.
In [poulenard2018topological], persistence optimization is used for modifying functions defined on 3D shapes. More specifically, given a 3D shape , one is interested in optimizing a function , defined on the vertices of , so that the Wasserstein distance between the persistence diagram associated to and is minimized, where is a target persistence diagram (which either comes from another function , or is defined by the user). In this experiment, we minimize the loss , where , that is, the Wasserstein distance between the 0-dimensional persistence diagram associated to the sublevel set of a function —initialized with the height function, see fig. 3 (upper left)—of a human 3D shape, and a target persistence diagram which only contains a single point with the same coordinates as the point in the persistence diagram of the height function of the shape corresponding to the right leg. This makes the function values to force the two points in corresponding to the hands of to go to the diagonal, by creating paths between the hands and the hips (upper middle). It is worth noting that these path creations come from the fact that we only used a topological penalty in the loss: in [poulenard2018topological]
, the authors ensure smoothness of the resulting function by forcing it to be a linear combination of a first few eigenfunctions of the Laplace-Beltrami operator on the 3D shape. We also display the sequence of optimized persistence diagrams infig. 3 (lower row), from which it can be observed that the optimization is piecewise-linear, which reflects the fact that the persistence map has an associated semi-algebraic partition, as per section 3.2.
Our last experiment comes from [gabrielsson2020topology]
, in which the authors use persistence optimization in a linear regression setting. Given some datasetand ground-truth associated values computed as , where is the vector of ground-truth coefficients and is some high-dimensional Gaussian noise, one can leverage some prior knowledge on the shape of to penalize the coefficients with bad topology. In particular, when using with three peaks, as in fig. 4 (left), we use the loss , where , is the 0-dimensional persistence diagram of the sublevel sets of , minus the three most prominent points, is the usual total variation penalty (which can also be interpreted as a topological penalty as it corresponds to the total persistence of the so-called extended persistence [cohen2009extending] of the signal), and is the usual mean-square error (MSE). We optimized with the MSE alone, then MSE plus total variation, then MSE plus total variation plus topological penalty, and we generated new MSE values from new randomly generated test data, see fig. 4 (right). It is interesting to note that using all penalties lead to the best result: using MSE alone leads to overfitting, and adding total variation smooths the coefficients a lot since is initialized with random values. Using all penalties ends up being a right combination of minimizing the error, smoothing the signal, and getting to the right shape of .
In this article, we demonstrated that all previous works for optimizing topology in the literature could be incorporated into our theoretical framework. In particular, we obtained convergence results for very general classes of functions with topological flavor computed with persistence theory, and provided corresponding code that one can use to reproduce previously introduced topological optimization tasks. For future work, we are planning to further investigate tasks related to classifier regularization in ML[chen2019topological], and to improve on computation time using, e.g., vineyards [Cohen-Steiner2006].