Algebraic topology provides a promising framework for extracting nonlinear features from finite metric spaces via the theory of persistent homology [17, 26, 28]. Persistent homology has solved a host of data-driven problems in disparate fields of science and engineering — examples include signal processing , proteomics , cosmology , sensor networks , molecular chemistry 
and computer vision. The typical output of persistent homology computation is called a barcode, and it constitutes a finite topological invariant of the coarse geometry which governs the shape of a given point cloud.
For the purposes of this introduction, it suffices to think of a barcode as a (multi)set of intervals , each identifying those values of a scale parameter at which some topological feature — such as a connected component, a tunnel, or a cavity — is present when the input metric space is thickened by . A central advantage of persistent homology is its remarkable stability theorem [9, Ch. 5.6]. This result asserts that the map which assigns barcodes to finite metric spaces is 1-Lipschitz when its source and target are equipped with certain natural metrics.
Persistence paths and signature features
Notwithstanding their usefulness for certain tasks, barcodes are notoriously unsuitable for standard statistical inference because itself is a nonlinear metric space, and most scalable learning algorithms rely on linear methods. In this work, we construct a feature map of the form
where denotes the tensor algebra of a linear space . The feature map is defined as composite, , of a persistence path embedding and the path signature ,
where the intermediate space contains all continuous maps of bounded variation.
- Persistence path embedding :
The maps disambiguate our ’s. There are many such embeddings, and they differ significantly in terms of stability, computability, and discriminative power.
- Signature features :
The map represents a path as its -valued signature. This map is injective (modulo natural equivalence classes of paths), provides a hierarchical description of a path, and has a rich algebraic structure that captures natural operations on paths, such as concatenation and time reversal.
The concept of a persistence path embedding reflects the interpretation of persistent homology as a dynamic description of the topological features which appear and disappear as a metric space is thickened across various scales.
There is precedent for such constructions; e.g. Bubenik’s landscapes  can be reformulated to give an important example of such an , which we denote with .
Despite their intuitive appeal, these approaches rely ultimately on a choice of feature map for paths, on which the resulting statistical learning guarantees depend.111For example,  chooses a functional on the Banach space of paths, but how to choose such functionals in a non-parametric fashion and evaluate them efficiently remains unclear (unless something special is known a priori about the probability distribution of the observed barcodes).
chooses a functional on the Banach space of paths, but how to choose such functionals in a non-parametric fashion and evaluate them efficiently remains unclear (unless something special is known a priori about the probability distribution of the observed barcodes).We show here that the composition with the signature map resolves such issues. For example, one of our results is that the feature map is
non-linear functions of the data are approximated by linear functionals in feature space: for every (sufficiently regular) function there exists in the dual of , such that uniformly over barcodes .
the expected value of the feature map characterizes the law of the random variable: the map which sends a probability measureon to its expectation in is injective.
Perhaps the biggest advantage of our approach is that it is not limited to . Besides , we will also discuss the following unstable path embeddings:
- the naive embedding :
sorts all intervals decreasing in length (with intervals of equal length ordered by increasing birth times), enumerates them , and forms an -dimensional path by running in the -th coordinate with unit speed if the -th bar is active (otherwise remaining constant).
- Euler embedding :
reduces a barcode to a single Euler characteristic curve (see [33, Sec. 3.2]). The resulting feature map is not stable, but is extremely fast to compute.
- Betti embedding :
records only the Betti numbers as a function of the scale, and ignores information (contained in the barcode) which connects homology across different scale values.
- envelope embedding :
constructed by sorting the intervals of a barcode in descending order by length, and then assembling the ordered sequence of ’s and ’s into two separate paths. This appears to be a completely new embedding.
Analogous statements for universality and characteristicness hold for the other ’s. Each of these embeddings leads to different properties in terms of stability, computability, and discriminative power, for the associated feature maps . For example, has neither the stability of nor the computability of , but it gives state-of-the-art performance on supervised classification tasks. The emergence of a single feature map which is optimal along all three axes (stability, computability, discriminative power) appears unlikely, since these requirements tend to contravene each other. For example, stability requires the feature map to depend mostly on the longer intervals in a barcode, while in various problems (such as ), the signal of interest also resides in intervals of intermediate and short length.
The dimension of varies significantly between the different persistence path embeddings. If a barcode contains intervals, would map it to a path that evolves in a -dimensional space, whereas always yields a path in dimensions. Each of the above feature maps gives a kernel for barcodes , and following , this kernel can be very efficiently computed regardless of as long as carries an inner product. However, for low-dimensional embeddings, can be computed directly and performs very well (e.g. for the envelope embedding ).
Benchmarks and related work
Statistical learning from barcodes has received a lot of attention, see the background section in  for a recent survey. The most common theme is to construct a kernel  or polynomial coordinates [3, 14] that serve as features for barcodes. We believe one strength of our approach is the access to both, the kernel and its feature map (at least for the Betti, Euler, and envelope embeddings; in practice the naive embedding is only accessible via kernelization due the high-dimensionality of the persistence paths); the former gives access to well-developed tools from the kernel and Gaussian processes learning literature, while the latter allows us to use any
learning method such as random forests or neural networks. A second advantage is that different choices of persistence path embeddings facilitates emphasis on different topological properties (so in a supervised learning task, the optimalcan be determined by cross-validation).
IC is funded by a Junior Research Fellowship of St John’s College, Oxford. VN’s work is supported by The Alan Turing Institute under the EPSRC grant number EP/N510129/1, and by the Friends of the Institute for Advanced Study. HO is supported by the Oxford-Man Institute of Quantitative Finance. We are grateful to Steve Oudot and Mathieu Carrière for generously sharing their data  with us.
As mentioned in the introduction, we construct maps of the form
from the space of persistence barcodes to the tensor algebra of via the space of bounded variation paths BV. In this section we define these three spaces and recall important properties; see [28, 9] resp.  for more details about resp. BV.
2.1. Persistence, barcodes and stability
The Vietoris–Rips filtration [26, Sec. 3.1] associates a one-parameter nested family of finite simplicial complexes to each finite metric space via the following rule. A subset of spans a -dimensional simplex in if and only if all the pairwise distances satisfy . Thus, one has the inclusion whenever . Computing the homology of this family [20, Ch. 2] with coefficients in a field produces, in each dimension, a corresponding family of -vector spaces as follows:
and inclusions of simplicial complexes induce linear maps for . This data consisting of vector spaces and linear maps indexed by real numbers is called a persistence module.
The following result from  uses the fact that the polynomial ring in one variable acts on sufficiently tame persistence modules. Since this ring is a principal ideal domain, -modules have a particularly simple representation theory.
Theorem 1 (Structure).
Under mild assumptions (always satisfied by Vietoris-Rips homology of finite metric spaces), each persistence module is completely characterized up to isomorphism by a finite collection of (not necessarily distinct) intervals , called its barcode.
Thus, a barcode is simply a multi-set containing some subintervals of . When the persistence module in question comes from the Vietoris-Rips construction as described above, its barcode provides a complete summary of all the intermediate homology of across . In particular, the -th Betti number of , written
equals the number of intervals in the -dimensional barcode that contain . Similarly, for , the rank of the induced map on homology
equals the number of intervals in that contain .
There are several efficient algorithms which take as input finite metric spaces and produce as outputs the barcodes of their Vietoris-Rips filtrations [25, 18]. Concurrently, the theory has also developed at a rapid pace, and the following result [12, 9] is an exemplar of its progress. (Note that is the collection of all finite metric spaces while is the collection of all barcodes containing finitely many intervals.)
Theorem 2 (Stability).
Roughly, two barcodes lie within bottleneck distance of each other if it is possible to deform one to the other by moving the endpoints of all its intervals by at most (and vice-versa). Thus, the longer intervals are more stable to perturbation of the originating metric space (in particular, intervals of length smaller than might be created or destroyed during such a deformation).
2.2. Paths of bounded variation
Let be a normed real vector space. Given a continuous path (for some ) and a finite partition of
the 1-variation of along is given by
The total -variation of a continuous path is defined as
where the supremum is taken over all finite partitions of . The normed real vector space consists of all continuous paths that satisfy , with addition and scalar multiplication being defined pointwise. The induced metric on is given as usual by .
Functions of bounded variation lie strictly between Lipschitz-continuous functions and almost everywhere differentiable functions, and in particular every Lipschitz-continuous function lies within .
2.3. The tensor algebra
Given a real vector space and an integer , let denote the -fold tensor product of with itself. By convention, . The tensor algebra of is the direct product
Thus, each element of is a sequence where . We equip with the structure of a (graded) algebra under the tensor product operation, for which takes values in . Finally, let us emphasize that is a linear space which makes it a suitable feature space.
3. From barcodes to paths
In this section we introduce several persistence path embeddings
To avoid technicalities, we make the following assumption (which is always met if arises from the persistent homology of a finite metric space):222The assumption can be sufficiently mollified, but since the underlying motivation for this work is computational and finitary, we do not lose any structure of interest by restricting to tame barcodes.
Assume that every barcode encountered in this section is tame in two senses: first, it has only finitely many intervals, and second, each interval is contained within for some sufficiently large .
3.1. The (integrated) landscape embedding
We first present an embedding with desirable stability properties. The persistence landscape of a barcode is a single function , but it is often convenient to denote each by . The best introduction to landscapes is visual:
To the left is a barcode containing only three intervals, where each is shown as a point in the plane with coordinates . The construction of three associated landscape functions , which are shown to the right, proceeds by first projecting these points onto the diagonal, and then extracting successive maximal envelopes of the resulting arrangement of line segments. The higher for this illustrated barcode are all identically zero.
[5, Def. 3] The landscape of the barcode is the (continuous) function given by
Here, equals the number of intervals in which contain (see (2) for the case of barcodes arising from persistent homology). Moreover, we adopt the usual convention that whenever the supremum is being taken over the empty set.
For tame barcodes , one can safely exclude from the codomain of . Moreover, each becomes bounded and compactly supported (in addition to continuous), so we may view the assignment of landscapes to barcodes as a function
for every .
The landscape embedding assigns to each landscape a path in whose -th component is the -th landscape function for each :
Similarly, the integrated landscape embedding is defined as
The choice of for the target space above is somewhat arbitrary: we may as well have mapped to for any or to via truncation. We now show that inherits stability (in the sense of Theorem 2) from barcodes via their landscapes. The two spaces defined below will appear in the proof.
For , define
the Banach space consisting of all functions for which the following -norm is finite: and
the Sobolev path space consisting of all functions for which there exists some such that . The seminorm of in this case is defined by .
We remark that one usually defines the Sobolev norm on as , while our definition drops the term . For paths defined on a compact interval with , we note that these norms are equivalent (but on unbounded domains, this is no longer the case). Our choice of norm is motivated by the upcoming Lemma 3.5.
At special values of and , the two spaces defined above become more familiar. For instance, is the space obtained by equipping with the product of the counting and Lebesgue measures, as considered in [5, Sec. 2.4]. Similarly, is the space of -Hölder paths in , while is the subspace of absolutely continuous paths in . See [15, Sec. 1.4] for further details. In any case, landscapes in the image of lie in for every possible .
For all , the map is an isometry
The map , , is an isometry:
By definition of , the map , , is also an isometry. The conclusion follows by observing that . ∎
We now obtain a desirable stability property for the integrated landscape embedding.
3.2. The envelope embedding
Consider . Order the intervals of in descending order by their lengths (with intervals of equal length ordered by increasing birth times), and embed them into as the (disjoint) union
The upper envelope of
is the piecewise linear curve obtained by linearly interpolating between the highest pointsof across , with by convention. Similarly, the lower envelope is obtained by interpolating between the lowest points , again with . Both curves are uniquely extended to have domain by keeping them constant on the intervals and . We illustrate both envelopes in the accompanying figure.
The envelope embedding is a map defined as follows. To each barcode , it associates the path , given by
Here and are the upper and lower envelopes of as described above.
For values of near zero, the upper and lower envelopes, capture only the longest, most stable intervals of . As increases, more of the smaller intervals get included, and the output becomes more volatile to small perturbations of (in the bottleneck metric). This motivates us to truncate after a given time: pick an integer and let be the restricted envelope embedding obtained by setting equal to
If the feature map associated to this truncated envelope embedding performs well for small values of and poorly for large ones, then one obtains evidence in favor of the hypothesis that the signal of interest genuinely resides in the larger, more stable intervals.
3.3. The Betti and Euler embeddings
In contrast to the previous two subsections, we now consider embeddings which depend on all homological dimensions.
Denote where contains all intervals of homological dimension . Choose an integer and numbers for and . Setting , the generalised Betti embedding is defined as follows. Let be the (ordered set of) all endpoints of intervals in (which lie in ) together with and . For each , set
where is the Betti number of as in (1). We extend the definition to points in a piecewise linear fashion
The Betti embedding is defined by setting if and otherwise:
The Euler embedding is defined by setting :
When arises from the persistent homology of a finite metric space, it is computationally convenient to recall that is the Euler characteristic of the associated Vietoris-Rips simplicial complex , and is thus also given by an alternating count of simplices across dimension:
Hence, can be computed without knowing the actual homology of . In fact, one could further consider generalised simplex embeddings
which can all be computed without knowing the homology of (albeit do not in general capture homological invariants).
Since and particularly are massive numerical reductions of , it is apparent that metric spaces with very different persistence barcodes might have identical Betti and Euler embeddings. On the other hand, for barcodes arising from certain popular models of random metric spaces, the expected value of is a remarkably good predictor of the Betti number for each — see [21, Sec. 5.3] and the references therein for details.
3.4. Stability and injectivity
As mentioned in the introduction, the emergence of a single feature map which is optimal in terms of stability, discriminative power and computability is unlikely since these three properties tend to contravene each other. Indeed, our persistence path embeddings vary drastically in terms of stability, discriminative power, and computability. In terms of discriminative power, the (integrated) landscape and envelope embeddings are injective maps from the space of barcodes to spaces of bounded variation paths, but neither Betti or Euler are injective, see Remark 3.9. In terms of stability, the only embedding that is stable is the (integrated) landscape embedding; for the other embeddings simple counterexamples can be constructed.333Consider adding small bars where if is even, if odd for a fixed sufficiently large .
4. From paths to tensors
We introduce the second component for our feature map .
For a Banach space , the signature map is defined as
is defined as a Riemann–Stieltjes integral over the simplex of length .
Example 4.2 (The case ).
For a path , it holds that
where the sum is taken over all multi-indexes , is a basis of , and
Hence the term can simply be interpreted as a collection of real numbers.
The mapping is essentially injective up to a natural equivalence class of paths, called tree-like equivalence.444 are tree-like equivalent iff there exists some -tree (i.e., a metric space in which any two points are connected by a unique arc which is isometric to a real interval) so that , the concatenation of with the time-reversal of , decomposes as
 Let . Then iff and are tree-like equivalent.
Tree-like equivalence can be a useful equivalence relation, e.g. it identifies paths that differ only by time-parametrization.555 and are tree-like equivalent for any time-change . For our applications, it is instructive to think of as the natural generalisation of the monomial of order of a vector to pathspace.
Let and consider the path where . The components of its signature are given by
Thus indeed recovers the moment map
indeed recovers the moment map.
A further useful similarity between the signature map and monomials is that the space of linear functions of the signature is closed under multiplication, which is commonly known as the shuffle identity.
Lemma 4.4 (Shuffle identity).
Suppose for (where is the continuous dual space of ). Then there exists such that for all
The linear functional is known as the shuffle product of and .
In light of Lemma 4.4, one can ask whether linear combinations of such “monomials” are dense in a space of functions, and, whether the sequence of expected “moments” characterizes the law of the random path. For compact subsets of , the answer to both questions is yes, as we shall see in Theorem 5, and follows by a standard Stone–Weierstrass argument; the general, non-compact case is more subtle (cf. classical moment problem) but is known to be true under suitable integrability conditions .
5. Statistical learning
In this section we discuss the problem of statistical learning on the space of barcodes . The space has traditionally been of more interest; however, since the persistent homology map from Section 2.1 is well-understood, we focus here on . Results for -valued random variables pull back along PH to results for -valued random variables.
We are interested in two standard learning problems: given independent random samples of a -valued random variable , our aim is to
learn a function of the data , and
characterize the law of the data .
As mentioned in the introduction, the standard approach to both problems is to find a feature map which is universal and characteristic (addressing points a and b respectively). Let us establish these properties for our feature map
where is a Banach space, is any persistence path embedding (e.g., one of the maps from Section 3) and is the signature map of Section 4. Due to the injectivity of (up to tree-like equivalence), preserves essentially all the information captured by . In particular, if maps some domain injectively into the space of tree-reduced paths (as is the case for any embedding once time is added as a coordinate), then is also injective on . To make this precise, we use suitable quotient spaces.
For , define the equivalence relation iff and are tree-like equivalent. Let denote the quotient of under , and equip with the initial topology with respect to the map
where denotes tree-like equivalence and is equipped with the quotient topology (recall that bears the -variation topology).
On each compact subset , the map has the following properties.
(Universal) Let be continuous. For each , there exists in (the dual space of the tensor algebra) such that
(Characteristic) Denoting by the set of Borel probability measures on , the map
(Kernelized) Suppose further that is a Hilbert space. Then the map
defines a bounded, continuous kernel666A kernel on a set is a positive definite map . The completion of with respect to the inner product forms a so-called reproducing kernel Hilbert space. A kernel is called universal for a topological vector space if embeds continuously into a dense subspace of and called universal if the transpose map is injective, see  for details. which is universal for the space of continuous functions and characteristic for Borel probability measures on .
By Theorem 4, the continuity of the signature map in the -variation topology [24, Thm. 3.10], and the definition of , it follows that the map is continuous and separates the points of . Combining these properties with Lemma 4.4 shows that the set of fuctions is a point-separating subalgebra of , hence Point a follows from the Stone–Weierstrass theorem.
Point b follows by duality: the dual of are the Radon measures on (these include the Borel probability measures) and universality implies that the map is dense in this dual; hence, every Radon measure is characterized by
In Point c the boundedness follows from continuity of and compactness of . Finally, every inner product kernel is universal [resp. characteristic] if and only if the feature map is universal [resp. characteristic]; this follows from a general argument about reproducing kernels, see for example [11, Prop. E.3] for details. ∎
The computational bottleneck for is typically the calculation of the signature, since an element of truncated at level needs real numbers if is assumed piecewise linear with at most time points. This gets prohibitively large for moderate dimensions. By contrast, the kernel on , , is defined via the canonical inner product on . Using , the level- approximation to can be computed in time777The low-rank approximation algorithm in  reduces this cost further to time and memory. and memory, where is the cost of evaluating one inner product in .
Each of the feature maps naturally generalises to a parametrised feature map where denotes a set of parameters (which will be typically chosen by the learning algorithm). The set of parameters are
is the truncation level of , meaning that we only consider the first components of every signature .
is the time-augmentation parameter. If it is non-zero, then we replace each path by the path before computing its signature.
, for some , is a lag vector containing non-negative real numbers. We replace by
in before computing its signature.
a non-linearity . Since composed with a sufficiently regular map to another vector space , lifts barcodes to . If is injective and non-linear, one expects to obtain more efficient learning from signatures of paths in since more non-linearities are provided by than by .888Generically will be very high- or infinite-dimensional which prevents the direct calculation of . But if is the feature map of a kernel on , (i.e. is a reproducing kernel Hilbert space over with kernel ), then the signature kernelization  still allows to compute the kernel . Typically we choose and use a classic kernel on for such as the RBF kernel.
For the choice , , (with ) and , the corresponding recovers , and recovers . With slight abuse of notation, we write for and for for the remainder of the article.
We evaluate our feature map on three supervised classification tasks: orbits, textures, and shapes. These are common benchmarks, taken from recent papers999 It can be hard to replicate reported results in the literature, since the preprocessing is often not fully specified; e.g. on OUTEX various downsampling methods combined with CBLP are possible before barcodes are computed. We used the same data set of barcodes and train-test split for all experiments to allow for a fair comparison. and are described in Figures 1, 6, and 7. For kernel methods
we used a support vector classifier and for feature mapswe used a random forest classifier. For kernel methods we used the same Nyström approximation to deal with the quadratic growth of the Gram matrix.
6.1. Computational complexity
For a persistence path , truncation of the feature map at tensors at level less than or equal to gives
coordinates (again, akin to classic polynomial features in machine learning). This combinatorial explosion limits the choice offor in practice: the integrated landscape embedding produces paths in for large , which rules out . However, the kernelization needs constant memory and computation time ; in practice this allows us to evaluate up to on standard laptops ; thus can be efficiently computed for all ’s as long as the persistence paths are not too long (long persistence paths are in principle possible by using the low-rank algorithm from  which we did not implement). On the other hand , , and are fast to compute directly since their ’s produce low-dimensional paths.
We implement our feature map and the kernel in Python’s sci-kit learn package . As part of this, we use for the signature kernel computation the legacy code from [22, Alg. 3], but we did not implement the low-rank algorithm (which limits to paths with time ticks on laptops, which is the reason why we do not report results for and on the Orbits dataset); for we make use of Terry Lyons’ ESIG package to compute the truncated signature .
To allow for a fair comparison, we also implemented the sliced Wasserstein kernel  and the persistence image feature map  combined with a random forest. This allows for a reasonable benchmarking since [7, 2] compares performance of and against a number of other methods — see [7, Sec. 4.2 and 4.3] and [2, Sec. 6].
6.3. Hyperparameter tuning
We use a grid search and
-fold cross-validation for parameter tuning. The hyperparameter grid forconsists of parameters that are: the truncation level in the tensor algebra and time-augmentation parameter ; additionally the parameters of the classifiers were used, and if the kernelized version is used, then additionally the parameters of the kernel on (we used throughout the Gauss kernel so this is just the length scale ). For the envelope embedding E a further grid search was performed over the restriction parameter described in Section 3.2. Adding lags described in Section 5.1 is possible but was not tested here.
reports the mean accuracy of repeating each experiment 20 times together with the standard deviation. As benchmark we report performance of the sliced Wasserstein kerneland the persistence image features . For , we used the approximation given in [7, Alg. 1] with six directions. For , we used the persim package101010https://github.com/sauln/persim
with the Gauss kernel, linear weight function, and a grid search over the number of pixels and variance.
Table 1 shows that our approach performs very competitively, achieving state-of-the art in two common benchmarks. To the best of our knowledge, the Betti embedding beats the state-of-the-art for shapes in feature form and achieves close to state-of-the-art for textures in kernelized form. This is encouraging since both and provide very competitive benchmarks.
Our implementation of achieves better accuracy, particularly for Orbits, than reported in the original papers , where it is for Textures and for Orbits. We believe this is due to a different Gram matrix approximation (we use Nyström for all kernel methods). Similarly drastically outperformed the results for the same orbits experiment in ; we believe this is due to our choice of classifier, namely a random forest (vs. discriminant subspace ensemble).
The best results for , , and E were achieved by looking at the 0th homology for texture and orbits, and 1st homology for shapes (we did not combine homologies of different dimensions for these methods).
The values for the optimal truncation level in the tensor algebra (as chosen by cross-validated gridsearch) is given in Table 2; we ran all tests for . The features performed best at higher levels, while performed best at lower levels. We suspect that the reason is that the Gaussian kernel non-linearity that lifts barcodes to paths in infinite-dimensional spaces, allows to capture the needed information already on level or . We also ran the same experiments for the landscape111111We employed [6, Alg. 1] to compute persistence landscapes and the naive embedding, and , but the results were not competitive on either dataset.