perslay
Implementation of the PersLay layer for persistence diagrams
view repo
Persistence diagrams, the most common descriptors of Topological Data Analysis, encode topological properties of data and have already proved pivotal in many different applications of data science. However, since the (metric) space of persistence diagrams is not Hilbert, they end up being difficult inputs for most Machine Learning techniques. To address this concern, several vectorization methods have been put forward that embed persistence diagrams into either finitedimensional Euclidean space or (implicit) infinite dimensional Hilbert space with kernels. In this work, we focus on persistence diagrams built on top of graphs. Relying on extended persistence theory and the socalled heat kernel signature, we show how graphs can be encoded by (extended) persistence diagrams in a provably stable way. We then propose a general and versatile framework for learning vectorizations of persistence diagrams, which encompasses most of the vectorization techniques used in the literature. We finally showcase the experimental strength of our setup by achieving competitive scores on classification tasks on reallife graph datasets.
READ FULL TEXT VIEW PDFImplementation of the PersLay layer for persistence diagrams
Topological Data Analysis (TDA) is a field of data science whose goal is to detect and encode topological features (such as connected components, loops, cavities…) that are present in datasets in order to improve inference and prediction. Its main descriptor is the socalled persistence diagram, which takes the form of a set of points in the Euclidean plane , each point corresponding to a topological feature of the data, with its coordinates encoding the feature size. This descriptor has been successfully used in many different applications of data science, such as signal analysis [PH15], material science [BHO18], cellular data [CÁM17], or shape recognition [LOC14] to name a few. This wide range of applications is mainly due to the fact that persistence diagrams encode information based on topology, and as such this information is very often complementary to the one retrieved by more classical descriptors.
However, the space of persistence diagrams heavily lacks structure: different persistence diagrams may have different number of points, and several basic operations are not welldefined, such as addition and scalar multiplication, which unfortunately dramatically impedes their use in machine learning applications. To handle this issue, a lot of attention has been devoted to vectorizations of persistence diagrams through the construction of either finitedimensional embeddings [AEK+17, COO15, CFL+15, KAL18], i.e., embeddings turning persistence diagrams into vectors in Euclidean space , or kernels [BUB15, CCO17, KHF16, LY18, RHB+15], i.e., generalized scalar products that implicitly turn persistence diagrams into elements of infinitedimensional Hilbert spaces.
Even though these methods improved the use of persistence diagrams in machine learning tremendously, several issues remain. For instance, most of these vectorizations only have a few trainable parameters, which may prevent them from fitting well to specific applications. As a consequence, it may be very difficult to determine which vectorization is going to work best for a given task. Furthermore, kernel methods (which are generally efficient in practice) require to compute and store the kernel evaluations for each pair of persistence diagrams. Since all available kernels have a complexity that is at least linear, and often quadratic in the number of persistence diagram points for a single matrix entry computation, kernel methods quickly become very expensive in terms of running time and memory usage on large sets or for large diagrams.
In this work, we show how to use neural networks for handling persistence diagrams. Contrary to static vectorization methods proposed in the literature, we actually learn the vectorization with respect to the learning task that is being solved. Moreover, our framework is general enough so that most of the common vectorizations of the literature [AEK+17, BUB15, CFL+15] can be retrieved from our method by specifying its parameters accordingly.
The contribution of this paper is twofold.
First, we introduce in Section 2 a new family of topological signatures on graphs: the extended persistence diagrams built from the Heat Kernel Signatures (HKS) of the graph. These signatures depend on a diffusion parameter . Although HKS are wellknown signatures, they have never been used in the context of persistent homology to encode topological information for graphs. We prove that the resulting diagrams are stable with respect to both the input graph and the parameter . The use of extended persistence, by opposition to the commonly used ordinary persistence, is introduced in order to handle “essential” components, see Section 2.1 below. To our knowledge, it is the first use of extended persistence in a machine learning context. We also experimentally showcase its strength over ordinary persistence on several applications.
Second, building on the recent introduction of Deep Sets from [ZKR+17], we apply and extend that framework for persistence diagrams, implementing PersLay: a simple, highly versatile, automatically differentiable layer for neural network architectures that can process topological information encoded in persistence diagrams computed from all sorts of datasets. Our framework encompasses most of the common vectorizations that exist in the literature, and we give a necessary and sufficient condition for the learned vectorization to be continuous, improving on the analysis of [HKN19]. Using a largescale dataset coming from dynamical systems which is commonly used in the TDA literature, we give in Section 3.2 a proofofconcept of the scalability and efficiency of this neural network approach over standard methods in TDA. The implementation of PersLay is publicly available as a plugandplay Python package based on tensorflow at https://github.com/MathieuCarriere/perslay.
We finally combine these two contributions in Section 4 by performing realgraph classification application with benchmark datasets coming various fields of science, such as biology, chemistry and social sciences.
Various techniques have been proposed to encode the topological information that is contained in the structure of a given graph, see for instance [AMA07, LSY+12, FF12]. In this work, we focus on topological features computed with persistent homology (see Section 1.3 below). This requires to define a realvalued function on the nodes of the graph. A first simple choice—made in [HKN+17]—is to map each node to its degree. In another recent work [ZW19]
, authors proposed to use the Jaccard index and the Ricci curvature. In
[TVH18], authors adopt a slightly different approach: given a graph with nodes indexed by and an integer parameter , they compute an matrix whereis the probability that a random walk starting at node
ends at node after steps. This matrix can be thought of as a finite metric space, i.e., a set of points embedded in , on which topological descriptors can also be computed [CdO14]. Here, acts as a scale parameter: small values of will encode local information while large values will catch largescale features. Our approach, presented in Section 2, shares the same scaleparametric idea, with the notable difference that we compute our topological descriptors on the graphs directly.The first approach to feed a neural network architecture with a persistence diagram was presented in [HKN+17]
. It amounts to evaluating the points of the persistence diagram against one (or more) Gaussian distributions with parameters
that are learned during the training process (see Section 3 for more details). Such a transformation is oblivious to any ordering of the diagram points, which is a suitable property, and is a particular case of permutation invariant transformations. These general transformations are studied in [ZKR+17], and used to define neural networks on sets. These networks are referred to as Deep Sets. In particular, the authors in [ZKR+17] observed that any permutation invariant function defined on point clouds supported on with exactly points can be written of the form:(1) 
for some and . Obviously, the converse is true: any function defined by (1) is a permutation invariant function. Hofer et al. make use of this idea in [HKN19], where they suggest three possible functions , which roughly correspond to Gaussian, spike and cone functions centered on the diagram points. Our framework builds on the same idea, but substantially generalize theirs, as we are able to generate many more possible vectorizations and ways to combine them (for instance by using a maximum instead of a sum in Equation 1). We deepen the analysis by observing how common vectorizations of the TDA literature can be obtained again as specific instances of our architecture (Section 3). Moreover, we allow for more general weight functions that provide additional interpretability, as show in Supplementary Material, Section C.
In this section, we briefly recall the basics of ordinary persistence theory. We refer the interested reader to [CEH09, EH10, OUD15] for a thorough description.
Let be a topological space, and be a realvalued continuous function. The sublevel set of is then defined as: . Making increase from to gives an increasing sequence of sublevel sets, called the filtration induced by . It starts with the empty set and ends with the whole space (see in Figure 1 for an illustration on a graph). Ordinary persistence keeps track of the times of appearance and disappearance of topological features (connected components, loops, cavities, etc.) in this sequence. For instance, one can store the value , called the birth time, for which a new connected component appears in . This connected component eventually gets merged with another one for some value , which is stored as well and called the death time. Moreover, one says that the component persists on the corresponding interval . Similarly, we save the values of each loop, cavity, etc. that appears in a specific sublevel set and disappears (get “filled”) in . This family of intervals is called the barcode, or persistence diagram, of , and can be represented as a multiset of points (i.e., point cloud where points are counted with multiplicity) supported on with coordinates .
The space of persistence diagrams can be equipped with a parametrized metric , , whose proper definition is not required in this work and is given in Supplementary Material, Appendix A for the sake of completeness. In the particular case , this metric will be refered to as the bottleneck distance between persistence diagrams.
In general ordinary persistence does not fully encode the topology of . For instance, consider a graph , with vertices and (nonoriented) edges . Let be a function defined on its vertices, and consider the sublevel graphs where , , and , see in Figure 1. In this sequence , the loops persist forever since they never disappear from the sequence of sublevel graphs (they never get “filled”), and the same applies for whole connected components of . Moreover, branches pointing upwards (with respect to the orientation given by , see Figure 2) are missed (while those pointing downward are detected), since they do not create connected components when they appear in the sublevel graphs, making ordinary persistence unable to detect them.
To handle this issue, extended persistence refines the analysis by also looking at the socalled superlevel set . Similarly to ordinary persistence on sublevel sets, making decrease from to produces a sequence of increasing subsets, for which structural changes can be recorded.
Although extended persistence can be defined for general metric spaces (see the references given above), we restrict ourselves to the case where is a graph. The sequence of increasing superlevel graphs is illustrated in Figure 1 . In particular, death times can be defined for loops and whole connected components by picking the superlevel graphs for which the feature appears again, and using the corresponding value as the death time for these features. In this case, branches pointing upwards can be detected in this sequence of superlevel graphs, in the exact same way that downwards branches were in the sublevel graphs. See Figure 2 for an illustration.
Finally, the family of intervals of the form is turned into a multiset of points in the Euclidean plane by using the interval endpoints as coordinates. This multiset is called the extended persistence diagram of and is denoted by .
Since graphs have four types of topological features (see Figure 2), namely upwards branches, downwards branches, loops and connected components, the corresponding points in extended persistence diagrams can be of four different types. These types are denoted as , , and for downwards branches, upwards branches, connected components and loops respectively.
While it encodes more information than ordinary persistence, extended persistence ensures that points have finite coordinates. In comparison, methods relying on ordinary persistence have to design specific tools to handle points with infinite coordinates [HKN+17, HKN19], or simply ignore them [CCO17], losing information in the process. Therefore extended persistence allows the use of generic architectures regardless of the homology dimension. Empirical performances show substantial improvement over using ordinary persistence only (see Supplementary Material, Table 7). In practice, computing extended persistence diagrams can be efficiently done with the C++/Python Gudhi library [THE15]. Persistence diagrams are usually compared with the socalled bottleneck distance —whose proper definition is not required for this work and is recalled in Supplementary Material, Section A. However, the resulting metric space is not Hilbert and as such, incorporating diagrams in a learning pipeline requires to design specific tools, which we do in Section 3.
We recall that extended persistence diagrams can be computed only after having defined a realvalued function on the nodes of the graphs. In the next section, we define a family of such functions from the socalled Heat Kernel Signatures (HKS) for graphs, and show that these signatures enjoy stability properties. Moreover, Section 4 will further demonstrate that they lead to competitive results for graph classification.
HKS is an example of spectral family of signatures, that is, functions derived from the spectral decomposition of graph Laplacians, which provide informative features for graph analysis. We start this section with a few basic definitions. The adjacency matrix of a graph with vertex set is the matrix . The degree matrix is the diagonal matrix defined by . The normalized graph Laplacian is the linear operator acting on the space of functions defined on the vertices of , and is represented by the matrix
. It admits an orthonormal basis of eigenfunctions
and its eigenvalues satisfy
. As the orthonormal eigenbasis is not uniquely defined, the eigenfunctions cannot be used as such to compare graphs. Instead we consider the Heat Kernel Signatures (HKS):Given a graph and , the Heat Kernel Signature with diffusion parameter is the function defined on the vertices of by .
The HKS have already been used as signatures to address graph matching problems [HRG14] or to define spectral descriptors to compare graphs [TMK+18]. These signatures rely on the distributions of values taken by the HKS but not on their global topological structures, which are encoded in their extended persistence diagrams. For the sake of concision, we denote by the extended persistence diagram obtained from a graph using the filtration induced by the HKS with diffusion parameter , that is . The following theorem shows these diagrams to be stable with respect to the bottleneck distance between persistence diagrams. The proof can be found in Supplementary Material, Section A.
Stability w.r.t. graph perturbations. Let and let be the Laplacian matrix of a graph with vertices. Let be another graph with vertices and Laplacian matrix . Then there exists a constant only depending on and the spectrum of such that, for small enough , where denotes the Frobenius norm:
(2) 
Building an extended persistence diagram on top of a graph with the Heat Kernel Signatures requires to pick a specific value of . In particular, understanding the influence and thus the choice of the diffusion parameter is an important question for statistical and learning applications. First, we state that the map is Lipschitzcontinuous. The proof is found in Supplementary Material, Section A.
Stability w.r.t. parameter . Let be a graph. The map is Lipschitz continuous, that is for ,
(3) 
In this section, we introduce PersLay: a general and versatile neural network layer for learning persistence diagram vectorizations.
Dataset  PSSK  PWGK  SWK  PFK  PersLay 

ORBIT5K  72.38(2.4)  76.63(0.7)  83.6(0.9)  85.9(0.8)  87.7(1.0) 
ORBIT100K  —  —  —  —  89.2(0.3) 
In order to define a layer for persistence diagrams, we modify the Deep Set architecture [ZKR+17] by defining and implementing a series of new permutation invariant layers, so as to be able to recover and generalize standard vectorization methods used in Topological Data Analysis. To that end we define our generic neural network layer for persistence diagrams, that we call PersLay, through the following equation:
(4) 
where op is any permutation invariant operation (such as minimum, maximum, sum, th largest value…), is a weight function for the persistence diagram points, and is a representation function that we call point transformation, mapping each point of a persistence diagram to a vector.
In practice, and are of the form where the gradients of and are known and implemented so that backpropagation can be performed, and the parameters can be optimized during the training process. We emphasize that any neural network architecture can be composed with PersLay to generate a neural network architecture for persistence diagrams. Let us now introduce three point transformation functions that we use and implement for parameter in Equation (4).
The triangle point transformation where the triangle function associated to a point is , with and .
The Gaussian point transformation , where the Gaussian function associated to a point is for a given , and .
The line point transformation , where the line function associated to a line with direction vector and bias is , with and are lines in the plane.
Formulation (4) is very general: despite its simplicity, it allows to remarkably encode most of classical persistence diagram vectorizations with a very small set of point transformation functions , allowing to consider the choice of
as a hyperparameter of sort. Let us show how it connects to most of the popular vectorizations and kernel methods for persistence diagrams in the literature.
Using: with samples , th largest value, (a constant weight function), amounts to evaluating the th persistence landscape [BUB15] on .
Using with samples , sum, arbitrary weight function , amounts to evaluating the persistence silhouette weighted by [CFL+15] on .
Using with samples , sum, arbitrary weight function , amounts to evaluating the persistence surface weighted by [AEK+17] on . Moreover, characterizing points of persistence diagrams with Gaussian functions is also the approach advocated in several kernel methods for persistence diagrams [KHF16, LY18, RHB+15].
Using where is a modification of the Gaussian point transformation defined with: for any , where if for some , and otherwise, sum, weight function , is the approach presented in [HKN+17].
Using with lines , th largest value, weight function , is similar to the approach advocated in [CCO17], where the sorted projections of the points onto the lines are then compared with the norm and exponentiated to build the socalled Sliced Wasserstein kernel for persistence diagrams.
Stability of PersLay. The question of the continuity and stability of persistence diagram vectorizations is of importance for TDA practitioners. In [HKN19, Remark 8], authors observed that the operation defined in (4)—with op=sum—is not continuous in general (with respect to the common persistence diagram metrics). Actually, [DL19, Prop. 5.1] showed that for all the map is continuous with respect to the metric if and only if is of the form , where denotes the distance from a point to the diagonal and is a continuous and bounded function. Furthermore, when and is Lipschitz continuous, one can show that the map is actually stable ([HKN19, Thm. 12], [DL19, Prop. 5.2]), in the following sense:
In particular, this means that requiring continuity for the learned vectorization, as done in [HKN19], implies constraining the weight function to take small values for points close to the diagonal. However, in general there is no specific reason to consider that points close to the diagonal are less important than others, given a learning task.
Our first application is on a synthetic dataset used as a benchmark in Topological Data Analysis [AEK+17, CCO17, LY18]. It consists in sequences of points generated by different dynamical systems, see [HSW07]. Given some initial position and a parameter , we generate a point cloud following:
(5) 
The orbits of this dynamical system heavily depend on parameter . More precisely, for some values of , voids might form in these orbits (see Supplementary Material, Figure 4
), and as such, persistence diagrams are likely to perform well at attempting to classify orbits with respect to the value of
generating them. As in previous works [AEK+17, CCO17, LY18], we use the five different parameters and to simulate the different classes of orbits, with random initialization of and points in each simulated orbit. These point clouds are then turned into persistence diagrams using a standard geometric filtration [CdO14], called the AlphaComplex filtration^{1}^{1}1http://gudhi.gforge.inria.fr/python/latest/alpha_complex_ref.html in dimensions and . We generate two datasets: The first is ORBIT5K, where for each value of , we generate orbits, ending up with a dataset of point clouds. This dataset is the same as the one used in [LY18]. The second is ORBIT100K, which contains orbits per class, resulting in a dataset of point clouds—a scale that kernel methods cannot handle. This dataset aims to show the edge of our neuralnetwork based approach over kernels methods when dealing with very large datasets of large diagrams, since all the previous works dealing with this data [AEK+17, CCO17, LY18] use kernel methods.Results are displayed in Table 1. Not only do we improve on previous results for ORBIT5K, we also show with ORBIT100K that classification accuracy is further increased as more observations are made available. For consistency we use the same accuracy metric as [LY18], that is, we split observations in 70%30% trainingtest sets and report the average test accuracy over runs. The parameters used are summarized in Supplementary Material, Section C.
In order to truly showcase the contribution of PersLay, we use a very simple network architecture, namely a twolayer network. The first layer is PersLay, which processes persistence diagrams. The resulting vector is normalized and fed to the second and final layer, a fullyconnected layer whose output is used for predictions. See Figure 3 for an illustration. We emphasize that this simplistic twolayer architecture is designed so as to produce knowledge and understanding (see Supplementary Material, Section C), rather than achieving the best possible performances.
In our experiments, we set , where denote the th cell in a grid discretization of the unit square, and all are trainable parameters. is typically set to or . For aggregation operator op we use the sum. Further details are given in Supplementary Material, see Table 5 for reporting of the chosen hyperparameters and Table 6 for a study of the influence of the grid size or the choice of .
Point transformations are chosen among the three choices introduced in Section 3.1. Empirically no representation is uniformly better than the others. The choice of the best point transformation
for a given task could also be selected through a crossvalidation scheme, or by learning a linear interpolation between these point transformations: by setting
, where are trainable nonnegative weights that sum to . Thorough exploration of these alternatives is left for future work.As mentioned in Section 2, the diagrams we produce are stable with respect to the choice of the HKS diffusion parameter (Thm. 2.2 and 2.3).
As such, we generally use and in our experiments.
We also refer to Supplementary Material where Figure 7 illustrates the evolution of a persistence diagram w.r.t. and Figure 8 provides the classification accuracy through varying values of .
In practice, it is thus sufficient to sample few values of using a logscale, as suggested for example in [SOG09, §5].
A subsequent natural question is: given a learning task, can itself be optimized? The question of optimizing over a family of filtrations induced by parametric functions the map has been studied both theoretically and practically in very recent works [BND+19, LOT19]. Hence, we also apply this approach for the filtrations induced by the HKS, optimizing the parameter
during the learning process. Note that the running time of the experiments is greatly increased since one has to recompute all persistence diagrams for each epoch, that is, each time
is updated. Moreover, we noticed after preliminary numerical investigations (see Supplementary Material, Section C) that classification accuracies were not improved by a large margin and remained comparable with results obtained without optimizing , so we did not include this optimization step in our results.Table 5 and the code provided in the supplementary material give a detailed and summary report of the different hyperparameters chosen for each experiment.
We are now ready to evaluate our architecture on a series of different graph datasets commonly used as a baseline in graph classification problems. REDDIT5K, REDDIT12K, COLLAB (from [YV15]) IMDBB, IMDBM (from [TVH18]) are composed of social graphs. COX2, DHFR, MUTAG, PROTEINS, NCI1, NCI109 are graphs coming from medical or biological frameworks (also from [TVH18]). A quantitative summary of these datasets is found in Supplementary Material, Table 4.
We compare performances with five other top graph classification methods. Scalevariant topo [TVH18] leverages a kernel for ordinary persistence diagrams computed on point cloud used to encode the graphs. RetGK [ZWX+18] is a kernel method for graphs that leverages eventual attributes on the graph vertices and edges. FGSD [VZ17] is a finitedimensional graph embedding that does not leverage attributes. Finally, GCNN [XC19] and GIN [XHL+18] are two graph neural network approaches that reach toptier results. One could also compare our results on the REDDIT datasets to the ones of [HKN+17], where authors also use persistence diagrams to feed a network (using as first channel a particular case of PersLay, see Section 3), achieving 54.5% and 44.5% of accuracy on REDDIT5K and REDDIT12K respectively.
Topological features were extracted using the graph signatures introduced in Section 2
. We combine these features with more traditional graph features formed by the eigenvalues of the normalized graph Laplacian along with the deciles of the computed HKS (rightside channel in Figure
3). In order to evaluate the impact of the topological features in this learning process, one will refer to the ablation study in Supplementary Material, Table 7.For each dataset, we perform 10 tenfold evaluations and report the average and best tenfold results. For a tenfold evaluation, we split the data in 10 equallysized folds, and record the classification accuracy obtained on the th fold (test) after training on the remaining others. Here, is cycled from 1 to 10. The mean of those 10 tenfold experiments is naturally more robust for evaluation purposes, and we report it in the column “PersLay  Mean”. This is consistent with the evaluation procedure from [ZWX+18]. Simultaneously, we also report the best single 10fold accuracy obtained, reported in the column “PersLay  Max”, which is comparable to the results reported by all the other competitors.
In most cases, our approach is comparable with stateoftheart results, despite using a very simple neural network architecture. Interestingly, both topologybased methods (SV and PersLay) have mediocre performances on the NCI datasets, suggesting that topology is not discriminative for these datasets. Additional experimental results, including ablation studies and variations of hyperparameters (weight grid size , diffusion parameter ) are provided in Supplementary Material, Section C.
Dataset  SV^{1}^{1}footnotemark: 1  RetGK ^{2}^{2}footnotemark: 2  FGSD ^{3}^{3}footnotemark: 3  GCNN ^{4}^{4}footnotemark: 4  GIN ^{5}^{5}footnotemark: 5  PersLay  
Mean  Max  
REDDIT5K  —  56.1  47.8  52.9  57.0  55.6  56.5 
REDDIT12K  —  48.7  —  46.6  —  47.7  49.1 
COLLAB  —  81.0  80.0  79.6  80.1  76.4  78.0 
IMDBB  72.9  71.9  73.6  73.1  74.3  71.2  72.6 
IMDBM  50.3  47.7  52.4  50.3  52.1  48.8  52.2 
78.4  80.1  —  —  —  80.9  81.6  
78.4  81.5  —  —  —  80.3  80.9  
88.3  90.3  92.1  86.7  89.0  89.8  91.5  
72.6  75.8  73.4  76.3  75.9  74.8  75.9  
71.6  84.5  79.8  78.4  82.7  73.5  74.0  
70.5  —  78.8  —  —  69.5  70.1 
In this article, we introduced a new family of topological signatures on graphs, that are both stable and wellformed for learning purposes. In parallel we defined a powerful and versatile neural network layer to process persistence diagrams called PersLay, which generalizes most of the techniques used to vectorize persistence diagrams that can be found in the literature—while optimizing them taskwise.
We showcase the efficiency of our approach by achieving stateoftheart results on synthetic orbit classification coming from dynamical systems and being competitive on several graph classification problems from reallife data, while working at larger scales than kernel methods developed for persistence diagrams and remaining simpler than most of its neural network competitors. We believe that PersLay has the potential to become a central tool to incorporate topological descriptors in a wide variety of complex machine learning tasks based on neural networks.
Our code is freely available publicly at https://github.com/MathieuCarriere/perslay.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2305–2312. Cited by: Appendix A, Theorem A.1, §2.2, Definition 2.1.Recall (see Section 1.3) that persistence diagrams are generally represented as multisets of points (i.e. points counted with multiplicity) supported on the upper half plane . Let and be two such diagrams and be a parameter. Note in particular that in general. Let denote the diagonal, and let denote the set of all bijections between and . Then, the diagram distance between and is defined as:
(6) 
In particular, if , we recover the bottleneck distance defined as:
(7) 
The proof directly follows from the following two theorems. This first one, proved in [HRG14], is a consequence of classical arguments from matrix perturbation theory.
Let and let be the Laplacian matrix of a graph with vertices. Let , be the distinct eigenvalues of and denote by the smallest distance between two distinct eigenvalues: . Let be another graph with vertices and Laplacian matrix with , where denotes the Frobenius norm of . Then, if , there exists a constant such that for any vertex ,
if , there exists two constants such that for any vertex ,
In particular, if , there exists a constant —notice that also depends on —such that in the two above cases,
Theorem 2.2 then immediately follows from the second following theorem, which is a special case of general stability results for persistence diagrams.
Fix a graph . With the same notations as in Section 2.2, recall that the eigenvalues of the normalized graph Laplacian satisfy
, and the corresponding eigenvectors
define an orthonormal family. In particular, is Lipschitz continuous for . Let be two positive diffusion parameters. We have, for any :Thus in particular,
As in the previous proof, we conclude using the stability of persistence diagrams w.r.t. the bottleneck distance (see Thm. A.2).
Tables 3 and 4 summarize key information of each dataset for both our experiments. We also provide in Figure 4 an illustration of the orbits we generated in Section 3.2.
Dataset  Nb of orbit observed  Number of classes  Number of points per orbit 

ORBIT5K  5,000  5  1,000 
ORBIT100K  100,000  5  1,000 
Dataset  Nb graphs  Nb classes  Av. nodes  Av. Edges  Av.  Av. 

REDDIT5K  5,000  5  508.5  594.9  3.71  90.1 
REDDIT12K  12,000  11  391.4  456.9  2.8  68.29 
COLLAB  5,000  3  74.5  2457.5  1.0  2383.7 
IMDBB  1,000  2  19.77  96.53  1.0  77.76 
IMDBM  1,500  3  13.00  65.94  1.0  53.93 
COX2  467  2  41.22  43.45  1.0  3.22 
DHFR  756  2  42.43  44.54  1.0  3.12 
MUTAG  188  2  17.93  19.79  1.0  2.86 
PROTEINS  1,113  2  39.06  72.82  1.08  34.84 
NCI1  4,110  2  29.87  32.30  1.19  3.62 
NCI109  4,127  2  29.68  32.13  1.20  3.64 
Figure 5 provides an illustration of the weight grid learned after training on the MUTAG dataset. Roughly speaking, activated cells highlight the areas of the plane where the presence of points was discriminating in the classification process. These learned grids thus emphasize the points of the persistence diagrams that matter w.r.t. learning task.
As stated in Theorem 2.3, for a fixed graph , the function is Lipschitz continuous with respect to the bottleneck distance between persistence diagrams. Informally (see Supplementary Material, Section A for a formal definition), it means that the points of must move smoothly with respect to . This is experimentally illustrated in Figure 7, where we plot the four diagrams built from a graph of the MUTAG dataset.
As mentioned in Section 2.2, the parameter can also be treated as a trainable parameter that is optimized during the learning. In our experiment, however, it does not prove to be worth it. Indeed, our diagrams are not particularly sensitive to the choice of , and thus fixing some sampled in logscale is enough. Figure 6 illustrates the evolution of parameter over 40 epochs when trained on the MUTAG
dataset (one epoch correspond to a stochastic gradient descent performed on the whole dataset). As one can see, parameter
converges quickly. More importantly, it remains almost constant when initialized at , suggesting that this choice is a (locally) optimal one. Fortunately, this is the parameter we use in our experiment (see Table 5). On the other hand, each time is updated (that is, at each epoch), one must recompute the diagrams for all the graphs in the training set, significantly increasing the running time of the algorithm.Dataset  Func. used  PD preproc.  PersLay  Optim. 

ORBIT5K  ,  prom(500)  Pm(25,25,10,top5)  adam(0.01, 0., 300) 
ORBIT100K  ,  prom(500)  Pm(25,25,10,top5)  adam(0.01, 0., 300) 
REDDIT5K  prom(500)  Pm(25,25,10,sum)  adam(0.01, 0.99, 500)  
REDDIT12K  prom(500)  Pm(5,5,10,sum)  adam(0.01, 0.99, 1000)  
COLLAB  ,  prom(500)  Pm(5,5,10,sum)  adam(0.01, 0.9, 1000) 
IMDBB  ,  prom(500)  Im(20,(10,2),20,sum)  adam(0.01, 0.9, 500) 
IMDBM  ,  prom(500)  Im(10,(10,2),10,sum)  adam(0.01, 0.9, 500) 
COX2  ,  —  Im(20,(10,2),20,sum)  adam(0.01, 0.9, 500) 
DHFR  ,  —  Im(20,(10,2),20,sum)  adam(0.01, 0.9, 500) 
MUTAG  —  Im(20,(10,2),10,sum)  adam(0.01, 0.9, 100)  
PROTEINS  prom(500)  Im(15,(10,2),10,sum)  adam(0.01, 0.9, 70)  
NCI1  ,  —  Pm(25,25,10,sum)  adam(0.01, 0.9, 300) 
NCI109  ,  —  Pm(25,25,10,sum)  adam(0.01, 0.9, 300) 

Grid size for trainable weights  Point transformation  Perm op  
None  Gaussian  line  triangle  Sum  Max  


MUTAG  Train/Test acc (%)  92.3/88.9  91.1/88.8  91.7/89.6  92.3/89.9  93.7/88.3  94.1/87.7  92.5/89.7  89.2/84.2  91.5/85.0  92.3/89.5  91.9/87.4 
Run time, CPU (s)  2.30  2.77  2.79  2.77  2.77  2.78  2.80  5.91  4.42  2.75  2.82  


COLLAB  Train/Test acc (%)  76.5/75.3  78.6/75.8  79.0/76.2  80.0/76.5  83.5/73.9  94.0/71.3  79.7/75.3  79.9/76.1  79.4/74.7  80.0/76.4  78.8/75.0 
Run time, GPU (s)  26.0  40.4  43.5  43.8  44.1  45.6  45.8  54.0  61.4  44.3  48.1  

Input data was fed to the network with minibatches of size 128. For each dataset, various parameters are given (extended persistence diagrams, neural network architecture, optimizers, etc.) that were used to obtain the scores from Footnote 5. In Table 5, we use the following shortcuts:
: persistence diagrams obtained with Gudhi’s dimensional AlphaComplex filtration.
: extended persistence diagram obtained with HKS on the graph with parameter .
prom(): preprocessing step selecting the points that are the farthest away from the diagonal.
PersLay channel Im(, (, ), , op) stands for a function obtained by using a Gaussian point transformation sampled on grid on the unit square followed by a convolution with filters of size , for a weight function optimized on a grid and for an operation op.
PersLay channel Pm(, , , op) stands for a function obtained by using a line point transformation with lines followed by a permutation equivariant function [ZKR+17] in dimension , for a weight function optimized on a grid and for an operation op.
adam() stands for the ADAM optimizer [KB14] with learning rate , using an Exponential Moving Average^{2}^{2}2https://www.tensorflow.org/api_docs/python/tf/train/ExponentialMovingAverage with decay rate , and run during epochs.
As our approach mix our topological features and some standard graph features, we provide two ablations studies. In Table 7
, the column “Spectral” reports the test accuracies obtained by using only these additional features, while the column “PD alone” records the accuracies obtained with the extended and ordinary persistence diagrams alone. As ordinary persistence only encodes the connectivity properties of graphs, a gap in performance between extended and ordinary persistence can be interpreted as 1dimensional features (i.e. loops) being informative for classification purpose. It also reports the standard deviations, that were omitted in
5 for the sake of clarity.Similarly, we give in Table 6 the influence of the grid size that we choose as weight function . In particular, we also perform an ablation study: grid size being None meaning that we enforce for all . As expected, increasing the grid size improves train accuracy but leads to overfitting for too large values. However, this increase has only a small impact on running times whereas not using any grid significantly lowers it.
Finally, Figure 8 illustrates the variation of accuracy for both MUTAG and COLLAB datasets when varying the HKS parameter used when generating the extended persistence diagrams. One can see that the accuracy reached on MUTAG does not depend on the choice of , which could intuitively be explained by the small size of the graphs in this dataset, making the parameter not very relevant. Experiments are performed on a single 10fold, with 100 epochs. Parameters of PersLay are set to Im(20,(),20) for this experiment.
Spectral alone  PD alone  PersLay  

Extended  Ordinary  
REDDIT5K  49.7(0.3)  55.0  52.5  55.6(0.3) 
REDDIT12K  39.7(0.1)  44.2  40.1  47.7(0.2) 
COLLAB  67.8(0.2)  71.6  69.2  76.4(0.4) 
IMDBB  67.6(0.6)  68.8  64.7  71.2(0.7) 
IMDBM  44.5(0.4)  48.2  42.0  48.8(0.6) 
COX2 *  78.2(1.3)  81.5  79.0  80.9(1.0) 
DHFR *  69.5(1.0)  78.2  71.8  80.3(0.8) 
MUTAG *  85.8(1.3)  85.1  70.2  89.8(0.9) 
PROTEINS *  73.5(0.3)  72.2  69.7  74.8(0.3) 
NCI1 *  65.3(0.2)  72.3  68.9  73.5(0.3) 
NCI109 *  64.9(0.2)  67.0  66.2  69.5(0.3) 