PersLay: A Neural Network Layer for Persistence Diagrams and New Graph Topological Signatures

04/20/2019 ∙ by Mathieu Carrière, et al. ∙ 0

Persistence diagrams, the most common descriptors of Topological Data Analysis, encode topological properties of data and have already proved pivotal in many different applications of data science. However, since the (metric) space of persistence diagrams is not Hilbert, they end up being difficult inputs for most Machine Learning techniques. To address this concern, several vectorization methods have been put forward that embed persistence diagrams into either finite-dimensional Euclidean space or (implicit) infinite dimensional Hilbert space with kernels. In this work, we focus on persistence diagrams built on top of graphs. Relying on extended persistence theory and the so-called heat kernel signature, we show how graphs can be encoded by (extended) persistence diagrams in a provably stable way. We then propose a general and versatile framework for learning vectorizations of persistence diagrams, which encompasses most of the vectorization techniques used in the literature. We finally showcase the experimental strength of our setup by achieving competitive scores on classification tasks on real-life graph datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

Code Repositories

perslay

Implementation of the PersLay layer for persistence diagrams


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Topological Data Analysis (TDA) is a field of data science whose goal is to detect and encode topological features (such as connected components, loops, cavities…) that are present in datasets in order to improve inference and prediction. Its main descriptor is the so-called persistence diagram, which takes the form of a set of points in the Euclidean plane , each point corresponding to a topological feature of the data, with its coordinates encoding the feature size. This descriptor has been successfully used in many different applications of data science, such as signal analysis [PH15], material science [BHO18], cellular data [CÁM17], or shape recognition [LOC14] to name a few. This wide range of applications is mainly due to the fact that persistence diagrams encode information based on topology, and as such this information is very often complementary to the one retrieved by more classical descriptors.

However, the space of persistence diagrams heavily lacks structure: different persistence diagrams may have different number of points, and several basic operations are not well-defined, such as addition and scalar multiplication, which unfortunately dramatically impedes their use in machine learning applications. To handle this issue, a lot of attention has been devoted to vectorizations of persistence diagrams through the construction of either finite-dimensional embeddings [AEK+17, COO15, CFL+15, KAL18], i.e., embeddings turning persistence diagrams into vectors in Euclidean space , or kernels [BUB15, CCO17, KHF16, LY18, RHB+15], i.e., generalized scalar products that implicitly turn persistence diagrams into elements of infinite-dimensional Hilbert spaces.

Even though these methods improved the use of persistence diagrams in machine learning tremendously, several issues remain. For instance, most of these vectorizations only have a few trainable parameters, which may prevent them from fitting well to specific applications. As a consequence, it may be very difficult to determine which vectorization is going to work best for a given task. Furthermore, kernel methods (which are generally efficient in practice) require to compute and store the kernel evaluations for each pair of persistence diagrams. Since all available kernels have a complexity that is at least linear, and often quadratic in the number of persistence diagram points for a single matrix entry computation, kernel methods quickly become very expensive in terms of running time and memory usage on large sets or for large diagrams.

In this work, we show how to use neural networks for handling persistence diagrams. Contrary to static vectorization methods proposed in the literature, we actually learn the vectorization with respect to the learning task that is being solved. Moreover, our framework is general enough so that most of the common vectorizations of the literature [AEK+17, BUB15, CFL+15] can be retrieved from our method by specifying its parameters accordingly.

1.1 Our contributions

The contribution of this paper is two-fold.

First, we introduce in Section 2 a new family of topological signatures on graphs: the extended persistence diagrams built from the Heat Kernel Signatures (HKS) of the graph. These signatures depend on a diffusion parameter . Although HKS are well-known signatures, they have never been used in the context of persistent homology to encode topological information for graphs. We prove that the resulting diagrams are stable with respect to both the input graph and the parameter . The use of extended persistence, by opposition to the commonly used ordinary persistence, is introduced in order to handle “essential” components, see Section 2.1 below. To our knowledge, it is the first use of extended persistence in a machine learning context. We also experimentally showcase its strength over ordinary persistence on several applications.

Second, building on the recent introduction of Deep Sets from [ZKR+17], we apply and extend that framework for persistence diagrams, implementing PersLay: a simple, highly versatile, automatically differentiable layer for neural network architectures that can process topological information encoded in persistence diagrams computed from all sorts of datasets. Our framework encompasses most of the common vectorizations that exist in the literature, and we give a necessary and sufficient condition for the learned vectorization to be continuous, improving on the analysis of [HKN19]. Using a large-scale dataset coming from dynamical systems which is commonly used in the TDA literature, we give in Section 3.2 a proof-of-concept of the scalability and efficiency of this neural network approach over standard methods in TDA. The implementation of PersLay is publicly available as a plug-and-play Python package based on tensorflow at https://github.com/MathieuCarriere/perslay.

We finally combine these two contributions in Section 4 by performing real-graph classification application with benchmark datasets coming various fields of science, such as biology, chemistry and social sciences.

Figure 1: Illustration of sublevel and superlevel graphs. Input graph along with the values of a function (blue). Sublevel graphs for respectively. Superlevel graphs for respectively.

1.2 Related work

Various techniques have been proposed to encode the topological information that is contained in the structure of a given graph, see for instance [AMA07, LSY+12, FF12]. In this work, we focus on topological features computed with persistent homology (see Section 1.3 below). This requires to define a real-valued function on the nodes of the graph. A first simple choice—made in [HKN+17]—is to map each node to its degree. In another recent work [ZW19]

, authors proposed to use the Jaccard index and the Ricci curvature. In

[TVH18], authors adopt a slightly different approach: given a graph with nodes indexed by and an integer parameter , they compute an matrix where

is the probability that a random walk starting at node

ends at node after steps. This matrix can be thought of as a finite metric space, i.e., a set of points embedded in , on which topological descriptors can also be computed [CdO14]. Here, acts as a scale parameter: small values of will encode local information while large values will catch large-scale features. Our approach, presented in Section 2, shares the same scale-parametric idea, with the notable difference that we compute our topological descriptors on the graphs directly.

The first approach to feed a neural network architecture with a persistence diagram was presented in [HKN+17]

. It amounts to evaluating the points of the persistence diagram against one (or more) Gaussian distributions with parameters

that are learned during the training process (see Section 3 for more details). Such a transformation is oblivious to any ordering of the diagram points, which is a suitable property, and is a particular case of permutation invariant transformations. These general transformations are studied in [ZKR+17], and used to define neural networks on sets. These networks are referred to as Deep Sets. In particular, the authors in [ZKR+17] observed that any permutation invariant function defined on point clouds supported on with exactly points can be written of the form:

(1)

for some and . Obviously, the converse is true: any function defined by (1) is a permutation invariant function. Hofer et al. make use of this idea in [HKN19], where they suggest three possible functions , which roughly correspond to Gaussian, spike and cone functions centered on the diagram points. Our framework builds on the same idea, but substantially generalize theirs, as we are able to generate many more possible vectorizations and ways to combine them (for instance by using a maximum instead of a sum in Equation 1). We deepen the analysis by observing how common vectorizations of the TDA literature can be obtained again as specific instances of our architecture (Section 3). Moreover, we allow for more general weight functions that provide additional interpretability, as show in Supplementary Material, Section C.

1.3 Background on ordinary persistence

In this section, we briefly recall the basics of ordinary persistence theory. We refer the interested reader to [CEH09, EH10, OUD15] for a thorough description.

Let be a topological space, and be a real-valued continuous function. The -sublevel set of is then defined as: . Making increase from to gives an increasing sequence of sublevel sets, called the filtration induced by . It starts with the empty set and ends with the whole space (see in Figure 1 for an illustration on a graph). Ordinary persistence keeps track of the times of appearance and disappearance of topological features (connected components, loops, cavities, etc.) in this sequence. For instance, one can store the value , called the birth time, for which a new connected component appears in . This connected component eventually gets merged with another one for some value , which is stored as well and called the death time. Moreover, one says that the component persists on the corresponding interval . Similarly, we save the values of each loop, cavity, etc. that appears in a specific sublevel set and disappears (get “filled”) in . This family of intervals is called the barcode, or persistence diagram, of , and can be represented as a multiset of points (i.e., point cloud where points are counted with multiplicity) supported on with coordinates .

The space of persistence diagrams can be equipped with a parametrized metric , , whose proper definition is not required in this work and is given in Supplementary Material, Appendix A for the sake of completeness. In the particular case , this metric will be refered to as the bottleneck distance between persistence diagrams.

2 Extended Persistence Diagrams

Figure 2: Extended persistence diagram computed on a graph: topological features of the graph are detected in the sequence of sublevel and superlevel graphs shown on the left of the figure. The corresponding intervals are displayed under the sequence: the black interval represents the connected component of the graph, the red one represents its downward branch, the blue one represents its upward branch, and the green one represents its loop. The extended persistence diagram given by the intervals is shown on the right.

2.1 Extended persistence

In general ordinary persistence does not fully encode the topology of . For instance, consider a graph , with vertices and (non-oriented) edges . Let be a function defined on its vertices, and consider the sublevel graphs where , , and , see in Figure 1. In this sequence , the loops persist forever since they never disappear from the sequence of sublevel graphs (they never get “filled”), and the same applies for whole connected components of . Moreover, branches pointing upwards (with respect to the orientation given by , see Figure 2) are missed (while those pointing downward are detected), since they do not create connected components when they appear in the sublevel graphs, making ordinary persistence unable to detect them.

To handle this issue, extended persistence refines the analysis by also looking at the so-called superlevel set . Similarly to ordinary persistence on sublevel sets, making decrease from to produces a sequence of increasing subsets, for which structural changes can be recorded.

Although extended persistence can be defined for general metric spaces (see the references given above), we restrict ourselves to the case where is a graph. The sequence of increasing superlevel graphs is illustrated in Figure 1 . In particular, death times can be defined for loops and whole connected components by picking the superlevel graphs for which the feature appears again, and using the corresponding value as the death time for these features. In this case, branches pointing upwards can be detected in this sequence of superlevel graphs, in the exact same way that downwards branches were in the sublevel graphs. See Figure 2 for an illustration.

Finally, the family of intervals of the form is turned into a multiset of points in the Euclidean plane by using the interval endpoints as coordinates. This multiset is called the extended persistence diagram of and is denoted by .

Since graphs have four types of topological features (see Figure 2), namely upwards branches, downwards branches, loops and connected components, the corresponding points in extended persistence diagrams can be of four different types. These types are denoted as , , and for downwards branches, upwards branches, connected components and loops respectively.

While it encodes more information than ordinary persistence, extended persistence ensures that points have finite coordinates. In comparison, methods relying on ordinary persistence have to design specific tools to handle points with infinite coordinates [HKN+17, HKN19], or simply ignore them [CCO17], losing information in the process. Therefore extended persistence allows the use of generic architectures regardless of the homology dimension. Empirical performances show substantial improvement over using ordinary persistence only (see Supplementary Material, Table 7). In practice, computing extended persistence diagrams can be efficiently done with the C++/Python Gudhi library [THE15]. Persistence diagrams are usually compared with the so-called bottleneck distance —whose proper definition is not required for this work and is recalled in Supplementary Material, Section A. However, the resulting metric space is not Hilbert and as such, incorporating diagrams in a learning pipeline requires to design specific tools, which we do in Section 3.

We recall that extended persistence diagrams can be computed only after having defined a real-valued function on the nodes of the graphs. In the next section, we define a family of such functions from the so-called Heat Kernel Signatures (HKS) for graphs, and show that these signatures enjoy stability properties. Moreover, Section 4 will further demonstrate that they lead to competitive results for graph classification.

2.2 Heat kernel signatures

HKS is an example of spectral family of signatures, that is, functions derived from the spectral decomposition of graph Laplacians, which provide informative features for graph analysis. We start this section with a few basic definitions. The adjacency matrix of a graph with vertex set is the matrix . The degree matrix is the diagonal matrix defined by . The normalized graph Laplacian is the linear operator acting on the space of functions defined on the vertices of , and is represented by the matrix

. It admits an orthonormal basis of eigenfunctions

and its eigenvalues satisfy

. As the orthonormal eigenbasis is not uniquely defined, the eigenfunctions cannot be used as such to compare graphs. Instead we consider the Heat Kernel Signatures (HKS):

Definition 2.1 ([Hrg14, Sog09]).

Given a graph and , the Heat Kernel Signature with diffusion parameter is the function defined on the vertices of by .

The HKS have already been used as signatures to address graph matching problems [HRG14] or to define spectral descriptors to compare graphs [TMK+18]. These signatures rely on the distributions of values taken by the HKS but not on their global topological structures, which are encoded in their extended persistence diagrams. For the sake of concision, we denote by the extended persistence diagram obtained from a graph using the filtration induced by the HKS with diffusion parameter , that is . The following theorem shows these diagrams to be stable with respect to the bottleneck distance between persistence diagrams. The proof can be found in Supplementary Material, Section A.

Theorem 2.2.

Stability w.r.t. graph perturbations. Let and let be the Laplacian matrix of a graph with vertices. Let be another graph with vertices and Laplacian matrix . Then there exists a constant only depending on and the spectrum of such that, for small enough , where denotes the Frobenius norm:

(2)

On the influence of diffusion parameter .

Building an extended persistence diagram on top of a graph with the Heat Kernel Signatures requires to pick a specific value of . In particular, understanding the influence and thus the choice of the diffusion parameter is an important question for statistical and learning applications. First, we state that the map is Lipschitz-continuous. The proof is found in Supplementary Material, Section A.

Theorem 2.3.

Stability w.r.t. parameter . Let be a graph. The map is -Lipschitz continuous, that is for ,

(3)

It follows from Theorem 2.3 that persistence diagrams are robust to the choice of . An empirical illustration is shown in Supplementary Material, Figures 7 and 8.

3 Neural Network Learning With Perslay

In this section, we introduce PersLay: a general and versatile neural network layer for learning persistence diagram vectorizations.

Dataset PSS-K PWG-K SW-K PF-K PersLay
ORBIT5K 72.38(2.4) 76.63(0.7) 83.6(0.9) 85.9(0.8) 87.7(1.0)
ORBIT100K 89.2(0.3)
Table 1: Performance table. PSS-K, PWG-K, SW-K, PF-K stand for Persistence Scale Space Kernel [RHB+15], Persistence Weighted Gaussian Kernel [KHF16], Sliced Wasserstein Kernel [CCO17] and Persistence Fisher Kernel [LY18] respectively. We report the scores given in [LY18] for competitors on ORBIT5K, and the one we obtained using PersLay for both the ORBIT5K and ORBIT100K datasets.

3.1 PersLay

In order to define a layer for persistence diagrams, we modify the Deep Set architecture [ZKR+17] by defining and implementing a series of new permutation invariant layers, so as to be able to recover and generalize standard vectorization methods used in Topological Data Analysis. To that end we define our generic neural network layer for persistence diagrams, that we call PersLay, through the following equation:

(4)

where op is any permutation invariant operation (such as minimum, maximum, sum, th largest value…), is a weight function for the persistence diagram points, and is a representation function that we call point transformation, mapping each point of a persistence diagram to a vector.

In practice, and are of the form where the gradients of and are known and implemented so that back-propagation can be performed, and the parameters can be optimized during the training process. We emphasize that any neural network architecture can be composed with PersLay to generate a neural network architecture for persistence diagrams. Let us now introduce three point transformation functions that we use and implement for parameter in Equation (4).

  • The triangle point transformation where the triangle function associated to a point is , with and .

  • The Gaussian point transformation , where the Gaussian function associated to a point is for a given , and .

  • The line point transformation , where the line function associated to a line with direction vector and bias is , with and are lines in the plane.

Formulation (4) is very general: despite its simplicity, it allows to remarkably encode most of classical persistence diagram vectorizations with a very small set of point transformation functions , allowing to consider the choice of

as a hyperparameter of sort. Let us show how it connects to most of the popular vectorizations and kernel methods for persistence diagrams in the literature.

  • Using: with samples , th largest value, (a constant weight function), amounts to evaluating the th persistence landscape [BUB15] on .

  • Using with samples , sum, arbitrary weight function , amounts to evaluating the persistence silhouette weighted by  [CFL+15] on .

  • Using with samples , sum, arbitrary weight function , amounts to evaluating the persistence surface weighted by  [AEK+17] on . Moreover, characterizing points of persistence diagrams with Gaussian functions is also the approach advocated in several kernel methods for persistence diagrams [KHF16, LY18, RHB+15].

  • Using where is a modification of the Gaussian point transformation defined with: for any , where if for some , and otherwise, sum, weight function , is the approach presented in [HKN+17].

  • Using with lines , th largest value, weight function , is similar to the approach advocated in [CCO17], where the sorted projections of the points onto the lines are then compared with the norm and exponentiated to build the so-called Sliced Wasserstein kernel for persistence diagrams.

Stability of PersLay. The question of the continuity and stability of persistence diagram vectorizations is of importance for TDA practitioners. In [HKN19, Remark 8], authors observed that the operation defined in (4)—with op=sum—is not continuous in general (with respect to the common persistence diagram metrics). Actually, [DL19, Prop. 5.1] showed that for all the map is continuous with respect to the metric if and only if is of the form , where denotes the distance from a point to the diagonal and is a continuous and bounded function. Furthermore, when and is -Lipschitz continuous, one can show that the map is actually stable ([HKN19, Thm. 12], [DL19, Prop. 5.2]), in the following sense:

In particular, this means that requiring continuity for the learned vectorization, as done in [HKN19], implies constraining the weight function to take small values for points close to the diagonal. However, in general there is no specific reason to consider that points close to the diagonal are less important than others, given a learning task.

3.2 A proof of concept: classification on large scale dynamical system dataset

Our first application is on a synthetic dataset used as a benchmark in Topological Data Analysis [AEK+17, CCO17, LY18]. It consists in sequences of points generated by different dynamical systems, see [HSW07]. Given some initial position and a parameter , we generate a point cloud following:

(5)

The orbits of this dynamical system heavily depend on parameter . More precisely, for some values of , voids might form in these orbits (see Supplementary Material, Figure 4

), and as such, persistence diagrams are likely to perform well at attempting to classify orbits with respect to the value of

generating them. As in previous works [AEK+17, CCO17, LY18], we use the five different parameters and to simulate the different classes of orbits, with random initialization of and points in each simulated orbit. These point clouds are then turned into persistence diagrams using a standard geometric filtration [CdO14], called the AlphaComplex filtration111http://gudhi.gforge.inria.fr/python/latest/alpha_complex_ref.html in dimensions and . We generate two datasets: The first is ORBIT5K, where for each value of , we generate orbits, ending up with a dataset of point clouds. This dataset is the same as the one used in [LY18]. The second is ORBIT100K, which contains orbits per class, resulting in a dataset of point clouds—a scale that kernel methods cannot handle. This dataset aims to show the edge of our neural-network based approach over kernels methods when dealing with very large datasets of large diagrams, since all the previous works dealing with this data [AEK+17, CCO17, LY18] use kernel methods.

Results are displayed in Table 1. Not only do we improve on previous results for ORBIT5K, we also show with ORBIT100K that classification accuracy is further increased as more observations are made available. For consistency we use the same accuracy metric as [LY18], that is, we split observations in 70%-30% training-test sets and report the average test accuracy over runs. The parameters used are summarized in Supplementary Material, Section C.

4 Application to Graph Classification

In order to truly showcase the contribution of PersLay, we use a very simple network architecture, namely a two-layer network. The first layer is PersLay, which processes persistence diagrams. The resulting vector is normalized and fed to the second and final layer, a fully-connected layer whose output is used for predictions. See Figure 3 for an illustration. We emphasize that this simplistic two-layer architecture is designed so as to produce knowledge and understanding (see Supplementary Material, Section C), rather than achieving the best possible performances.

Choice of hyperparameters.

In our experiments, we set , where denote the -th cell in a grid discretization of the unit square, and all are trainable parameters. is typically set to or . For aggregation operator op we use the sum. Further details are given in Supplementary Material, see Table 5 for reporting of the chosen hyper-parameters and Table 6 for a study of the influence of the grid size or the choice of .

Point transformations are chosen among the three choices introduced in Section 3.1. Empirically no representation is uniformly better than the others. The choice of the best point transformation

for a given task could also be selected through a cross-validation scheme, or by learning a linear interpolation between these point transformations: by setting

, where are trainable non-negative weights that sum to . Thorough exploration of these alternatives is left for future work.

As mentioned in Section 2, the diagrams we produce are stable with respect to the choice of the HKS diffusion parameter (Thm. 2.2 and 2.3). As such, we generally use and in our experiments. We also refer to Supplementary Material where Figure 7 illustrates the evolution of a persistence diagram w.r.t.  and Figure 8 provides the classification accuracy through varying values of . In practice, it is thus sufficient to sample few values of using a log-scale, as suggested for example in [SOG09, §5].
A subsequent natural question is: given a learning task, can itself be optimized? The question of optimizing over a family of filtrations induced by parametric functions the map has been studied both theoretically and practically in very recent works [BND+19, LOT19]. Hence, we also apply this approach for the filtrations induced by the HKS, optimizing the parameter

during the learning process. Note that the running time of the experiments is greatly increased since one has to recompute all persistence diagrams for each epoch, that is, each time

is updated. Moreover, we noticed after preliminary numerical investigations (see Supplementary Material, Section C) that classification accuracies were not improved by a large margin and remained comparable with results obtained without optimizing , so we did not include this optimization step in our results.

Table 5 and the code provided in the supplementary material give a detailed and summary report of the different hyper-parameters chosen for each experiment.

Figure 3: Network architecture illustrated in the case of our graph classification experiments (Section 4). Each graph is encoded as a set of persistence diagrams, then processed by an independent instance of PersLay. Each instance embeds diagrams in some vector space using two functions that are optimized during training and a fixed permutation-invariant operator op.

Experimental settings.

We are now ready to evaluate our architecture on a series of different graph datasets commonly used as a baseline in graph classification problems. REDDIT5K, REDDIT12K, COLLAB (from [YV15]) IMDB-B, IMDB-M (from [TVH18]) are composed of social graphs. COX2, DHFR, MUTAG, PROTEINS, NCI1, NCI109 are graphs coming from medical or biological frameworks (also from [TVH18]). A quantitative summary of these datasets is found in Supplementary Material, Table 4.

We compare performances with five other top graph classification methods. Scale-variant topo [TVH18] leverages a kernel for ordinary persistence diagrams computed on point cloud used to encode the graphs. RetGK [ZWX+18] is a kernel method for graphs that leverages eventual attributes on the graph vertices and edges. FGSD [VZ17] is a finite-dimensional graph embedding that does not leverage attributes. Finally, GCNN [XC19] and GIN [XHL+18] are two graph neural network approaches that reach top-tier results. One could also compare our results on the REDDIT datasets to the ones of [HKN+17], where authors also use persistence diagrams to feed a network (using as first channel a particular case of PersLay, see Section 3), achieving 54.5% and 44.5% of accuracy on REDDIT5K and REDDIT12K respectively.

Topological features were extracted using the graph signatures introduced in Section 2

. We combine these features with more traditional graph features formed by the eigenvalues of the normalized graph Laplacian along with the deciles of the computed HKS (right-side channel in Figure 

3). In order to evaluate the impact of the topological features in this learning process, one will refer to the ablation study in Supplementary Material, Table 7.

For each dataset, we perform 10 ten-fold evaluations and report the average and best ten-fold results. For a ten-fold evaluation, we split the data in 10 equally-sized folds, and record the classification accuracy obtained on the -th fold (test) after training on the remaining others. Here, is cycled from 1 to 10. The mean of those 10 ten-fold experiments is naturally more robust for evaluation purposes, and we report it in the column “PersLay - Mean”. This is consistent with the evaluation procedure from [ZWX+18]. Simultaneously, we also report the best single 10-fold accuracy obtained, reported in the column “PersLay - Max”, which is comparable to the results reported by all the other competitors.

In most cases, our approach is comparable with state-of-the-art results, despite using a very simple neural network architecture. Interestingly, both topology-based methods (SV and PersLay) have mediocre performances on the NCI datasets, suggesting that topology is not discriminative for these datasets. Additional experimental results, including ablation studies and variations of hyper-parameters (weight grid size , diffusion parameter ) are provided in Supplementary Material, Section C.

Dataset SV11footnotemark: 1 RetGK 22footnotemark: 2 FGSD 33footnotemark: 3 GCNN 44footnotemark: 4 GIN 55footnotemark: 5 PersLay
Mean Max
REDDIT5K 56.1 47.8 52.9 57.0 55.6 56.5
REDDIT12K 48.7 46.6 47.7 49.1
COLLAB 81.0 80.0 79.6 80.1 76.4 78.0
IMDB-B 72.9 71.9 73.6 73.1 74.3 71.2 72.6
IMDB-M 50.3 47.7 52.4 50.3 52.1 48.8 52.2
78.4 80.1 80.9 81.6
78.4 81.5 80.3 80.9
88.3 90.3 92.1 86.7 89.0 89.8 91.5
72.6 75.8 73.4 76.3 75.9 74.8 75.9
71.6 84.5 79.8 78.4 82.7 73.5 74.0
70.5 78.8 69.5 70.1
Table 2: Classification accuracy over benchmark graph datasets. Our results (PersLay, right hand side) are recorded from ten runs of a 10-fold classification evaluation (see Section 4 for details). “Mean” is consistent with [ZWX+18]22footnotemark: 2, while “Max” should be compared to [TVH18]11footnotemark: 1, [VZ17]33footnotemark: 3, [XC19]44footnotemark: 4 and [XHL+18]55footnotemark: 5, as it corresponds to the mean accuracy over a single 10-fold. The * indicates datasets that contain attributes (labels) on graph nodes and symmetrically the methods that leverage such attributes for classification purposes.

5 Conclusion

In this article, we introduced a new family of topological signatures on graphs, that are both stable and well-formed for learning purposes. In parallel we defined a powerful and versatile neural network layer to process persistence diagrams called PersLay, which generalizes most of the techniques used to vectorize persistence diagrams that can be found in the literature—while optimizing them task-wise.

We showcase the efficiency of our approach by achieving state-of-the-art results on synthetic orbit classification coming from dynamical systems and being competitive on several graph classification problems from real-life data, while working at larger scales than kernel methods developed for persistence diagrams and remaining simpler than most of its neural network competitors. We believe that PersLay has the potential to become a central tool to incorporate topological descriptors in a wide variety of complex machine learning tasks based on neural networks.

Our code is freely available publicly at https://github.com/MathieuCarriere/perslay.

References

  • [AEK+17] H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepushtanova, E. Hanson, F. Motta, and L. Ziegelmeier (2017) Persistence images: a stable vector representation of persistent homology. Journal of Machine Learning Research 18 (8). Cited by: §1, §1, 3rd item, §3.2.
  • [AMA07] D. Archambault, T. Munzner, and D. Auber (2007) Topolayout: multilevel graph layout by topological features. IEEE transactions on visualization and computer graphics 13 (2), pp. 305–317. Cited by: §1.2.
  • [BND+19] R. Brüel-Gabrielsson, B. J. Nelson, A. Dwaraknath, P. Skraba, L. J. Guibas, and G. Carlsson (2019) A topology layer for machine learning. arXiv preprint arXiv:1905.12200. Cited by: §4.
  • [BUB15] P. Bubenik (2015) Statistical topological data analysis using persistence landscapes. Journal of Machine Learning Research 16 (77), pp. 77–102. Cited by: §1, §1, 1st item.
  • [BHO18] M. Buchet, Y. Hiraoka, and I. Obayashi (2018) Persistent homology and materials informatics. In Nanoinformatics, pp. 75–95. Cited by: §1.
  • [CÁM17] P. Cámara (2017-02) Topological methods for genomics: present and future directions. Current Opinion in Systems Biology 1, pp. 95–101. Cited by: §1.
  • [CCO17] M. Carrière, M. Cuturi, and S. Oudot (2017-07) Sliced Wasserstein kernel for persistence diagrams. In International Conference on Machine Learning, Vol. 70, pp. 664–673. Cited by: §1, §2.1, 5th item, §3.2, Table 1.
  • [COO15] M. Carrière, S. Oudot, and M. Ovsjanikov (2015) Stable topological signatures for points on 3d shapes. In Computer Graphics Forum, Vol. 34, pp. 1–12. Cited by: §1.
  • [CdG+16] F. Chazal, V. de Silva, M. Glisse, and S. Oudot (2016) The structure and stability of persistence modules. Springer International Publishing. Cited by: Theorem A.2.
  • [CdO14] F. Chazal, V. de Silva, and S. Oudot (2014) Persistence stability for geometric complexes. Geometriae Dedicata 173 (1), pp. 193–214. Cited by: §1.2, §3.2.
  • [CFL+15] F. Chazal, B. T. Fasy, F. Lecci, A. Rinaldo, and L. Wasserman (2015) Stochastic convergence of persistence landscapes and silhouettes. Journal of Computational Geometry 6 (2), pp. 140–161. Cited by: §1, §1, 2nd item.
  • [CEH09] D. Cohen-Steiner, H. Edelsbrunner, and J. Harer (2009-02) Extending persistence using Poincaré and Lefschetz duality. Foundations of Computational Mathematics 9 (1), pp. 79–103. Cited by: Theorem A.2, §1.3.
  • [DL19] V. Divol and T. Lacombe (2019) Understanding the topology and the geometry of the persistence diagram space via optimal partial transport. arXiv preprint arXiv:1901.03048. Cited by: §3.1.
  • [EH10] H. Edelsbrunner and J. Harer (2010) Computational topology: an introduction. American Mathematical Society. Cited by: §1.3.
  • [FF12] E. Ferrara and G. Fiumara (2012) Topological features of online social networks. arXiv preprint arXiv:1202.0331. Cited by: §1.2.
  • [HSW07] J. Hertzsch, R. Sturman, and S. Wiggins (2007) DNA microarrays: design principles for maximizing ergodic, chaotic mixing. Small 3 (2), pp. 202–218. Cited by: §3.2.
  • [HKN19] C. D. Hofer, R. Kwitt, and M. Niethammer (2019) Learning representations of persistence barcodes. Journal of Machine Learning Research 20 (126), pp. 1–45. External Links: Link Cited by: §1.1, §1.2, §2.1, §3.1.
  • [HKN+17] C. Hofer, R. Kwitt, M. Niethammer, and A. Uhl (2017) Deep learning with topological signatures. In Advances in Neural Information Processing Systems, pp. 1634–1644. Cited by: §1.2, §1.2, §2.1, 4th item, §4.
  • [HRG14] N. Hu, R. Rustamov, and L. Guibas (2014) Stable and informative spectral signatures for graph matching. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2305–2312. Cited by: Appendix A, Theorem A.1, §2.2, Definition 2.1.
  • [KAL18] S. Kališnik (2018-01) Tropical coordinates on the space of persistence barcodes. Foundations of Computational Mathematics, pp. 1–29. Cited by: §1.
  • [KB14] D. Kingma and J. Ba (2014-12) Adam: a method for stochastic optimization. arXiv. Cited by: 6th item.
  • [KHF16] G. Kusano, Y. Hiraoka, and K. Fukumizu (2016-06) Persistence weighted Gaussian kernel for topological data analysis. In International Conference on Machine Learning, Vol. 48, pp. 2004–2013. Cited by: §1, 3rd item, Table 1.
  • [LY18] T. Le and M. Yamada (2018) Persistence Fisher kernel: a Riemannian manifold kernel for persistence diagrams. In Advances in Neural Information Processing Systems, pp. 10027–10038. Cited by: §1, 3rd item, §3.2, §3.2, Table 1.
  • [LOT19] J. Leygonie, S. Oudot, and U. Tillmann (2019) A framework for differential calculus on persistence barcodes. External Links: 1910.00960 Cited by: §4.
  • [LOC14] C. Li, M. Ovsjanikov, and F. Chazal (2014-06) Persistence-based structural recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2003–2010. Cited by: §1.
  • [LSY+12] G. Li, M. Semerci, B. Yener, and M. J. Zaki (2012) Effective graph classification based on topological and label attributes. Statistical Analysis and Data Mining: The ASA Data Science Journal 5 (4), pp. 265–283. Cited by: §1.2.
  • [OUD15] S. Oudot (2015) Persistence theory: from quiver representations to data analysis. American Mathematical Society. Cited by: §1.3.
  • [PH15] J. Perea and J. Harer (2015-06) Sliding windows and persistence: an application of topological methods to signal analysis. Foundations of Computational Mathematics 15 (3), pp. 799–838. Cited by: §1.
  • [RHB+15] J. Reininghaus, S. Huber, U. Bauer, and R. Kwitt (2015) A stable multi-scale kernel for topological machine learning. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, 3rd item, Table 1.
  • [SOG09] J. Sun, M. Ovsjanikov, and L. Guibas (2009) A concise and provably informative multi-scale signature based on heat diffusion. Computer graphics forum 28, pp. 1383–1392. Cited by: Definition 2.1, §4.
  • [THE15] The GUDHI Project (2015) GUDHI user and reference manual. GUDHI Editorial Board. External Links: Link Cited by: §2.1.
  • [TVH18] Q. H. Tran, V. T. Vo, and Y. Hasegawa (2018) Scale-variant topological information for characterizing complex networks. arXiv preprint arXiv:1811.03573. Cited by: §1.2, §4, §4, Table 2.
  • [TMK+18] A. Tsitsulin, D. Mottin, P. Karras, A. Bronstein, and E. Müller (2018) Netlsd: hearing the shape of a graph. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2347–2356. Cited by: §2.2.
  • [VZ17] S. Verma and Z. Zhang (2017) Hunt for the unique, stable, sparse and fast feature learning on graphs. In Advances in Neural Information Processing Systems, pp. 88–98. Cited by: §4, Table 2.
  • [XC19] Z. Xinyi and L. Chen (2019) Capsule graph neural network. In International Conference on Learning Representations, External Links: Link Cited by: §4, Table 2.
  • [XHL+18] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018-10) How powerful are graph neural networks?. arXiv. Cited by: §4, Table 2.
  • [YV15] P. Yanardag and S.V.N. Vishwanathan (2015) Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, New York, NY, USA, pp. 1365–1374. External Links: ISBN 978-1-4503-3664-2, Link, Document Cited by: §4.
  • [ZKR+17] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. Smola (2017) Deep sets. In Advances in Neural Information Processing Systems, pp. 3391–3401. Cited by: 5th item, §1.1, §1.2, §3.1.
  • [ZWX+18] Z. Zhang, M. Wang, Y. Xiang, Y. Huang, and A. Nehorai (2018) RetGK: Graph Kernels based on Return Probabilities of Random Walks. In Advances in Neural Information Processing Systems, pp. 3968–3978. Cited by: §4, §4, Table 2.
  • [ZW19] Q. Zhao and Y. Wang (2019-04) Learning metrics for persistence-based summaries and applications for graph classification. arXiv. Cited by: §1.2.

Appendix A Proofs of Stability Theorems

Definition of diagram distances.

Recall (see Section 1.3) that persistence diagrams are generally represented as multisets of points (i.e. points counted with multiplicity) supported on the upper half plane . Let and be two such diagrams and be a parameter. Note in particular that in general. Let denote the diagonal, and let denote the set of all bijections between and . Then, the -diagram distance between and is defined as:

(6)

In particular, if , we recover the bottleneck distance defined as:

(7)

Proof of Theorem 2.2

The proof directly follows from the following two theorems. This first one, proved in [HRG14], is a consequence of classical arguments from matrix perturbation theory.

Theorem A.1 ([Hrg14], Theorem 1).

Let and let be the Laplacian matrix of a graph with vertices. Let , be the distinct eigenvalues of and denote by the smallest distance between two distinct eigenvalues: . Let be another graph with vertices and Laplacian matrix with , where denotes the Frobenius norm of . Then, if , there exists a constant such that for any vertex ,

if , there exists two constants such that for any vertex ,

In particular, if , there exists a constant —notice that also depends on —such that in the two above cases,

Theorem 2.2 then immediately follows from the second following theorem, which is a special case of general stability results for persistence diagrams.

Theorem A.2 ([CdG+16, Ceh09]).

Let be a graph and be two functions defined on its vertices. Then:

(8)

where stands for the so-called bottleneck distance between persistence diagrams and . Moreover, this inequality is also satisfied for each of the subtypes and individually.

Proof of Theorem 2.3

Fix a graph . With the same notations as in Section 2.2, recall that the eigenvalues of the normalized graph Laplacian satisfy

, and the corresponding eigenvectors

define an orthonormal family. In particular, is -Lipschitz continuous for . Let be two positive diffusion parameters. We have, for any :

Thus in particular,

As in the previous proof, we conclude using the stability of persistence diagrams w.r.t. the bottleneck distance (see Thm. A.2).

Appendix B Datasets Description

Tables 3 and 4 summarize key information of each dataset for both our experiments. We also provide in Figure 4 an illustration of the orbits we generated in Section 3.2.

Dataset Nb of orbit observed Number of classes Number of points per orbit
ORBIT5K 5,000 5 1,000
ORBIT100K 100,000 5 1,000
Table 3: Description of the two orbits dataset we generated. The five classes correspond to the five parameter choices for . In both ORBIT5K and ORBIT100K, classes are balanced.
Figure 4: Some example of orbits generated by the different choices of (three simulations are represented for the different values of ).
Dataset Nb graphs Nb classes Av. nodes Av. Edges Av.  Av. 
REDDIT5K 5,000 5 508.5 594.9 3.71 90.1
REDDIT12K 12,000 11 391.4 456.9 2.8 68.29
COLLAB 5,000 3 74.5 2457.5 1.0 2383.7
IMDB-B 1,000 2 19.77 96.53 1.0 77.76
IMDB-M 1,500 3 13.00 65.94 1.0 53.93
COX2 467 2 41.22 43.45 1.0 3.22
DHFR 756 2 42.43 44.54 1.0 3.12
MUTAG 188 2 17.93 19.79 1.0 2.86
PROTEINS 1,113 2 39.06 72.82 1.08 34.84
NCI1 4,110 2 29.87 32.30 1.19 3.62
NCI109 4,127 2 29.68 32.13 1.20 3.64
Table 4: Datasets description. (resp. ) stands for the th-Betti-number (resp. st), that is the number of connected components (resp. cycles) in a graph. In particular, an average means that all graph in the dataset are connected, and in this case .

Appendix C Complementary Experimental Results

c.1 Weight learning

Figure 5 provides an illustration of the weight grid learned after training on the MUTAG dataset. Roughly speaking, activated cells highlight the areas of the plane where the presence of points was discriminating in the classification process. These learned grids thus emphasize the points of the persistence diagrams that matter w.r.t. learning task.

Figure 5: Weight function when chosen to be a grid with size before and after training (MUTAG dataset). Here, Ord0, Rel1, Ext0, and Ext1 denote the extended diagrams corresponding to downwards branches, upwards branches, connected components and loops respectively (cf Section 2.1).

c.2 Selection of HKS diffusion parameter

As stated in Theorem 2.3, for a fixed graph , the function is -Lipschitz continuous with respect to the bottleneck distance between persistence diagrams. Informally (see Supplementary Material, Section A for a formal definition), it means that the points of must move smoothly with respect to . This is experimentally illustrated in Figure 7, where we plot the four diagrams built from a graph of the MUTAG dataset.

As mentioned in Section 2.2, the parameter can also be treated as a trainable parameter that is optimized during the learning. In our experiment, however, it does not prove to be worth it. Indeed, our diagrams are not particularly sensitive to the choice of , and thus fixing some sampled in log-scale is enough. Figure 6 illustrates the evolution of parameter over 40 epochs when trained on the MUTAG

dataset (one epoch correspond to a stochastic gradient descent performed on the whole dataset). As one can see, parameter

converges quickly. More importantly, it remains almost constant when initialized at , suggesting that this choice is a (locally) optimal one. Fortunately, this is the parameter we use in our experiment (see Table 5). On the other hand, each time is updated (that is, at each epoch), one must recompute the diagrams for all the graphs in the training set, significantly increasing the running time of the algorithm.

Figure 6: Evolution of HKS parameter when considered as a trainable variable (i.e. differentiating for all ) across 40 epochs for three different initializations of , namely 0.1, 1 and 10, on the MUTAG dataset.
Figure 7: Evolution of for one graph from the MUTAG dataset (, in log-scale).

c.3 Experimental settings

Dataset Func. used PD preproc. PersLay Optim.
ORBIT5K , prom(500) Pm(25,25,10,top-5) adam(0.01, 0., 300)
ORBIT100K , prom(500) Pm(25,25,10,top-5) adam(0.01, 0., 300)
REDDIT5K prom(500) Pm(25,25,10,sum) adam(0.01, 0.99, 500)
REDDIT12K prom(500) Pm(5,5,10,sum) adam(0.01, 0.99, 1000)
COLLAB , prom(500) Pm(5,5,10,sum) adam(0.01, 0.9, 1000)
IMDB-B , prom(500) Im(20,(10,2),20,sum) adam(0.01, 0.9, 500)
IMDB-M , prom(500) Im(10,(10,2),10,sum) adam(0.01, 0.9, 500)
COX2 , Im(20,(10,2),20,sum) adam(0.01, 0.9, 500)
DHFR , Im(20,(10,2),20,sum) adam(0.01, 0.9, 500)
MUTAG Im(20,(10,2),10,sum) adam(0.01, 0.9, 100)
PROTEINS prom(500) Im(15,(10,2),10,sum) adam(0.01, 0.9, 70)
NCI1 , Pm(25,25,10,sum) adam(0.01, 0.9, 300)
NCI109 , Pm(25,25,10,sum) adam(0.01, 0.9, 300)

Table 5: Settings used to generate our experimental results.
Grid size for trainable weights Point transformation Perm op
None Gaussian line triangle Sum Max

 

MUTAG Train/Test acc (%) 92.3/88.9 91.1/88.8 91.7/89.6 92.3/89.9 93.7/88.3 94.1/87.7 92.5/89.7 89.2/84.2 91.5/85.0 92.3/89.5 91.9/87.4
Run time, CPU (s) 2.30 2.77 2.79 2.77 2.77 2.78 2.80 5.91 4.42 2.75 2.82

 

COLLAB Train/Test acc (%) 76.5/75.3 78.6/75.8 79.0/76.2 80.0/76.5 83.5/73.9 94.0/71.3 79.7/75.3 79.9/76.1 79.4/74.7 80.0/76.4 78.8/75.0
Run time, GPU (s) 26.0 40.4 43.5 43.8 44.1 45.6 45.8 54.0 61.4 44.3 48.1

 

Table 6: Influence of hyper-parameters and ablation study. When varying a single hyper-parameter (e.g. grid size), all the others (e.g. perm op) are fixed to the values described in Supplementary Material, Table 5. Accuracies and running times are averaged over 100 runs (i5-8350U 1.70GHz CPU for the small MUTAG dataset, P100 GPU for the large COLLAB one). Bold-blue font refers to the experimental setting used in Section 4.

Input data was fed to the network with mini-batches of size 128. For each dataset, various parameters are given (extended persistence diagrams, neural network architecture, optimizers, etc.) that were used to obtain the scores from Footnote 5. In Table 5, we use the following shortcuts:

  • : persistence diagrams obtained with Gudhi’s -dimensional AlphaComplex filtration.

  • : extended persistence diagram obtained with HKS on the graph with parameter .

  • prom(): preprocessing step selecting the points that are the farthest away from the diagonal.

  • PersLay channel Im(, (, ), , op) stands for a function obtained by using a Gaussian point transformation sampled on grid on the unit square followed by a convolution with filters of size , for a weight function optimized on a grid and for an operation op.

  • PersLay channel Pm(, , , op) stands for a function obtained by using a line point transformation with lines followed by a permutation equivariant function [ZKR+17] in dimension , for a weight function optimized on a grid and for an operation op.

  • adam() stands for the ADAM optimizer [KB14] with learning rate , using an Exponential Moving Average222https://www.tensorflow.org/api_docs/python/tf/train/ExponentialMovingAverage with decay rate , and run during epochs.

c.4 Hyper-parameters influence

As our approach mix our topological features and some standard graph features, we provide two ablations studies. In Table 7

, the column “Spectral” reports the test accuracies obtained by using only these additional features, while the column “PD alone” records the accuracies obtained with the extended and ordinary persistence diagrams alone. As ordinary persistence only encodes the connectivity properties of graphs, a gap in performance between extended and ordinary persistence can be interpreted as 1-dimensional features (i.e. loops) being informative for classification purpose. It also reports the standard deviations, that were omitted in

5 for the sake of clarity.

Similarly, we give in Table 6 the influence of the grid size that we choose as weight function . In particular, we also perform an ablation study: grid size being None meaning that we enforce for all . As expected, increasing the grid size improves train accuracy but leads to overfitting for too large values. However, this increase has only a small impact on running times whereas not using any grid significantly lowers it.

Finally, Figure 8 illustrates the variation of accuracy for both MUTAG and COLLAB datasets when varying the HKS parameter used when generating the extended persistence diagrams. One can see that the accuracy reached on MUTAG does not depend on the choice of , which could intuitively be explained by the small size of the graphs in this dataset, making the parameter not very relevant. Experiments are performed on a single 10-fold, with 100 epochs. Parameters of PersLay are set to Im(20,(),20) for this experiment.

Figure 8: Variation of test accuracy for MUTAG and COLLAB dataset when varying HKS parameter between and (log-10 scale).
Spectral alone PD alone PersLay
Extended Ordinary
REDDIT5K 49.7(0.3) 55.0 52.5 55.6(0.3)
REDDIT12K 39.7(0.1) 44.2 40.1 47.7(0.2)
COLLAB 67.8(0.2) 71.6 69.2 76.4(0.4)
IMDB-B 67.6(0.6) 68.8 64.7 71.2(0.7)
IMDB-M 44.5(0.4) 48.2 42.0 48.8(0.6)
COX2 * 78.2(1.3) 81.5 79.0 80.9(1.0)
DHFR * 69.5(1.0) 78.2 71.8 80.3(0.8)
MUTAG * 85.8(1.3) 85.1 70.2 89.8(0.9)
PROTEINS * 73.5(0.3) 72.2 69.7 74.8(0.3)
NCI1 * 65.3(0.2) 72.3 68.9 73.5(0.3)
NCI109 * 64.9(0.2) 67.0 66.2 69.5(0.3)
Table 7: Complementary report of experimental results.