Implementation of the PersLay layer for persistence diagrams
Persistence diagrams, a key descriptor from Topological Data Analysis, encode and summarize all sorts of topological features and have already proved pivotal in many different applications of data science. But persistence diagrams are weakly structured and therefore constitute a difficult input for most Machine Learning techniques. To address this concern several vectorization methods have been put forward that embed persistence diagrams into either finite-dimensional Euclidean spaces or implicit Hilbert spaces with kernels. But finite-dimensional embeddings are prone to miss a lot of information about persistence diagrams, while kernel methods require the full computation of the kernel matrix. We introduce PersLay: a simple, highly modular layer of learning architecture for persistence diagrams that allows to exploit the full capacities of neural networks on topological information from any dataset. This layer encompasses most of the vectorization methods of the literature. We illustrate its strengths on challenging classification problems on dynamical systems orbit or real-life graph data, with results improving or comparable to the state-of-the-art. In order to exploit topological information from graph data, we show how graph structures can be encoded in the so-called extended persistence diagrams computed with the heat kernel signatures of the graphs.READ FULL TEXT VIEW PDF
Persistence diagrams, the most common descriptors of Topological Data
Graph classification is a difficult problem that has drawn a lot of atte...
There has been an intense recent activity in embedding of very high
Learning task-specific representations of persistence diagrams is an
Topological features such as persistence diagrams and their functional
Topological methods for data analysis present opportunities for enforcin...
Robust topological information commonly comes in the form of a set of
Implementation of the PersLay layer for persistence diagrams
Topological Data Analysis is a field of data science whose purpose is to capture and encode the topological features (such as the connected components, loops, cavities…) that are present in datasets in order to improve inference and prediction. Its main descriptor is the so-called persistence diagram, which takes the form of a set of points in the Euclidean plane , each point corresponding to a topological feature of the data. This descriptor has been successfully used in many different applications of data science, such as material science (Buchet et al., 2018), signal analysis (Perea & Harer, 2015), cellular data (Cámara, 2017), or shape recognition (Li et al., 2014) to name a few. This wide range of applications is mainly due to the fact that persistence diagrams encode information whose essence is topological, and as such this information tends to be complementary to the one captured by more classical descriptors.
However, the space of persistence diagrams heavily lacks structure: different persistence diagrams may have different number of points, and several basic operations are not well-defined, such as addition and mean. This dramatically impedes their capacity to be used in machine learning applications so far. To handle this issue, a lot of attention has been devoted to the vectorization of persistence diagrams through the construction of either finite-dimensional embeddings of persistence diagrams (Adams et al., 2017; Carrière et al., 2015; Chazal et al., 2015; Kališnik, 2018), i.e., embeddings turning diagrams into vectors in Euclidean space , or kernels for persistence diagrams (Bubenik, 2015; Carrière et al., 2017; Kusano et al., 2016; Le & Yamada, 2018; Reininghaus et al., 2015), i.e., generalized scalar products that implicitly turn diagrams into elements of infinite-dimensional Hilbert spaces. Unfortunately, both approaches suffer from major issues: it has been shown that finite-dimensional embeddings miss a lot of information about persistence diagrams (Carrière & Bauer, 2019), and kernel methods require to compute and store the kernel evaluations for each pair of persistence diagrams. Since all available kernels have a complexity that is at least linear, and often quadratic, in the number of persistence diagram points for a single matrix entry computation, kernel methods quickly become very expensive in terms of running time and memory usage on large sets or for large diagrams. An interesting preliminary approach has been developed in trying to feed persistence diagrams to neural networks Hofer et al. (2017), which consists in adapting and branching an existing vectorization method (the so-called persistence images Adams et al. (2017)) to a large neural network. However, there are applications for which persistence images might be inappropriate a choice, and it is generally hard to tell ahead of time which kernel or vectorization method will be the most relevant one.
In this article, we present a framework to make full use of the modularity, learning capacities and computing power of neural networks for Topological Data Analysis. Building on the recent introduction of DeepSet from (Zaheer et al., 2017) that produces a neural network formatting targeted at processing sets of points, we apply and extend that framework to the context of persistence diagrams, thus defining PersLay
: a simple, highly versatile, automatically differentiable layer for neural network architectures that can process topological information from all sorts of datasets. Our framework encompasses most of the popular vectorization and kernel methods that exist in the literature. We demonstrate its strengths on two challenging applications where we reach, improve or significantly improve over state-of-the-art classifying accuracies with performances evaluated by simple network architectures based on PersLay layers. The first is a large-scale classification application on a synthetic set of a hundred thousand dynamical system orbits. Second, we turn to a real graph classification application with benchmark datasets borrowing from the fields of biology, chemistry or social networks. Since graph data also suffer from a lack of general structure, we independently provide a robust method to efficiently encode graph information in a theoretically sound way using an extension of ordinary persistence calledextended persistence, that is computed from a family of functions defined on the graph vertices called heat kernel signatures with ensuing stability properties. Lastly, we make our contribution available as a public tensorflow-based Python package.
To summarize our contributions:
We introduce a simple, versatile neural network layer for persistence diagrams, called PersLay for handling topological information from all sorts of datasets, encompassing most of vectorizing approaches in the literature.
We showcase the strength of our method by achieving state-of-the-art effectiveness on classifying a collection of a hundred thousand synthetic orbit data from dynamical systems, as well as on difficult graph classification problems from the literature.
We introduce a theoretically-based approach for extracting topological information from graph data using a combination of extended persistence and graph signatures.
We provide a ready-to-use PersLay Python package based on tensorflow: https://github.com/MathieuCarriere/perslay.
The basics of (extended) persistence theory are first introduced in Section 2. Our general neural network layer PersLay for persistence diagrams is then defined in Section 3. Finally, the two applications showcased are presented in Section 4.1 for the dynamical system problem, and in Section 4.2 for the graph application. This Section 4.2 also informs on the family of functions used to generate persistence diagrams, the so-called heat kernel signatures with stability properties.
In this section, we recall the basics of persistence (and extended persistence) diagrams. We point to (Cohen-Steiner et al., 2009; Edelsbrunner & Harer, 2010; Oudot, 2015) for a thorough description of general (extended) persistence theory for metric spaces.
Ordinary persistence. Let us consider a pair , where is a topological space, and is a real-valued function, and define the sublevel set . Making increase from to gives an increasing sequence of sets, called the filtration induced by , and which starts with the empty set and ends with the whole space . Ordinary persistence will record the time of appearance and disappearance of topological components (connected components, loops, cavities, etc.) in this sequence. For instance, one can record the value for which a new connected component appears in , called the birth time of the connected component. This connected component eventually gets merged with another for some value , and thus is stored and called the death time of the component, and one says that the component persists on the interval . Similarly, we save the values of each loop, cavity, etc. that appears in a specific sublevel set and disappears (get “filled”) in . This family of intervals is called the barcode, or persistence diagram, of , and can be represented as a multiset of points (i.e., point cloud where points are counted with multiplicity) supported on with coordinates .
In some context, ordinary persistence might not be sufficient to encode the topology of an object . For instance, consider a graph , with vertices and (non-oriented) edges . Let be a function defined on its vertices, and consider the sublevel graphs where , , and , see in Figure 1. In this sequence , loops persist forever since they never disappear from the sequence of sublevel graphs (they never get “filled”), and the same applies for whole connected components of . Moreover, branches pointing upwards (with respect to the orientation given by , see Figure 2) are missed (while those pointing downward are detected), since they do not create connected components when they appear in the sublevel graphs, making ordinary persistence unable to detect them.
Extended persistence. To handle the issue stated above, extended persistence refines the analysis by also looking at the superlevel set . Similarly, making decrease from to also gives a sequence of increasing subsets, for which structural changes can be recorded.
Although extended persistence can be defined for general metric spaces (see the references given above), we restrict ourselves to the case where is a graph. The sequence of increasing superlevel graphs is illustrated in Figure 1 . In particular, death times can be defined for loops and whole connected components by picking the superlevel graphs for which the feature appears again, and using the corresponding value as the death time for these features. In this case, branches pointing upwards can be detected in this sequence of superlevel graphs, in the exact same way that downwards branches were in the sublevel graphs. See Figure 2 for an illustration.
Finally, the family of intervals of the form is turned into a multiset of points in the Euclidean plane by using the interval endpoints as coordinates. This multiset is called the extended persistence diagram of and is denoted by .
Since graphs have four types of topological features (see Figure 2), namely upwards branches, downwards branches, loops and connected components, the corresponding points in extended persistence diagrams can be of four different types. These types are denoted as , , and for downwards branches, upwards branches, connected components and loops respectively:
Note that an advantage of using extended persistence is that all diagram types can be treated similarly, in the sense that points in each type all have finite coordinates. Moreover, as each point represents a topological feature that persists along a given interval, provides explainable information about the topological structure of the pair . In practice, computing extended persistence diagrams can be efficiently done with the C++/Python Gudhi library (The GUDHI Project, 2015). Persistence diagrams are usually compared with the so-called bottleneck distance —whose proper definition is not required for this work (see e.g. (Edelsbrunner & Harer, 2010, VIII.2)). However, the resulting metric space is not Hilbert and as such, incorporating diagrams in a learning pipeline requires to design specific tools.
We have seen in Section 2 how to derive topological descriptors from data, namely (extended) persistence diagrams, which we aim at using in machine learning applications. In this section, we present PersLay, our general neural network layer for (extended) persistence diagrams. To define it, we leverage a recent neural network architecture called DeepSet (Zaheer et al., 2017) that was targeted at processing sets of points.
The main purpose of the DeepSet design is to be invariant to the point orderings in the sets. Any such neural network is called a permutation invariant network. In order to achieve this, Zaheer et al. (Zaheer et al., 2017) propose to process point sets with a layer implementing the general equation: , where is the function implemented by the layer, , and is a point transformation. Their final neural network architecture is then obtained by composing with a transformation , which can be parametrized by any other neural network architecture. It is shown in (Zaheer et al., 2017, Theorem 2) that if the cardinality is the same for all sets, then for any permutation invariant function , there exist and such that . Moreover, this is still true for variable if the sets belong to some countable space.
In this work, we transpose this architecture to the context of persistence diagrams by defining and implementing a series of new permutation invariant layers that will generalize some standard tools used in Topological Data Analysis. To that end we define our generic neural network layer for persistence diagrams , that we call PersLay, through the following equation:
where op is any permutation invariant operation (such as minimum, maximum, sum, th largest value…), is a weight function for the persistence diagram points, and is a function that we call point transformation function. We emphasize that any neural network architecture can be composed with PersLay to generate a neural network architecture for persistence diagrams. Let us now introduce the three point transformation functions that we use and implement for parameter in Equation (2).
The triangle point transformation where the triangle function associated to a point is , with and .
The Gaussian point transformation , where the Gaussian function associated to a point is for a given , and .
The line point transformation , where the line function associated to a line with direction vector and bias is , with and lines.
Note that line point transformations are examples of permutation equivariant functions, which are specific functions defined on sets of points that were used to build the DeepSet layer in (Zaheer et al., 2017).
Formulation (2) has high representative power: it allows to remarkably encode most of classical persistence diagram representations with a very small set of point transformation functions , allowing to consider the choice of
as a hyperparameter of sort. Let us show how we connect it to popular vectorizations and kernel methods for persistence diagrams in the literature.
Using with samples , th largest value, , amounts to evaluating the th persistence landscape (Bubenik, 2015) on .
Using with samples , sum, arbitrary , amounts to evaluating the persistence silhouette weighted by (Chazal et al., 2015) on .
Using with samples , sum, arbitrary , amounts to evaluating the persistence surface weighted by (Adams et al., 2017) on . Moreover, characterizing points of persistence diagrams with Gaussian functions is also the approach advocated in several kernel methods for persistence diagrams (Kusano et al., 2016; Le & Yamada, 2018; Reininghaus et al., 2015).
Using where is a modification of the Gaussian point transformation defined with: for any , where if for some , and otherwise, sum, , is the approach presented in (Hofer et al., 2017).
Using with lines , th largest value, , is similar to the approach advocated in (Carrière et al., 2017), where the sorted projections of the points onto the lines are then compared with the norm and exponentiated to build the so-called Sliced Wasserstein kernel for persistence diagrams.
In practice, samples and lines are parameters that are optimized on the training set by the network. This means that our approach looks for the best locations to evaluate the landscapes, silhouettes and persistence surfaces, as well as the best lines to project the diagrams onto, with respect to the problem the network is trying to solve.
We now introduce two different applications aimed at showcasing both the scaling power, precision and versatility of PersLay. In order to truly showcase the contribution of this layer we use a very simple network architecture, namely a two-layer network. The first layer is the PersLay layer that processes persistence diagrams. The output of this layer is then either used as such (as in the first application §4.1) or concatenated with additional features (for the graph application §4.2). The resulting vector is normalized and fed to the second and final layer, a fully connected layer whose output is used for predictions. An illustration in the context of graph classification is found Fig. 3. We emphasize that this simplistic two-layer architecture is designed so as to produce knowledge and understanding, rather than the best possible performances.
In those applications, the weight function from Equation (2) is actually treated as a trainable parameter whose values are optimized during training. More precisely, the input diagrams are scaled to , so that becomes a function defined on the unit square, that we approximate by discretizing this square into a set of pixels. The value of on each pixel is then optimized individually during training. As both and are functions defined on , it suggests—we leave it as perspective work—that they can be interpreted and visualized in ways to help explain the network learning behavior on persistence diagrams.
Our results can be reproduced with the help of Table 5 in the supplementary material with specific settings for each experiment. The implementation relies on the open source C++/Python library Gudhi (The GUDHI Project, 2015), Python packages sklearn-tda (Carrière, 2018) and tensorflow (Abadi et al., 2015), and is available at https://github.com/MathieuCarriere/perslay.
The first application is demonstrated on synthetic data used as a benchmark in Topological Data Analysis (Adams et al., 2017; Carrière et al., 2017; Le & Yamada, 2018). It consists in sequences of points generated by different dynamical systems, see Hertzsch et al. (2007). Given some initial position and a parameter , we generate a point cloud following (mod 1) and (mod 1). The orbits of this dynamical system are highly dependent on parameter . More precisely, for some values of , voids can form in these orbits (see Fig. 5 in the Supplementary Material), and as such, persistence diagrams are likely to perform well when attempting to classify orbits with respect to the value of generating them. As in previous works (Adams et al., 2017; Carrière et al., 2017; Le & Yamada, 2018), we use the five different parameters and to simulate the different classes of orbits, with random initialization of and points in each simulated orbit. These point clouds are then turned into persistence diagrams using a standard geometric filtration Chazal et al. (2014), namely the AlphaComplex filtration111http://gudhi.gforge.inria.fr/python/latest/alpha_complex_ref.html in dimensions and . Doing so, we generate two datasets. The first is ORBIT5K, where for each value of , we generate orbits, ending up with a dataset of point clouds. This dataset is the same as the one used in (Le & Yamada, 2018). The second, ORBIT100K contains orbits per class, resulting in a dataset of point clouds — a scale that kernel methods cannot handle. This dataset aims to show the edge of our neural-network based approach over kernels methods when dealing with very large datasets of large diagrams. The goal is now to perform classification with the aforementioned architecture, while all the previous works dealing with this data (Adams et al., 2017; Carrière et al., 2017; Le & Yamada, 2018) use a kernel method to classify the persistence diagrams built on top of the point clouds.
Results are displayed in Table 1. Not only do we improve on previous results for ORBIT5K, with performances on ORBIT100K we also show that classification accuracy is further increased as more observations are made available. For consistency we use the same accuracy metric as (Le & Yamada, 2018), we split observations in 70%-30% training-test sets and report the average test accuracy over runs. The parameters used summarized Section C in the Supplementary Material.
We now specialize our framework to the evaluation of graph datasets. We have seen in Section 2 how extended persistence diagrams can be computed from a graph for a given function , defined on its vertices. As the choice for this function is essential in capturing relevant topological information, we first provide more details about our choice of the family of such functions that we use in our experiments: the heat kernel signatures (HKS).
Heat kernel signatures on graphs. HKS is an example of spectral family of signatures, i.e. functions derived from the spectral decomposition of graph Laplacians, which provide informative features for graph analysis. The adjacency matrix of a graph with vertex set is the matrix . The degree matrix is the diagonal matrix defined by . The normalized graph Laplacian is the linear operator acting on the space of functions defined on the vertices of , and is represented by the matrix
. It admits an orthonormal basis of eigenfunctions
and its eigenvalues satisfy. As the orthonormal eigenbasis is not uniquely defined, the eigenfunctions cannot be used as such to compare graphs. Instead we consider the heat kernel signatures:
Given a graph and , the heat kernel signature at time is the function defined on the vertices of by .
The HKS have already been used as signatures to address graph matching problems (Hu et al., 2014) or to define spectral descriptors to compare graphs (Tsitsulin et al., 2018). These signatures rely on the distributions of values taken by the HKS but not on their global topological structures, which are encoded in their extended persistence diagrams. Moreover the following theorem shows these diagrams to be stable with respect to the bottleneck distance between persistence diagrams. The proof is found in the Supplementary Material, Section D.
Let and let be the Laplacian matrix of a graph with vertices. Let be another graph with vertices and Laplacian matrix . Then there exists a constant only depending on and the spectrum of such that, for small enough :
Graph classification experiments. We are now ready to evaluate our architecture on a series of different graph datasets commonly used as a baseline in graph classification problems. REDDIT5K, REDDIT12K, COLLAB (from (Yanardag & Vishwanathan, 2015)) IMDB-B, IMDB-M (from (Tran et al., 2018)) are composed of social graphs. BZR, COX2, DHFR, MUTAG, PROTEINS, NCI1, NCI109, FRANKENSTEIN are graphs coming from medical or biological frameworks (also from (Tran et al., 2018)). A quantitative summary of these datasets is found in Table 4 from the Supplementary Material.
We compare performances with four other top graph classification methods. Scale-variant topo (Tran et al., 2018) uses a kernel for ordinary persistence diagrams computed on the graphs. RetGK (Zhang et al., 2018) is a kernel method for graphs that leverages attributes on the graph vertices and edges, and reaches state-of-the-art results on many datasets. Note that while the exact computation (denoted RetGK1 in Footnote 4) can be quite long, the method can be efficiently approximated (RetGK11 in Footnote 4) while preserving good accuracy scores. FGSD (Verma & Zhang, 2017) is a finite-dimensional graph embedding that does not leverage attributes, and reaches state-of-the-art results on different datasets. Finally, (Xinyi & Chen, 2019) is a neural network approach that also reaches top-tier results. One could also compare our results on the REDDIT datasets to the ones of Hofer et al. (2017), where authors also use persistence diagrams to feed a network (using as first channel a particular case of PersLay, see Section 3), achieving 54.5% and 44.5% of accuracy on REDDIT5K and REDDIT12K respectively.
In order to extract topological features, we use this general scheme: for each graph we compute the HKS filtrations at one or two time step (e.g. with values in ). After generating the corresponding extended persistence diagram induced by HKS (see §2 and §4.2), we eventually keep, for each diagram, the first points which are the farthest away from the diagonal (a specific is fixed for each dataset, see Table 5
in the supplementary material). We combine these topological features with more traditional graph features formed by the eigenvalues of the normalized graph Laplacian along with the deciles of the computed HKS (right-side channel in Figure3). So as to evaluate the impact of persistence diagrams, we also report classification performances for those features alone (thus unplugging the PersLay layers) denoted "Spectral + HKS"55footnotemark: 5 in Table 4.
We use the same accuracy metric as in (Zhang et al., 2018). For each dataset, we compute a final score by averaging ten 10-folds, where a single 10-fold is computed by randomly shuffling the dataset, then splitting it into 10 different parts, and finally classifying each part using the nine others for training and averaging the classification accuracy obtained throughout the folds. We report in Footnote 4
the average and standard deviation of the scores we obtain. In most cases, our approach is comparable, if not better, than state-of-the-art results, despite using a very simple neural network architecture. More importantly, it can be observed from the last two columns of Table4 that, for all datasets, the use of extended persistence diagrams significantly improves over using the additional features alone.
|Dataset||ScaleVariant11footnotemark: 1||RetGK1 * 22footnotemark: 2||RetGK11 * 22footnotemark: 2||FGSD 33footnotemark: 3||GCNN 44footnotemark: 4||Spectral + HKS 55footnotemark: 5||PersLay|
In this article, we propose a versatile, powerful and simple neural network layer to process persistence diagrams called PersLay, which generalizes most of the techniques used to vectorize persistence diagrams that can be found in the literature—while optimizing them task-wise. Our code is freely available publicly at https://github.com/MathieuCarriere/perslay. We showcase the efficiency of our approach by achieving state-of-the-art results on synthetic orbit classification coming from dynamical systems and several graph classification problems from real-life data, while working at larger scales than kernel methods developed for persistence diagrams and remaining simpler than most of its neural network competitors. We believe that PersLay has the potential to become a central tool to incorporate topological descriptors in a wide variety of complex machine learning tasks.
RetGK: Graph Kernels based on Return Probabilities of Random Walks.In Advances in Neural Information Processing Systems, pp. 3968–3978, 2018.
|Dataset||Nb of orbit observed||Number of classes||Number of points per orbit|
|Dataset||Nb graphs||Nb classes||Av. nodes||Av. Edges||Av.||Av.|
|Dataset||Func. used||PD preproc.||DeepSet Channel||Optim.|
|ORBIT5K||,||prom(400)||Pm(25,25,10,top-5)||adam(0.01, 0., 300)|
|ORBIT100K||,||prom(400)||Pm(25,25,10,top-5)||adam(0.01, 0., 300)|
|REDDIT5K||prom(300)||Pm(25,25,10,sum)||adam(0.01, 0.99, 500)|
|REDDIT12K||prom(400)||Pm(5,5,10,sum)||adam(0.01, 0.99, 1000)|
|COLLAB||,||prom(200)||Pm(5,5,10,sum)||adam(0.01, 0.9, 1000)|
|IMDB-B||,||prom(200)||Im(20,(10,2),20,sum)||adam(0.01, 0.9, 500)|
|IMDB-M||,||prom(500)||Im(10,(10,2),10,sum)||adam(0.01, 0.9, 500)|
|BZR||,||—||Im(15,(10,2),10,sum)||adam(0.01, 0.9, 100)|
|COX2||,||—||Im(20,(10,2),20,sum)||adam(0.01, 0.9, 500)|
|DHFR||,||—||Im(20,(10,2),20,sum)||adam(0.01, 0.9, 500)|
|MUTAG||—||Im(20,(10,2),20,sum)||adam(0.01, 0.9, 300)|
|PROTEINS||prom(200)||Im(15,(10,2),10,sum)||adam(0.01, 0.9, 70)|
|NCI1||,||—||Pm(25,25,10,sum)||adam(0.01, 0.9, 300)|
|NCI109||,||—||Pm(25,25,10,sum)||adam(0.01, 0.9, 300)|
|FRANKENSTEIN||—||Im(20,(10,2),20,sum)||adam(0.01, 0.9, 300)|
Input data was fed to the network with mini-batches of size 128. For each dataset, various parameters are given (extended persistence diagrams, neural network architecture, optimizers, etc.) that were used to obtain the scores from Footnote 4. In Table 5, we use the following shortcuts:
: persistence diagrams obtained with Gudhi’s -dimensional AlphaComplex filtration.
: extended persistence diagram obtained with HKS on the graph with parameter .
prom(): preprocessing step selecting the points that are the farthest away from the diagonal.
PersLay channel Im(, (, ), , op) stands for a function obtained by using a Gaussian point transformation sampled on grid on the unit square followed by a convolution with filters of size , for a weight function optimized on a grid and for an operation op.
PersLay channel Pm(, , , op) stands for a function obtained by using a line point transformation with lines followed by a permutation equivariant function  in dimension , for a weight function optimized on a grid and for an operation op.
adam() stands for the ADAM optimizer  with learning rate , using an Exponential Moving Average222https://www.tensorflow.org/api_docs/python/tf/train/ExponentialMovingAverage with decay rate , and run during epochs.
The proof directly follows from the following two theorems. This first one, proved in , is a consequence of classical arguments from matrix perturbation theory.
Let and let be the Laplacian matrix of a graph with vertices. Let , be the distinct eigenvalues of and denote by the smallest distance between two distinct eigenvalues: . Let be another graph with vertices and Laplacian matrix with , where denotes the Frobenius norm of . Then, if , there exists a constant such that for any vertex ,
if , there exists two constants such that for any vertex ,
In particular, if , there exists a constant - notice that also depends on - such that in the two above cases,
Theorem 4.2 then immediately follows from the second following theorem, which is a special case of general stability results for persistence diagrams.