1 Introduction
Many different areas of science require working with graph data, such as molecular biology, finance and social sciences. Being able to solve machine learning problems with input data given as graphs has thus become a problem of primary importance in the context of big data. Unfortunately, this turns out to be a difficult question, since the space of graphs is not wellstructured (in particular it does not have an Euclidean structure): different graphs may have different numbers of vertices and edges with no clear correspondences between them, and many standard operations, such as computing the sum or the mean of a sample of graphs, are not defined in the space of graphs.
To handle this issue, a lot of attention has been devoted to the construction of kernels that can handle graphs, since kernel methods are a very handy and common set of tools that can cope with nonstructured data. These kernels are constructed with various graph techniques, such as random walks or graphlet isomorphisms. See zhang2018retgk and the references therein for a complete description of graph kernels presented in the literature. The main issue faced by the use of kernels is scalability: kernel methods usually require the storage of the entire corresponding kernel matrices, which are quadratic with respect to the number of inputs. This becomes quickly prohibitive as the dataset size increases. Hence, a more recent line of work has focused on extending neural networks
for graph data. These networks extend the traditional convolution occuring in convolutional neural networks with various neighborhood aggregation schemes for graph vertices, often leading to scalable architectures whose results are competitive with kernel methods. Again, we refer the interested reader to
Xu2018 and the references therein for an overview of the literature.A third line of work uses computational topology to efficiently encode and summarize graph structures in compact descriptors, that can then be processed with the aforementioned techniques. These descriptors are called persistence diagrams, which are sets of points in the Eulidean plane . They aim at computing and quantifying the presence and importance of topological features in graphs (and eventually higher dimensional combinatorial objects—known as simplicial complexes), such as branches, loops and connected components. This is the approach advocated in two recent works, where persistence diagrams are first computed on the graphs, and then processed by either a kernel tran2018scale or a neural network hofer2017deep. Both of these approaches convolve the persistence diagrams with 2D Gaussian functions in order to define the kernel or the first layer of the neural network.
In this article, we follow this third line of work and focus on the interactions between deep learning and computational topology, by studying neural networks that are tailored to handle and process persistence diagrams, and by providing an illustration on graph classification.
Contributions.
In this article, our contributions are two fold:

We provide an efficient and theoretically sound way to encode and summarize graph structures using an extension of ordinary persistence called extended persistence and a specific family of functions called heat kernel signatures.

We then provide a very simple (namely, two layers), versatile and competitive neural network architecture to process (extended) persistence diagrams. It is based on the recently proposed DeepSet architecture Zaheer2017, and encompasses most of the vectorizing approaches for persistence diagrams of the literature.
Note that our architecture is not restricted to graph classification, and can be used as soon as persistence diagrams are given as input, whatever data they were computed from. Moreover, we will soon release publicly the Python code implementing our architecture and persistence diagram computations.
Outline.
Section 2 recalls the basics of (extended) persistence theory. Since (extended) persistence diagrams are defined from functions on graph vertices, we present in Section 3 the family of functions that we use to generate our diagrams, the socalled heat kernel signatures. Then, we detail our general neural network architecture in Section 4, and we finally confirm the efficiency of our approach with a set of experiments in Section 5.
Notations.
In this paper, a graph is denoted by where is the set of vertices and is the set of nonoriented edges, i.e. a subset of quotiented by the equivalence relation that identifies with for any . The weight function on edges is denoted by . A graph is unweighted if all of its edges have same weight .
2 Extended persistence diagrams
In this section, we explain how one can encode graph structures with the socalled extended persistence diagram descriptors. We will merely recall the basics of (extended) persistence theory in the context of graphs, and we refer the interested reader to Edelsbrunner2010; Oudot2015 for a thorough description of general (extended) persistence theory for topological spaces, which was originally defined in CohenSteiner2009.
Ordinary persistence.
Persistence aims at encoding the structure of a pair , where is a graph and is a function defined on its vertices. This is achieved by looking at the sublevel graphs where , and —see Figure 1 for an illustration. Indeed, making increase from to gives a sequence of increasing subgraphs, i.e. subgraphs that are nested with respect to the inclusion. This sequence starts with the empty graph and ends with the full graph . Persistence aims at encoding the topological changes in the sublevel graphs in this sequence. For instance, any branch of pointing downwards (with respect to the orientation given by ) will create a new connected component in the sequence of sublevel graphs, which appears for a specific sublevel graph , the value , called the birth time of this connected component, being the function value at the tip of the branch. Since this connected component eventually gets merged with another at the value where the branch connects to the graph, the corresponding value is stored and called the death time of the branch, and one says that the branch persists on the interval . Similarly, each time a loop of appears in a specific sublevel graph, the corresponding value is saved. Note however that loops persist until since loops never disappear from the sequence of sublevel graphs, and the same applies for whole connected components of . Moreover, branches pointing upwards are completely missed, since they do not create connected components when they appear in the sublevel graphs, making ordinary persistence unable to detect them.
Extended persistence.
To handle this issue, extended persistence refines the analysis by also looking at the superlevel graphs where , and . Similarly, making decrease from to gives a sequence of decreasing subgraphs, for which structural changes can also be recorded, as illustrated on the bottom row of Figure 1. In particular, death times can be defined for loops and whole connected components by picking the superlevel graphs for which the feature appears again, and using the corresponding value as the death time for these features. Moreover, branches pointing upwards can be detected in this sequence of superlevel graphs, in the exact same way that downwards branches were in the sublevel graphs. See the upper part of Figure 2.
Finally, the family of intervals of the form is turned into a multiset of points (i.e. point cloud where points are counted with multiplicity) in the Euclidean plane by using the interval bounds as coordinates. This multiset is called the extended persistence diagram of and is denoted by . Since graphs have four topological features, namely the upwards and downwards branches, the loops and the connected components, points in extended persistence diagrams can have four different types, that are denoted as , , and for downwards branches, upwards branches, connected components and loops respectively:
See the lower part of Figure 2. Note that an advantage of using extended persistence is that all diagram types can be treated similarly, in the sense that points in each type all have finite coordinates, contrarily to ordinary persistence for which points have to be divided into two groups, with either finite or infinite death times, and thus processed differently, as in hofer2017deep. In practice, computing extended persistence diagrams can be efficiently achieved with the C++/Python Gudhi library gudhi.
Choosing the function used to generate extended persistence diagrams is thus a critical step of our pipeline, which we detail, in the context of graph classification, in Section 3. Note however that the architecture that we propose subsequently in Section 4 does not make any assumptions on the choice of function , and can be used to process any set of (extended) persistence diagrams. Before moving to the next section, we finally recall an important property that (extended) persistence diagrams enjoy, namely their stability.
Stability.
The space of persistence diagrams is generally equipped with a metric called the bottleneck distance, denoted by . Its proper definition is not required for this work and can be found in (Edelsbrunner2010, Ch. VIII.2). An important property of extended persistence diagrams endowed with the bottleneck distance is their stability with respect to perturbations applied to the functions generating them. Indeed, diagrams computed with similar functions are also going to be similar in a Lipschitzcontinuous manner, as stated in the following theorem:
Theorem 2.1 (Chazal2016; CohenSteiner2009; Oudot2015)
Let be a graph and be two functions defined on its vertices. Then:
(1) 
where stands for the socalled bottleneck distance between persistence diagrams. Moreover, this inequality is also satisfied for each of the subtypes and individually.
Despite its appealing theoretical properties, the bottleneck distance is not suited for statistical and learning applications of persistence diagrams. The space of persistence diagrams equipped with the bottleneck distance does not have a linear structure, and computing the bottleneck distance for large diagrams can be computationally prohibitive. As such, performing machine learning with persistence diagrams requires more investigation. For instance, in practical applications, diagrams are generally embedded into Hilbert spaces of either finite Adams2017; carriere2015stableSignature3DShape or infinite dimension Carriere2017; Kusano2016; Le2018; Reininghaus2015 using kernel methods. In Section 4, we follow the former approach by building a versatile layer for neural networks that learn taskspecific vectorizations for (extended) persistence diagrams.
Remark 2.2
Note that, even though extended persistence diagrams are stable, they are in fact bagoffeatures descriptors, since they encode the presence and size of every topological feature but are oblivious to their actual layout in the graphs. Hence, different graphs might still have same extended persistence diagrams. However, it has been shown that extended persistence diagrams are actually complete descriptors for graphs that are close enough Carriere2017a.
3 Heat kernel signatures on graphs
We have seen in Section 2 how extended persistence diagrams can be computed from a graph and a function defined on its vertices. Thus, in this section, we provide more details about the family of functions that we use in our experiments in Section 5, called the heat kernel signatures (HKS). HKS is an example of spectral family of signatures, i.e. functions derived from the spectral decomposition of graph Laplacians, which provide informative features for graph analysis. Although our approach based on extended persistence diagrams can be used with any family of signatures, we restrict for clarity of exposure to the HKS, whose extended persistent homology turns out to be stable with respect to the edges weights, leading to robust signatures that we eventually feed to our neural network in Sections 4 and 5.
Graph Laplacian.
The adjacency matrix of a graph with vertex set is the symmetric matrix defined by if and otherwise. The degree matrix is the diagonal matrix defined by . The normalized graph Laplacian is the linear operator acting on the space of functions defined on the vertices of which is represented by the matrix
. It admits an orthonormal basis of eigenfunctions
and its eigenvalues satisfy
.Heat kernel signatures.
Note that, as the orthonormal eigenbasis is not uniquely defined, the eigenfunctions cannot be directly used as signatures to compare graphs. To overcome this problem we consider the heat kernel signatures:
Definition 3.1 (hu2014stable; sun2009concise)
Let . Then, given a graph and , the heat kernel signature at is the function defined on by:
Remark 3.2
Note that the HKS are part of a more general family of signatures, the Laplacian family signatures hu2014stable, in which can be replaced by any other function. Included in this general family are, for instance, the wave kernel signatures aubry2011wave and the wavelet kernel signatures hammond2011wavelets. The constructions and results of the present work extend straightforwardly to these other signatures.
Remark 3.3
Since the HKS can also be computed by taking the diagonal of the heat kernel associated to the heat equation on the graph —see, e.g., (chung1997spectral, Chapter 10)—they can be seen as multiscale signatures and they inherit properties that make them particularly wellsuited for graph analysis and classification. In particular, they do not depend on the choice of orthonormal eigenbasis and are invariant under graph isomorphisms: if is a graph isomorphism, then for any , one has .
Stability.
The HKS have already been used as signatures to address graph matching problems hu2014stable or to define spectral descriptors to compare graphs tsitsulin2018netlsd. These signatures rely on the distributions of values taken by the HKS but not on their global topological structures, which are encoded in their extended persistence diagram. Combining Theorem 2.1 for extended persistence diagrams with stability properties of the HKS proven in hu2014stable leads to the stability of extended persistence diagrams of the HKS under perturbations of the Laplacian, as stated in the following theorem.
Theorem 3.4
Let and let be the Laplacian matrix of a graph with vertices. Let be another graph with vertices and Laplacian matrix . Then there exists a constant only depending on and the spectrum of such that:
as soon as , the Frobenius norm of , is small enough.
4 Neural Network for Persistence Diagrams
In this section, we detail the general neural network architecture that we use to perform classification on the extended persistence diagrams generated with HKS that we presented in Sections 2 and 3. To do so, we leverage a recent neural network architecture called DeepSet Zaheer2017 that was developed to handle sets of points.
4.1 DeepSet neural network
The main purpose of the DeepSet architecture is to be invariant to the point orderings in the sets. Any such neural network is called a permutation invariant network. In order to achieve this, Zaheer et al. Zaheer2017 proposed to develop a network implementing the general equation:
(2) 
where is the function implemented by the network (usually if one is solving a regression problem and is equal to the number of classes if one is solving a classification problem), is a set with points, is a point transformation and is a final transformation often used to make the input and output dimensions agree. Actually, it is shown in (Zaheer2017, Theorem 2) that if the cardinality is the same for all sets, then for any permutation invariant network , there exist and such that Equation (2) is satisfied. Moreover, this is still true for variable if the sets belong to some countable space.
4.2 Extension to persistence diagrams
DeepSet for persistence diagrams.
In this section, we slightly extend the DeepSet architecture to persistence diagrams and we show how several approaches for vectorizing persistence diagrams can be seen as special cases of our general neural network. Our general architecture implements the following general form:
(3) 
where is any permutation invariant operation (such as minimum, maximum, sum, th largest value…), is a weight function for the persistence diagram points, and is a point transformation. Let us now discuss potential choices for these functions that we implemented for our experiments in Section 5.
Weight function .
In our experiments, is actually treated as a parameter whose values are optimized during training. More precisely, the diagrams are scaled to , so that becomes a function defined on the unit square, that we approximate by discretizing this square into a set of pixels. The value of on each pixel is then optimized individually during training.
Point transformation and vectorization of persistence diagrams.
We now present three point transformations that we used and implemented for parameter in Equation (3). Then, we explicit the persistence diagram vectorizations of the literature that are induced by each of these point transformations.

Triangle function. The triangle function associated to a point is:
Let and . The triangle point transformation is:
See Figure 3 for an example.

Gaussian function. The Gaussian function associated to a point is:
Let and . The Gaussian point transformation is:
See Figure 4 for an example.

Line function. The line function associated to a line with direction vector and bias is:
Let and be lines. The line point transformation is:
See Figure 5 for an example.
Note that line point transformations are actually examples of socalled permutation equivariant functions, which are specific functions defined on sets of points that were used to build the DeepSet architecture in Zaheer2017.
Relation to existing vectorization methods.
Here we explicit the connections between our generic formula in Equation (3) and various vectorization and kernel methods for persistence diagrams in the literature.

Using with samples , th largest value, , amounts to evaluating the th persistence landscape Bubenik2015 on .

Using with samples , sum, arbitrary , amounts to evaluating the persistence silhouette weighted by Chazal2015 on .

Using with samples , sum, arbitrary , amounts to evaluating the persistence surface weighted by Adams2017 on . Moreover, characterizing points of persistence diagrams with Gaussian functions is also the approach advocated in several kernel methods for persistence diagrams Kusano2016; Le2018; Reininghaus2015.

Using where is a modification of the Gaussian point transformation defined with: for any , where if for some , and otherwise, sum, , is the approach presented in hofer2017deep. This shows that we substantially generalize the architecture in this article.

Using with lines , th largest value, , is similar to the approach advocated in Carriere2017, where the sorted projections of the points onto the lines are then compared with the norm and exponentiated to build the socalled Sliced Wasserstein kernel for persistence diagrams.
Optimization.
In practice, the samples and lines are parameters that are optimized on the training set by the network. This means that our approach looks for the best locations to evaluate the landscapes, silhouettes and persistence surfaces, as well as the best lines to project the diagrams onto, with respect to the problem the network is trying to solve.
5 Experimental results
In this section, we detail the experiments we used to validate our approach. Our code is based on the open source C++/Python library Gudhi gudhi, Python package sklearntda sklearntda, and Python package tensorflow tensorflow2015 and will be publicly released soon.
5.1 Datasets and data preprocessing
Datasets.
We evaluate our classification model on a series of different graph datasets that are commonly used as a baseline in graph classification problems. A description of these datasets can be found in Table 2 in the Appendix.

REDDIT5K, REDDIT12K, COLLAB, IMDBB, IMDBM are composed of social graphs. We obtain REDDIT5K, REDDIT12K, COLLAB from http://www.mit.edu/~pinary/kdd/datasets.tar.gz, and IMDBB, IMDBM are provided as .graphml files by tran2018scale.

BZR, COX2, DHFR, MUTAG, PROTEINS, NCI1, NCI109, FRANKENSTEIN are composed of graphs coming from medical or biological frameworks. For all these datasets, we used the .graphml files provided by tran2018scale.
All the observations in these datasets are graphs represented by their adjacency matrices. Note that some of them (marked as * in Table 1) may have attributes on their nodes or edges, an additional information that our approach does not use.
Preprocessing.
For each graph in each dataset, we generate a set of persistence diagrams along with some finitedimensional features in the following way: for , we compute the corresponding extended persistence diagram induced by HKS (see Section 2 and Section 3) with parameter . We then divide each diagram into its four different types , , and to get four different diagrams. Hence, each graph gives 16 persistence diagrams: one for each parameter of the HKS and diagram type. These diagrams are then transformed with the birthpersistence function: , truncated to the first points which are the farthest away from the diagonal (we use a specific for each dataset, see Table 3), and scaled to the unit square .
Additional features.
We use in parallel feature vectors built on graphs, which contain the sorted eigenvalues of the normalized graph Laplacian
, along with the deciles of the HKS for each
.5.2 Network architecture
In order to show the contribution of the extended persistence diagrams and the way we use them in neural networks with Equation (3), we voluntarily use a very simple network architecture. Namely, our network has only two layers: the first one consists of few channels that process the persistence diagrams in parallel. Note that we do not necessarily use all the diagrams produced in practice. Table 3 precises, for each experiments, which diagrams where used. The output of this layer is then concatenated with the graph features and the resulting vector is normalized. The second (and final) layer is a fully connected layer, whose output is used to make the prediction. See Figure 6.
5.3 Experimental settings and results
Table 1 summaries the scores obtained by our approach on different sets of graphs. We compare it to four other graph classification methods:

Scalevariant topo tran2018scale, which uses a kernel for persistence diagrams

DLtopo hofer2017deep, which uses a neural network for persistence diagrams. We recall from Section 4 that this approach is a specific case of ours.

RetGK zhang2018retgk is a graph classification method relying on a kernel approach that leverages attributes on the graph vertices and edges. It reaches stateoftheart results on many datasets. Note that while the exact computation (denoted as RetGK1 in Table 1) can be quite long, this method can be efficiently approximated (RetGK11 in Table 1) while preserving good accuracy scores.

FGDS verma2017hunt is a graph classification method that does not leverage attributes, and reaches stateoftheart results on different datasets.
For each dataset, we compute a final score by averaging ten 10folds (as in zhang2018retgk), where a single 10fold is computed by randomly shuffling the dataset, then splitting it into 10 different parts, and finally classifying each part using the nine others for training and averaging the classification accuracy obtained throughout the folds. We then report in Table 1
the average and standard deviation of these scores. One can see that, in most cases, our approach is comparable, if not better, than stateoftheart results, despite avoiding kernels and using very simple neural networks architecture. More importantly, it can be observed from the last two columns of Table
1 that, for all datasets, the use of extended persistence diagrams significantly improves over using the additional features alone.Dataset  ScaleVariant tran2018scale  DLtopo hofer2017deep  RetGK1 * zhang2018retgk  RetGK11 * zhang2018retgk  FGSD verma2017hunt  Spectral + HKS only  Spectral + HKS + TDA 

REDDIT5K  —  54.5  56.1(0.5)  55.3(0.3)  47.8  49.7(0.3)  57.2 
REDDIT12K  —  44.5  48.7(0.2)  47.1(0.3)  —  39.7(0.1)  48.2 
COLLAB  —  —  81.0(0.3)  80.6(0.3)  80.0  67.8(0.2)  74.6(0.2) 
IMDBB  72.9  —  71.9(1.0)  72.3(0.6)  73.6  67.6(0.6)  70.9(0.7) 
IMDBM  50.3  —  47.7(0.3)  48.7(0.6)  52.4  44.5(0.4)  47.2(0.5) 
BZR *  86.6  —  —  —  —  80.8(0.8)  87.2(0.7) 
COX2 *  78.4  —  80.1(0.9)  81.4(0.6)  —  78.2(1.3)  80.7(1.2) 
DHFR *  78.4  —  81.5(0.9)  82.5(0.8)  —  69.5(1.0)  80.2(0.9) 
MUTAG *  88.3  —  90.3(1.1)  90.1(1.0)  92.1  85.8(1.32)  91.2(1.5) 
PROTEINS *  72.6  —  75.8(0.6)  75.2(0.3)  73.4  73.5(0.3)  75.3 
NCI1 *  71.6  —  84.5(0.2)  83.5(0.2)  79.8  65.3()  72.8(0.3) 
NCI109 *  70.5  —  —  —  78.8  64.9(0.2)  70.6(0.3) 
FRANKENSTEIN  69.4  —  —  —  —  62.9(0.1)  70.7(0.4) 
6 Conclusion
In this article, we propose a versatile, powerful and simple way to handle graphs in machine learning. More precisely, our approach is based on combining topological descriptors and neural networks: we presented a simple and easy way to encode graph structure into compact descriptors by combining an extension of persistence theory and a family of Laplacianbased graph functions. These descriptors, in addition to being provably robust, can then be fed to our simple and general neural network architecture that generalizes most of the techniques used to vectorize persistence diagrams that can be found in the literature. Finally, we validated our method by showing that it can achieve stateoftheart results on several graph classification problems, despite being much simpler than most of its competitors.
References
Appendix A Datasets description
Table 2 summarizes key information for each dataset.
Dataset  Nb graphs  Nb classes  Av. nodes  Av. Edges  Av.  Av. 

REDDIT5K  5000  5  508.5  594.9  3.71  90.1 
REDDIT12K  12000  11  391.4  456.9  2.8  68.29 
COLLAB  5000  3  74.5  2457.5  1.0  2383.7 
IMDBB  1000  2  19.77  96.53  1.0  77.76 
IMDBMULTY  1500  3  13.00  65.94  1.0  53.93 
BZR  405  2  35.75  38.36  1.0  3.61 
COX2  467  2  41.22  43.45  1.0  3.22 
DHFR  756  2  42.43  44.54  1.0  3.12 
MUTAG  188  2  17.93  19.79  1.0  2.86 
PROTEINS  1113  2  39.06  72.82  1.08  34.84 
NCI1  4110  2  29.87  32.30  1.19  3.62 
NCI109  4127  2  29.68  32.13  1.20  3.64 
FRANKENSTEIN  4337  2  16.90  17.88  1.09  2.07 
Appendix B Parameters used in our experiments
Input data was fed to the network with minibatches of size 128. We give, for each dataset, parameters details (extended persistence diagrams, neural network architecture, optimizers, etc.) used to obtain the scores we present in Table 1 in the column Our approach. In Table 3, we use the following shortcuts:

: extended persistence diagram obtained with HKS on the graph with parameter .

prom(): preprocessing step consisting in keeping the points that are the farthest away from the diagonal.

channels are denoted in the following way:

Im(, (, ), , ) stands for a function obtained by using a Gaussian point transformation sampled on grid on the unit square followed by a convolution with filters of size , for a weight function optimized on a grid and for an operation .

Pm(, , , ) stands for a function obtained by using a line point transformation with lines followed by a permutation equivariant function with parameters , and , where the maximum is over each column, for a weight function optimized on a grid and for an operation .


adam() stands for the ADAM optimizer [Kingma2014] with learning rate , using an Exponential Moving Average^{1}^{1}1https://www.tensorflow.org/api_docs/python/tf/train/ExponentialMovingAverage with decay rate , and run during epochs.
Dataset  Func. used  PD preproc.  DeepSet Channel  Optim. 

REDDIT5K  prom(300)  Pm(25,25,10,sum)  adam(0.01, 0.99, 70)  
REDDIT12K  prom(400)  Pm(5,5,10,sum)  adam(0.01, 0.99, 400)  
COLLAB  ,  prom(200)  Im(20,(10,2),20,sum)  adam(0.01, 0.9, 500) 
IMDBB  ,  prom(200)  Im(20,(10,2),20,sum)  adam(0.01, 0.9, 500) 
IMDBM  ,  prom(200)  Im(20,(10,2),20,sum)  adam(0.01, 0.9, 500) 
BZR  ,  —  Im(15,(10,2),10,sum)  adam(0.01, 0.9, 300) 
COX2  —  Im(20,(10,2),20,sum)  adam(0.01, 0.9, 300)  
DHFR  —  Im(20,(10,2),20,sum)  adam(0.01, 0.9, 300)  
MUTAG  —  Im(20,(10,2),20,sum)  adam(0.01, 0.9, 300)  
PROTEINS  prom(200)  Im(15,(10,2),10,sum)  adam(0.01, 0.9, 300)  
NCI1  —  Im(20,(10,2),20,sum)  adam(0.01, 0.9, 300)  
NCI109  —  Im(20,(10,2),20,sum)  adam(0.01, 0.9, 300)  
FRANKENSTEIN  —  Im(20,(10,2),20,sum)  adam(0.01, 0.9, 300)  
