1 Introduction
Persistent homology is a commonly used tool in the field of Topological Data Analysis, which is designed to track topological changes as data is examined at different scales. Its main descriptor is the persistence diagram (PD), often represented as a multiset of birth and death pairs of topological features. Since it can encode topological and geometrical properties of the data, it is potentially complementary to features retrieved by other classical descriptors. Applications of PD have been found in different fields, from signal analysis to 3D shape classification [perea2015sliding, buchet2018persistent, li2014persistence, carriere2015stable, saggar2018towards, suh2019persistent, pranav2017topology, kannan2019persistent].
A recent line of research is primarily aimed at leveraging PD in the task of graph classification. Given a graph signature on vertices of the graph, one can construct a sublevel filtration from the nested sequence of increasing subgraphs. The PDs obtained from this filtration process can be used in machine learning tasks by constructing a kernel, such as sliced Wasserstein kernel [carriere2017sliced] and Fisher kernel [le2018persistence]
, or associating them to a vector space, such as persistence landscape
[bubenik2015statistical] and persistence image [adams2017persistence]. However, kernel methods require evaluation for each pair of graphs thus do not scale well to the size of the dataset, and vectorization methods are usually nonparametric or have a few trainable parameters, resulting in suffering from being agnostic to the type of application. For more scalability and flexibility, a deep learning model that operates on sets can be used directly on PDs [hofer2017deep, carriere2019perslay].In this paper, we focus on graph classification using PDs from a deep learning perspective. A majority of prior works rely on a scalar node descriptor, such as node degree [hofer2017deep] or heat kernel signature [carriere2019perslay], to construct the filtration, where the choice is often empirical. However, a multidimensional embedding is usually preferred to describe a graph node. While multidimensional persistence has been studied [carlsson2009theory, lesnick2015theory], it exhibits an essentially different character from its onedimensional version and is often complicated to generalize corresponding concepts, such as persistence diagrams, making it nontrivial to utilize them in downstream models. Instead, we explore the potential of leveraging multiple PDs for each data sample based on a family of multiscale graph signatures: the step return probabilities of random walks, where is the number of random steps from a starting node, indicating the size of the local structure being explored. For example, 2hop random walk can help to capture the neighborhood size of surrounding nodes, while 3hop can capture circles in the graph. The idea is that different scales help the model capture more information about its neighborhood, which in turn can help improve the model’s performance. The return probability feature has been used as a node structural role descriptor to construct a kernel for graph classification [zhang2018retgk]. We show that with an appropriate application and model architecture, it can also be useful in topologybased approaches.
Our contributions are as follows. First, we propose a family of scalar graph signatures based on the return probability of random walks to construct multiple PDs per data sample. The PDs are padded to a fixed size and stacked together to form a multichannel input. Second, we introduce RPNet, a deep neural network architecture designed to learn from the proposed features. We demonstrate our method on a wide range of benchmark datasets and observe that it often outperforms other persistence diagrambased and neural networkbased approaches, while being competitive with stateoftheart graph kernel methods.
The rest of this paper is organized as follows. Section 2 and Section 3 briefly go through some related work and background. Section 4 describes our proposed graph signature and network architecture. Section 5 and Section 6 present our experiments and results. Finally, we discuss future directions in Section 7.
2 Related Works
Since PDs have an unusual structure for machine learning approaches, many techniques have been proposed to map them into machine learning compatible representations. A popular technique is vectorization. For instance, Bubenik et al. [bubenik2015statistical] propose mapping persistence barcodes into a Banach space, referred to as persistence landscape. Adams et al. [adams2017persistence] propose persistence image, which is a discretization of the persistence surface. Another approach is constructing a kernel from PDs. For example, the sliced Wasserstein kernel [carriere2017sliced], or persistence Fisher kernel [le2018persistence] are designed to work on PDs. Rieck et al. [rieck2019persistent] define a kernel constructed from a vector representation of persistence diagram through the WeisfeilerLehman subtree feature. While these methods make it much easier to deploy machine learning techniques, they are predefined and thus often suboptimal to a specific task.
Learning representation has been investigated to provide a taskoptimal representation in the vector space for PDs. Hofer et al. [hofer2017deep] and Carriere et al. [carriere2019perslay] propose a deep learning architecture designed to handle sets of points in PDs. Zhao et al. [zhao2019learning] propose a weighted kernel for persistence image. These methods have been proven to perform well on a wide range of benchmarks.
In these works, a specific node descriptor is usually required for the filtration process. Some of the simple choices include node degree [hofer2017deep] or the node attributes if available [rieck2019persistent]. More complicated descriptors include the heat kernel signature [carriere2019perslay]
, Riccicurvature & Jaccardindex
[zhao2019learning]. For nonattributed graphs, [rieck2019persistent] uses a stabilization procedure based on node degree to leverage more topological information.Our work employs a similar approach in [hofer2017deep, carriere2019perslay] to propose a deep learning architecture that contains pooling layers to learn from multiset inputs. Different from prior works, our architecture is able to handle multiple PDs for each data sample, each computed from a node descriptor in the proposed family of graph signatures.
3 Background
We briefly mention the concepts and refer interested readers to [edelsbrunner2010computational, aktas2019persistence] for more detailed definitions of homology.
3.1 Simplices and Simplicial Complexes
Simplices
are higher dimensional analogs of points, line segments, and triangles, such as a tetrahedron. Formally, an simplex is the convex hull of affinely independent points, i.e., a set of all convex combinations where and for all For example, a 0simplex is a single point (the vertex), a 1simplex is a line segment (the edge), and a 2simplex is a triangle with its interior (Figure 1). Simplices are the building blocks of simplicial complexes.
Simplicial Complexes
A simplicial complex is a finite collection of simplices such that (1) given a simplex , a face implies , and (2) two simplices implies is either empty or a common face of both. The dimension of is the maximum dimension of any of its simplices. The first condition states that if a simplex is in , then its faces are also in . For the second condition, we are only allowed to glue simplices by their common faces to avoid no improper intersections (Figure 2).
3.2 Persistence Diagrams of Graphs
Consider a graph with nodes, edges and a node descriptor function . First, we use the descriptor to label the vertices in . These labels are used as threshold values to define the sublevel graph where , and . This provides a nested family of simplicial complexes where , , are weights assigned nodes sorted in increasing order. , starting with the empty graph and ending with the full graph , is called a filtration of . During this construction, some holes may appear and disappear; in other words, the time of appearance and disappearance of topological features such as loops, and connected components are recorded. When a component first appears, its birth is kept track of, and when it gets merged into another component, we record the death. Such homological features can be considered the graph feature. Specifically, these intervals are persistence diagram that consists of the birth and death time of the feature as points (birth, death) in . Figure 3 illustrates 0dimensional and 1dimensional persistence barcodes and the corresponding persistence diagram of the filtration on a simple input graph. For each vertex , its node degree is used as the node descriptor , where is the adjacency matrix of .
3.3 Return Probabilities of Random Walks
Consider an undirected graph . Let be the adjacency matrix of and be the degree matrix, i.e. a diagonal matrix with .
A step walk starting from node is a sequence of nodes , with , where . A random walk on
is a Markov chain
, whose transition probabilities areThe transition probability matrix is calculated by . The hop transition probability matrix is , where denotes the transition probability in steps from node to . The return probability of random walks is defined as , which is the probability to return the source node after hops. When , all the probabilities are equal to .
A straightforward way to calculate a transition probability matrix requires matrix multiplication of , which is in time complexity. We are only interested in the diagonal elements, thus the matrix can be computed more efficiently and the time complexity can be reduced to [zhang2018retgk]. The calculation of all the hop transition probability matrix with takes time , allowing computation to be scalable to large graphs.
3.4 Deep Learning with Sets
Most machine learning algorithms take fixedsize vectors as input. They are not designed for data in the form of sets. In contrast to vectors, sets are invariant to permutation and have a variable length. Thus, dealing with this type of input is a challenging and nontrivial task. There has been some research aimed at tackling this problem by employing permutationinvariant functions that have two desirable characteristics: invariant to the ordering of the inputs and ability to handle variablesize inputs [zaheer2017deep, rezatofighi2017deepsetnet, xu2018spidercnn]. Specifically, a permutation invariant function in is defined
where , is the number of points, and . For example, averagepooling and sumpooling are two predefined permutation invariant functions that are commonly used to handle sets [kipf2016semi, hamilton2017inductive, atwood2016diffusion].
4 Proposed Method
In this section, we describe our approach to using the return probabilities of random walks for topological graph representation and propose our neural network architecture, referred to as RPNet. Let be an undirected graph, where is the set of vertices and is the set of edges connecting two vertices, we define a set of descriptors on , where represents the hop return probabilities of random walk starting from the vertex , with . The higher value of , the larger a subgraph is considered. By definition:
where is the adjacency matrix, is the degree matrix, and is the th vertex in .
PDs Computation and Processing
Using the sublevel filtration based on each of node descriptors , we can compute 0,1dimensional PDs for each graph . We split points in these diagrams into three groups: 0homology essential points (0homology points whose death is infinite), 0homology nonessential points (0homology points whose death is finite) and 1homology essential points (1homology points whose death is infinite). Note that 1homology does not disappear since we do not consider higher degree homology; therefore, 1homology nonessential point (1homology points whose death is finite) does not exist. We normalize these points to the range by dividing them by the maximum coordinate (i.e., birth, death) value ( is also mapped to 1), and add a 3dimensional onehot vector characterizing each group. Thus, each point is represented by a vector in : birth, death, and 3d onehot vector. After this process, we obtain sets of 5dimensional vectors. These sets can vary in size (for example, in Fig. 4, with , the first set has 6 points (blue), while gives 5 points (yellow)), thus we pad them to have the same number of elements, , in each set. The final resulting output is a feature for each graph . Figure 4 illustrates this process on a simple graph.
4.1 Network architecture
We propose RPNet, a neural network operating on the extracted features. The network consists of an encoder, a decoder, a point pooling layer, a diagram pooling layer, and a classification head. First, the encoder maps each point into a dimensional space:
Next, we concatenate the encoded feature with the 3d onehot vectors to obtain . We then pass it to a decoder to map each point into dimensional space:
The point pooling layer performs sum pooling on points in each diagram, and the diagram pooling layer performs weighted pooling on diagrams for each graph.
The classification head is a multilayer perceptron on the embedding of each graph, followed by a softmax function to output the class probabilities.
Our network is optimized with the crossentropy loss. Figure 5 demonstrates the architecture of our proposed neural network.
5 Experiments
Setup
We use persistence diagrams up to degree 1 for each as inputs to our model. We follow the conventional settings for graph classification as described in previous works [hofer2017deep, carriere2019perslay]
. Specifically, we perform a 10fold crossvalidation (9 folds for training and 1 fold for testing) for each run. We train each model for at most 500 epochs and report the average accuracy and standard deviation on the epochs that have the lowest loss of each fold. Training is terminated after 50 nondecreasing epoch losses. Adam optimizer
[kingma2014adam] is used with an initial learning rate between the range of 0.001 to 0.01, and a decay factor of 0.5 after every 25 epochs up to 6 times. For the encoder and decoder, we use several blocks of four following components: Linear Layer, Norm Layer (e.g., BatchNorm [ioffe2015batch], LayerNorm [ba2016layer]), Dropout [srivastava14a], and Activation Layer (e.g., ReLu
[agarap2018deep], ELU [clevert2015fast]) in that order. We finetune the number of blocks (from 2 to 5) and their dimensions in the encoder, along with the learning rate. We report the best scores for each variant of our models, denoted as RPNet, where represents the number of return probabilities of random walks that are used to generate the multiscale graph representation. It is worth noting that all the baselines we compare with follow the same setup, thus it is a fair comparison.Datasets
#graphs  avg #nodes  avg #edges  #classes  
MUTAG  188  17.93  19.79  2 
PROTEINS  1113  39.06  72.82  2 
NCI1  4110  29.87  32.30  2 
NCI109  4127  29.68  32.13  2 
PTC_FM  349  14.11  14.48  2 
PTC_FR  351  14.56  15.00  2 
PTC_MM  336  13.97  14.32  2 
PTC_MR  344  14.29  14.69  2 
COX2  467  41.22  43.45  2 
DHFR  756  42.43  44.54  2 
REDDITBINARY 
2000  429.6  497.8  2 
IMDBBINARY  1000  19.77  96.53  2 
COLLAB  5000  74.49  2457.78  3 
IMDBMULTI  1500  13.00  65.94  3 
REDDIT5K  4999  508.5  594.9  5 
REDDIT12K  11929  391.4  456.9  12 

We evaluated our proposed method in graph classification tasks on 16 graph benchmark datasets from different domains such as small chemical compounds or protein molecules (e.g., COX2, PROTEIN), and social networks (e.g., REDDITBINARY, IMDBM). We group these datasets into two categories: (1) Datasets with less than 1,000 graphs, or less than 100 vertices per graph on average, and (2) Datasets with more than 1,000 graphs and more than 100 vertices per graph on average. It is worth noting that while most of the datasets consist of an adjacency matrix, some bioinformatic datasets also provide categorical node attributes. The aggregate statistics of these datasets are provided in Table 1^{1}^{1}1More details can be found at https://ls11www.cs.tudortmund.de/staff/morris/graphkerneldatasets.

MUTAG  PROTEINS  NCI1  NCI109  PTCMR  PTCFR  PTCMM  PTCFM  COX2  DHFR 

WL  82.1    82.2  82.5             
RetGK  90.31.1  75.80.6  84.50.2    62.51.6  66.71.4  65.61.1  62.31.0     
GCAPS    76.44.1  82.72.4  81.11.3             
PersLay  89.80.9  74.80.3  73.50.3  69.50.3          80.91.0  80.30.8 
SV  88.21.0  72.60.4  71.30.4  69.80.2          78.40.4  78.80.7 
PWL  86.11.4  75.30.7  85.30.1  84.80.2  63.11.5  67.31.5  68.41.2  64.51.8     
PWLC  90.51.3  75.30.4  85.50.2  85.00.3  64.00.8  67.21.1  68.61.8  65.81.2     
RPNet1  89.44.0  74.54.5  73.51.1  70.71.6  60.85.3  66.73.6  64.95.5  63.36.8  81.83.0  76.33.6 
RPNet2  90.93.3  75.83.3  73.92.5  71.11.7  60.74.6  68.17.3  66.74.8  63.35.3  82.42.7  75.54.0 
RPNet4  91.03.4  75.42.6  72.11.8  70.01.9  62.56.1  66.74.0  63.74.9  64.56.1  81.63.5  81.33.0 
RPNet8  91.04.0  74.22.8  72.61.9  70.41.6  59.67.1  66.44.6  65.54.5  62.27.8  79.92.7  80.72.6 

COLLAB  RDB  RDM5K  RDM12K  IMDBB  IMDBM  
Neural Net  GCAPS  77.72.5  87.62.5  50.11.7    71.73.4  48.54.1 
PersLay  76.40.4    55.60.3  47.70.2  71.20.7  48.80.6  
DeepTopo      54.5  44.5  
Ours 
RPNet1  69.22.9  91.41.8  56.92.2  48.50.8  69.64.2  45.93.9 
RPNet2  70.02.2  91.62.0  57.62.4  48.51.0  68.33.9  46.83.1  
RPNet4  71.82.5  91.71.9  57.22.1  48.4  71.9 4.4  47.72.9  
RPNet8  73.41.8  90.42.1  56.42.1  48.31.2  70.73.8  48.92.9  
Graph Kernel 
RetGK  81.00.3  92.60.3  56.10.5  48.70.2  71.91.0  47.70.2 
WKPI      59.50.6  48.40.5  75.11.1  49.50.4  
SV  79.60.3  87.80.3  53.10.2    74.20.9  49.90.3  

Baselines
We compare our results with WL subtree [shervashidze2011weisfeiler], RetGK [zhang2018retgk]; deep learningbased approaches GCAPS [verma2018graph], PersLay [carriere2019perslay], DeepTopo [hofer2017deep]; Kernelbased methods with topological features: SV [tran2018scale], PWL and PWLC [rieck2019persistent], WKPI [zhao2019learning]. For the abovementioned baselines, we report their accuracies and standard deviations (if available) in the original papers, following the same evaluation setting. Some of our datasets were not used in certain papers (but used by others), we denote this by "" or leave out the method if it did not use any of the datasets in the resulting table.
Implementation
Models are implemented in Pytorch
[paszke2019pytorch]. The calculation of persistence diagrams are performed by Dionysus ^{2}^{2}2https://mrzv.org/software/dionysus2 and Gudhi ^{3}^{3}3http://gudhi.gforge.inria.fr libraries.6 Results
Dataset Group 1
Table 2 demonstrates the results on 10 graph datasets. Our method performs the best on 4 out of 10 datasets. On the other 6 datasets, it performs comparably with PersLay [carriere2019perslay], another persistence diagram based method.
Dataset Group 2
For large datasets, our method is ranked in the top 3 on most of the datasets as observed in Table 3. When compared against other methods in Neural Network family, it outperforms these methods on 5 out of 6 datasets.
Some baseline methods such as RetGK [zhang2018retgk] leverage node attributes (node labels) of the graph for classification purposes. Our results are promising considering that the approach does not use any node attributes as the features. When comparing the best model among our variants (i.e., different values of ), we observe that using more than 1 return probability of random walk achieves better results in most of the datasets. In particular, RPNet1 performs the best only on 1 dataset, in a total of 16 datasets. This implies that our method benefits from using multiple PDs based on multiscale graph signatures. On the other hand, while our variant with (RPNet2) works the best on small datasets in group 1, we observe that increasing (i.e., and ) gains some improvement on large datasets in group 2. Despite having high performance on most benchmarks, graph kernel methods require computing and storing evaluations for each pair of graphs in the dataset, making them intractable when dealing with datasets containing a large number of graphs. For example, the random walk kernel (RW) [vishwanathan2010graph] takes more than 3 days on NCI1 dataset [zhang2018end]
. Our approach can avoid the quadratic complexity with respect to the number of graphs required for graph kernels, and thus can easily apply to realworld graph datasets. It is also worth mentioning that our approach has higher variance compared with others, such as kernel methods. This high variance is a common problem when using deep learning approaches, which can be observed in deep learning baselines such as GCAPS
[verma2018graph].Ablation experiments
In this section, we conduct some ablation study on our proposal’s architecture. We start with our RPNet4, and then remove some of its components to study their efficiency. The comparison is shown in Table 4.
We can observe that removing 3d onehot vectors (1) and (2) decreases the accuracy by 6.4%, 6.6%, and 2.3% on NCI1, IMDBM, PROTEINS respectively. The main efficiency mainly comes from onehot vectors (1) as it contributes much more to the performance compared with onehot vectors (2). This shows the importance of using onehot vectors to keep the information about the type of persistence diagrams from the original graphs. Besides, we find that learning a set of weights to aggregate the diagram features is beneficial to the model. When compared with the average pooling as a baseline, we find that weighted diagram pooling improves approximately 13% accuracy on the 3 datasets.
According to the experimental results, incorporating all the three components into our model, in general, can also help to reduce the variance, allowing us to obtain more stable performance.
Dataset  onehot (1)  onehot (2)  weighted diagram pooling (3)  Accuracy std 
NCI1  65.72.9  
66.6 2.6  
70.81.5  
70.71.0  
72.11.8  
IMDBM 
41.14.1  
41.24.1  
43.43.2  
44.33.7  
47.72.9  
PROTEINS 
73.13.6  
73.12.8  
73.13.8  
72.34.2  
75.42.6  

denotes concatenation of the 3d onehot vectors with topological features extracted from the graph,
onehot (2) denotes concatenation of the 3d onehot vectors with encoded features (after encoding step), and weighted diagram pooling (3) indicates using weighted sum to aggregate the diagrams, otherwise average pooling is employed. These components are indicated by (1), (2), and (3) in Fig. 5, respectively.7 Conclusion
In this paper, we propose a family of multiscale graph signatures for persistence diagrams based on return probabilities of random walks. We introduce RPNet, a deep neural network architecture to handle the proposed features and demonstrate its efficiency on a wide range of benchmark graph datasets. We show that our proposed method outperforms other persistent homologybased methods and graph neural networks while being on par with stateoftheart graph kernels. There are some other interesting directions worth exploring. One of those is to employ a kernelbased approach with our multiset features. There has been some work on defining the distance between two persistence diagrams. It would be promising to build a kernel method working on the proposed multiscale representation. Besides, our approach does not make use of the node attributes, which may be critical to classifying graphs, such as chemical compounds where each node attribute represents a chemical element. Thus, another future direction is to investigate in detail how to incorporate the node attribute into our multiscale graph signature. We are also interested in developing a theory for multidimensional persistent homology and using it for deep learning models in downstream tasks.
8 Acknowledgments
Approved for public release, 22607