1 Introduction
Persistent homology is a commonly used tool in the field of Topological Data Analysis, which is designed to track topological changes as data is examined at different scales. Its main descriptor is the persistence diagram (PD), often represented as a multiset of birth and death pairs of topological features. Since it can encode topological and geometrical properties of the data, it is potentially complementary to features retrieved by other classical descriptors. Applications of PD have been found in different fields, from signal analysis to 3D shape classification [perea2015sliding, buchet2018persistent, li2014persistence, carriere2015stable, saggar2018towards, suh2019persistent, pranav2017topology, kannan2019persistent].
A recent line of research is primarily aimed at leveraging PD in the task of graph classification. Given a graph signature on vertices of the graph, one can construct a sublevel filtration from the nested sequence of increasing subgraphs. The PDs obtained from this filtration process can be used in machine learning tasks by constructing a kernel, such as sliced Wasserstein kernel [carriere2017sliced] and Fisher kernel [le2018persistence]
, or associating them to a vector space, such as persistence landscape
[bubenik2015statistical] and persistence image [adams2017persistence]. However, kernel methods require evaluation for each pair of graphs thus do not scale well to the size of the dataset, and vectorization methods are usually non-parametric or have a few trainable parameters, resulting in suffering from being agnostic to the type of application. For more scalability and flexibility, a deep learning model that operates on sets can be used directly on PDs [hofer2017deep, carriere2019perslay].In this paper, we focus on graph classification using PDs from a deep learning perspective. A majority of prior works rely on a scalar node descriptor, such as node degree [hofer2017deep] or heat kernel signature [carriere2019perslay], to construct the filtration, where the choice is often empirical. However, a multi-dimensional embedding is usually preferred to describe a graph node. While multi-dimensional persistence has been studied [carlsson2009theory, lesnick2015theory], it exhibits an essentially different character from its one-dimensional version and is often complicated to generalize corresponding concepts, such as persistence diagrams, making it nontrivial to utilize them in downstream models. Instead, we explore the potential of leveraging multiple PDs for each data sample based on a family of multi-scale graph signatures: the -step return probabilities of random walks, where is the number of random steps from a starting node, indicating the size of the local structure being explored. For example, 2-hop random walk can help to capture the neighborhood size of surrounding nodes, while 3-hop can capture circles in the graph. The idea is that different scales help the model capture more information about its neighborhood, which in turn can help improve the model’s performance. The return probability feature has been used as a node structural role descriptor to construct a kernel for graph classification [zhang2018retgk]. We show that with an appropriate application and model architecture, it can also be useful in topology-based approaches.
Our contributions are as follows. First, we propose a family of scalar graph signatures based on the return probability of random walks to construct multiple PDs per data sample. The PDs are padded to a fixed size and stacked together to form a multi-channel input. Second, we introduce RPNet, a deep neural network architecture designed to learn from the proposed features. We demonstrate our method on a wide range of benchmark datasets and observe that it often outperforms other persistence diagram-based and neural network-based approaches, while being competitive with state-of-the-art graph kernel methods.
The rest of this paper is organized as follows. Section 2 and Section 3 briefly go through some related work and background. Section 4 describes our proposed graph signature and network architecture. Section 5 and Section 6 present our experiments and results. Finally, we discuss future directions in Section 7.
2 Related Works
Since PDs have an unusual structure for machine learning approaches, many techniques have been proposed to map them into machine learning compatible representations. A popular technique is vectorization. For instance, Bubenik et al. [bubenik2015statistical] propose mapping persistence barcodes into a Banach space, referred to as persistence landscape. Adams et al. [adams2017persistence] propose persistence image, which is a discretization of the persistence surface. Another approach is constructing a kernel from PDs. For example, the sliced Wasserstein kernel [carriere2017sliced], or persistence Fisher kernel [le2018persistence] are designed to work on PDs. Rieck et al. [rieck2019persistent] define a kernel constructed from a vector representation of persistence diagram through the Weisfeiler-Lehman subtree feature. While these methods make it much easier to deploy machine learning techniques, they are pre-defined and thus often suboptimal to a specific task.
Learning representation has been investigated to provide a task-optimal representation in the vector space for PDs. Hofer et al. [hofer2017deep] and Carriere et al. [carriere2019perslay] propose a deep learning architecture designed to handle sets of points in PDs. Zhao et al. [zhao2019learning] propose a weighted kernel for persistence image. These methods have been proven to perform well on a wide range of benchmarks.
In these works, a specific node descriptor is usually required for the filtration process. Some of the simple choices include node degree [hofer2017deep] or the node attributes if available [rieck2019persistent]. More complicated descriptors include the heat kernel signature [carriere2019perslay]
, Ricci-curvature & Jaccard-index
[zhao2019learning]. For non-attributed graphs, [rieck2019persistent] uses a stabilization procedure based on node degree to leverage more topological information.Our work employs a similar approach in [hofer2017deep, carriere2019perslay] to propose a deep learning architecture that contains pooling layers to learn from multiset inputs. Different from prior works, our architecture is able to handle multiple PDs for each data sample, each computed from a node descriptor in the proposed family of graph signatures.
3 Background
We briefly mention the concepts and refer interested readers to [edelsbrunner2010computational, aktas2019persistence] for more detailed definitions of homology.
3.1 Simplices and Simplicial Complexes
Simplices
are higher dimensional analogs of points, line segments, and triangles, such as a tetrahedron. Formally, an -simplex is the convex hull of affinely independent points, i.e., a set of all convex combinations where and for all For example, a 0-simplex is a single point (the vertex), a 1-simplex is a line segment (the edge), and a 2-simplex is a triangle with its interior (Figure 1). Simplices are the building blocks of simplicial complexes.

Simplicial Complexes
A simplicial complex is a finite collection of simplices such that (1) given a simplex , a face implies , and (2) two simplices implies is either empty or a common face of both. The dimension of is the maximum dimension of any of its simplices. The first condition states that if a simplex is in , then its faces are also in . For the second condition, we are only allowed to glue simplices by their common faces to avoid no improper intersections (Figure 2).

3.2 Persistence Diagrams of Graphs

Consider a graph with nodes, edges and a node descriptor function . First, we use the descriptor to label the vertices in . These labels are used as threshold values to define the sublevel graph where , and . This provides a nested family of simplicial complexes where , , are weights assigned nodes sorted in increasing order. , starting with the empty graph and ending with the full graph , is called a filtration of . During this construction, some holes may appear and disappear; in other words, the time of appearance and disappearance of topological features such as loops, and connected components are recorded. When a component first appears, its birth is kept track of, and when it gets merged into another component, we record the death. Such homological features can be considered the graph feature. Specifically, these intervals are persistence diagram that consists of the birth and death time of the feature as points (birth, death) in . Figure 3 illustrates 0-dimensional and 1-dimensional persistence barcodes and the corresponding persistence diagram of the filtration on a simple input graph. For each vertex , its node degree is used as the node descriptor , where is the adjacency matrix of .
3.3 Return Probabilities of Random Walks
Consider an undirected graph . Let be the adjacency matrix of and be the degree matrix, i.e. a diagonal matrix with .
A -step walk starting from node is a sequence of nodes , with , where . A random walk on
is a Markov chain
, whose transition probabilities areThe transition probability matrix is calculated by . The -hop transition probability matrix is , where denotes the transition probability in steps from node to . The return probability of random walks is defined as , which is the probability to return the source node after hops. When , all the probabilities are equal to .
A straightforward way to calculate a transition probability matrix requires matrix multiplication of , which is in time complexity. We are only interested in the diagonal elements, thus the matrix can be computed more efficiently and the time complexity can be reduced to [zhang2018retgk]. The calculation of all the -hop transition probability matrix with takes time , allowing computation to be scalable to large graphs.
3.4 Deep Learning with Sets
Most machine learning algorithms take fixed-size vectors as input. They are not designed for data in the form of sets. In contrast to vectors, sets are invariant to permutation and have a variable length. Thus, dealing with this type of input is a challenging and non-trivial task. There has been some research aimed at tackling this problem by employing permutation-invariant functions that have two desirable characteristics: invariant to the ordering of the inputs and ability to handle variable-size inputs [zaheer2017deep, rezatofighi2017deepsetnet, xu2018spidercnn]. Specifically, a permutation invariant function in is defined
where , is the number of points, and . For example, average-pooling and sum-pooling are two pre-defined permutation invariant functions that are commonly used to handle sets [kipf2016semi, hamilton2017inductive, atwood2016diffusion].
4 Proposed Method
In this section, we describe our approach to using the return probabilities of random walks for topological graph representation and propose our neural network architecture, referred to as RPNet. Let be an undirected graph, where is the set of vertices and is the set of edges connecting two vertices, we define a set of descriptors on , where represents the -hop return probabilities of random walk starting from the vertex , with . The higher value of , the larger a subgraph is considered. By definition:
where is the adjacency matrix, is the degree matrix, and is the -th vertex in .
PDs Computation and Processing
Using the sublevel filtration based on each of node descriptors , we can compute 0,1-dimensional PDs for each graph . We split points in these diagrams into three groups: 0-homology essential points (0-homology points whose death is infinite), 0-homology non-essential points (0-homology points whose death is finite) and 1-homology essential points (1-homology points whose death is infinite). Note that 1-homology does not disappear since we do not consider higher degree homology; therefore, 1-homology non-essential point (1-homology points whose death is finite) does not exist. We normalize these points to the range by dividing them by the maximum coordinate (i.e., birth, death) value ( is also mapped to 1), and add a 3-dimensional one-hot vector characterizing each group. Thus, each point is represented by a vector in : birth, death, and 3-d one-hot vector. After this process, we obtain sets of 5-dimensional vectors. These sets can vary in size (for example, in Fig. 4, with , the first set has 6 points (blue), while gives 5 points (yellow)), thus we pad them to have the same number of elements, , in each set. The final resulting output is a feature for each graph . Figure 4 illustrates this process on a simple graph.

4.1 Network architecture

We propose RPNet, a neural network operating on the extracted features. The network consists of an encoder, a decoder, a point pooling layer, a diagram pooling layer, and a classification head. First, the encoder maps each point into a -dimensional space:
Next, we concatenate the encoded feature with the 3-d one-hot vectors to obtain . We then pass it to a decoder to map each point into -dimensional space:
The point pooling layer performs sum pooling on points in each diagram, and the diagram pooling layer performs weighted pooling on diagrams for each graph.
The classification head is a multi-layer perceptron on the embedding of each graph, followed by a softmax function to output the class probabilities.
Our network is optimized with the cross-entropy loss. Figure 5 demonstrates the architecture of our proposed neural network.
5 Experiments
Setup
We use persistence diagrams up to degree 1 for each as inputs to our model. We follow the conventional settings for graph classification as described in previous works [hofer2017deep, carriere2019perslay]
. Specifically, we perform a 10-fold cross-validation (9 folds for training and 1 fold for testing) for each run. We train each model for at most 500 epochs and report the average accuracy and standard deviation on the epochs that have the lowest loss of each fold. Training is terminated after 50 non-decreasing epoch losses. Adam optimizer
[kingma2014adam] is used with an initial learning rate between the range of 0.001 to 0.01, and a decay factor of 0.5 after every 25 epochs up to 6 times. For the encoder and decoder, we use several blocks of four following components: Linear Layer, Norm Layer (e.g., BatchNorm [ioffe2015batch], LayerNorm [ba2016layer]), Dropout [srivastava14a], and Activation Layer (e.g., ReLu
[agarap2018deep], ELU [clevert2015fast]) in that order. We fine-tune the number of blocks (from 2 to 5) and their dimensions in the encoder, along with the learning rate. We report the best scores for each variant of our models, denoted as RPNet-, where represents the number of return probabilities of random walks that are used to generate the multi-scale graph representation. It is worth noting that all the baselines we compare with follow the same setup, thus it is a fair comparison.Datasets
#graphs | avg #nodes | avg #edges | #classes | |
MUTAG | 188 | 17.93 | 19.79 | 2 |
PROTEINS | 1113 | 39.06 | 72.82 | 2 |
NCI1 | 4110 | 29.87 | 32.30 | 2 |
NCI109 | 4127 | 29.68 | 32.13 | 2 |
PTC_FM | 349 | 14.11 | 14.48 | 2 |
PTC_FR | 351 | 14.56 | 15.00 | 2 |
PTC_MM | 336 | 13.97 | 14.32 | 2 |
PTC_MR | 344 | 14.29 | 14.69 | 2 |
COX2 | 467 | 41.22 | 43.45 | 2 |
DHFR | 756 | 42.43 | 44.54 | 2 |
REDDIT-BINARY |
2000 | 429.6 | 497.8 | 2 |
IMDB-BINARY | 1000 | 19.77 | 96.53 | 2 |
COLLAB | 5000 | 74.49 | 2457.78 | 3 |
IMDB-MULTI | 1500 | 13.00 | 65.94 | 3 |
REDDIT-5K | 4999 | 508.5 | 594.9 | 5 |
REDDIT-12K | 11929 | 391.4 | 456.9 | 12 |
|
We evaluated our proposed method in graph classification tasks on 16 graph benchmark datasets from different domains such as small chemical compounds or protein molecules (e.g., COX2, PROTEIN), and social networks (e.g., REDDIT-BINARY, IMDB-M). We group these datasets into two categories: (1) Datasets with less than 1,000 graphs, or less than 100 vertices per graph on average, and (2) Datasets with more than 1,000 graphs and more than 100 vertices per graph on average. It is worth noting that while most of the datasets consist of an adjacency matrix, some bioinformatic datasets also provide categorical node attributes. The aggregate statistics of these datasets are provided in Table 1111More details can be found at https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets.
|
MUTAG | PROTEINS | NCI1 | NCI109 | PTC-MR | PTC-FR | PTC-MM | PTC-FM | COX2 | DHFR |
---|---|---|---|---|---|---|---|---|---|---|
WL | 82.1 | - | 82.2 | 82.5 | - | - | - | - | - | - |
RetGK | 90.31.1 | 75.80.6 | 84.50.2 | - | 62.51.6 | 66.71.4 | 65.61.1 | 62.31.0 | - | - |
GCAPS | - | 76.44.1 | 82.72.4 | 81.11.3 | - | - | - | - | - | - |
PersLay | 89.80.9 | 74.80.3 | 73.50.3 | 69.50.3 | - | - | - | - | 80.91.0 | 80.30.8 |
SV | 88.21.0 | 72.60.4 | 71.30.4 | 69.80.2 | - | - | - | - | 78.40.4 | 78.80.7 |
P-WL | 86.11.4 | 75.30.7 | 85.30.1 | 84.80.2 | 63.11.5 | 67.31.5 | 68.41.2 | 64.51.8 | - | - |
P-WLC | 90.51.3 | 75.30.4 | 85.50.2 | 85.00.3 | 64.00.8 | 67.21.1 | 68.61.8 | 65.81.2 | - | - |
RPNet-1 | 89.44.0 | 74.54.5 | 73.51.1 | 70.71.6 | 60.85.3 | 66.73.6 | 64.95.5 | 63.36.8 | 81.83.0 | 76.33.6 |
RPNet-2 | 90.93.3 | 75.83.3 | 73.92.5 | 71.11.7 | 60.74.6 | 68.17.3 | 66.74.8 | 63.35.3 | 82.42.7 | 75.54.0 |
RPNet-4 | 91.03.4 | 75.42.6 | 72.11.8 | 70.01.9 | 62.56.1 | 66.74.0 | 63.74.9 | 64.56.1 | 81.63.5 | 81.33.0 |
RPNet-8 | 91.04.0 | 74.22.8 | 72.61.9 | 70.41.6 | 59.67.1 | 66.44.6 | 65.54.5 | 62.27.8 | 79.92.7 | 80.72.6 |
|
COLLAB | RD-B | RD-M5K | RD-M12K | IMDB-B | IMDB-M | ||
Neural Net | GCAPS | 77.72.5 | 87.62.5 | 50.11.7 | - | 71.73.4 | 48.54.1 |
PersLay | 76.40.4 | - | 55.60.3 | 47.70.2 | 71.20.7 | 48.80.6 | |
DeepTopo | - | - | 54.5 | 44.5 | |||
Ours |
RPNet-1 | 69.22.9 | 91.41.8 | 56.92.2 | 48.50.8 | 69.64.2 | 45.93.9 |
RPNet-2 | 70.02.2 | 91.62.0 | 57.62.4 | 48.51.0 | 68.33.9 | 46.83.1 | |
RPNet-4 | 71.82.5 | 91.71.9 | 57.22.1 | 48.4 | 71.9 4.4 | 47.72.9 | |
RPNet-8 | 73.41.8 | 90.42.1 | 56.42.1 | 48.31.2 | 70.73.8 | 48.92.9 | |
Graph Kernel |
RetGK | 81.00.3 | 92.60.3 | 56.10.5 | 48.70.2 | 71.91.0 | 47.70.2 |
WKPI | - | - | 59.50.6 | 48.40.5 | 75.11.1 | 49.50.4 | |
SV | 79.60.3 | 87.80.3 | 53.10.2 | - | 74.20.9 | 49.90.3 | |
|
Baselines
We compare our results with WL subtree [shervashidze2011weisfeiler], RetGK [zhang2018retgk]; deep learning-based approaches GCAPS [verma2018graph], PersLay [carriere2019perslay], DeepTopo [hofer2017deep]; Kernel-based methods with topological features: SV [tran2018scale], P-WL and P-WLC [rieck2019persistent], WKPI [zhao2019learning]. For the above-mentioned baselines, we report their accuracies and standard deviations (if available) in the original papers, following the same evaluation setting. Some of our datasets were not used in certain papers (but used by others), we denote this by "-" or leave out the method if it did not use any of the datasets in the resulting table.
Implementation
Models are implemented in Pytorch
[paszke2019pytorch]. The calculation of persistence diagrams are performed by Dionysus 222https://mrzv.org/software/dionysus2 and Gudhi 333http://gudhi.gforge.inria.fr libraries.6 Results
Dataset Group 1
Table 2 demonstrates the results on 10 graph datasets. Our method performs the best on 4 out of 10 datasets. On the other 6 datasets, it performs comparably with PersLay [carriere2019perslay], another persistence diagram based method.
Dataset Group 2
For large datasets, our method is ranked in the top 3 on most of the datasets as observed in Table 3. When compared against other methods in Neural Network family, it outperforms these methods on 5 out of 6 datasets.
Some baseline methods such as RetGK [zhang2018retgk] leverage node attributes (node labels) of the graph for classification purposes. Our results are promising considering that the approach does not use any node attributes as the features. When comparing the best model among our variants (i.e., different values of ), we observe that using more than 1 return probability of random walk achieves better results in most of the datasets. In particular, RPNet-1 performs the best only on 1 dataset, in a total of 16 datasets. This implies that our method benefits from using multiple PDs based on multi-scale graph signatures. On the other hand, while our variant with (RPNet-2) works the best on small datasets in group 1, we observe that increasing (i.e., and ) gains some improvement on large datasets in group 2. Despite having high performance on most benchmarks, graph kernel methods require computing and storing evaluations for each pair of graphs in the dataset, making them intractable when dealing with datasets containing a large number of graphs. For example, the random walk kernel (RW) [vishwanathan2010graph] takes more than 3 days on NCI1 dataset [zhang2018end]
. Our approach can avoid the quadratic complexity with respect to the number of graphs required for graph kernels, and thus can easily apply to real-world graph datasets. It is also worth mentioning that our approach has higher variance compared with others, such as kernel methods. This high variance is a common problem when using deep learning approaches, which can be observed in deep learning baselines such as GCAPS
[verma2018graph].Ablation experiments
In this section, we conduct some ablation study on our proposal’s architecture. We start with our RPNet-4, and then remove some of its components to study their efficiency. The comparison is shown in Table 4.
We can observe that removing 3-d one-hot vectors (1) and (2) decreases the accuracy by 6.4%, 6.6%, and 2.3% on NCI1, IMDB-M, PROTEINS respectively. The main efficiency mainly comes from one-hot vectors (1) as it contributes much more to the performance compared with one-hot vectors (2). This shows the importance of using one-hot vectors to keep the information about the type of persistence diagrams from the original graphs. Besides, we find that learning a set of weights to aggregate the diagram features is beneficial to the model. When compared with the average pooling as a baseline, we find that weighted diagram pooling improves approximately 1-3% accuracy on the 3 datasets.
According to the experimental results, incorporating all the three components into our model, in general, can also help to reduce the variance, allowing us to obtain more stable performance.
Dataset | one-hot (1) | one-hot (2) | weighted diagram pooling (3) | Accuracy std |
NCI1 | 65.72.9 | |||
66.6 2.6 | ||||
70.81.5 | ||||
70.71.0 | ||||
72.11.8 | ||||
IMDB-M |
41.14.1 | |||
41.24.1 | ||||
43.43.2 | ||||
44.33.7 | ||||
47.72.9 | ||||
PROTEINS |
73.13.6 | |||
73.12.8 | ||||
73.13.8 | ||||
72.34.2 | ||||
75.42.6 | ||||
|
denotes concatenation of the 3-d one-hot vectors with topological features extracted from the graph,
one-hot (2) denotes concatenation of the 3-d one-hot vectors with encoded features (after encoding step), and weighted diagram pooling (3) indicates using weighted sum to aggregate the diagrams, otherwise average pooling is employed. These components are indicated by (1), (2), and (3) in Fig. 5, respectively.7 Conclusion
In this paper, we propose a family of multi-scale graph signatures for persistence diagrams based on return probabilities of random walks. We introduce RPNet, a deep neural network architecture to handle the proposed features and demonstrate its efficiency on a wide range of benchmark graph datasets. We show that our proposed method outperforms other persistent homology-based methods and graph neural networks while being on par with state-of-the-art graph kernels. There are some other interesting directions worth exploring. One of those is to employ a kernel-based approach with our multiset features. There has been some work on defining the distance between two persistence diagrams. It would be promising to build a kernel method working on the proposed multiscale representation. Besides, our approach does not make use of the node attributes, which may be critical to classifying graphs, such as chemical compounds where each node attribute represents a chemical element. Thus, another future direction is to investigate in detail how to incorporate the node attribute into our multi-scale graph signature. We are also interested in developing a theory for multi-dimensional persistent homology and using it for deep learning models in downstream tasks.
8 Acknowledgments
Approved for public release, 22-607