DeepAI
Log In Sign Up

A Multi-scale Graph Signature for Persistence Diagrams based on Return Probabilities of Random Walks

09/28/2022
by   Chau Pham, et al.
Boston University
0

Persistence diagrams (PDs), often characterized as sets of death and birth of homology class, have been known for providing a topological representation of a graph structure, which is often useful in machine learning tasks. Prior works rely on a single graph signature to construct PDs. In this paper, we explore the use of a family of multi-scale graph signatures to enhance the robustness of topological features. We propose a deep learning architecture to handle this set input. Experiments on benchmark graph classification datasets demonstrate that our proposed architecture outperforms other persistent homology-based methods and achieves competitive performance compared to state-of-the-art methods using graph neural networks. In addition, our approach can be easily applied to large size of input graphs as it does not suffer from limited scalability which can be an issue for graph kernel methods.

READ FULL TEXT VIEW PDF
04/20/2019

A General Neural Network Architecture for Persistence Diagrams and Graph Classification

Graph classification is a difficult problem that has drawn a lot of atte...
04/20/2019

PersLay: A Neural Network Layer for Persistence Diagrams and New Graph Topological Signatures

Persistence diagrams, the most common descriptors of Topological Data An...
04/20/2019

PersLay: A Simple and Versatile Neural Network Layer for Persistence Diagrams

Persistence diagrams, a key descriptor from Topological Data Analysis, e...
01/10/2021

Optimisation of Spectral Wavelets for Persistence-based Graph Classification

A graph's spectral wavelet signature determines a filtration, and conseq...
01/28/2022

Neural Approximation of Extended Persistent Homology on Graphs

Persistent homology is a widely used theory in topological data analysis...
12/21/2014

A Stable Multi-Scale Kernel for Topological Machine Learning

Topological data analysis offers a rich source of valuable information t...
11/25/2019

Classification of Single-lead Electrocardiograms: TDA Informed Machine Learning

Atrial Fibrillation is a heart condition characterized by erratic heart ...

1 Introduction

Persistent homology is a commonly used tool in the field of Topological Data Analysis, which is designed to track topological changes as data is examined at different scales. Its main descriptor is the persistence diagram (PD), often represented as a multiset of birth and death pairs of topological features. Since it can encode topological and geometrical properties of the data, it is potentially complementary to features retrieved by other classical descriptors. Applications of PD have been found in different fields, from signal analysis to 3D shape classification [perea2015sliding, buchet2018persistent, li2014persistence, carriere2015stable, saggar2018towards, suh2019persistent, pranav2017topology, kannan2019persistent].

A recent line of research is primarily aimed at leveraging PD in the task of graph classification. Given a graph signature on vertices of the graph, one can construct a sublevel filtration from the nested sequence of increasing subgraphs. The PDs obtained from this filtration process can be used in machine learning tasks by constructing a kernel, such as sliced Wasserstein kernel [carriere2017sliced] and Fisher kernel [le2018persistence]

, or associating them to a vector space, such as persistence landscape

[bubenik2015statistical] and persistence image [adams2017persistence]. However, kernel methods require evaluation for each pair of graphs thus do not scale well to the size of the dataset, and vectorization methods are usually non-parametric or have a few trainable parameters, resulting in suffering from being agnostic to the type of application. For more scalability and flexibility, a deep learning model that operates on sets can be used directly on PDs [hofer2017deep, carriere2019perslay].

In this paper, we focus on graph classification using PDs from a deep learning perspective. A majority of prior works rely on a scalar node descriptor, such as node degree [hofer2017deep] or heat kernel signature [carriere2019perslay], to construct the filtration, where the choice is often empirical. However, a multi-dimensional embedding is usually preferred to describe a graph node. While multi-dimensional persistence has been studied [carlsson2009theory, lesnick2015theory], it exhibits an essentially different character from its one-dimensional version and is often complicated to generalize corresponding concepts, such as persistence diagrams, making it nontrivial to utilize them in downstream models. Instead, we explore the potential of leveraging multiple PDs for each data sample based on a family of multi-scale graph signatures: the -step return probabilities of random walks, where is the number of random steps from a starting node, indicating the size of the local structure being explored. For example, 2-hop random walk can help to capture the neighborhood size of surrounding nodes, while 3-hop can capture circles in the graph. The idea is that different scales help the model capture more information about its neighborhood, which in turn can help improve the model’s performance. The return probability feature has been used as a node structural role descriptor to construct a kernel for graph classification [zhang2018retgk]. We show that with an appropriate application and model architecture, it can also be useful in topology-based approaches.

Our contributions are as follows. First, we propose a family of scalar graph signatures based on the return probability of random walks to construct multiple PDs per data sample. The PDs are padded to a fixed size and stacked together to form a multi-channel input. Second, we introduce RPNet, a deep neural network architecture designed to learn from the proposed features. We demonstrate our method on a wide range of benchmark datasets and observe that it often outperforms other persistence diagram-based and neural network-based approaches, while being competitive with state-of-the-art graph kernel methods.

The rest of this paper is organized as follows. Section 2 and Section 3 briefly go through some related work and background. Section 4 describes our proposed graph signature and network architecture. Section 5 and Section 6 present our experiments and results. Finally, we discuss future directions in Section 7.

2 Related Works

Since PDs have an unusual structure for machine learning approaches, many techniques have been proposed to map them into machine learning compatible representations. A popular technique is vectorization. For instance, Bubenik et al. [bubenik2015statistical] propose mapping persistence barcodes into a Banach space, referred to as persistence landscape. Adams et al. [adams2017persistence] propose persistence image, which is a discretization of the persistence surface. Another approach is constructing a kernel from PDs. For example, the sliced Wasserstein kernel [carriere2017sliced], or persistence Fisher kernel [le2018persistence] are designed to work on PDs. Rieck et al. [rieck2019persistent] define a kernel constructed from a vector representation of persistence diagram through the Weisfeiler-Lehman subtree feature. While these methods make it much easier to deploy machine learning techniques, they are pre-defined and thus often suboptimal to a specific task.

Learning representation has been investigated to provide a task-optimal representation in the vector space for PDs. Hofer et al. [hofer2017deep] and Carriere et al. [carriere2019perslay] propose a deep learning architecture designed to handle sets of points in PDs. Zhao et al. [zhao2019learning] propose a weighted kernel for persistence image. These methods have been proven to perform well on a wide range of benchmarks.

In these works, a specific node descriptor is usually required for the filtration process. Some of the simple choices include node degree  [hofer2017deep] or the node attributes if available [rieck2019persistent]. More complicated descriptors include the heat kernel signature  [carriere2019perslay]

, Ricci-curvature & Jaccard-index  

[zhao2019learning]. For non-attributed graphs,  [rieck2019persistent] uses a stabilization procedure based on node degree to leverage more topological information.

Our work employs a similar approach in [hofer2017deep, carriere2019perslay] to propose a deep learning architecture that contains pooling layers to learn from multiset inputs. Different from prior works, our architecture is able to handle multiple PDs for each data sample, each computed from a node descriptor in the proposed family of graph signatures.

3 Background

We briefly mention the concepts and refer interested readers to [edelsbrunner2010computational, aktas2019persistence] for more detailed definitions of homology.

3.1 Simplices and Simplicial Complexes

Simplices

are higher dimensional analogs of points, line segments, and triangles, such as a tetrahedron. Formally, an -simplex is the convex hull of affinely independent points, i.e., a set of all convex combinations where and for all For example, a 0-simplex is a single point (the vertex), a 1-simplex is a line segment (the edge), and a 2-simplex is a triangle with its interior (Figure 1). Simplices are the building blocks of simplicial complexes.

Figure 1: Examples of simplices. From left to right: a vertex, an edge, a triangle, and a tetrahedron

Simplicial Complexes

A simplicial complex is a finite collection of simplices such that (1) given a simplex , a face implies , and (2) two simplices implies is either empty or a common face of both. The dimension of is the maximum dimension of any of its simplices. The first condition states that if a simplex is in , then its faces are also in . For the second condition, we are only allowed to glue simplices by their common faces to avoid no improper intersections (Figure 2).

Figure 2: left: simplicial complex created by 0-, 1-, and 2-simplicies; middle: not a simplicial complex since the intersection of the two triangles is not an edge; right: not a simplicial complex because of missing an edge

3.2 Persistence Diagrams of Graphs

Figure 3: 0-dimensional and 1-dimensional persistent homology on a graph based on node-degree filtration. (a) Given an input graph, the persistence diagrams are computed by persistent homology; (b) More details on the computation: The growth process of the graph by time steps and its corresponding bar codes. At the first time step (time step 0), there is no node and edge. At the next time steps, nodes and edges start to appear. The bar codes represent the persistence diagrams (birth, death).

Consider a graph with nodes, edges and a node descriptor function . First, we use the descriptor to label the vertices in . These labels are used as threshold values to define the sublevel graph where , and . This provides a nested family of simplicial complexes where , , are weights assigned nodes sorted in increasing order. , starting with the empty graph and ending with the full graph , is called a filtration of . During this construction, some holes may appear and disappear; in other words, the time of appearance and disappearance of topological features such as loops, and connected components are recorded. When a component first appears, its birth is kept track of, and when it gets merged into another component, we record the death. Such homological features can be considered the graph feature. Specifically, these intervals are persistence diagram that consists of the birth and death time of the feature as points (birth, death) in . Figure 3 illustrates 0-dimensional and 1-dimensional persistence barcodes and the corresponding persistence diagram of the filtration on a simple input graph. For each vertex , its node degree is used as the node descriptor , where is the adjacency matrix of .

3.3 Return Probabilities of Random Walks

Consider an undirected graph . Let be the adjacency matrix of and be the degree matrix, i.e. a diagonal matrix with .

A -step walk starting from node is a sequence of nodes , with , where . A random walk on

is a Markov chain

, whose transition probabilities are

The transition probability matrix is calculated by . The -hop transition probability matrix is , where denotes the transition probability in steps from node to . The return probability of random walks is defined as , which is the probability to return the source node after hops. When , all the probabilities are equal to .

A straightforward way to calculate a transition probability matrix requires matrix multiplication of , which is in time complexity. We are only interested in the diagonal elements, thus the matrix can be computed more efficiently and the time complexity can be reduced to  [zhang2018retgk]. The calculation of all the -hop transition probability matrix with takes time , allowing computation to be scalable to large graphs.

3.4 Deep Learning with Sets

Most machine learning algorithms take fixed-size vectors as input. They are not designed for data in the form of sets. In contrast to vectors, sets are invariant to permutation and have a variable length. Thus, dealing with this type of input is a challenging and non-trivial task. There has been some research aimed at tackling this problem by employing permutation-invariant functions that have two desirable characteristics: invariant to the ordering of the inputs and ability to handle variable-size inputs [zaheer2017deep, rezatofighi2017deepsetnet, xu2018spidercnn]. Specifically, a permutation invariant function in is defined

where , is the number of points, and . For example, average-pooling and sum-pooling are two pre-defined permutation invariant functions that are commonly used to handle sets [kipf2016semi, hamilton2017inductive, atwood2016diffusion].

4 Proposed Method

In this section, we describe our approach to using the return probabilities of random walks for topological graph representation and propose our neural network architecture, referred to as RPNet. Let be an undirected graph, where is the set of vertices and is the set of edges connecting two vertices, we define a set of descriptors on , where represents the -hop return probabilities of random walk starting from the vertex , with . The higher value of , the larger a subgraph is considered. By definition:

where is the adjacency matrix, is the degree matrix, and is the -th vertex in .

PDs Computation and Processing

Using the sublevel filtration based on each of node descriptors , we can compute 0,1-dimensional PDs for each graph . We split points in these diagrams into three groups: 0-homology essential points (0-homology points whose death is infinite), 0-homology non-essential points (0-homology points whose death is finite) and 1-homology essential points (1-homology points whose death is infinite). Note that 1-homology does not disappear since we do not consider higher degree homology; therefore, 1-homology non-essential point (1-homology points whose death is finite) does not exist. We normalize these points to the range by dividing them by the maximum coordinate (i.e., birth, death) value ( is also mapped to 1), and add a 3-dimensional one-hot vector characterizing each group. Thus, each point is represented by a vector in : birth, death, and 3-d one-hot vector. After this process, we obtain sets of 5-dimensional vectors. These sets can vary in size (for example, in Fig. 4, with , the first set has 6 points (blue), while gives 5 points (yellow)), thus we pad them to have the same number of elements, , in each set. The final resulting output is a feature for each graph . Figure 4 illustrates this process on a simple graph.

Figure 4: Procedure to generate and process (e.g., ) persistence diagrams from a graph using return probabilities of random walks: We first use each return probability of random walk to label the nodes of the graph, which are used to generate pairs of (birth, death), then stack them together with 3-d one-hot vectors to obtain the features.

4.1 Network architecture

Figure 5: Overview architecture of RPNet - our proposed neural network

We propose RPNet, a neural network operating on the extracted features. The network consists of an encoder, a decoder, a point pooling layer, a diagram pooling layer, and a classification head. First, the encoder maps each point into a -dimensional space:

Next, we concatenate the encoded feature with the 3-d one-hot vectors to obtain . We then pass it to a decoder to map each point into -dimensional space:

The point pooling layer performs sum pooling on points in each diagram, and the diagram pooling layer performs weighted pooling on diagrams for each graph.

The classification head is a multi-layer perceptron on the embedding of each graph, followed by a softmax function to output the class probabilities.

Our network is optimized with the cross-entropy loss. Figure 5 demonstrates the architecture of our proposed neural network.

5 Experiments

Setup

We use persistence diagrams up to degree 1 for each as inputs to our model. We follow the conventional settings for graph classification as described in previous works [hofer2017deep, carriere2019perslay]

. Specifically, we perform a 10-fold cross-validation (9 folds for training and 1 fold for testing) for each run. We train each model for at most 500 epochs and report the average accuracy and standard deviation on the epochs that have the lowest loss of each fold. Training is terminated after 50 non-decreasing epoch losses. Adam optimizer 

[kingma2014adam] is used with an initial learning rate between the range of 0.001 to 0.01, and a decay factor of 0.5 after every 25 epochs up to 6 times. For the encoder and decoder, we use several blocks of four following components: Linear Layer, Norm Layer (e.g., BatchNorm [ioffe2015batch], LayerNorm [ba2016layer]), Dropout [srivastava14a]

, and Activation Layer (e.g., ReLu

[agarap2018deep], ELU [clevert2015fast]) in that order. We fine-tune the number of blocks (from 2 to 5) and their dimensions in the encoder, along with the learning rate. We report the best scores for each variant of our models, denoted as RPNet-, where represents the number of return probabilities of random walks that are used to generate the multi-scale graph representation. It is worth noting that all the baselines we compare with follow the same setup, thus it is a fair comparison.

Datasets


#graphs avg #nodes avg #edges #classes
MUTAG 188 17.93 19.79 2
PROTEINS 1113 39.06 72.82 2
NCI1 4110 29.87 32.30 2
NCI109 4127 29.68 32.13 2
PTC_FM 349 14.11 14.48 2
PTC_FR 351 14.56 15.00 2
PTC_MM 336 13.97 14.32 2
PTC_MR 344 14.29 14.69 2
COX2 467 41.22 43.45 2
DHFR 756 42.43 44.54 2

REDDIT-BINARY
2000 429.6 497.8 2
IMDB-BINARY 1000 19.77 96.53 2
COLLAB 5000 74.49 2457.78 3
IMDB-MULTI 1500 13.00 65.94 3
REDDIT-5K 4999 508.5 594.9 5
REDDIT-12K 11929 391.4 456.9 12



Table 1: Descriptions of datasets used in our experiments

We evaluated our proposed method in graph classification tasks on 16 graph benchmark datasets from different domains such as small chemical compounds or protein molecules (e.g., COX2, PROTEIN), and social networks (e.g., REDDIT-BINARY, IMDB-M). We group these datasets into two categories: (1) Datasets with less than 1,000 graphs, or less than 100 vertices per graph on average, and (2) Datasets with more than 1,000 graphs and more than 100 vertices per graph on average. It is worth noting that while most of the datasets consist of an adjacency matrix, some bioinformatic datasets also provide categorical node attributes. The aggregate statistics of these datasets are provided in Table 1111More details can be found at https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets.



MUTAG PROTEINS NCI1 NCI109 PTC-MR PTC-FR PTC-MM PTC-FM COX2 DHFR
WL 82.1 - 82.2 82.5 - - - - - -
RetGK 90.31.1 75.80.6 84.50.2 - 62.51.6 66.71.4 65.61.1 62.31.0 - -
GCAPS - 76.44.1 82.72.4 81.11.3 - - - - - -
PersLay 89.80.9 74.80.3 73.50.3 69.50.3 - - - - 80.91.0 80.30.8
SV 88.21.0 72.60.4 71.30.4 69.80.2 - - - - 78.40.4 78.80.7
P-WL 86.11.4 75.30.7 85.30.1 84.80.2 63.11.5 67.31.5 68.41.2 64.51.8 - -
P-WLC 90.51.3 75.30.4 85.50.2 85.00.3 64.00.8 67.21.1 68.61.8 65.81.2 - -
RPNet-1 89.44.0 74.54.5 73.51.1 70.71.6 60.85.3 66.73.6 64.95.5 63.36.8 81.83.0 76.33.6
RPNet-2 90.93.3 75.83.3 73.92.5 71.11.7 60.74.6 68.17.3 66.74.8 63.35.3 82.42.7 75.54.0
RPNet-4 91.03.4 75.42.6 72.11.8 70.01.9 62.56.1 66.74.0 63.74.9 64.56.1 81.63.5 81.33.0
RPNet-8 91.04.0 74.22.8 72.61.9 70.41.6 59.67.1 66.44.6 65.54.5 62.27.8 79.92.7 80.72.6



Table 2: Classification results (accuracy and standard deviation) on dataset group 1. The best-performing results are highlighted in boldface. Note : graph kernel, : graph neural network, : method based on persistence homology. We use "-" to indicate that the dataset was not used in the baseline paper.

COLLAB RD-B RD-M5K RD-M12K IMDB-B IMDB-M
Neural Net GCAPS 77.72.5 87.62.5 50.11.7 - 71.73.4 48.54.1
PersLay 76.40.4 - 55.60.3 47.70.2 71.20.7 48.80.6
DeepTopo - - 54.5 44.5

Ours
RPNet-1 69.22.9 91.41.8 56.92.2 48.50.8 69.64.2 45.93.9
RPNet-2 70.02.2 91.62.0 57.62.4 48.51.0 68.33.9 46.83.1
RPNet-4 71.82.5 91.71.9 57.22.1 48.4 71.9 4.4 47.72.9
RPNet-8 73.41.8 90.42.1 56.42.1 48.31.2 70.73.8 48.92.9

Graph Kernel
RetGK 81.00.3 92.60.3 56.10.5 48.70.2 71.91.0 47.70.2
WKPI - - 59.50.6 48.40.5 75.11.1 49.50.4
SV 79.60.3 87.80.3 53.10.2 - 74.20.9 49.90.3


Table 3: Classification results (accuracy and standard deviation) on dataset group 2. The best performances of our methods and other methods are highlighted in boldface. Note: "-" indicates the dataset was not used in the baseline paper.

Baselines

We compare our results with WL subtree [shervashidze2011weisfeiler], RetGK [zhang2018retgk]; deep learning-based approaches GCAPS [verma2018graph], PersLay [carriere2019perslay], DeepTopo [hofer2017deep]; Kernel-based methods with topological features: SV [tran2018scale], P-WL and P-WLC [rieck2019persistent], WKPI [zhao2019learning]. For the above-mentioned baselines, we report their accuracies and standard deviations (if available) in the original papers, following the same evaluation setting. Some of our datasets were not used in certain papers (but used by others), we denote this by "-" or leave out the method if it did not use any of the datasets in the resulting table.

Implementation

Models are implemented in Pytorch 

[paszke2019pytorch]. The calculation of persistence diagrams are performed by Dionysus 222https://mrzv.org/software/dionysus2 and Gudhi 333http://gudhi.gforge.inria.fr libraries.

6 Results

Dataset Group 1

Table 2 demonstrates the results on 10 graph datasets. Our method performs the best on 4 out of 10 datasets. On the other 6 datasets, it performs comparably with PersLay [carriere2019perslay], another persistence diagram based method.

Dataset Group 2

For large datasets, our method is ranked in the top 3 on most of the datasets as observed in Table 3. When compared against other methods in Neural Network family, it outperforms these methods on 5 out of 6 datasets.

Some baseline methods such as RetGK [zhang2018retgk] leverage node attributes (node labels) of the graph for classification purposes. Our results are promising considering that the approach does not use any node attributes as the features. When comparing the best model among our variants (i.e., different values of ), we observe that using more than 1 return probability of random walk achieves better results in most of the datasets. In particular, RPNet-1 performs the best only on 1 dataset, in a total of 16 datasets. This implies that our method benefits from using multiple PDs based on multi-scale graph signatures. On the other hand, while our variant with (RPNet-2) works the best on small datasets in group 1, we observe that increasing (i.e., and ) gains some improvement on large datasets in group 2. Despite having high performance on most benchmarks, graph kernel methods require computing and storing evaluations for each pair of graphs in the dataset, making them intractable when dealing with datasets containing a large number of graphs. For example, the random walk kernel (RW) [vishwanathan2010graph] takes more than 3 days on NCI1 dataset [zhang2018end]

. Our approach can avoid the quadratic complexity with respect to the number of graphs required for graph kernels, and thus can easily apply to real-world graph datasets. It is also worth mentioning that our approach has higher variance compared with others, such as kernel methods. This high variance is a common problem when using deep learning approaches, which can be observed in deep learning baselines such as GCAPS 

[verma2018graph].

Ablation experiments

In this section, we conduct some ablation study on our proposal’s architecture. We start with our RPNet-4, and then remove some of its components to study their efficiency. The comparison is shown in Table 4.

We can observe that removing 3-d one-hot vectors (1) and (2) decreases the accuracy by 6.4%, 6.6%, and 2.3% on NCI1, IMDB-M, PROTEINS respectively. The main efficiency mainly comes from one-hot vectors (1) as it contributes much more to the performance compared with one-hot vectors (2). This shows the importance of using one-hot vectors to keep the information about the type of persistence diagrams from the original graphs. Besides, we find that learning a set of weights to aggregate the diagram features is beneficial to the model. When compared with the average pooling as a baseline, we find that weighted diagram pooling improves approximately 1-3% accuracy on the 3 datasets.

According to the experimental results, incorporating all the three components into our model, in general, can also help to reduce the variance, allowing us to obtain more stable performance.


Dataset one-hot (1) one-hot (2) weighted diagram pooling (3) Accuracy std
NCI1 65.72.9
66.6 2.6
70.81.5
70.71.0
72.11.8


IMDB-M
41.14.1
41.24.1
43.43.2
44.33.7
47.72.9


PROTEINS
73.13.6
73.12.8
73.13.8
72.34.2
75.42.6

Table 4: Ablation experiment results of our model’s architecture on 3 datasets NCI1, IMDB-M, PROTEINS. Note: one-hot (1)

denotes concatenation of the 3-d one-hot vectors with topological features extracted from the graph,

one-hot (2) denotes concatenation of the 3-d one-hot vectors with encoded features (after encoding step), and weighted diagram pooling (3) indicates using weighted sum to aggregate the diagrams, otherwise average pooling is employed. These components are indicated by (1), (2), and (3) in Fig. 5, respectively.

7 Conclusion

In this paper, we propose a family of multi-scale graph signatures for persistence diagrams based on return probabilities of random walks. We introduce RPNet, a deep neural network architecture to handle the proposed features and demonstrate its efficiency on a wide range of benchmark graph datasets. We show that our proposed method outperforms other persistent homology-based methods and graph neural networks while being on par with state-of-the-art graph kernels. There are some other interesting directions worth exploring. One of those is to employ a kernel-based approach with our multiset features. There has been some work on defining the distance between two persistence diagrams. It would be promising to build a kernel method working on the proposed multiscale representation. Besides, our approach does not make use of the node attributes, which may be critical to classifying graphs, such as chemical compounds where each node attribute represents a chemical element. Thus, another future direction is to investigate in detail how to incorporate the node attribute into our multi-scale graph signature. We are also interested in developing a theory for multi-dimensional persistent homology and using it for deep learning models in downstream tasks.

8 Acknowledgments

Approved for public release, 22-607

References