Path Integral Based Convolution and Pooling for Graph Neural Networks

06/29/2020 ∙ by Zheng Ma, et al. ∙ University of Cambridge University of Technology Sydney Max Planck Society Princeton University zjnu 10

Graph neural networks (GNNs) extends the functionality of traditional neural networks to graph-structured data. Similar to CNNs, an optimized design of graph convolution and pooling is key to success. Borrowing ideas from physics, we propose a path integral based graph neural networks (PAN) for classification and regression tasks on graphs. Specifically, we consider a convolution operation that involves every path linking the message sender and receiver with learnable weights depending on the path length, which corresponds to the maximal entropy random walk. It generalizes the graph Laplacian to a new transition matrix we call maximal entropy transition (MET) matrix derived from a path integral formalism. Importantly, the diagonal entries of the MET matrix are directly related to the subgraph centrality, thus providing a natural and adaptive pooling mechanism. PAN provides a versatile framework that can be tailored for different graph data with varying sizes and structures. We can view most existing GNN architectures as special cases of PAN. Experimental results show that PAN achieves state-of-the-art performance on various graph classification/regression tasks, including a new benchmark dataset from statistical mechanics we propose to boost applications of GNN in physical sciences.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The triumph of convolutional neural networks (CNNs) has motivated researchers to develop similar architectures for graph-structured data. The task is challenging due to the absence of regular grids. One notable proposal is to define convolutions in the Fourier space

BrZaSzLe2013 ; Bronstein_etal2017 . This method relies on finding the spectrum of the graph Laplacian or and then applies filters to the components of input signal under the corresponding basis, where is the adjacency matrix of the graph, and is the corresponding degree matrix. Due to the high computational complexity of diagonalizing the graph Laplacian, people have proposed many simplifications defferrard2016convolutional ; KiWe2017 .

The graph Laplacian based methods essentially rely on message passing gilmer2017neural between directly connected nodes with equal weights shared among all edges, which is at heart a generic random walk (GRW) defined on graphs. It can be seen most obviously from the GCN model KiWe2017 , where the normalized adjacency matrix is directly applied to the left-hand side of the input. In statistical physics, is known as the transition matrix of a particle doing a random walk on the graph, where the particle hops to all directly connected nodes with equiprobability. Many direct space-based methods node2vec_2016 ; LiTaBrZe2015 ; velivckovic2017graph ; Planetoid_2016 can be viewed as generalizations of GRW, but with biased weights among the neighbors.

In this paper, we go beyond the GRW picture, where information necessarily dilutes when a path branches, and instead consider every path linking the message sender and receiver as the elemental unit in message passing. Inspired by the path integral formulation developed by Feynman feynman2010quantum ; feynman1948space , we propose a graph convolution that assigns trainable weights to each path depending on its length. This formulation results in a maximal entropy transition (MET) matrix, which is the counterpart of graph Laplacian in GRW. By introducing a fictitious temperature, we can continuously tune our model from a fully localized one (MLP) to a spectrum based model. Importantly, the diagonal of the MET matrix is intimately related to the subgraph centrality, and thus provides a natural pooling method without extra computations. We call this complete path integral based graph neural network framework PAN.

We demonstrate that PAN outperforms many popular architectures on benchmark datasets. We also introduce a new dataset from statistical mechanics, which overcomes the lack of explanability and tunability of many previous ones. The dataset can serve as another benchmark, especially for boosting applications of GNN in physical sciences. This dataset again confirms that PAN has a faster convergence rate, higher prediction accuracy, and better stability compared to many counterparts.

2 Path Integral Based Graph Convolution

Path integral and MET matrix

Feynman’s path integral formulation feynman2010quantum ; Zinn-Justin:2009

interprets the probability amplitude

as a weighted average in the configuration space, where the contribution from is computed by summing over the influences (denoted by ) from all paths connecting itself and . This formulation has been later extensively used in statistical mechanics and stochastic processes kleinert2009path . We note that this formulation essentially constructs a convolution by considering the contribution from all possible paths in the continuous space.

Figure 1: A schematic analogy between the original path integral formulation in continuous space (left) and the discrete version for a graph (right). Symbols are defined in the text.

Using this idea, but modified for discrete graph structures, we can heuristically propose a statistical mechanics model on how information is shared between different nodes on a given graph. In the most general form, we write observable

at the -th node for a graph with nodes as

(1)

where is the normalization factor known as the partition function for the -th node. Here a path is a sequence of connected nodes where , and the length of the path is denoted by . In Figure 1 we draw the analogy between our discrete version and the original formulation. It is straightforward to see that the integral should now be replaced by a summation, and only resides on nodes. Since a statistical mechanics perspective is more proper in our case, we directly change the exponential term, which is originally an integral of Lagrangian, to a Boltzmann’s factor with fictitious energy and temperature ; we choose Boltzmann’s constant . Nevertheless, we still exploit the fact that the energy is a functional of the path, which gives us a way to weight the influence of other nodes through a certain path. The fictitious temperature controls the excitation level of the system, which reflects that to what extent information is localized or extended. In practice, there is no need to learn the fictitious temperature or energy separately, instead the neural networks can directly learn the overall weights, as will be made clearer later.

To obtain an explicit form of our model, we now introduce some mild assumptions and simplifications. Intuitively, we know that information quality usually decays as the path between the message sender and the receiver becomes longer, thus it is reasonable to assume that the energy is not only a functional of path, but can be further simplified as a function that solely depends on the length of the path. In the random walk picture, this means that the hopping is equiprobable among all the paths that have the same length, which maximizes the Shannon entropy of the probability distribution of paths globally, and thus the random walk is given the name maximal entropy random walk

burda2009localization 111For a weighted graph, a feasible choice for the functional form of the energy could be , where the effective length of the path can be defined as a summation of the inverse of weights along the path, i.e. .. By first conditioning on the length of the path, we can introduce the overall -th layer weight for node by

(2)

where denotes the number of paths between nodes and with length of , or density of states for the energy level with respect to nodes and , and the summation is taken over all nodes of the graph. Intuitively, node with larger means that it has more channels to talk with node , thus may impose a greater influence on node as the case in our formulation. For example, in Figure 1, nodes and are both two-step away from , but has more paths connecting and would be assigned with a larger weight as a consequence. Presumably, the energy is an increasing function of , which leads to a decaying weight as increases.222This does not mean that must necessarily be a decreasing function, as grows exponentially in general. It would be valid to apply a cutoff as long as for large , where

is the largest eigenvalue of the adjacency matrix

. By applying a cutoff of the maximal path length , we exchange the summation order in (1) to obtain

(3)

where the partition function can be explicitly written as

(4)

A nice property of this formalism is that we can easily compute by raising the power of the adjacency matrix to , which is a well-known property of the adjacency matrix from graph theory, i.e., . Plug in (3) we now have a group of self-consistent equations governed by a transition matrix (a counterpart of the propagator in quantum mechanics), which can be written in the following compact form

(5)

where . We call the matrix maximal entropy transition (MET) matrix, with regard to the fact that it realizes maximal entropy under the microcanonical ensemble. This transition matrix replaces the role of the graph Laplacian under our framework.

More generally, one can constrain the paths under consideration to, for example, shortest paths or self-avoiding paths. Consequentially, will take more complicated forms and the matrix needs to be modified accordingly. In this paper, we focus on the simplest scenario and apply no constraints for the simplicity of the discussion.

PAN convolution

The eigenstates, or the basis of the system satisfy . Similar to the basis formed by the graph Laplacian, one can define graph convolution based on the spectrum of MET matrix, which now has a distinct physical meaning. However, it is computationally impractical to diagonalize in every iteration as it is updated. To reduce the computational complexity, we apply the trick similar to GCN KiWe2017 by directly multiplying to the left hand side of the input and accompanying it by another weight matrix on the right-hand side. The convolutional layer is then reduced to a simple form

(6)

where refers to the layer number. Applying to the input is essentially a weighted average among neighbors of a given node, which leads to the question that if the normalization consistent with the path integral formulation works best in a data-driven context. It has been consistently shown experimentally that a symmetric normalization usually gives better results KiWe2017 ; LNet ; MaLiWa2019 . This observation might have an intuitive explanation. Most generally, one can consider the normalization , where . There are two extreme situations. When and , it is called random-walk normalization and the model can be understood as “receiver-controlled", in the sense that the node of interest performs an average among all the neighbors weighted by the number of channels that connect them. On the contrary, when and , the model becomes “sender-controlled", since the weight is determined by the fraction of the flow coming out from the sender that is directed to the receiver. Because of the fact that for an undirected graph, the exact interaction between connected nodes are unknown, as a compromise, the symmetric normalization can outperform both extremes, even it may not be the optimal. This consideration leads us to a final perfection step that changes the normalization in to the symmetric normalized version. The convolutional layer then becomes

(7)

We shall call this graph convolution PANConv.

The optimal cutoff of the series depends on the intrinsic properties of the graph, which is represented by temperature . Incorporating more terms is analogous to having more particles excited to the higher energy level at a higher temperature. For instance, in low-temperature limit, , the model is reduced to the MLP model. In the high-temperature limit, all factors are effectively one, and the term with the largest power dominates the summation. We can see it by , where is sorted in a descending order. By the Perron-Frobenius theorem, we may only keep the leading order term with the unique largest eigenvalue when . We then reach a prototype of the high temperature model . The most suitable choice of the cutoff reflects the intrinsic dynamics of the graph.

3 Path Integral Based Graph Pooling

For graph classification and regression tasks, another critical component is the pooling mechanism, which enables us to deal with graph input with variable sizes and structures. Here we show that the PAN framework provides a natural ranking of node importance based on the MET matrix, intimately related to the subgraph centrality. This pooling scheme, denoted by PANPool, requires no further work aside from the convolution and can discover the underlying local motif adaptively.

MET matrix and subgraph centrality

Many different ways to rank the “importance" of nodes in a graph have been proposed in the complex networks community. The most straightforward one is the degree centrality (DC), which counts the number of neighbors, other more sophisticated measures include, for example, betweenness centrality (BC) and eigenvector centrality (EC)

newman2018networks . Although these methods do give specific measures of the global importance of the nodes, they usually fail to pick up local patterns. However, from the way CNNs work on image classifications, we know that it is the locally representative pixels that matter.

Estrada and Rodriguez-Velazquez estrada2005subgraph have shown that subgraph centrality is superior to the methods mentioned above in detecting local graph motifs, which are crucial to the analysis of many social and biological networks. The subgraph centrality computes a weighted sum of the number of self-loops with different lengths. Mathematically, it simply writes as for node . Interestingly, one immediately sees that the resemblance of this expression and the diagonal elements of the MET matrix. The difference is easy to explain. The summation in the MET matrix is truncated at maximal length , and the weights for different path length is learnable. In contrast, the predetermined weight is a convenient choice to ensure the convergence of the summation and an analytical form of the result, which writes , where is the -th element of the orthonormal basis associated with the eigenvalue .

Now it becomes clear that the MET matrix plays the role of a path integral-based convolution. Its diagonal elements also automatically provides a measure of the importance of node , thus enabling a pooling mechanism by sorting . Importantly, this pooling method has three main merits compared to the subgraph centrality. First, we can exploit the readily-computed MET matrix, thus circumvent extra computations, especially the direct diagonalization of the adjacency matrix in the case of subgraph centrality. Second, the weights are data-driven rather than predetermined, which can effectively adapt to different inputs. Furthermore, the MET matrix is normalized 333Notice that unlike the case in convolutions, the normalization is symmetric or not does not matter here. Here we only care about the diagonal terms, and different normalization methods will give the same result., which adds weights on the local importance of the nodes, and can potentially avoid clustering around “hubs" that are commonly seen in real-world “scale-free" networks barabasi2016network .

The PAN Pooling strategy has similar physical explanations as the PAN convolution. In the low-temperature limit, for example, if we set the cut-off at , the rank of is of the same order as the rank of degrees, and thus we recover the degree centrality. In the high-temperature limit, as , the sum is dominated by the magnitude of the -th element of the orthonormal basis associated with the largest eigenvalue of , thus the corresponding ranking is reduced to the ranking of the eigenvector centrality. By tuning , PANPool provides a flexible strategy that can adapt to the “sweet spot" of the input.

To better understand the effect of the proposed method, in Figure 2, we visualize the top 20% nodes by different measures of node importance of a connected point pattern called RSA, which we detail in Section 5.2. It is noteworthy that while DC selects points relatively uniform, the result of EC is highly concentrated. This phenomenon is analogous to the contrast between the rather uniform diffusion in the classical picture and the Anderson localization anderson1958absence in the quantum mechanics of disordered systems burda2009localization . In this sense, it tries to find a “mesoscopic" description that best fits the structure of input data. Importantly, we note that the unnormalized MET matrix tends to focus on the densely connected areas or hubs. In contrast, the normalized one tends to choose the locally representative nodes and leave out the equally well-connected nodes in the hubs. This observation leads us to propose an improved pooling strategy that balances the influencers at both the global and local levels.

Figure 2: Top 20% nodes (shown in blue) by different measures of node importance of an RSA pattern from PointPattern dataset. From left to right are results from: Degree Centrality, Eigenvector Centrality, MET matrix without normalization, MET matrix and Hybrid PANPool.

Hybrid PANPool

To combine the contribution of the local motifs and the global importance, we propose a hybrid PAN pooling (denoted by PANPool) using a simple linear model. The global importance can be represented by, but not limited to the strength of the input signal itself. More precisely, we project feature

by a trainable parameter vector

and combine it with the diagonal of the MET matrix to obtain a score vector

(8)

Here is a real learnable parameter that controls the emphasis on these two potentially competing factors. PANPool then selects a fraction of the nodes ranked by this score, and outputs the pooled feature array and the corresponding adjacency matrix . This new node score in (8) has jointly considered both node features (at global level) and graph structures (at local level). In Figure 2, PANPool tends to select nodes that are both important locally and globally. We also tested alternative designs under the same consideration, see supplementary material for details.

4 Related Works

Graph neural networks have received much attention recently Survey_Battaglia ; LiTaBrZe2015 ; scarselli2009graph ; Survey_ZhangCQ ; Survey_ZhuWW ; Survey_SunMS . For graph convolutions, many works take accounts of the first order of the adjacency matrix in the spatial domain or graph Laplacian in the spectral domain. Bruna et al. BrZaSzLe2013 first proposed graph convolution using the Fourier method, which is, however, computationally expensive. Many different methods have been proposed to overcome this difficulty DCNN_2016 ; ChZhSo2018 ; ChMaXi2018fastgcn ; defferrard2016convolutional ; gilmer2017neural ; hamilton2017inductive ; KiWe2017 ; Monti_etal2017 ; Graph-CNN_2017 ; GWNN ; GIN . Another vital stream considers the attention mechanism velivckovic2017graph , which infers the interaction between nodes without using a diffusion-like picture. Some other GNN models use multi-scale information and higher-order adjacency matrix sami2018watch ; mixhop ; ngcn ; flam2020neural ; klicpera2019diffusion ; LNet ; SGC . Compared to the generic diffusion picture node2vec_2016 ; DeepWalk_2014 ; tang2015line , the maximal entropy random walk has already shown excellent performance on link prediction li2011link or community detection ochab2013maximal tasks. However, many popular models can be related to or viewed as certain explicit realizations of our framework. We can interpret the MET matrix as an operator that acts on the graph input, which works as a kernel that allocates appropriate weights among the neighbors of a given node. This mechanism is similar to the attention mechanism velivckovic2017graph , while we restrict the functional form of based on physical intuitions and preserve a compact form. Although we keep the number of features by applying , one can easily concatenate the aggregated information of neighbors like GraphSAGE hamilton2017inductive or GAT velivckovic2017graph . Importantly, the best choice of the cutoff reveals the intrinsic dynamics of the graph. In particular, by choosing , model (7) is essentially the GCN model KiWe2017 . The trick of adding self-loops is automatically realized in higher powers of . By replacing in (7) with or , we can easily transform our model to a multi-step GRW version, which is indeed the format of LanczosNet LNet . The preliminary ideas about PAN convolution and its application to node classification have been presented at an ICML workshop MaLiWa2019 . This paper focuses on path integral based convolution and pooling for classification and regression tasks at graph-level.

Graph pooling is another crucial step of a GNN to make the output uniform size in graph classification and regression tasks. Researchers have proposed many pooling methods from different aspects. For example, one can merely consider node feature or node embeddings duvenaud2015convolutional ; gilmer2017neural ; vinyals2015order ; zhang2018end

. These global pooling methods do not utilize the hierarchical structure of the graph. One way to reinforce learning ability is to build a data-dependent pooling layer with trainable operations or parameters

cangea2018towards ; gao2019graph ; knyazev2019understanding ; lee2019self ; ying2018hierarchical . One can incorporate more edge information in graph pooling diehl2019towards ; Yuan2020StructPool . One can also use spectral method and pool in Fourier or wavelet domain ma2019graph ; noutahi2019towards ; wang2020haargraph . PANPool is a method that takes both feature and structure into account. Finally, it does not escape our analysis that the loss of paths could represent an efficient way to achieve dropout.

5 Experiments

In this section, we present the test results of PAN on various datasets in graph classification tasks. We show a performance comparison of PAN with some existing GNN methods. All the experiments were performed using PyTorch Geometric

fey2019fast and run on a server with Intel(R) Core(TM) i9-9820X CPU 3.30GHz, NVIDIA GeForce RTX 2080 Ti and NVIDIA TITAN V GV100.

5.1 PAN on Graph Classification Benchmarks

Datasets and baseline methods

We test the performance of PAN on five widely used benchmark datasets for graph classification tasks KKMMN2016 , including two protein graph datasets PROTEINS and PROTEINS_full  borgwardt2005protein ; dobson2003distinguishing ; one mutagen dataset MUTAGEN riesen2008iam ; kazius2005derivation (full name Mutagenicity); and one dataset that consists of chemical compounds screened for activity against non-small cell lung cancer and ovarian cancer cell lines NCI1 wale2008comparison ; one dataset that consists of molecular compounds for activity against HIV or not AIDS riesen2008iam . These datasets cover different domains, sample sizes, and graph structures, thus enable us to obtain a comprehensive understanding of PAN’s performance in various scenarios. Specifically, the number of data samples ranges from 1,113 to 4,337, the average number of nodes is from 15.69 to 39.06, and the average number of edges is from 16.20 to 72.82, see a detailed statistical summary of the datasets in the supplementary material. We compare PAN in Table 1 with existing GNN models built by combining graph convolution layers GCNConv KiWe2017 , SAGEConv hamilton2017inductive , GATConv velivckovic2017graph , or SGConv Wu2019Simplifying , and graph pooling layers TopKPool, SAGPool lee2019self , EdgePool ma2019graph , or ASAPool ranjan2019asap .

Setting

In each experiment, we split 80% and 20% of each dataset for training and test. All GNNs models shared the exactly same architecture: Conv(-512) + Pool + Conv(512-256) + Pool + Conv(256-128) + FC(128-), where is the feature dimension and

is the number of classes. We give the choice of hyperparameters for these layers in the supplementary material. We evaluate the performance by the percentage of correctly predicted labels on test data. Specifically for PAN, we compared different choices of the cutoff

(between 2 and 7) and reported the one that achieved the best result (shown in the brackets of Table 1).

Results

Table 1 reports classification test accuracy for several GNN models. PAN has excellent performance on all datasets and achieves top accuracy on four of the five datasets, and in some cases, improve state of the art by a few percentage points. Even for MUTAGEN, PAN still has the second-best performance. Most interestingly, the optimal choice of the highest order for the MET matrix varies for different types of graph data. It confirms that the flexibility of PAN enables it to learn and adapt to the most natural representation of the given graph data.

Additionally, we also tested PAN on graph regression tasks such as QM7 and achieved excellent performances. See supplementary material for details.

Method PROTEINS PROTEINSF NCI1 AIDS MUTAGEN
GCNConv + TopKPool 67.71 68.16 50.85 79.25 58.99
SAGEConv + SAGPool 64.13 70.40 64.84 77.50 67.40
GATConv + EdgePool 64.57 62.78 59.37 79.00 62.33
SGConv + TopKPool 68.16 69.06 50.85 79.00 63.82
GATConv + ASAPool 64.57 65.47 50.85 79.25 56.68
SGConv + EdgePool 70.85 69.51 56.33 79.00 70.05
SAGEConv + ASAPool 58.74 58.74 50.73 79.25 56.68
GCNConv + SAGPool 59.64 72.65 50.85 78.75 67.28
PANConv+PANPool (ours) 73.09 (1) 72.65 (1) 68.98 (3) 92.75 (2) 69.70 (2)
Table 1: Performance comparison for graph classification tasks (test accuracy in percentage; bold font is used to highlight the best performance in the list; the value in brackets is the cutoff used in the MET matrix.)

5.2 PAN for Point Distribution Recognition

A new classification dataset for point pattern recognition

People have proposed many graph neural network architectures; however, there are still insufficient well-accepted datasets to access their relative strength hu2020open

. Despite being popular, many datasets suffer from a lack of understanding of the underlying mechanism, such as whether one can theoretically guarantee that a graph representation is proper. These datasets are usually not controllable either; many different prepossessing tricks might be needed, such as zero paddings. Consequentially, reproducibility might be compromised.

Figure 3: From left to right: Graph samples generated from HD, Poisson and RSA point processes in PointPattern dataset.

In order to tackle this challenge, we introduce a new graph classification dataset constructed by simple point patterns from statistical mechanics. We simulated three point patterns in 2D: hard disks in equilibrium (HD), Poisson point process, and random sequential adsorption (RSA) of disks. The HD and Poisson distributions can be seen as simple models that describe the microstructures of liquids and gases

hansen1990theory , while the RSA is a nonequilibrium stochastic process that introduces new particles one by one subject to nonoverlapping conditions. These systems are well known to be structurally different, while being easy to simulate, thus provides a solid and controllable classification task. For each point pattern, the particles are treated as nodes, and edges are subsequently drawn according to whether two particles are within a threshold distance. We name the dataset PointPattern. See Figure 3 for an example of the three types of resulting graphs. The volume fraction (covered by particles) of HD is fixed at 0.5, while we tune to control the similarity between RSA and the other two distributions (Poisson point pattern corresponds to =0). As becomes closer to 0.5, RSA patterns are harder to be distinguished from HD. We use the degree as the feature for each node. It thus allows us to generate a series of graph datasets with varying difficulties as classification tasks.

Figure 4: Comparison of validation loss and accuracy of PAN, GCN and GIN on PointPattern under similar network architectures with 10 repetitions.

Setting

We tested the PANConv+PANPool model on PointPattern with and , and compared it with other two GNN models which use GCNConv+TopKPool or GINConv+TopKPool as basic architecture blocks cangea2018towards ; gao2019graph ; KiWe2017 ; knyazev2019understanding ; GIN . Each PointPattern

dataset is a 3-classification problem for 15,000 graphs (5000 for each type) with sizes varying between 100 and 1000. All GNN models use the same network architecture: 3 units of one graph convolutional layer plus one graph pooling, followed by fully connected layers. In GCN and GIN models, we also use global max pooling to compress the node size to one before the fully connected layer. We split the data into training, validation, and test sets of size 12,000, 1,500, and 1,500. We fix the number of neurons in the convolutional layers to 64, the learning rate and weight decay are set to 0.001 and 0.0005.

PointPattern GINConv + SAGPool GCNConv + TopKPool PANConv + PANPool (ours)
90.92.95 92.93.21 99.00.30 (4)
86.73.30 89.33.31 97.60.53 (4)
80.23.80 85.14.06 94.40.55 (4)
Table 2:

Test accuracy (in percentage) of PAN, GIN and GCN on three types of PointPattern datasets with different difficulties, epoch up to 20. The value in brackets is the cutoff of

.

Results

Table 2

shows the mean and SD of the test accuracy of the three networks on the three PointPattern datasets. PAN outperforms GIN and GCN models on all datasets with 5 to 10 percents higher accuracy, while significantly reduces variances. We observe that PAN’s advantage is persistent over varying task difficulties, which may be due to the consideration of higher order paths (here

). We compare the validation loss and accuracy trends in the training of PANConv+PANPool with GCNConv+TopKPool in Figure 4. It illustrates that the learning and generalization capabilities of PAN are better than those of the GCN and GIN models. The loss of PAN decays to much smaller values early while the accuracy reaches higher plateau more rapidly. Moreover, the loss and accuracy of PAN both have much smaller variances, which can be seen most evidently after epoch four. In this perspective, PAN provides a more efficient and stable learning model for the graph classification task. Another intriguing pattern we notice is that the weights are concentrated on the powers and . It suggests that what differentiates these graph structures is the high orders of the adjacency matrix, or physically, the pair correlations at intermediate . It may explain why PAN performs better than GCN, which uses only in its model.

6 Conclusion

We propose a path integral based GNN framework (PAN), which consists of self-consistent convolution and pooling units, the later is closely related to the subgraph centrality. PAN can be seen as a class of generalization of GNN. PAN achieves excellent performances on various graph classification and regression tasks, while demonstrating fast convergence rate and great stability. We also introduce a new graph classification dataset PointPattern which can serve as a new benchmark.

References

  • [1] Sami Abu-El-Haija, Bryan Perozzi, Rami Al-Rfou, and Alexander A Alemi. Watch your step: Learning node embeddings via graph attention. In NeurIPS, pages 9180–9190, 2018.
  • [2] Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Hrayr Harutyunyan, Nazanin Alipourfard, Kristina Lerman, Greg Ver Steeg, and Aram Galstyan. Mixhop: Higher-order graph convolution architectures via sparsified neighborhood mixing. In ICML, 2019.
  • [3] Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, and Joonseok Lee. N-GCN: multi-scale graph convolutionfor semi-supervised node classification. In UAI, 2019.
  • [4] Han Altae-Tran, Bharath Ramsundar, Aneesh S Pappu, and Vijay Pande. Low data drug discovery with one-shot learning. ACS Central Science, 3(4):283–293, 2017.
  • [5] Philip W Anderson. Absence of diffusion in certain random lattices. Physical Review, 109(5):1492, 1958.
  • [6] James Atwood and Don Towsley. Diffusion-convolutional neural networks. In NIPS, pages 1993–2001, 2016.
  • [7] Albert-László Barabási et al. Network Science. Cambridge University Press, 2016.
  • [8] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
  • [9] Lorenz C Blum and Jean-Louis Reymond. 970 million druglike small molecules for virtual screening in the chemical universe database gdb-13. Journal of the American Chemical Society, 131(25):8732–8733, 2009.
  • [10] Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vishwanathan, Alex J Smola, and Hans-Peter Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56, 2005.
  • [11] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
  • [12] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.

    Geometric deep learning: going beyond euclidean data.

    IEEE Signal Processing Magazine, 34(4):18–42, 2017.
  • [13] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. In ICLR, 2014.
  • [14] Zdzislaw Burda, Jarek Duda, Jean-Marc Luck, and Bartek Waclaw. Localization of the maximal entropy random walk. Physical Review Letters, 102(16):160602, 2009.
  • [15] Cătălina Cangea, Petar Veličković, Nikola Jovanović, Thomas Kipf, and Pietro Liò.

    Towards sparse hierarchical graph classifiers.

    In NeurIPS Workshop on Relational Representation Learning, 2018.
  • [16] Jianfei Chen, Jun Zhu, and Le Song. Stochastic training of graph convolutional networks with variance reduction. In ICML, pages 941–949, 2018.
  • [17] Jie Chen, Tengfei Ma, and Cao Xiao. FastGCN: fast learning with graph convolutional networks via importance sampling. In ICLR, 2018.
  • [18] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995.
  • [19] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, pages 3844–3852, 2016.
  • [20] Frederik Diehl, Thomas Brunner, Michael Truong Le, and Alois Knoll. Towards graph pooling by edge contraction. In ICML Workshop on Learning and Reasoning with Graph-Structured Representation, 2019.
  • [21] Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes without alignments. Journal of Molecular Biology, 330(4):771–783, 2003.
  • [22] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, pages 2224–2232, 2015.
  • [23] Ernesto Estrada and Juan A Rodriguez-Velazquez. Subgraph centrality in complex networks. Physical Review E, 71(5):056103, 2005.
  • [24] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
  • [25] Richard P Feynman. Space-time approach to non-relativistic quantum mechanics. Reviews of Modern Physics, 20:367–387, Apr 1948.
  • [26] Richard P Feynman, Albert R Hibbs, and Daniel F Styer. Quantum mechanics and path integrals. Courier Corporation, 2010.
  • [27] Daniel Flam-Shepherd, Tony Wu, Pascal Friederich, and Alan Aspuru-Guzik. Neural message passing on high order paths. arXiv preprint arXiv:2002.10413, 2020.
  • [28] Hongyang Gao and Shuiwang Ji. Graph U-Nets. In ICML, pages 2083–2092, 2019.
  • [29] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In ICML, pages 1263–1272, 2017.
  • [30] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD, pages 855–864, 2016.
  • [31] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, pages 1024–1034, 2017.
  • [32] Jean-Pierre Hansen and Ian R McDonald. Theory of simple liquids. Elsevier, 1990.
  • [33] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687, 2020.
  • [34] Jeroen Kazius, Ross McGuire, and Roberta Bursi. Derivation and validation of toxicophores for mutagenicity prediction. Journal of Medicinal Chemistry, 48(1):312–320, 2005.
  • [35] Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark data sets for graph kernels, 2020. http://www.graphlearning.io/.
  • [36] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
  • [37] Hagen Kleinert. Path integrals in quantum mechanics, statistics, polymer physics, and financial markets. World scientific, 2009.
  • [38] Johannes Klicpera, Stefan Weißenberger, and Stephan Günnemann. Diffusion improves graph learning. In NeurIPS, pages 13354–13366, 2019.
  • [39] Boris Knyazev, Graham W Taylor, and Mohamed R Amer. Understanding attention and generalization in graph neural networks. In NeurIPS, 2019.
  • [40] Junhyun Lee, Inyeop Lee, and Jaewoo Kang. Self-attention graph pooling. In ICML, pages 3734–3743, 2019.
  • [41] Rong-Hua Li, Jeffrey Xu Yu, and Jianquan Liu. Link prediction: the power of maximal entropy random walk. In CIKM, pages 1147–1156, 2011.
  • [42] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. ICLR, 2016.
  • [43] Renjie Liao, Zhizhen Zhao, Raquel Urtasun, and Richard S Zemel. Lanczosnet: Multi-scale deep graph convolutional networks. In ICLR, 2019.
  • [44] Yao Ma, Suhang Wang, Charu C. Aggarwal, and Jiliang Tang. Graph convolutional networks with EigenPooling. In KDD, pages 723–731, 2019.
  • [45] Zheng Ma, Ming Li, and Yu Guang Wang. PAN: Path integral based convolution for deep graph neural networks. In ICML Workshop on Learning and Reasoning with Graph-Structured Representation, 2019.
  • [46] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model CNNs. In CVPR, pages 5425–5434, 2017.
  • [47] Mark Newman. Networks. Oxford university press, 2018.
  • [48] Emmanuel Noutahi, Dominique Beani, Julien Horwood, and Prudencio Tossou. Towards interpretable sparse graph representation learning with Laplacian pooling. arXiv preprint arXiv:1905.11577, 2019.
  • [49] JK Ochab and Zdzisław Burda. Maximal entropy random walk in community detection. The European Physical Journal Special Topics, 216(1):73–81, 2013.
  • [50] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In KDD, pages 701–710, 2014.
  • [51] Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding, and Vijay Pande. Massively multitask networks for drug discovery. arXiv preprint arXiv:1502.02072, 2015.
  • [52] Ekagra Ranjan, Soumya Sanyal, and Partha Pratim Talukdar. ASAP: Adaptive structure aware pooling for learning hierarchical graph representations. AAAI, 2020.
  • [53] Kaspar Riesen and Horst Bunke.

    IAM graph database repository for graph based pattern recognition and machine learning.

    In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages 287–297. Springer, 2008.
  • [54] Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert Müller, and O Anatole Von Lilienfeld. Fast and accurate modeling of molecular atomization energies with machine learning. Physical Review Letters, 108(5):058301, 2012.
  • [55] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
  • [56] Felipe Petroski Such, Shagan Sah, Miguel Alexander Dominguez, Suhas Pillai, Chao Zhang, Andrew Michael, Nathan D Cahill, and Raymond Ptucha. Robust spatial filtering with graph convolutional neural networks. IEEE Journal of Selected Topics in Signal Processing, 11(6):884–896, 2017.
  • [57] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In WWW, pages 1067–1077, 2015.
  • [58] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR, 2018.
  • [59] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. In ICLR, 2015.
  • [60] Nikil Wale, Ian A Watson, and George Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems, 14(3):347–375, 2008.
  • [61] Yu Guang Wang, Ming Li, Zheng Ma, Guido Montufar, Xiaosheng Zhuang, and Yanan Fan. Haar graph pooling. In ICML, 2020.
  • [62] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplifying graph convolutional networks. In ICML, pages 6861–6871, 2019.
  • [63] Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr, Christopher Fifty, Tao Yu, and Kilian Q Weinberger. Simplifying graph convolutional networks. In ICML, 2019.
  • [64] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. MoleculeNet: a benchmark for molecular machine learning. Chemical Science, 9(2):513–530, 2018.
  • [65] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems.
  • [66] Bingbing Xu, Huawei Shen, Qi Cao, Yunqi Qiu, and Xueqi Cheng. Graph wavelet neural network. In ICLR, 2019.
  • [67] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In ICLR, 2019.
  • [68] Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov.

    Revisiting semi-supervised learning with graph embeddings.

    In ICML, 2016.
  • [69] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In NeurIPS, pages 4800–4810, 2018.
  • [70] Hao Yuan and Shuiwang Ji. Structpool: Structured graph pooling via conditional random fields. In ICLR, 2020.
  • [71] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep learning architecture for graph classification. In AAAI, 2018.
  • [72] Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A survey. IEEE Transactions on Knowledge and Data Engineering, 2020.
  • [73] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434, 2018.
  • [74] Jean Zinn-Justin. Path integral. Scholarpedia, 4(2):8674, 2009.

Appendix A Variations of PANPool

In the main text, we discussed the relation between the diagonal of the MET matrix and subgraph centrality, as well as the idea of combining structural information and signals to develop pooling methods. We study several alternatives of the Hybrid PANPool proposed in the paper and report experimental results on benchmark datasets.

First, we consider the subgraph centrality’s direct counterpart under the PAN framework, i.e., the weighted sum of powers of . Formally, we consider the score as the diagonal of the MET matrix before normalization, it writes as

(9)

Similarly, we can also combine this unnormalized MET matrix with projected features, i.e.,

(10)

This method also considers both graph structures and signals, while the measure of structural importance is at a global rather than local level.

We can also take simple approaches to mix structural information with signals. Most straightforwardly, we can employ the readily calculated convoluted feature to define the score. For example, the -norm of each row of can define a score vector. The score for node can be written as

(11)

Finally, instead of using a parameterized linear combination of the MET matrix and projected signals, we can apply the Hadamard product of the two contributions. The score then becomes

(12)

We use PANUMPool, PANXUMPool, PANMPool, and PANXHMPool to denote these variations of PANPool corresponding to (9)–(12) in the following experimental results.

Appendix B Datasets and extended experiments

We put the PyTorch codes for experiments in the folder “codes” with dataset downloading and program execution instructions in “README.md”.

b.1 PointPatterns

All simulations are performed in square simulation boxes with periodic boundary conditions. For hard disks, we use corresponding RSA configurations as initial conditions. We then perform an average of 10,000 Monte Carlo steps per particle to equilibrate the system. In the following step of converting a point pattern to a graph, we do not consider the images of the simulation boxes; that is, we do not connect particles across the boundaries. The choice of the threshold is inevitably subjective. Here we use as the threshold, where is the radius of the corresponding hard disks with the same number density at volume fraction 0.5. This threshold is of the same order of the typical distance between two neighboring particles, which guarantees that the resulting graph is connected.

We list the summary statistics of the three datasets of PointPattern used in the main text with in Table 3. They can be downloaded from Google Drive at

We also show an example in README.md of running PAN on PointPattern, which program includes downloading and preprocessing PointPattern datasets.

PointPattern
#classes 3 3 3
#graphs 15,000 15,000 15,000
max #nodes 1000 1000 1000
min #nodes 100 100 100
avg #nodes 478 474 475
avg #edges 3265 3223 3220
Table 3: Summary information of PointPattern datasets.

b.2 PAN on Classification Benchmarks

Extended experiments on Classification Benchmark

We list the summary statistics of benchmark graph classification datasets in Table 4. In Table 5, we report the classification test accuracy for variations of PAN compared with other methods. All networks utilize the same architecture. The PAN model, in general, has excellent performance on all datasets. The table shows the variations of PAN models can achieve the state of the art performance on a variety of graph classification tasks, and in some cases, improve state of the art by a few percentage points. In particular, PANConv+PANPool tends to perform better than other methods or variations on average, as presented in the main text. While among alternative PAN pooling methods, PANConv+PANMPool tends to have the least SD.

Dataset MUTAG PROTEINS PROTEINSF NCI1 AIDS MUTAGEN
max #nodes 28 620 620 111 95 417
min #nodes 10 4 4 3 2 4
avg #nodes 17.93 39.06 39.06 29.87 15.69 30.32
# node attributes - 1 29 - 4 -
avg #edges 19.79 72.82 72.82 32.30 16.20 30.77
#graphs 188 1,113 1,113 4,110 2,000 4,337
#classes 2 2 2 2 2 2
Table 4: Summary statistics of benchmark graph classification datasets.
Method PROTEINS PROTEINSF NCI1 AIDS MUTAGEN
GCNConv + TopKPool 64.00.40 69.66.03 49.90.50 81.21.00 63.56.69
SAGEConv + SAGPool 70.53.95 63.02.34 64.03.61 79.52.02 67.63.24
GATConv + EdgePool 72.41.46 71.33.16 60.11.76 80.50.72 71.51.09
SGConv + TopKPooling 73.61.70 65.91.25 61.55.11 81.00.01 66.32.08
GATConv + ASAPooling 64.85.43 67.34.37 53.94.11 84.7 6.21 58.45.19
SGConv + EdgePooling 69.01.74 70.52.48 58.41.96 76.71.12 70.70.69
SAGEConv + ASAPooling 59.25.84 63.92.44 53.52.91 80.66.39 63.13.74
GCNConv + SAGPooling 71.52.72 68.62.25 52.28.87 83.11.10 68.95.80
PANConv+PANUMPool (Eq 9) 67.80.82 69.11.21 59.20.69 82.77.82 70.02.11
PANConv+PANXUMPool (Eq 10) 69.71.60 72.63.20 60.11.74 86.93.64 69.41.08
PANConv+PANMPool (Eq 11) 66.80.78 71.00.60 51.91.39 80.60.44 68.41.01
PANConv+PANXHMPool (Eq 12) 68.85.23 69.71.97 55.91.81 91.43.39 70.21.08
PANConv+PANPool 76.62.06 71.76.05 60.8 3.45 97.51.86 70.92.76
Table 5: Performance comparison for graph classification tasks (test accuracy in percentage; bold font is used to highlight the best performance in the list; the of all PAN-models on five datasets are , respectively).

b.3 Quantum Chemistry Regression

Qm7

In this section, we test the performance of the PAN model on the QM7 dataset. The QM7 has been utilized to measure the efficacy of machine-learning methods for quantum chemistry [9, 54]. The dataset contains 7,165 molecules, each represented by the Coulomb (energy) matrix, and labeled with the value of atomization energy. The molecules have varying node size and structure with up to 23 atoms. We view each molecule as a weighted graph: atoms are nodes, and the Coulomb matrix of the molecule is the adjacency matrix. Since the node (atom) itself does not have feature information, we set the node feature to a constant vector with components all one, so that features here are uninformative, and the learning is mainly concerned with identifying the molecule structure. The task is to predict the atomization energy value of each molecule graph, which boils down to a standard graph regression problem.

Method Test MAE Multitask [51] 123.715.6 RF [11] 122.74.2 KRR [18] 110.34.7 GC [4]   77.92.1 GCNConv+TopKPool   43.60.98 PANConv+PANUMPool (Eq 9)   43.50.86 (1) PANConv+PANXUMPool (Eq 10)   43.31.32 (2) PANConv+PANMPool (Eq 11)   43.60.84 (2) PANConv+PANXHMPool (Eq 12)   43.01.27 (1) PANConv+PANPool   42.80.63 (1) ‘*’ indicates records retrieved from [64], and bold font is used to highlight the best performance in the list.
Table 6:

Test mean absolute error (MAE) comparison on QM7, with the standard deviation over ten repetitions of the experiments. The value in brackets is the cutoff

.

Experimental setting

In the experiment, we normalize the label value by subtracting the mean and scaling SD to 1. We then need to convert the predicted output to the original label domain (by re-scaling and adding the mean back). Following [29]

, we use mean squared error (MSE) as the loss for training and mean absolute error (MAE) as the evaluation metric for validation and test. We use the splitting percentages of 80%, 10%, and 10% for training, validation, and testing. We set the hidden dimension of the PANConv and GCN layers as 64, the learning rate 5.0e-4 for Adam optimization, and the maximal epoch 50 with no early stop. For better comparison, we repeat all experiments ten times with randomly shuffled datasets of different random seeds.

Comparison methods and results

We test and compare the performance (test MAE and validation loss) of PAN against the GNN model with GCNConv+SAGPool [36, 40] and other methods including Multitask Networks (Multitask) [51], Random Forest (RF) [11]

, Kernel Ridge Regression (

KRR) [18], Graph Convolutional models (GC) [4]. In our test, each PAN model contains one PANConv layer plus one PAN pooling layer, followed by two fully connected layers. The GCN model has two units of GCNConv+SAGPool, followed by GCNConv plus global max pooling and one fully connected layer. For other methods, we use their public results from [64] on QM7. In Table 6, we evaluate five PAN models on QM7 compared to other methods. The PAN models achieve top average test MAE and a smaller SD than other methods.