With the advent of data science in various application domains, data is no longer constrained to regular structures, like images and videos, but quite frequently lies on irregular structures represented by graphs (e.g., social networks). Thereby, a series of pioneer works have been conducted to generalize the state-of-the-art deep learning models used for grid-like data to the hierarchical representation of irregularly structured data. Taking graph data as an example, a collection of spectral convolution networks(bruna2013spectral; defferrard2016convolutional; khasanova2017graph; kipf2016semi) and spatial convolution networks (hamilton2017inductive; wang2018dynamic; velikovi2018graph; zhang2018end; xu2018how; DBLP:conf/aistats/VashishthYBT19) have been developed recently to generalize the convolution operation. On the other hand, even if it is an important module in multiscale representation learning, the pooling operator for graph neural networks (GNNs) has been mostly overlooked and surely deserves more attention.
Graph pooling attempts to use a few degrees of freedom (i.e.,) to summarize the original graph in terms of both graph topology and graph signal. One potential solution is to first extract the skeleton of the original graph and then aggregate information of the other graph parts into the skeleton. Intuitively, the skeleton should strongly coupled with the other nodes in terms of either structure or signal information. Then the assignment of the other graph parts to skeleton nodes should be in accordance with a certain criterion measuring the closeness of different graph parts. Following these general principles, several recent works attempt to design differentiable modules in graph neural networks to extract the skeleton of graphs either explicitly, such as gPool (DBLP:conf/icml/GaoJ19), SAGPool (DBLP:conf/icml/LeeLK19), and iPool (gao2019ipool), or implicitly, e.g., DIFFPOOL (ying2018hierarchical), and then coarsen the graph. However, these methods still have some limitations, either in the joint exploitation of the signal and topology of graph data, or in terms of storage and computational complexity.
Alternatively, spectral graph theory has also provided a large literature on graph analysis, which could potentially lead to graph coarsening schemes. For instance, eigenvalues and eigenvectors associated with a graph are effective in characterizing the topology information of graph data, and several spectral algorithms(von2007tutorial)
are proposed to partition graphs. However, there are several limitations of these spectral clustering methods that limit their generalization to the design of graph pooling operators. First, these methods consider graph topology information but ignore graph signals which contain extra information about the graph data. Furthermore, the eigendecomposition of the Laplacian matrix associated with a graph is computationally complex and the subsequent k-means clustering algorithm involves iterations. This makes it hard to adopt these operators as a building block of graph neural networks that should be able to deal with graphs of arbitrary topology simultaneously.
In this paper, we propose a strategy to generalize the spectral methods to the design of a new graph pooling operator, called ProxPool, by jointly considering the topology and signal information of graph data without an explicit eigendecomposition. We first introduce a proximity measure to evaluate the closeness of an arbitrary pair of nodes of a graph in terms of both the topology and signal information. Specifically, we design a structure-aware kernel on the basis of the eigenvectors of the Laplacian matrix associated with a graph, in order to measure the proximity of nodes that may be not directly connected with an edge. Most importantly, this measure is computed without the need of an explicit eigendecomposition. We further take the signal residing on the vertices of a graph into consideration to characterize the relationship between nodes with an affine transform and a Gaussian RBF kernel. On the basis of the proposed proximity measure that combines signal and topology information, we then propose a novel graph pooling operator consisting of an explicit graph skeleton extraction and a coarsened graph construction, as demonstrated in Fig. 1. It is adaptive and flexible as it can deal with pairs of nodes within diverse neighborhood ranges simultaneously. The proposed pooling operator is also interpretable, stackable, and easy to interleave with diverse graph neural networks and can handle graphs of arbitrary structures. It further permits to achieve state-of-the-art performance on public benchmark graph datasets in terms of graph classification. Our main contributions are as follows:
We present a strategy to measure the closeness between two arbitrary vertices of a graph by jointly considering the signal and topology information of a graph.
To capture the structure proximity between pairs of nodes that are not necessarily connected with a direct edge, we design a structure-aware kernel exploiting spectral properties while avoiding computationally complex eigendecomposition.
With our meaningful node proximity measure, we design an adaptive and stackable graph pooling operator, which permits to achieve state-of-the-art performance on several graph classification benchmark datasets.
2 Related Work
Several recent methods have been developed for generalizing pooling operators to graphs. For instance, gPool (DBLP:conf/icml/GaoJ19)
introduces a trainable vector to obtain node footprint and downsamples the graph accordingly. However, gPool ignores the structure information of graphs. SAGPool(DBLP:conf/icml/LeeLK19) coarsens graphs with self-attention. It exploits one-hop structure in graphs by computing attention scores of nodes with a graph convolution operation, which however leads to similar attention scores of nodes within a neighborhood. Thereby, the selected nodes usually concentrate in several specific neighborhoods. DIFFPOOL (ying2018hierarchical) generalizes the Galerkin operator (trottenberg2000multigrid) in algebraic multigrid with a learnable projection matrix and obtains the coarsened graph through projections. However, the number of parameters of DIFFPOOL depends on the number of vertices, which will impede its applications to large graphs.
Alternatively, spectral graph theory provides a generalization of the frequency analysis on grid-like data to graph data (shuman2013emerging). With the spectrum and associated eigenvectors of a graph well characterizing the topology of the graph, a collection of graph processing methods manage to extract hierarchical representation of graphs. For instance, several spectral clustering algorithms (von2007tutorial; biyikoglu2007laplacian) are proposed for graph clustering. A multilevel recursive spectral bisection method is presented in (barnard1994fast). EigenPooling (DBLP:conf/kdd/0001WAT19)
recently designs a graph pooling operator on the basis of Graph Fourier Transform. It relies on the spectral clustering to partition nodes and downsample graphs, and then assigns node signals with the truncated Graph Fourier coefficients of subgraphs. However, it involves high computational complexity and iterations, such as the spectral decomposition, which prevents its uses as a building block of graph neural networks.
3 Preliminaries and Framework
We consider undirected graphs, and represent them as . Specifically, and represent the set of vertices, and the set of edges, respectively. The adjacency matrix characterizes the graph topology with non-negeative entries, and a non-zero value corresponds to an edge connecting vertices and , with value one for unweighted graphs or an actual edge weight for weighted graphs. The degree of vertices is characterized by a degree matrix , a diagonal matrix with . The symmetric normalized Laplacian associated with a graph is defined as , and its eigendecomposion is with and being a diagonal matrix composed of . Specifically, is an eigenvector associated with eigenvalue , with . The eigenvalues form the spectrum of the graph and the eigenvectors construct bases of the so called Graph Fourier Transform (shuman2013emerging).
Generally, capital letters represent matrices, while bold lowercase letters indicate vectors. Furthermore, we employ a subscript to indicate variables or parameters belonging to the -th layer of the neural network architecture. For instance, for the graph with vertices in the -th neural network representation layer, indicates the signal of dimension residing on node , and represents the signal of the whole graph .
In this paper, we rely on graph convolution networks (GCNs), which usually consist of a stack of interlaced graph convolution layers and graph pooling layers, in order to extract multiscale representations of graph data. For the spatial convolution networks, the graph convolution operator usually adopts a neighborhood message aggregation as:
where is a graph shift operator (e.g., or ), indicates an aggregation function,
denotes a non-linear activation function, andare learnable parameters. The graph pooling operator takes as input the adjacency matrix and the node signals of the original graph and generates those of the coarsened graphs. It can be generally formulated as
where denotes a coarsening matrix, which is the core of the design of graph pooling operators.
4 Node Proximity
We will first introduce a structure-aware kernel to characterize the connection strength between vertices of a graph in terms of graph topology. A RBF kernel further deals with signals residing on the vertices of the graph and complements the structure kernel with the information of node signals. Both kernels are used together to measure the proximity between pairs of nodes of a graph.
4.1 Structure-aware Kernels
The topology of a graph describes the relationship between nodes and is completely characterized by the adjacency matrix associated with a graph. With an adjacency matrix, we can obtain the direct connections between vertices but are unable to directly measure the closeness of two nodes without direct edges. In many situations, however, it is desirable to evaluate the closeness of nodes that are not only direct neighbors but connected within an -hop neighborhood. In this section, we introduce a strategy to measure the closeness of nodes within an -hop neighborhood in terms of the topology of the graph.
We propose to resort to a proxy smooth graph signal to evaluate the node proximity for a graph with arbitrary topology. If a graph signal is smooth, signals that reside on nearby vertices are similar, and we can then measure the closeness of vertices by computing the similarity of the signals they support. We can construct such smooth graph signals for arbitrary graphs with the help of the eigenvectors of their Laplacian matrices.
Specifically, for a symmetric normalized Laplacian matrix associated with a graph , we have
where and are respectively an eigenvalue and its corresponding eigenvector of . Let us now consider a signal with a real value to each vertex of . Equation (3) shows that is a smooth signal for , when is small. Note that is a vector of constant that corresponds to the eigenvalue . When only the first eigenvalues are kept, we can further obtain an -dimensional smooth node signal for . Here, depends on the spectrum of a graph. However, would not be adaptively found, as the spectrum varies with the diverse topologies of different graph data. Alternatively, we present a universal strategy to construct smooth graph signals. Instead of explicitly determining , we introduce a monotonically decreasing real function with and at . For arbitrary vertex , its smooth node signal is obtained by suppressing the amplitudes of eigenvectors corresponding to the large eigenvalues with .
In Proposition 1, we prove that the dimension of equals to the number of eigenvalues satisfying .
Given a graph without isolated vertices and a monotonically decreasing real function with and at , when eigenvalues of satisfy for , the smooth node signal of defined in Eq. (4.1) locates in a -dimensional subspace.
for without isolated vertices, which means that and are invertible. Thus, is linearly independent, as is invertible. Since , is also linearly independent. As a result, is a -dimensional signal for , . ∎
Proposition 1 implies that the dimension of depends on the function and the spectrum of graphs. Thus, an adaptive strategy is presented to determine . Similarity measure of the smooth node signals enables the fast implementation of the proximity measure without eigendecomposition. The cosine function is widely used to evaluate similarity of variables and its inner-product formulation enables the kernel-based implementation of the proposed proximity measure. The node proximity can be quantitatively measured with the similarity of the proxy smooth graph signal residing on two vertices.
With the proxy graph signal , we can obtain this proximity of all the pairs of nodes:
where is a diagonal matrix with the same diagonal values of and
indicates a filter in frequency domain with
We choose the filtering function as
Eq. (9) suggests that can be obtained without explicit computation-intensive eigendecomposition of the graph Laplacian. Notably, the proposed proximity measure is a generalization of the cluster kernel defined in (chapelle2003cluster), as is a gram matrix that naturally derives a kernel. In comparison to the adjacent matrix , characterizes the topology of the graph with the -hop connections in addition to a direct edge. The proposed proximity measure is also distinctive from the normalized spectral clustering (shi2000normalized) that adopts the node signal for -means clustering of vertices. We exploit a kernel method to measure the closeness between nodes to avoid complex eigendecomposition and excessive iterations.
4.2 Node Signal Proximity
We now measure the proximity of nodes in terms of graph signals. The radial basis function kernel (RBF kernel) is an effective method to calculate similarity of two variables with an implicit mapping. Rather than directly applying RBF kernel, we first project node signals to a low-dimension subspace with an affine transform, which permits to focus on the specific components of the signals:
where is learnable with to reduce computation and . Then node proximity is computed with a Gaussian RBF kernel:
with as the precision. We can obtain the corresponding gram matrix that captures all the proximity values between pairs of nodes with
In this way, with the affine transform and the implicit mapping of the kernel trick, we are able to adaptively characterize the relationship of nodes in terms of node signals.
4.3 Node Proximity
We finally present a strategy to measure node proximity by jointly considering the topology and signal information. In general, two nodes have high proximity when they have a tight interconnections in topology and the node signals they support are closely related. With structure-aware kernels and RBF kernels that respectively measure the proximity of nodes in terms of the topology and signal information, we implement this “AND Gate” and design the proximity measure by reconciling them with a multiplication:
for the whole graph , we have
where indicates the Hadamard product. The contribution of the topology and signal information can be indirectly but adaptively adjusted with the hyper-parameter in Eq. (11). Furthermore, the proximity measure can be adapted to exploit local and global information with the choice of neighborhood in the structure-aware kernel .
On the basis of node proximity, we design a novel graph pooling operator. We first introduce a strategy to evaluate the coupling strength of a vertex with other nodes and then present a graph downsampling operation with proximity-based seed node selection. Finally, we present a graph reduction strategy with soft-assignment to construct coarsened graphs towards hierarchical graph representation.
5.1 Graph Downsampling with Seed Node Selection
We adopt a similar downsampling method as (zhang2018end; DBLP:conf/icml/GaoJ19; DBLP:conf/icml/LeeLK19). To faithfully represent the original graph, the nodes selected for graph downsampling should be “strongly coupled” with other nodes or sufficiently representative.
Since our node proximity criterion reconciles the topology and signal information, we can use it to govern the seed node selection doing graph downsampling. Specifically, we define the coupling factor for node as
where is the proximity measure defined in Eq. (13). measures the volume that gets from other nodes, and thereby the strength of its coupling. With normalization, we break the symmetry of the proximity of each pair of nodes, and the node with more extensive connections with other nodes different than (i.e., ) gains a greater coupling value () , which eventually favors the selection of strongly connected seed nodes.
For the whole graph, the coupling factor is reformulated as:
where returns a diagonal matrix with the diagonal elements of the input matrix of Eq. (13), is a diagonal matrix with , and indicates a vector of constant 1.
On the basis of the coupling factor of each vertex, we can finally select seed nodes that are “strongly coupled” with others by re-ordering vertices in terms of and keeping the top ones accordingly, i.e.,
where is a global ranking operator that returns the top nodes with the largest score and is dependent on the pooling ratio and the number of nodes.
|DIFFPOOL||79.19 3.35||74.96 4.14||74.62 2.04||74.60 1.88||78.55 1.87|
|gPool||79.32 4.07||74.78 4.02||75.64 2.47||75.54 2.00||80.33 1.54|
|SAGPool||79.06 3.96||75.09 4.82||76.23 2.26||75.80 2.35||79.99 2.09|
|EigenPool||78.89 3.95||75.09 3.51||76.57 2.79||76.16 1.94||80.10 2.03|
|ProxPool-NT||79.19 3.49||75.09 4.09||77.77 2.04||76.02 2.04||80.41 2.11|
|ProxPool-NS||80.17 2.27||75.36 4.22||77.87 2.54||76.69 2.67||80.71 1.78|
|ProxPool||79.83 3.18||75.67 4.61||77.83 2.39||77.02 1.97||80.71 1.72|
5.2 Coarsened Graph Construction
A coarsened graph can be constructed with the selected seed nodes. We first consider the aggregation of non-selected nodes to seed nodes. To preserve the locality, a seed node aggregates the information of non-selected nodes within its -hop neighborhoods in terms of their proximity. In order to sparsify the connection of the coarsened graphs, we utilize the sparsemax introduced in (martins2016softmax):
where denotes the index set of seed nodes. Correspondingly, the coarsening matrix indicates the assignment of vertices in the original graph to vertices in the coarsened graph is obtained by:
where the delta function equals iff .
After the seed nodes selection and the aggregation of non-selected nodes, we need to reduce the adjacency matrix of the original graph to another one defined on these seed nodes. In this way, we obtain the connections between nodes in the coarsened graph accordingly and extract multiscale representations with the following layers. For each pair of nodes in the coarsened graph, we construct their connection by taking into consideration the links between subgraphs of the original graph, each of which consists of a seed node and its associated non-selected nodes, and compute the corresponding new edge weight by weighted aggregation of these connections. Specifically, for nodes in the coarsened graph, the weight of the edge connecting them is computed as . If , there is no edge between node and node . In other words, the adjacency matrix with the connections in the coarsened graph is obtained as:
Furthermore, we assign the node signal to a seed node as its original node signal together with the aggregation of its associated nodes through a weighted summarization, i.e.,
Note that, the proposed pooling operator is generic can be interlaced with graph convolution layers as well as other modules to extract hierarchical multiscale representations of graph data to solve a variety of tasks. It is differentiable with learnable parameters that can be trained together with other modules of the graph neural network using diverse gradient-based optimization methods.
We evaluate the proposed graph pooling operator and the state-of-the-arts in graph classification tasks.
6.1 Experimental Settings
Datasets. We follow previous methods (DBLP:conf/icml/LeeLK19; DBLP:conf/kdd/0001WAT19) to conduct experiments on five large public benchmark graph classification datasets (), including D&D, PROTEINS, NCI1, NCI109 and MUTAGENICITY111Datasets could be downloaded from https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets. Statistics and properties of the datasets are summarized in Table 2. Node categorical features are adopted as the node signal.
Network architecture. We evaluate the proposed pooling operators with the help of deep graph convolution networks. Similar to (ying2018hierarchical; DBLP:conf/icml/LeeLK19; gao2019ipool), we integrate the proposed pooling operation into the GraphSAGE (hamilton2017inductive) framework. In the experiments, the network architecture consists of three convolution layers and two (proposed) pooling layers ([conv-pool]:
is the identity matrix anddenotes learnable parameters. An normalization function is further utilized after each convolution layer to stabilize and accelerate the training process. The pooling operator then follows the convolution layer to coarsen graphs in accordance with the operator proposed in Section 5. Subsequently, a readout module is adopted to aggregate the graph features at different scales and generate the graph representation .
indicates the node-wise summation and maximum operators to aggregate node information. Finally, a prediction module consisting of two fully connected layers and a softmax layer predicts the class of the graph under study based on the graph representation.
Configurations. According to (DBLP:conf/icml/LeeLK19)
, we randomly split each dataset into training, validation and test sets with a ratio of 8:1:1. The trained model achieving best performance on the validation set is selected for test. We conduct 20 random splits for each dataset and report the classification accuracy for test sets. Mean accuracy with standard deviation is used to alleviate the impact of splitting.
In our experiments, each graph in a dataset is downsampled with the same pooling ratio . The pooling ratio
is set as 0.25 on D&D and 0.3 on others, in consideration of the larger size of the graphs in D&D. For the network architectures, each convolution layer consists of 64 hidden neurons, and the size of low-dimension features in the pooling layers obtained through the affine transformation defined in Eq. (10
) is 16. The proposed models are implemented in Pytorch(paszke2017automatic), and the models are optimized with the Adam optimizer (DBLP:journals/corr/KingmaB14) with a batch size of 64. The learning rate is 0.001 on all the datasets, except for D&D using 0.0001. We obtain the following optimal hyper-parameters through grid search: , , and weight decay .
Baselines. We compare our pooling operator with the recent state-of-the-art graph pooling operators for GCNs. DIFFPOOL (ying2018hierarchical) coarsens graph with an assignment matrix generated by an extra branch of GCNs. Since this branch of GNNs to produce the assignment matrix is predefined and used for all graphs, the sizes of coarsened graphs in a same dataset are the same and are proportional to the maximum number of nodes. We set this proportion as 0.2 so that the average size of coarsened graphs generated by different baselines are similar on most datasets. gPool (DBLP:conf/icml/GaoJ19) introduces a trainable vector to generate footprint of each node and select nodes accordingly. SAGPool (DBLP:conf/icml/LeeLK19) first applies attention mechanisms to graph pooling and exploits a graph convolution operation to generate self-attention score. EigenPooling (DBLP:conf/kdd/0001WAT19) relies on spectral clustering algorithms to partition nodes and takes each cluster as a node of the coarsened graph. The node signal is obtained by projecting signals residing on each subgraph to its first eigenvectors. For fair comparison, all these baselines are re-implemented in the same framework as our pooling operator, except for DIFFPOOL in downsampling as above discussion.
6.2 Ablation Studies
ProxPool: The method proposed in Section 5.
ProxPool-NT: The structure-aware kernel is substituted with the adjacency matrix to exploit the topology information. The node proximity defined in Eq. (14) is reformulated by . Graph downsampling and reduction change accordingly.
ProxPool-NS: Only the structure-aware kernel is utilized in model node proximity, i.e., . Graph downsampling and reduction change accordingly.
6.3 Results and Discussions
The comparisons of performance achieved by the baselines, ProxPool and its variants in terms of classification accuracy are presented in Table 1. The proposed pooling operators outperform all the baseline pooling operators on all the five datasets. Fig. 2 shows that ProxPool yields better selection of seed nodes than gPool and SAGPool, considering the coupling of seed nodes with non-selected nodes. Furthermore, the coarsened graphs produced by gPool and SAGPool consist of several separate subgraphs with only few nodes. This fact implies that they would impede the information propagation and extraction in the subsequent layers. As shown in Fig. 2 (c) and Fig. 4 (a), DIFFPOOL tends to generate complete coarsened graphs with a dense assignment matrix, which leads to the partial loss of the structure information and signal locality. In contrast, Fig. 4 (b) suggests that ProxPool adopts a sparse assignment matrix to exploit the signal and topology information within -hop neighborhood of seed nodes. Thus, ProxPool is able to balance the connectivity of the structure and the locality of the signal of the coarsened graphs.
We further study the benefits of the proposed structure-aware kernel. Considering one-hop topology information in modeling the relations between nodes, ProxPool-NT is inferior to ProxPool on all the datasets. As illustrated in Fig. 3, the structure-aware kernel captures -hop topology, in addition to the one-hop relationship modeled with the adjacency matrix. These results demonstrate that the proposed structure-aware kernel can sufficiently exploit the topology information of graph data. We also evaluate the node proximity in terms of the graph signal. ProxPool-NS achieves degraded but competitive performance on most datasets, as it only considers the graph topology information. Fig. 3 shows that the proposed node proximity measure facilitates ProxPool by jointly considering the graph signal and graph topology.
The computational complexity of ProxPool is dominated by the computation of structure-aware kernel. Thus, its complexity would be by optimizing matrix multiplication with the Coppersmith-Winograd algorithm (coppersmith1987matrix). It can be further reduced with sparse implementation. In contrast, the computational complexity of eigendecomposition of a Laplacian matrix associated with is . These results show that ProxPool leverages the structure-aware kernel to efficiently consider the graph signal and graph topology for pooling.
In this paper, we propose a novel graph pooling operator based on the kernel-based measure of node proximity. This measure reconciles the topology and signal information and permits quantitative evaluation of the closeness of arbitrary two nodes within a -hop neighborhood. ProxPool is shown to yield state-of-the-art performance in graph classification. In future, we would employ the proposed node proximity in tasks like node classification and community detection.