Hierarchical Graph Pooling with Structure Learning(AAAI-2020)
Graph Neural Networks (GNNs), which generalize deep neural networks to graph-structured data, have drawn considerable attention and achieved state-of-the-art performance in numerous graph related tasks. However, existing GNN models mainly focus on designing graph convolution operations. The graph pooling (or downsampling) operations, that play an important role in learning hierarchical representations, are usually overlooked. In this paper, we propose a novel graph pooling operator, called Hierarchical Graph Pooling with Structure Learning (HGP-SL), which can be integrated into various graph neural network architectures. HGP-SL incorporates graph pooling and structure learning into a unified module to generate hierarchical representations of graphs. More specifically, the graph pooling operation adaptively selects a subset of nodes to form an induced subgraph for the subsequent layers. To preserve the integrity of graph's topological information, we further introduce a structure learning mechanism to learn a refined graph structure for the pooled graph at each layer. By combining HGP-SL operator with graph neural networks, we perform graph level representation learning with focus on graph classification task. Experimental results on six widely used benchmarks demonstrate the effectiveness of our proposed model.READ FULL TEXT VIEW PDF
Hierarchical Graph Pooling with Structure Learning(AAAI-2020)
Deep neural networks with convolution and pooling layers have achieved great success in various challenging tasks, ranging from computer vision, natural language understanding  to video processing 
. The data in these tasks are typically represented in the Euclidean space (i.e., modeled as 2-D or 3-D tensors), thus usually containing locality and order information for the convolution operations. However, in many real-world problems, a large amount of data, such as social networks, chemical molecules and biological networks, are lying on non-Euclidean domains that can be naturally represented as graphs. Due to the neural network’s powerful capabilities, it’s quite appealing to generalize the convolution and pooling operations to graph-structured data.
Recently, there have been a myriad of attempts to generalize the convolution operations to arbitrary graphs, referred to as graph neural networks (GNNs for short). In general, these algorithms can be classified into two big categories: spectral and spatial approaches. For the spectral methods, they typically define the graph convolution operations based on graph Fourier transform[4, 5, 19]. For the spatial methods, the graph convolution operations are devised by aggregating the node representations directly from its neighborhood [13, 24, 30, 25]. Majority of the aforementioned methods mainly involve transforming, propagating and aggregating node features across the graph, which can fit in the message passing scheme . GNNs have been applied to different types of graphs [30, 6], and obtained outstanding performance in numerous graph related tasks, including node classification , link prediction [27, 37] and recommendation , etc.
Nevertheless, the pooling operations in graphs have not been extensively studied yet, though they act a pivotal part in learning hierarchical representations for the task of graph classification . The goal of graph classification is to predict the label associated with the entire graph by utilizing its node features and graph structure information, i.e., a graph level representation is needed. GNNs are originally designed to learn meaningful node level representations, thus a commonly adopted approach to generate graph level representation is to globally summarize all the node representations in the graph. Although workable, the graph level representation generated via this way is inherently “flat”, since the entire graph structure information is neglected during this process. Furthermore, GNNs can only pass messages between nodes through edges, but cannot aggregate node information in a hierarchical way. Meanwhile, graphs often have different substructures and nodes are of different roles, therefore they should contribute differently to the graph level representation. For example, in the protein-protein interaction graphs, the certain substructures may represent some specific functionalities, which are of great significance to predict the whole graph characteristics. To capture both the graph’s local and global structure information, a hierarchical pooling process is demanded.
There exists some very recent work that focuses on the hierarchical pooling procedure in GNNs [35, 11, 8, 10]. These models usually coarsen the graphs through grouping or sampling nodes into subgraphs level by level, thus the entire graph information is gradually reduced to the hierarchical induced subgraphs. However, the graph pooling operations still have room for improvement. In node grouping approaches, the hierarchical pooling methods [35, 8] suffer from high computational complexity, which require additional neural networks to downsize the nodes. In node sampling approaches, the generated induced subgraph [11, 20] might fail to preserve the key substructures and eventually lose the completeness of graph topological information. For instance, two nodes that are not directly connected but sharing many common neighbors in the original graph might become unreachable from each other in the induced subgraph, even if intuitively they ought to be “close” in the subgraph. Therefore, the distorted graph structure will hinder the message passing in subsequent layers.
To address the aforementioned limitations, we propose a novel graph pooling operator HGP-SL to learn hierarchical graph level representations. Specifically, HGP-SL first adaptively selects a subset of nodes according to our defined node information score, which fully utilizes both the node features and graph topological information. In addition, the proposed graph pooling operation is a non-parametric step, therefore no additional parameters need to be optimized during this procedure. Then, we apply a structure learning mechanism with sparse attention 
to the pooled graph, aiming to learn a refined graph structure that preserves the key substructures in the original graph. We integrate the pooling operator into graph convolutional neural network to perform graph classification and the whole procedure can be optimized in an end-to-end manner. To summarize, the main contributions of this paper are as follows:
We introduce a novel graph pooling operator HGP-SL that can be integrated into various graph neural network architectures. Similarly to the pooling operations in convolutional neural networks, our proposed graph pooling operation is non-parametric111Note that the pooling process itself is non-parametric, however the structure learning mechanism indeed has an attention parameter. Thus, the overall HGP-SL operator is not non-parametric. and very easy to implement.
To the best of our knowledge, we are the first to design a structure learning mechanism for the pooled graph, which has the advantage of learning a refined graph structure to preserve the graph’s key substructures.
We conduct extensive experiments on six public datasets to demonstrate HGP-SL’s effectiveness as well as superiority compared to a range of state-of-the-art methods.
GNNs can be generally categorized into two branches: spectral and spatial approaches. The spectral methods typically define the parameterized filters according to graph spectral theory.  first proposed to define convolution operations for graph in the Fourier transform domain. Due to its heavy computation cost, it has difficulty in scaling to large graphs. Later on,  improved its efficiency by approximating the K-polynomial filters through Chebyshev expansion. GCN  further simplified the ChebNet by truncating the Chebyshev polynomial to the first-order approximation of the localized spectral filters.
The spatial approaches design convolution operations by directly aggregating the node’s neighborhood information. Among them, GraphSAGE  proposed an inductive algorithm that can generalize to unseen nodes by aggregating its neighborhood content information. GAT  utilized attention mechanism to aggregate nodes’ neighborhood representations with different weights. JK-Net  leveraged flexible neighborhood ranges to enable better node representations. More details can be found in several comprehensive surveys on graph neural networks [39, 38, 32]. Nevertheless, the above mentioned two branches of GNNs are mainly designed for learning meaningful node representations, and unable to generate hierarchical graph representations due to the lack of pooling operations.
Pooling operations in GNNs can scale down the size of inputs and enlarge the receptive fields, thus giving rise to better generalization and performance. DiffPool  proposed to softly assign nodes to a set of clusters using neural networks, which forms a dense cluster assignment matrix and is computation expensive. gPool  and SAGPool  devised a top-K node selection procedure to form an induced subgraph for the next input layer. Though efficient, it might lose the completeness of the graph structure information and result in isolated subgraphs, which will hamper the message passing process in subsequent layers. EdgePool  designed pooling operation by contracting the edges in the graph, but its flexibility is poor because it will always pool roughly half of the total nodes. EigenPool 
introduced a pooling operator based on the graph Fourier transform, which controls the pooling ratio through spectral clustering and it’s also very time consuming.
In addition, there are also some approaches that perform global pooling. For instance, Set2Set  implemented the global pooling operation by aggregating information through LSTMs . DGCNN  pooled the graph according to the last channel of the feature map values which are sorted in the descending order. Graph topological based pooling operations are proposed in  and  as well, where Graclus method  is employed as a pooling module.
Given a set of graph data , where the number of nodes and edges in each graph might be quite different. For an arbitrary graph , we have and denote the number of nodes and edges, respectively. Let be the adjacent matrix describing its edge connection information and represents the node feature matrix, where is the dimension of node attributes. Label matrix indicates the associated labels for each graph, i.e., if belongs to class , then , otherwise . Since the graph structure and node numbers change between layers due to the graph pooling operation, we further represent the -th graph fed into the -th layer as with
nodes. The adjacent matrix and hidden representation matrix are then denoted asand . With the above notations, we formally define our problem as follows:
Input: Given a set of graphs with its label information , the number of graph neural network layers , pooling ratio , and representation dimension in each layer.
Output: Our goal is to predict the unknown graph labels of with graph neural network in an end-to-end way.
Graph convolutional neural network (or GCN)  has shown to be very efficient and achieved promising performance in various challenging tasks. Thus, we choose GCN as our model’s building block and briefly review its mechanism in this subsection. Please note that our proposed HGP-SL operator can also be integrated into other graph neural network architectures like GraphSAGE  and GAT . We will discuss this in the experiment section. For the -th layer in GCN, it takes graph ’s adjacent matrix and hidden representation matrix as input, then the next layer’s output will be generated as follows:
is the non-linear activation function and, is the adjacent matrix with self-connections. is the diagonal degree matrix of , and is a trainable weight matrix. For the ease of parameter tuning, we set output dimension for all layers.
Figure 1 provides an overview of our proposed ierarchical raph ooling with tructure
earning (HGP-SL) that combines with graph neural network, where graph pooling operations are added between graph convolution operations. The proposed HGP-SL operator is composed of two major components: 1) graph pooling, which preserves a subset of informative nodes and forms a smaller induced subgraph; and 2) structure learning, which learns a refined graph structure for the pooled subgraph. The advantage of our proposed structure learning lies in its capability to preserve the essential graph structure information, which will facilitate the message passing procedure. As in this illustrative example, the pooled subgraph might exist isolated nodes but intuitively ought to be connected, thus it would hinder the information propagation in subsequent layers especially when aggregating information from its neighborhood nodes. The whole architecture is the stacking of convolution and pooling operations, thus making it possible to learn graph representations in a hierarchical way. Then, a readout function is utilized to summarize node representations in each level, and the final graph level representation is the addition of different levels’ summarizations. At last, the graph level representation is fed into a Multi-Layer Perceptron (MLP) with softmax layer to perform graph classification task. In what follows, we give the details of graph pooling and structure learning layers.
In this subsection, we introduce our proposed graph pooling operation to enable down-sampling on graph data. Inspired by [11, 20], the pooling operation identifies a subset of informative nodes to form a new but smaller graph. Here, we design a non-parametric pooling operation, which can fully utilize both the node features and graph structure information.
The key of our proposed graph pooling operation is to define a criterion that guides the node selection procedure. We therefore introduce a criterion named node information score to evaluate the information that each node contains given its neighborhood. In general, if a node’s representation can be well reconstructed by its neighborhood, it means this node can probably be removed in the pooled graph with negligible information loss. Here, we formally define the node information score as the Manhattan distance between the node representation itself and the one constructed from its neighbors:
where and are the adjacent and node representations matrices, respectively. denotes the norm. represents the diagonal degree matrix of , and
is the identity matrix. Therefore, we haveencode the information score of each node in the graph.
After having obtained the node information score, we can now select nodes that should be preserved by the pooling operator. To approximate the graph information, we choose to preserve the nodes that can not be well represented by their neighbors, i.e., the nodes with relative larger node information score will be preserved in the construction of the pooled graph, because they can provide more information. Specifically, the graph nodes are first re-ordered based on the value of their node information scores, then we can simply select a subset of top-ranked nodes as follows:
where is the pooling ratio and top-rank denotes the function that returns the indices of the top values. and perform the row or (and) column extraction to form the node representation matrix and adjacent matrix for the induced subgraph. Thus, we have and represent the node feature and graph structure information of next layer .
In this subsection, we present how our proposed structure learning mechanism learns a refined graph structure in the pooled graph. As we have illustrated in Figure 1
, the pooling operation might result in highly related nodes being disconnected in the induced subgraph, which loses the completeness of the graph structure information and further hinders the message passing procedure. Meanwhile, the graph structure obtained from domain knowledge (e.g., social network) or established by human (e.g., KNN graph) are usually non-optimal for the learning task in graph neural networks, due to the lost or noisy information. To overcome this problem,
proposed to adaptively estimate graph Laplacian using an approximate distance metric learning algorithm, which might lead to local optimal solution. introduced to learn the constructed graph structure for node label estimation, however it generates dense connected graph and is not applicable in our hierarchical graph level representation learning scenario.
Here, we develop a novel structure learning layer, which learns sparse graph structure through sparse attention mechanism . For graph ’s pooled subgraph at its -th layer, we take its structure information and hidden representations
as input. Our target is to learn a refined graph structure that encodes the underlying pairwise relationship between each pair of nodes. Formally, we utilize a single layer neural network parameterized by a weight vector. Then, the similarity score between node and calculated by the attention mechanism can be expressed as:
where is the activation function like and represents the concatenation operation. and indicate the -th and -th row of matrix , which denote the representations of node and , respectively. Specifically, encodes the induced subgraph structure information, where if node and are not directly connected. We incorporate into our structure learning layer to bias the attention mechanism to give a relatively larger similarity score between directly connected nodes, and at the same time try to learn the underlying pairwise relationships between disconnected nodes. is a trade-off parameter between them.
To make the similarity score easily comparable across different nodes, we could normalize them across nodes using the softmax function:
However, the softmax transformation always has non-zero values and thus results in dense fully connected graph, which may introduce lots of noise into the learned structure. Hence, we propose to utilize sparsemax function , which retains most the important properties of softmax function and has in addition the ability of producing sparse distributions. The function aims to return the Euclidean projection of input onto the probability simplex and can be formulated as follows:
where , and is the threshold function that returns a threshold according to the procedure shown in Algorithm 1. Thus, preserves the values above the threshold and the other values will be truncated to zeros, which brings sparse graph structure. Similarly to softmax function, also has the properties of non-negative and sum-to-one, that’s to say, and . The proof procedure is available in the supplemental material.
For large scale graphs, it will be computation expensive to calculate the similarities between each pair of nodes during the learning of structure . If we further take graph’s localization and smoothness properties into consideration, it is reasonable to limit the calculation process within the node’s -hop neighborhood ( or ). Therefore, the computation cost of can be greatly reduced.
After having obtained the refined graph structure , we conduct graph convolution and pooling operations in the following layers based on and (instead of ). Thus, Equation (1) can be simplified as follows:
Since the learned satisfies , therefore we have the diagonal matrix with , which degenerates to identity matrix . Similarly, the calculation of node information score in Equation (2) can also be simplified as below:
which makes our model very easy to implement.
As we have demonstrated in Figure 1, the neural network architecture repeats the graph convolution and pooling operations for several times, thus we would observe multiple subgraphs with different size in each level:
. To generate a fixed size graph level representation, we devise a readout function that aggregates all the node representations in the subgraph. Here, we simply use the concatenation of mean-pooling and max-pooling in each subgraph as follows:
where is a nonlinear activation function and . We then add222In our experiment, we use fixed size node representation across all layers, i.e., . the readout outputs of different levels to form our final graph level representation:
which summarizes different levels’ graph representations.
Finally, we feed the graph level representation into MLP layer with softmax classifier, and the loss function is defined as the cross-entropy of predictions over the labels:
where represents the predicted probability that graph belongs to class , and is the ground truth. denotes the training set of graphs that have labels.
|HGP-SL||68.79 2.11||84.91 1.62||80.96 1.26||78.45 0.77||80.67 1.16||82.15 0.58|
Graph classification in terms of accuracy with standard deviation (in percentage). We usebold to highlight wins.
We adopt six commonly used public benchmarks333Benchmarks are publicly available at https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets for empirical studies. Statistics of the six datasets are summarized in Table 1 with more descriptions as follows: ENZYMES  is a dataset of protein tertiary structures, and each enzyme belongs to one of the 6 EC top-level classes. PROTEINS and D&D  are two protein graph datasets, where nodes represent the amino acids and two nodes are connected by an edge if they are less than 6 Angstroms apart. The label indicates whether or not a protein is a non-enzyme. NCI1 and NCI109  are two biological datasets screened for activity against non-small cell lung cancer and ovarian cancer cell lines, where each graph is a chemical compound with nodes and edges representing atoms and chemical bonds, respectively. Mutagenicity  is a chemical compound dataset of drugs, which can be categorized into two classes: mutagen and non-mutagen.
In this group, we further consider numerous models that combine GNNs with pooling operator for graph level representation learning. Set2Set  and DGCNN  are two novel global graph pooling algorithms. Another five hierarchical graph pooling models including DiffPool , gPool , SAGPool , EdgePool  and EigenPool  are also compared as baselines.
To further analyze the effectiveness of our proposed HGP-SL operator, we consider four variants here: (No Structure Learning) which discards the structure learning layer to verify the effectiveness of our proposed structure learning module, which removes the structure learning layer and connects the nodes within its -hops, (DENse) which employs the structure learning layer to learn a dense graph structure with softmax function defined in Equation (5) and HGP-SL which utilizes sparsemax function define in Equation (6) to learn a sparse graph structure. Both and HGP-SL use efficiency improved structure learning strategy.
as test set. We repeat this randomly splitting process 10 times, and the average performance with standard derivation is reported. For baseline algorithms, we use the source code released by the authors, and their hyper-parameters are tuned to be optimal based on the validation set. In order to ensure a fair comparison, the same neural network architectures are used for the existing pooling baselines and our proposed model. The dimension of node representations is set as 128 for all methods and datasets. We implement our proposed HGP-SL with PyTorch, and the Adam optimizer is utilized to optimize the model. The learning rate and weight decays are searched in, pooling ratio and layers
. The MLP consists of three fully connected layers with number of neurons in each layer setting as 256, 128, 64, followed by a softmax classifier. Early stopping criterion is employed in the training process, i.e., we stop training if the validation loss dose not decrease for 100 consecutive epochs. The source code is publicly available444Code is available at https://github.com/cszhangzhen/HGP-SL.
The classification performance is reported in Table 2. To summarize, we have the following observations:
First of all, a general observation we can draw from the results is that our proposed HGP-SL consistently outperforms other state-of-the-art baselines among all datasets. For instance, our method achieves about 3.08% improvement over the best baseline in PROTEINS dataset, which is 12.97% improvement over GCN with no hierarchical pooling mechanism. This verifies the necessity of adding graph pooling module.
It is worth noting that the traditional graph kernel based methods demonstrate competitive performance. However, the carefully designed graph kernels typically involve massive human domain knowledge, which has difficulty in generalizing to graphs with arbitrary structures. Furthermore, the two-stage procedure of extracting graph features and performing graph classification might result in sub-optimal performance.
In particular, the global pooling approaches Set2Set and DGCNN are surpassed by most of the hierarchical pooling methods with a few exceptions. This is because their learned graph representations are still “flat”, and the hierarchical structure information or functional units in the graph are ignored, which play an important role in predicting the entire graph labels.
We note that the hierarchical pooling models can achieve relative better performance among most baselines, which further shows the effectiveness of the hierarchical pooling mechanism. Among them, gPool and SAGPool perform poorly in ENZYMES dataset. This may be due to the limited training samples per class resulting in the neural network overfitting. EdgePool gains superior performance in this group of competitors, which scales down the size of graphs by contracting each pair of nodes in the graph. Obviously, our proposed HGP-SL outperforms EdgePool with different gains for all settings.
Finally, HGP-SL and obtain better performance than and , which justifies the effectiveness of our proposed structure learning layer. Moreover, performs worse than HGP-SL. This is because the disconnected nodes are still unreachable in its -hops. HGP-SL further outperforms , which indicates the learned dense graph structure might introduce additional noisy information and degenerate the performance. Furthermore, in the real-world scenario, graphs usually have sparse topologies, thus our proposed HGP-SL could learn more reasonable graph structures compared with .
As mentioned in previous sections, our proposed HGP-SL can be integrated into various graph neural network architectures. We consider three most widely used graph convolutional architectures as our model’s building block to investigate the affect of different convolution operations: GCN , GraphSAGE  and GAT . We evaluate them on three datasets, which cover both small and large datasets. Their results are shown in Table 3. Similar results can also be found in the remaining datasets, thus we omit them due to the limited space. As demonstrated in Table 3, the performance on graph classification varies depending on which dataset and the type of GNN in HGP-SL are chosen. In addition, we also combine the top-K selection procedure proposed in gPool and SAGPool with our proposed structure learning. We name them as gPool-SL and SAGPool-SL for short. From the results, we observe that gPool-SL and SAGPool-SL outperform gPool and SAGPool by incorporating the structure learning mechanism, which verifies the effectiveness of our proposed structure learning.
We further study the sensitivities of several key hyper-parameters by varying them in different scales. Specifically, we investigate how the number of neural network layers , graph representation dimension and pooling ratio will affect the graph classification performance. As we can see in Figure 2, HGP-SL almost achieves the best performance across different datasets when setting , and , respectively. The pooling ratio cannot be too small, otherwise most of the graph structure information will be lost during the pooling process.
We utilize networkx555https://networkx.github.io/ to visualize the pooling results of HGP-SL and its variants. In detail, we randomly sample a graph from PROTEINS dataset, which contains 154 nodes. We build a three layer graph neural network with pooling ratio setting as 0.5, which then generates three pooled graphs with nodes as 77, 39 and 20 respectively. We plot the 3rd pooled graph in Figure 3. It shows and fail to preserve meaningful graph topologies, while HGP-SL is able to preserve relatively reasonable topology of the original protein graph after pooling.
In this paper, we investigate graph level representation learning for the task of graph classification. We propose a novel graph pooling operator HGP-SL, which empowers GNNs to learn hierarchical graph representations. It can also be conveniently integrated into various GNN architectures. Specifically, the graph pooling operation is a non-parametric step, which utilizes node features and graph structure information to perform down-sampling on graphs. Then, a structure learning layer is stacked on the pooling operation, which aims to learn a refined graph structure that can best preserve the essential topological information. We combine the proposed HGP-SL operator with graph convolutional neural networks to conduct graph classification task. Comprehensive experiments on six widely used benchmarks demonstrate its superiority to a range of state-of-the-art methods.
This work is supported by Alibaba-Zhejiang University Joint Institute of Frontier Technologies, The National Key R&D Program of China (No.2018YFC2002603), Zhejiang Provincial Natural Science Foundation of China (No. LZ13F020001), the National Natural Science Foundation of China (No.61972349, 61173185, 61173186) and the National Key Technology R&D Program of China (No.2012BAI34B01, 2014BAK15B02), National Natural Science Foundation of China (Grant No: U1866602) and National Key Research and Development Project (Grant No: 2018AAA0101503).
Weighted graph cuts without eigenvectors a multilevel approach. TPAMI 29 (11), pp. 1944–1957. Cited by: Graph Pooling.
International Conference on Machine Learning, pp. 2083–2092. Cited by: Introduction, Graph Pooling, Graph Pooling Operation, Graph Pooling Models..
Geometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, pp. 5115–5124. Cited by: Introduction.
To summarize, considers the Euclidean projection of the input vector onto the probability simplex, which can be defined as the following optimization problem:
Then, the Lagrangian of the optimization problem in Equation (12) is:
The optimal must satisfy the following Karush-Kuhn-Tucker conditions:
If for we have , then from Equation (16) we must satisfy . Thus, from Equation (14) we can get . Let . From Equation (15) we obtain , which yields the Line 3 in Algorithm 1, i.e., . Again from Equation (16), we have that implies , which from Equation (14) implies , i.e., for . Thus, we have the procedure in Algorithm 1.