Log In Sign Up

Robust Hierarchical Graph Classification with Subgraph Attention

Graph neural networks get significant attention for graph representation and classification in machine learning community. Attention mechanism applied on the neighborhood of a node improves the performance of graph neural networks. Typically, it helps to identify a neighbor node which plays more important role to determine the label of the node under consideration. But in real world scenarios, a particular subset of nodes together, but not the individual pairs in the subset, may be important to determine the label of the graph. To address this problem, we introduce the concept of subgraph attention for graphs. On the other hand, hierarchical graph pooling has been shown to be promising in recent literature. But due to noisy hierarchical structure of real world graphs, not all the hierarchies of a graph play equal role for graph classification. Towards this end, we propose a graph classification algorithm called SubGattPool which jointly learns the subgraph attention and employs two different types of hierarchical attention mechanisms to find the important nodes in a hierarchy and the importance of individual hierarchies in a graph. Experimental evaluation with different types of graph classification algorithms shows that SubGattPool is able to improve the state-of-the-art or remains competitive on multiple publicly available graph classification datasets. We conduct further experiments on both synthetic and real world graph datasets to justify the usefulness of different components of SubGattPool and to show its consistent performance on other downstream tasks.


Unsupervised Graph Representation by Periphery and Hierarchical Information Maximization

Deep representation learning on non-Euclidean data types, such as graphs...

Sparse Graph Attention Networks

Graph Neural Networks (GNNs) have proved to be an effective representati...

iPool -- Information-based Pooling in Hierarchical Graph Neural Networks

With the advent of data science, the analysis of network or graph data h...

HMSG: Heterogeneous Graph Neural Network based on Metapath Subgraph Learning

Many real-world data can be represented as heterogeneous graphs with dif...

Revisiting Adversarial Attacks on Graph Neural Networks for Graph Classification

Graph neural networks (GNNs) have achieved tremendous success in the tas...

Heterogeneous Graph Neural Networks for Malicious Account Detection

We present, GEM, the first heterogeneous graph neural network approach f...

How to Find Your Friendly Neighborhood: Graph Attention Design with Self-Supervision

Attention mechanism in graph neural networks is designed to assign large...

1. Introduction

Graphs are the most suitable way to represent different types of relational data such as social networks, protein interactions and molecular structures. Typically, A graph is represented by , where is the set of nodes and is the set of edges. Further, each node

is also associated with an attribute (or feature) vector

. Recent advent of deep representation learning has heavily influenced the field of graphs. Graph neural networks (GNNs) (Defferrard et al., 2016; Xu et al., 2019) are developed to use the underlying graph as a computational graph and aggregate node attributes from the neighbors of a node to generate the node embeddings (Kipf and Welling, 2016). A simplistic message passing framework (Gilmer et al., 2017) for graph neural networks can be presented by the following equations.


Here, is the representation of node of graph in -th layer of the GNN. The function considers representation of the neighboring nodes of from the th layer of the GNN and maps them into a single vector representation. As neighbors of a node do not have any ordering in a graph and the number of neighbors can vary for different nodes, function needs to be permutation invariant and should be able to handle different number of nodes as input. Then, function uses the node representation of th node from the layer of GNN and the aggregated information from the neighbors to obtain an updated representation of the node . Finally for the graph level tasks, function (also known as graph pooling) generates a summary representation for the whole graph from all the node representations , from the final layer (L) of GNN. Similar to , the function also needs to be invariant to different node permutations of the input graph, and should be able to handle graphs with different number of nodes.

In the existing literature, different types of neural architectures are proposed to implement each of the three functions mentioned in Equation 1. For example, GraphSAGE (Hamilton et al., 2017) implements 3 different variants of the function with mean, maxpool and LSTM respectively. For a graph level task such as graph classification (Xu et al., 2019; Duvenaud et al., 2015), GNNs jointly derive the node embeddings and use different pooling mechanisms (Ying et al., 2018; Lee et al., 2019) to obtain a representation of the entire graph. Recently, attention mechanisms on graphs show promising results for both node classification (Veličković et al., 2018) and graph classification (Lee et al., 2019, 2018) tasks. There are different ways to compute attention mechanisms on graph. (Veličković et al., 2018) compute attention between a pair of nodes in the immediate neighborhood to capture the importance of a node on the embedding of the other node by learning an attention vector. (Lee et al., 2018) compute attention between a pair of nodes in the neighborhood to guide the direction of a random walk in the graph for graph classification. (Lee et al., 2019) propose self attention pooling of the nodes which is then used to capture the importance of the node to generate the label of the entire graph.

Most of the attention mechanisms developed in graph literature use attention to derive the importance of a node or a pair of nodes for different tasks. But in real world situation, calculating importance up to a pair of nodes is not adequate. In molecular biology or in social networks, the presence of particular sub-structures (a subset of nodes with their connections and features), potentially of varying sizes, in a graph often determines its label. Hence, all the nodes collectively in such a substructure are important, and they may not be important individually or in pairs to classify the graph. In Figure

1, each node (indexed from to ) in the small synthetic graph can be considered as an agent whose attributes determine its opinion (1:positive, 0: neutral, -1: negative) about 4 products. Suppose the graph can be labelled +1 only if there is a subset of connected (by edges) agents who jointly have positive opinion about all the product. In this case, the blue shaded connected subgraph is important to determine the label of the graph. Please note, attention over the pairs (Veličković et al., 2018) is not enough as cannot make the label of the graph +1 by itself. Also, multiple layers of graph convolution (Kipf and Welling, 2016) with pair-wise attention may not work as the aggregated features of a node get corrupted after the feature aggregation by the first few convolution layers. Besides, recent literature also shows that higher order GNNs that directly aggregate features from higher order neighborhood of a node are theoretically more powerful than 1st order GNNs (Morris et al., 2019). With these motivations, we develop a novel higher order attention mechanism in the graph which operates in the subgraph level in the vicinity of a node. We call it subgraph attention mechanism and use it for graph classification111Subgraph attention can easily be applied for node classification as well. But we focus only on graph classification in this paper..

On the other hand, different types of graph pooling (i.e., function in Equation 1) mechanisms (Duvenaud et al., 2015; Gilmer et al., 2017; Morris et al., 2019) have been proposed in the recent GNN literature. Simple functions such as taking sum or mean of all the node representations to compute the graph-level representation are studied in (Duvenaud et al., 2015). Recently, hierarchical graph pooling (Ying et al., 2018; Morris et al., 2019) gains significant interest as it is able to capture the intrinsic hierarchical structure of several real-world graphs. For e.g., in a social network, one must model both the ego-networks around individual nodes, as well as the coarse-grained relationships between entire communities (Newman, 2003). Instead of directly obtaining a graph level summary vector, hierarchical pooling mechanisms recursively converts the input graph to graphs with smaller sizes. But hierarchical representation (Ying et al., 2018) often fails to perform well in practice mainly due to two major shortcomings. First, there is significant loss of information in learning the sequential hierarchies of a graph when the data is limited. Second, it treats all the nodes within a hierarchy, and all the hierarchies equally while computing the entire graph representation. But for some real-world graphs, the structure between the sub-communities may be more important than that between the nodes or the communities to determine the label of the entire graph (Newman, 2003). Moreover, due to presence of noise, some of the discovered hierarchies may not follow the actual hierarchical structure of the graph (Sun et al., 2017), and can negatively impact the overall graph representation. To address these issues, we again use attention to differentiate different units of a hierarchical graph representation in a GNN framework. Thus, our contributions in this paper are multifold, as follows:

  • We propose a novel higher order attention mechanism (called subgraph attention) for graph neural networks, which is based on the importance of a subgraph of dynamic size to determine the label of the graph.

  • We also propose hierarchical attentions in graph representation. More precisely, we propose intra-level and inter level attention which respectively find important nodes within a hierarchy and important hierarchies of the hierarchical representation of the graph. This enables the overall architecture to minimize the loss of information in the hierarchical learning and to achieve robust performance on real world noisy graph datasets.

  • We propose a novel neural network architecture SubGattPool (Sub-Graph attention network with hierarchically attentive graph Pooling) to combine the above two ideas for graph classification. Thorough experimentation on both real world and synthetic graphs shows the merit of the proposed algorithms over the state-of-the-arts.

Figure 1. Example to motivate subgraph attention

2. Related Work and the Research Gaps

A survey on network representation (Grover and Leskovec, 2016; Bandyopadhyay et al., 2018) learning and graph neural networks can be found in (Wu et al., 2019). For the interest of space, we briefly discuss some more prominent approaches for graph classification and representation. Graph kernel based approaches (Vishwanathan et al., 2010), which map the graphs to Hilbert space implicitly or explicitly, remain to be the state-of-the-art for graph classification for long time. There are different types of graph kernels present in the literature, such as random walk based kernel (Kashima et al., 2003), shortest path based kernels (Borgwardt and Kriegel, 2005), graphlet counting based kernel (Shervashidze et al., 2009), Weisfeiler-Lehman subtree kernel (Shervashidze et al., 2011) and Deep graph kernel (Yanardag and Vishwanathan, 2015). But most of the existing graph kernels use hand-crafted features and they often fail to adapt the data distribution of the graph.

Significant progress happened in the domain of node representation and node level tasks via graph neural networks. Spectral graph convolutional neural networks with fast localized convolutions

(Defferrard et al., 2016; Kipf and Welling, 2016), graph attention (GAT) over a pair of connected node in the graph convolution framework (Veličković et al., 2018), attention over different layers of convolution (Xu et al., 2018), position aware graph neural networks (You et al., 2019) and hyperbolic graph convolution networks (Chami et al., 2019) are some notable examples of GNN for node representation. To go from node embeddings to a single representation for the whole graph, simple aggregation technique such as taking the average of node embeddings in the final layer of a GCN (Duvenaud et al., 2015)

and more advanced deep learning architectures that operate over the sets

(Gilmer et al., 2017) have been used. Attention based graph classification technique GAM (Lee et al., 2018) is proposed, which processes only a portion of the graph by adaptively selecting a sequence of informative nodes. DIFFPOOL (Ying et al., 2018) is a recently proposed hierarchical GNN which uses a GCN based pooling to create a set of hierarchical graphs in each level. (Lee et al., 2019) propose a self attention based pooling strategy which determines the importance of a node to find the label of the graph. Different extensions of GNNs, such as Ego-CNN (Tzeng and Wu, 2019) and ChebyGIN (Knyazev et al., 2019) are proposed for graph classification. Theoretical frameworks to analyze the representational power of GNNs are proposed in (Xu et al., 2019; Maron et al., 2019). (Knyazev et al., 2019) study the ability of attention GNNs to generalize to larger and complex graphs.

Higher order GNNs which operate beyond immediate neighborhood are proposed recently. Based on higher dimensional Weisfeiler-Leman algorithm, k-GNN (Morris et al., 2019) is proposed which derive the representation of all the subgraphs of size through convolution. Mixhop GNN for node classification is proposed in (Abu-El-Haija et al., 2019) which aggregates node features according to the higher order adjacency matrices. Though these higher order GNNs are more powerful representation of graphs, they do not employ attention in the higher order neighborhood. To the best of our knowledge, (Yang et al., 2019) is the only work to propose an attention mechanism on the shortest paths starting from a node to generate the node embedding. However, their computation of shortest path depends on the pairwise node attention and this may fail in the cases when a collection of nodes together is important, but not the individual pairs. Our proposed subgraph attention addresses this gap in the literature. Further, hierarchical pooling as proposed in DIFFPOOL (Ying et al., 2018) has become a popular pooling strategy in GNNs (Morris et al., 2019). But it suffers because of the loss of information and its nature to represent the whole graph by the last level (containing only a single node) of the hierarchy. As discussed in Section 1, some intermediate levels may play more important role to determine the label of the entire graph than the last one (Newman, 2003). The intra-level and inter level attention mechanisms proposed in this work precisely address this research gap in hierarchical graph representation.

3. Proposed Approach: SubGattPool

We formally define the problem of graph classification first. Given a set of graphs , and a subset of graphs with each graph is labelled with (the subscript stands for ‘graphs’), the task is to predict the label of a graph using the structure of the graphs and the node attributes, and the graph labels from . Again, this leads to learning a function . Here, is the set of discrete labels for the graphs.

Figure 2 shows the high-level architecture of SubGattPool. One major component of SubGattPool is the generation of node representations through SubGraph attention (referred as SubGatt) layer. Below, we describe the building blocks of SubGatt for any arbitrary graph. For the ease of reading, we summarize all the important notations used in this paper in Table 1.

Notations Explanations
Set of graphs in a graph dataset
One single graph
Set of discrete labels for graphs
Attribute vector for th node
Multiset of sampled subgraph for the node .
Derived feature vector of a subgraph
Maximum size (i.e., number of nodes) of a subtree
Number of subgraphs to sample for each node
Final representation of the graph
Level graphs of some input graph
Embedding matrix of
Node assignment matrix from to
Table 1. Different notations used in the paper

3.1. Subgraph Attention Mechanism

The input to the subgraph attention network is an attributed graph , where is the set of nodes and is the attribute vector of the node . The output of the model is a set of node features (or embeddings) , ( is potentially different from ). We use to denote the set for any positive integer . We define the immediate (or first order) neighborhood of a node as . For the simplicity of notations, we assume an input graph to be undirected for the rest of the paper, but extending it for directed graph is straightforward.

3.1.1. Subgraph selection and Sampling

For each node in the graph, we aim to find the importance of the nearby subgraphs to that node. In general, subgraphs can be of any shape or size. Motivated by the prior works on graph kernels (Shervashidze et al., 2011), we choose to consider only a set of rooted subtrees as the set of candidate subgraphs. So for a node , any tree of the form , or where , or where and , and so on will form the set of candidate subgraphs of . We restrict that maximum size (i.e., number of nodes) of a subtree is . Also note that, the node is always a part of any candidate subgraph for the node according to our design. For example, all possible subgraphs of maximum size 3 for the node a in Figure 1 are: (a), (a,b), (a,d), (a,f), (a,b,c), (a,b,f), (a,b,g), (a,d,e), (a,f,e) and (a,f,b).

Depending on the maximum size () of a rooted subtree, the number of candidate subgraphs for a node can be very large. For example, the number of rooted subgraphs for the node is , where is the degree of a node and . Clearly, computing attention over these many subgraphs for each node is computationally difficult. So we employ a subgraph subsampling technique, inspired by the node subsampling techniques for network embedding (Hamilton et al., 2017). First, we fix the number of subgraphs to sample for each node. Let the number be . For each node in the input graph, if the total number of rooted subtrees of size is more than (or equal to) , we randomly sample number of subtrees without replacement. If the total number of rooted subtrees of size is less than , we use round robin sampling (i.e., permute all the subtrees, picking up samples from the beginning of the list; after consuming all the trees, again start from the beginning till we complete picking

subtrees). For each node, sample of subtrees remains same for one epoch of the algorithm (explained in the next subsection) and new samples are taken in each epoch. In any epoch, let us use the notation

to denote the set (more precisely it is a multiset as subgraphs can repeat) of sampled subgraph for the node .

Figure 2. SubGattPool Network for graph classification

3.1.2. Subgraph Attention Network

This subsection describes the attention mechanism on the set of rooted subtrees selected for each epoch of the algorithm. As mentioned, the node of interest is always positioned as the root of each subgraph generated for that node. Next step is to generate a feature for the subgraph. We tried different simple feature aggregations (for e.g., mean) of the nodes that belong to the subgraph as the feature of the subgraph. It turns out that concatenation of the features of nodes gives better performance. But for the attention to work, we need equal length feature vectors (the length is ) for all the subgraphs. So if a subgraph has less than nodes, we append zeros at the end to assign equal length feature vector for all the subgraphs. For example, if the maximum size of a subgraph is , then the feature of the subgraph is , where is the concatenation operation and is the zero vector in . Let us denote this derived feature vector of any subgraph as , and .

Next, we use self-attention on the features for the sampled subgraphs for each node as described here. As the first step, we use a shared linear transformation, parameterized by a trainable weight matrix

, to the feature of all the sampled subgraphs , and selected in an epoch. Next we introduce a trainable self attention vector to compute the attention coefficient which captures the importance of the subgraph on the node , as follows:



is a non-linear activation function. We have used Leaky ReLU as the activation function for all the experiments.

gives normalized attention scores over the set of sampled subgraphs for each node. We use them to compute the representation of a node as shown in Eq. 3.1.2. Please note, the attention mechanism described in (Veličković et al., 2018) operates only over the immediate neighboring nodes, whereas the higher order attention mechanism proposed in this work operates over the subgraphs. Needless to say, one can easily extend the above subgraph attention by multi-head attention by employing few independent attention mechanisms of Eq. 3.1.2 and concatenate the resulting representations (Vaswani et al., 2017). This completes one full subgraph attention layer. We can stack such multiple layers to design a full SubGatt network.

3.2. Hierarchically Attentive Graph Pooling

This subsection discusses all the components of SubGattPool architecture. As shown in Figure 2, there are different levels of the graph in the hierarchical architecture. The first level is the input graph. Let us denote these level graphs (i.e., graphs at different levels) by . There is a GNN layer between the level graph (i.e., the graph at level ) and the level graph . This GNN layer comprises of an embedding layer which generates the embedding of the nodes of and a pooling layer which maps the nodes of to the nodes of . We refer the GNN layer between the level graph and by th layer of GNN, . Pleas note, number of nodes in the first level graph depends on the input graph, but we keep the number of nodes in the consequent level graphs () fixed for all the input graphs (in a graph classification dataset), which help us to design the shared hierarchical attention mechanisms, as discussed later. As pooling mechanisms shrink a graph, , .

Let us assume that any level graph is defined by its adjacency matrix and the feature matrix (except for , which is the input graph and its feature matrix ). The th embedding layer and the pooling layer are defined by:


Here, is the embedding matrix of the nodes of . The softmax after the pooling is applied row-wise. th element of

gives the probability of assigning node

in to node in . Based on these, the graph is constructed as follows,


The matrix contains information about how nodes in are mapped to the nodes of , and the adjacency matrix contains information about the connection of nodes in . Eq. 4 combines them to generate the connections between the nodes (i.e., the adjacency matrix ) of . Node feature matrix of is also generated similarly. As the embedding and pooling GNNs, we use SubGatt networks (Section 3.1) only after the level graph 1. This is because other level graphs () have more number of soft edges (i.e., with probabilistic edge weights) due to use of softmax at the end of pooling layers. Hence, the number of neighboring rooted subtrees will be high in those level graphs and the chance of having discrete patterns will be less. We use GIN (Xu et al., 2019) as the embedding and pooling GNNs for , . GIN has been shown to be the most powerful 1st order GNN and the th layer of GIN can be defined as:



is the hidden representation of the node

in th layer of GIN and is a learnable parameter.

Intra-level attention layer: As observed in (Lee et al., 2019), hierarchical GNNs often suffer because of the loss of information in various embedding and pooling layers, from the input graph to the last level graph summarizing the entire graph. Moreover, the learned hierarchy is often not perfect due to noisy structure of the real world graphs. To alleviate these problems, we propose to use attention mechanisms again, to combine features from different level graphs of our hierarchical architecture. We consider level graphs to for this, as their respective numbers of nodes are same across all the graphs in a dataset. We introduce intra-level attention layer to obtain a global feature for each level graphs , . More precisely, we use the convolution based self attention within the level graph as:


Here, the softmax to compute is taken so that a component of becomes the normalized (i.e., probabilistic) importance of the corresponding node in . is the adjacency matrix with added self loops of . is the diagonal matrix of dimension with . is the trainable vector of parameters of intra-level attention, which is shared across all the level graphs , . Intuitively, contains the importance of individual attributes and the components of dimensional gives the same for each node. Finally, multiplying that with produces the (normalized) importance of a node based on its own features and the features of immediate neighbors (for one layer of intra-level attention). Hence, , which is a dimensional representation of the level graph , is a sum of the features of the nodes weighted by the respective normalized node importance. Please note, the impact from the first few level graphs becomes noisy due to too many subsequent operations in a hierarchical pooling method. But representing level graphs separately by the proposed intra-level attention makes their impact more prominent.

Inter-level attention layer: This layer aims to get the final representation, referred as , of the input graph from ; as obtained from the intra-level attention layers. It is fed to a neural classifier. As different level graphs of the hierarchical representation have different importance to determine the label of the input graph, we propose to use the following self-attention mechanism.


is the dimensional matrix whose rows correspond to (the output of intra-level attention layer for ), . is a trainable self attention vector. Similar to Eq. 6, softmax is taken to convert

to a probability distribution of importance of different graph levels. Finally, the vector representation

of the input graph is computed as a weighted sum of representations of different level graphs . is fed to a classification layer of the GNN, which is a dense layer followed by a softmax to classify the entire input graph in an end-to-end fashion. This completes the construction of SubGattPool architecture.

3.3. Key Insights of SubGattPool

First layer of SubGattPool consists of an embedding SubGatt network and a pooling SubGatt network, which have a total of trainable parameters. Consequent layers of SubGattPool have GIN as embedding and pooling layers, which have a total of parameters. Total number of parameters for intra-level attention layers is , as is shared across the level graphs. Finally the inter-level attention layer has parameters. Hence, total number of parameters to train in SubGattPool network is , which is independent of both the average number of nodes and the number of graphs in the dataset. We use ADAM (with learning rate set to 0.001) on the cross-entropy loss of graph classification to train these parameters.

Please note that in contrast to existing hierarchical pooling mechanisms in GNN (Ying et al., 2018; Morris et al., 2019), SubGattPool does not only rely on the last level of the GNN hierarchy to obtain the final graph representation. SubGattPool even may have more than 1 node in the last level graph. Essentially information from all the level graphs are aggregated through the attention layers. SubGattPool is also less prone to information loss in the hierarchy and able to learn importance of individual nodes in a hierarchy (i.e., within a level graph) and the importance of different hierarchies. In terms of design, most of the existing GNNs use GCN embedding and pooling layers (Ying et al., 2018). Whereas, we propose subgraph attention mechanism through SubGatt network (discussed in Section 3.1) and use it along with GIN as different embedding and pooling layers of SubGattPool. Following lemma shows that SubGattPool, though have different types of components in the overall architecture, satisfies a fundamental property required to be a graph neural network.

Lemma 3.1 ().

For a graph , with adjacency matrix and node attribute matrix , let us use
to denote the final graph representation generated by SubGattPool on that graph. Let, is any permutation matrix. Assuming that the initialization and random selection strategies of the neural architecture are always the same, .


Please note that is the new adjacency matrix and is the new feature matrix of the same graph under the node permutation defined by the permutation matrix . So, to prove the above, we need to show that each component of SubGattPool is invariant to any node permutation. First, SubGatt uses attention mechanism over the neighboring subgraphs through Equation 3.1.2. Clearly, different ordering of neighbors would not affect the node embeddings as we use a weighted sum aggregator where weights are learned through the subgraph attention. Next, the GIN aggregator (as in Equation 5) is also invariant to node permutation. Thus, all the embedding and pooling layers (as shown in Figure 2) present in SubGattPool are invariant to different node permutations. Finally, both intra-level and inter-level attention mechanisms also do not depend on the ordering of the nodes in any level graph, as each of them uses sum aggregation with self-attention. Hence, SubGattPool is invariant to node permutations of the input graph. ∎

4. Experimental Evaluation

This section describes the details of the experiments conducted on both real-life and synthetic datasets.

Dataset #Graphs #Max Nodes Avg. Number of Nodes #Labels #Attributes
MUTAG 188 28 17.93 2 NA
PTC 344 64 14.29 2 NA
PROTEINS 1113 620 39.06 2 29
NCI1 4110 111 29.87 2 NA
NCI109 4127 111 29.68 2 NA
IMDB-BINARY 1000 136 19.77 2 NA
IMDB-MULTI 1500 89 13.00 3 NA
Table 2. Statistics of different datasets used in our experiments

4.1. Experimental Setup for Graph Classification

We use 5 bioinformatics graph datasets (MUTAG, PTC, PROTEINS, NCI1 and NCI09) and 2 social network datasets (IMDB-BINARY and IMDB-MULTI) to evaluate the performance for graph classification. The details of these datasets can be found at ( Table 2 contains a high-level summary of these datasets.

To compare the performance of SubGattPool, we choose twenty state-of-the-art baseline algorithms from the domains of graph kernels, unsupervised graph representation and graph neural networks (Table 3). The reported accuracy numbers of the baseline algorithms are collected from (Maron et al., 2019; Sun et al., 2020; Narayanan et al., 2017a) where the same experimental setup is adopted. Thus, we avoid any degradation of the performance of the baseline algorithms due to insufficient parameter tuning and validation.

We adopt the same experimental setup as there in (Xu et al., 2019)

. We perform 10-fold cross validation and report the averaged accuracy and corresponding standard deviation for graph classification. We keep the values of the hyperparameters to be the same across all the datasets, based on the averaged validation accuracy. We set the pooling ratio (defined as

, ) at 0.5, the number of levels R=3 and the maximum subgraph size (T) to be 3. We sample L=12 subgraphs for each node in each epoch of SubGatt. Following most of the literature, we set the embedding dimension K to be 128. We use L2 normalization and dropout in SubGattPool architecture to make the training stable.

GK (Shervashidze et al., 2009) 81.391.7 55.650.5 71.390.3 62.490.3 62.350.3 NA NA
RW (Vishwanathan et al., 2010) 79.172.1 55.910.3 59.570.1 NA NA NA NA
PK (Neumann et al., 2016) 762.7 59.52.4 73.680.7 82.540.5 NA NA NA
WL (Shervashidze et al., 2011) 84.111.9 57.972.5 74.680.5 84.460.5 85.120.3 NA NA
AWE-DD (Ivanov and Burnaev, 2018) NA NA NA NA NA 74.455.8 51.543.6
AWE-FB (Ivanov and Burnaev, 2018) 87.879.7 NA NA NA NA 73.133.2 51.584.6
node2vec (Grover and Leskovec, 2016) 72.6310.20 58.858.00 57.493.57 54.891.61 52.681.56 NA NA
sub2vec (Adhikari et al., 2017) 61.0515.79 59.996.38 53.035.55 52.841.47 50.671.50 55.261.54 36.670.83
graph2vec (Narayanan et al., 2017b) 83.159.25 60.176.86 73.302.05 73.221.81 74.261.47 71.10.54 50.440.87
InfoGraph (Sun et al., 2020) 89.011.13 61.651.43 NA NA NA 73.030.87 49.690.53
DGCNN (Zhang et al., 2018) 85.831.7 58.592.5 75.540.9 74.440.5 NA 70.030.9 47.830.9
PSCN (Niepert et al., 2016) 88.954.4 62.295.7 752.5 76.341.7 NA 712.3 45.232.8
DCNN (Atwood and Towsley, 2016) NA NA 61.291.6 56.611.0 NA 49.061.4 33.491.4
ECC (Simonovsky and Komodakis, 2017) 76.11 NA NA 76.82 75.03 NA NA
DGK (Yanardag and Vishwanathan, 2015) 87.442.7 60.082.6 75.680.5 80.310.5 80.320.3 66.960.6 44.550.5
DIFFPOOL (Ying et al., 2018) 83.56 NA 76.25 NA NA NA 47.91
IGN (Maron et al., 2018) 83.8912.95 58.536.86 76.585.49 74.332.71 72.821.45 72.05.54 48.733.41
GIN (Xu et al., 2019) 89.45.6 64.67.0 76.22.8 82.71.7 NA 75.15.1 52.32.8
1-2-3GNN (Morris et al., 2019) 86.1 60.9 75.5 76.2 NA 74.2 49.5
3WL-GNN (Maron et al., 2019) 90.558.7 66.176.54 77.24.73 83.191.11 81.841.85 72.64.9 503.15
SubGattPool 93.294.78 67.136.45 76.923.44 82.591.42 80.951.76 76.492.94 52.463.48
Rank 1 1 2 3 3 1 1
Table 3. Classification accuracy (%) of different algorithms (21 in total) for graph classification. NA denotes the case when the result of a baseline algorithm could not be found on that particular dataset from the existing literature. The last row ‘Rank’ is the rank (1 being the highest position) of our proposed algorithm SubGattPool among all the algorithms present in the table.

4.2. Performance on Graph Classification

Table 3 shows the performance of SubGattPool along with the diverse set of baseline algorithms for graph classification on multiple real-world datasets. From the results, we can observe that SubGattPool is able to improve the state-of-the-art on MUTAG, PTC, IMDB-B and IMDB-M for graph classification. On PROTEINS, the performance gap with the best performing baseline (which is 3WL-GNN (Maron et al., 2019) for both) is less than 1%. But on NCI1 and NCI109, WL kernel turns out to be the best performing algorithm with a good margin () from all the GNN based algorithms. It is interesting to note that SubGattPool is able to outperform existing hierarchical GNN algorithms DIFFPOOL and 1-2-3GNN consistently on all the datasets. This is because of the use of (i) attention over subgraphs in SubGatt embedding and pooling layers, and (ii) use of intra-level and inter-level attention mechanisms over different level graphs which makes the overall architecture more robust and reduces information loss. In terms of standard deviation, SubGattPool is highly competitive and often better than most of the better performing GNNs (specially GIN and 3WL-GNN).

Figure 3. (a) A sample graph. (b) Normalized attention values of different subgraphs selected for the node 7 of the Graph in (a). Clearly, attention to the clique of size 4 is more than all the other subgraphs.

4.3. Interpretation of Subgraph Attention via Synthetic Experiment

Subgraph attention is a key component of SubGattPool. Here, we validate the learned attention values on different subgraphs by conducting an experiment on a small synthetic dataset containing 50 graphs, and each graph having 8 nodes. Each graph has 2 balanced communities and exactly for 50% of the graphs, one community consists of a clique of size 4. We label a graph with +1 if the clique of size 4 is present, otherwise the label is -1. The goal of this experiment is to see if SubGattPool is able to learn this simple rule of graph classification by paying proper attention to the substructure which determines the label of a graph.

We run SubGattPool on this synthetic dataset, with , , #SubGatt layers=1, and . Once the training is complete, we randomly select a graph from the positive class and a node in it and plot the attention values of all the subgraphs selected in the last epoch for that node, in Figure 3. Clearly, the attention value corresponding to the clique (containing the nodes 7, 6, 5 and 4) is much higher than that to the other subgraphs. We have manually verified the same observation on multiple graphs in this synthetic dataset. Thus, SubGattPool is able to pay more attention to the correct substructure (i.e., subgraph) and pay less attention to other irrelevant substructures. This also explains the robust behavior of SubGattPool.

(b) SubGattPool \SubGatt
(c) SubGattPool \I-I-L-A
(d) SubGattPool
Figure 4. t-SNE visualization of the graphs from MUTAG (different colors show different labels of the graphs) by the representations generated by: (a) DIFFPOOL; (b) SubGattPool, but the SubGatt embedding and pooling layers being replaced by GIN; (c) SubGattPool without intra and inter layer attention; (d) the complete SubGattPool network. Compared to (a), there is improvement of performances for both the SubGatt layer and the intra/inter-level attention individually. Finally different classes are separated most by SubGattPool which again shows the merit of the proposed algorithm.
(b) SubGattPool \SubGatt
(c) SubGattPool \I-I-L-A
(d) SubGattPool
Figure 5. t-SNE visualization of the graphs from PTC (different colors show different labels of the graphs) by the representations generated by different GNN algrotihms. The description of each variant of SubGattPool is exactly the same as in Figure 4. Again for PTC also, we can see the performances of different variants of SubGattPool are better than that of DIFFPOOL and the overall performance of SubGattPool in visualizing PTC is better than the other variants which are obtained by removing one or more components from SubGattPool.

4.4. Graph Clustering

Though our proposed algorithm SubGattPool is for graph classification, we also wants to check the quality of the graph representations , obtained in SubGattPool through graph clustering. We use only a subset of recently proposed GNN based algorithm as baselines in this experiment. We use similar hyperparameter values (discussed in Section 4

) as applicable and adopt same hyperparameter tuning strategy to obtain the graph representation for all the algorithms considered. The vector representations obtained for all the graphs by a GNN are given to K-Means++

(Arthur and Vassilvitskii, 2006) algorithm to get the clusters. To evaluate the quality of clustering, we use unsupervised clustering accuracy (Bandyopadhyay et al., 2019, 2020) which uses different permutations of the labels and chooses the label ordering which gives the best possible accuracy . Here is the ground truth labeling of the dataset such that gives the ground truth label of th data point. Similarly is the clustering assignments discovered by some algorithm, and is a permutation of the set of labels. We assume to be a logical operator which returns 1 when the argument is true, otherwise returns 0. Table 4 shows that SubGattPool is able to outperform all the baselines we used for graph clustering on all the three datasets. Please note that DGI and InfoGraph derive the graph embeddings in an unsupervised way, whereas DIFFPOOL and SubGattPool use supervision. Naturally, the performance of the later two are better on all the datasets. Further, the use of subgraph attention along with the hierarchical attention layers helps SubGattPool to perform consistently better than DIFFPOOL which is also hierarchical in nature.

DGI 74.73 59.20 36.83
InfoGraph 77.65 59.93 35.93
DIFFPOOL 82.08 60.81 41.72
SubGattPool 90.68 65.45 50.23
Table 4. Clustering accuracy(%).
Figure 6. Sensitivity analysis of SubGattPool for graph classification on MUTAG with respect to different hyper-parameters: (a) Maximum subgraph size, (b) Number of subgraphs sampled per epoch for each node, (c) Embedding dimension and (d) Number of SubGatt layers in SubGattPool.

4.5. Model Ablation Study

SubGattPool has mainly two novel components. They are the SubGatt layer, and the intra-level and inter-level attention layers which makes SubGattPool a mixture of both global and hierarchical pooling strategy. To see the usefulness of each component, we show the performance after removing that component from SubGattPool. We chose graph visualization of MUTAG in Figure 4 and graph visualization of PTC in Figure 5 as the downstream tasks for this experiment. We use t-SNE (van der Maaten and Hinton, 2008) to convert the graph embeddings into two dimensional plane. Different colors represent different labels of the graphs and the performance is better when different colors form different clusters in the plot. We choose DIFFPOOL as the base model in Figure 3(a) because it is also a hierarchical graph representation technique. In Figure 3(b), we replace the SubGatt embedding and pooling layers by GIN embedding and pooling layers in SubGattPool (refer Figure 2). Similarly, in Figure 3(c), we remove inter and intra layer attention and obtain the graph representation from the last level graph (by creating only one node there) in SubGattPool. Finally, Figure 3(d) shows the performance by SubGattPool, which combines all these components into a single network. Clearly, the performances in Figure 3(b) and 3(c) are better than that in Figure 3(a), but the best performance is achieved in Figure 3(d) which uses the complete SubGattPool network on MUTAG. The same observation of improved performance of the variants of SubGattPool over DIFFPOOL and the performance of SubGattPool over its variants is also prominent in Figure 5 on PTC dataset. This clearly shows the individual and combined usefulness of various components of SubGattPool for graph representation.

4.6. Sensitivity Analysis

We aim to conduct sensitivity analysis of the proposed algorithm in this section. SubGatt network has four important hyperparameters. They are: (i) Maximum size of a subgraph (), (ii) Number of subgraphs sampled per node in each epoch () and (iii) Dimension of the final node representation or embedding () (See Eq. 3.1.2) and (iv) Number of SubGatt layers used in the network. We conduct graph classification experiment on MUTAG to see the sensitivity of SubGattPool with respect to each of these hyperparameters. Figure 6 shows the variation of the performance of SubGattPool network for graph classification with respect to all these hyper-parameters. We have shown both average graph classification accuracy and standard deviation over 10 repetitions for each experiment.

From Figure , we can see that the performance of SubGattPool on MUTAG improves when maximum length of subgraph is set to 3. As the average size of a graph in MUTAG is quite small, a subgraph of size more than 3 does not help. Similarly, Figure shows that with increasing number of samples () for each node in an epoch, the performance of SubGattPool improves first, and then saturates. The same observation can be made in Figure for embedding dimension of the graphs. We use SubGatt as the embedding and pooling layers of the GNN after level graph 1. Figure shows that best performance on MUTAG is obtained with 2 layers of SubGatt. Adding more number of layers actually deteriorates the performance because of oversmoothing which is a well-known problem of graph neural networks (Luan et al., 2019). Overall, the variation is as expected and often less with respect to each hyper-parameter and hence it shows the robustness of SubGattPool. Please note, when we are varying one hyper-parameter of SubGattPool, the values of all other hyper-parameters are fixed to the values mentioned in Section 4.1.

5. Conclusion

We have proposed a novel GNN based robust graph classification algorithm called SubGattPool which uses higher order attention over the subgraphs of a graph and also addresses some shortcomings of the existing hierarchical graph representation techniques. We conduct experiments with both real world and synthetic graph datasets on multiple graph-level downstream tasks to show the robustness of our algorithm. We are also able to improve the state-of-the-art graph classification performance on four popularly used graph datasets. In future, we would like to theoretically examine the expressiveness power of SubGatt and SubGattPool for node and graph representations respectively. We will also analyze and see the recovery of communities in a graph in the hierarchical structure of SubGattPool. Overall, we believe that this work would encourage further development in the area of hierarchical graph representation and classification.


  • S. Abu-El-Haija, B. Perozzi, A. Kapoor, N. Alipourfard, K. Lerman, H. Harutyunyan, G. Ver Steeg, and A. Galstyan (2019) MixHop: higher-order graph convolutional architectures via sparsified neighborhood mixing. In International Conference on Machine Learning, pp. 21–29. Cited by: §2.
  • B. Adhikari, Y. Zhang, N. Ramakrishnan, and B. A. Prakash (2017) Distributed representations of subgraphs. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 111–117. Cited by: Table 3.
  • D. Arthur and S. Vassilvitskii (2006) K-means++: the advantages of careful seeding. Technical report Stanford. Cited by: §4.4.
  • J. Atwood and D. Towsley (2016) Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1993–2001. Cited by: Table 3.
  • S. Bandyopadhyay, H. Kara, A. Kannan, and M. N. Murty (2018) Fscnmf: fusing structure and content via non-negative matrix factorization for embedding information networks. arXiv preprint arXiv:1804.05313. Cited by: §2.
  • S. Bandyopadhyay, N. Lokesh, and M. N. Murty (2019) Outlier aware network embedding for attributed networks. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 12–19. Cited by: §4.4.
  • S. Bandyopadhyay, S. V. Vivek, and M. Murty (2020) Outlier resistant unsupervised deep architectures for attributed network embedding. In Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 25–33. Cited by: §4.4.
  • K. M. Borgwardt and H. Kriegel (2005) Shortest-path kernels on graphs. In Fifth IEEE international conference on data mining (ICDM’05), pp. 8–pp. Cited by: §2.
  • I. Chami, Z. Ying, C. Ré, and J. Leskovec (2019) Hyperbolic graph convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 4869–4880. Cited by: §2.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §1, §2.
  • D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §1, §1, §2.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §1, §1, §2.
  • A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §2, Table 3.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1025–1035. Cited by: §1, §3.1.1.
  • S. Ivanov and E. Burnaev (2018) Anonymous walk embeddings. arXiv preprint arXiv:1805.11921. Cited by: Table 3.
  • H. Kashima, K. Tsuda, and A. Inokuchi (2003) Marginalized kernels between labeled graphs. In ICML, pp. 321–328. Cited by: §2.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §1, §2.
  • B. Knyazev, G. W. Taylor, and M. R. Amer (2019) Understanding attention in graph neural networks. In Advances in Neural Information Processing Systems, Cited by: §2.
  • J. B. Lee, R. Rossi, and X. Kong (2018) Graph classification using structural attention. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1666–1674. Cited by: §1, §2.
  • J. Lee, I. Lee, and J. Kang (2019) Self-attention graph pooling. In International Conference on Machine Learning, pp. 3734–3743. Cited by: §1, §2, §3.2.
  • S. Luan, M. Zhao, X. Chang, and D. Precup (2019) Break the ceiling: stronger multi-scale deep graph convolutional networks. In Advances in Neural Information Processing Systems, pp. 10943–10953. Cited by: §4.6.
  • H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman (2019) Provably powerful graph networks. In Advances in Neural Information Processing Systems, pp. 2153–2164. Cited by: §2, §4.1, §4.2, Table 3.
  • H. Maron, H. Ben-Hamu, N. Shamir, and Y. Lipman (2018) Invariant and equivariant graph networks. arXiv preprint arXiv:1812.09902. Cited by: Table 3.
  • C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe (2019) Weisfeiler and leman go neural: higher-order graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4602–4609. Cited by: §1, §1, §2, §3.3, Table 3.
  • A. Narayanan, M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal (2017a) Graph2vec: learning distributed representations of graphs. In 13th International Workshop on Mining and Learning with Graphs (MLGWorkshop 2017), Cited by: §4.1.
  • A. Narayanan, M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal (2017b) Graph2vec: learning distributed representations of graphs. arXiv preprint arXiv:1707.05005. Cited by: Table 3.
  • M. Neumann, R. Garnett, C. Bauckhage, and K. Kersting (2016) Propagation kernels: efficient graph kernels from propagated information. Machine Learning 102 (2), pp. 209–245. Cited by: Table 3.
  • M. E. Newman (2003) The structure and function of complex networks. SIAM review 45 (2), pp. 167–256. Cited by: §1, §2.
  • M. Niepert, M. Ahmed, and K. Kutzkov (2016) Learning convolutional neural networks for graphs. In International conference on machine learning, pp. 2014–2023. Cited by: Table 3.
  • N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt (2011) Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12 (Sep), pp. 2539–2561. Cited by: §2, §3.1.1, Table 3.
  • N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt (2009) Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, pp. 488–495. Cited by: §2, Table 3.
  • M. Simonovsky and N. Komodakis (2017) Dynamic edge-conditioned filters in convolutional neural networks on graphs. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 3693–3702. Cited by: Table 3.
  • F. Sun, J. Hoffman, V. Verma, and J. Tang (2020) InfoGraph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In International Conference on Learning Representations, External Links: Link Cited by: §4.1, Table 3.
  • J. Sun, D. Ajwani, P. K. Nicholson, A. Sala, and S. Parthasarathy (2017) Breaking cycles in noisy hierarchies. In Proceedings of the 2017 ACM on Web Science Conference, pp. 151–160. Cited by: §1.
  • R. Tzeng and S. Wu (2019) Distributed, egocentric representations of graphs for detecting critical structures. In International Conference on Machine Learning, pp. 6354–6362. Cited by: §2.
  • L. van der Maaten and G. Hinton (2008) Visualizing data using t-SNE. Journal of Machine Learning Research 9, pp. 2579–2605. External Links: Link Cited by: §4.5.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.1.2.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2018) Graph attention networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §2, §3.1.2.
  • S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt (2010) Graph kernels. Journal of Machine Learning Research 11 (Apr), pp. 1201–1242. Cited by: §2, Table 3.
  • Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §2.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §2, §3.2, §4.1, Table 3.
  • K. Xu, C. Li, Y. Tian, T. Sonobe, K. Kawarabayashi, and S. Jegelka (2018) Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, pp. 5449–5458. Cited by: §2.
  • P. Yanardag and S. Vishwanathan (2015) Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374. Cited by: §2, Table 3.
  • Y. Yang, X. Wang, M. Song, J. Yuan, and D. Tao (2019) SPAGAN: shortest path graph attention network. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 4099–4105. Cited by: §2.
  • Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec (2018) Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pp. 4800–4810. Cited by: §1, §1, §2, §2, §3.3, Table 3.
  • J. You, R. Ying, and J. Leskovec (2019) Position-aware graph neural networks. In International Conference on Machine Learning, pp. 7134–7143. Cited by: §2.
  • M. Zhang, Z. Cui, M. Neumann, and Y. Chen (2018) An end-to-end deep learning architecture for graph classification. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Table 3.