1. Introduction
Graphs are the most suitable way to represent different types of relational data such as social networks, protein interactions and molecular structures. Typically, A graph is represented by , where is the set of nodes and is the set of edges. Further, each node
is also associated with an attribute (or feature) vector
. Recent advent of deep representation learning has heavily influenced the field of graphs. Graph neural networks (GNNs) (Defferrard et al., 2016; Xu et al., 2019) are developed to use the underlying graph as a computational graph and aggregate node attributes from the neighbors of a node to generate the node embeddings (Kipf and Welling, 2016). A simplistic message passing framework (Gilmer et al., 2017) for graph neural networks can be presented by the following equations.(1) 
Here, is the representation of node of graph in th layer of the GNN. The function considers representation of the neighboring nodes of from the th layer of the GNN and maps them into a single vector representation. As neighbors of a node do not have any ordering in a graph and the number of neighbors can vary for different nodes, function needs to be permutation invariant and should be able to handle different number of nodes as input. Then, function uses the node representation of th node from the layer of GNN and the aggregated information from the neighbors to obtain an updated representation of the node . Finally for the graph level tasks, function (also known as graph pooling) generates a summary representation for the whole graph from all the node representations , from the final layer (L) of GNN. Similar to , the function also needs to be invariant to different node permutations of the input graph, and should be able to handle graphs with different number of nodes.
In the existing literature, different types of neural architectures are proposed to implement each of the three functions mentioned in Equation 1. For example, GraphSAGE (Hamilton et al., 2017) implements 3 different variants of the function with mean, maxpool and LSTM respectively. For a graph level task such as graph classification (Xu et al., 2019; Duvenaud et al., 2015), GNNs jointly derive the node embeddings and use different pooling mechanisms (Ying et al., 2018; Lee et al., 2019) to obtain a representation of the entire graph. Recently, attention mechanisms on graphs show promising results for both node classification (Veličković et al., 2018) and graph classification (Lee et al., 2019, 2018) tasks. There are different ways to compute attention mechanisms on graph. (Veličković et al., 2018) compute attention between a pair of nodes in the immediate neighborhood to capture the importance of a node on the embedding of the other node by learning an attention vector. (Lee et al., 2018) compute attention between a pair of nodes in the neighborhood to guide the direction of a random walk in the graph for graph classification. (Lee et al., 2019) propose self attention pooling of the nodes which is then used to capture the importance of the node to generate the label of the entire graph.
Most of the attention mechanisms developed in graph literature use attention to derive the importance of a node or a pair of nodes for different tasks. But in real world situation, calculating importance up to a pair of nodes is not adequate. In molecular biology or in social networks, the presence of particular substructures (a subset of nodes with their connections and features), potentially of varying sizes, in a graph often determines its label. Hence, all the nodes collectively in such a substructure are important, and they may not be important individually or in pairs to classify the graph. In Figure
1, each node (indexed from to ) in the small synthetic graph can be considered as an agent whose attributes determine its opinion (1:positive, 0: neutral, 1: negative) about 4 products. Suppose the graph can be labelled +1 only if there is a subset of connected (by edges) agents who jointly have positive opinion about all the product. In this case, the blue shaded connected subgraph is important to determine the label of the graph. Please note, attention over the pairs (Veličković et al., 2018) is not enough as cannot make the label of the graph +1 by itself. Also, multiple layers of graph convolution (Kipf and Welling, 2016) with pairwise attention may not work as the aggregated features of a node get corrupted after the feature aggregation by the first few convolution layers. Besides, recent literature also shows that higher order GNNs that directly aggregate features from higher order neighborhood of a node are theoretically more powerful than 1st order GNNs (Morris et al., 2019). With these motivations, we develop a novel higher order attention mechanism in the graph which operates in the subgraph level in the vicinity of a node. We call it subgraph attention mechanism and use it for graph classification^{1}^{1}1Subgraph attention can easily be applied for node classification as well. But we focus only on graph classification in this paper..On the other hand, different types of graph pooling (i.e., function in Equation 1) mechanisms (Duvenaud et al., 2015; Gilmer et al., 2017; Morris et al., 2019) have been proposed in the recent GNN literature. Simple functions such as taking sum or mean of all the node representations to compute the graphlevel representation are studied in (Duvenaud et al., 2015). Recently, hierarchical graph pooling (Ying et al., 2018; Morris et al., 2019) gains significant interest as it is able to capture the intrinsic hierarchical structure of several realworld graphs. For e.g., in a social network, one must model both the egonetworks around individual nodes, as well as the coarsegrained relationships between entire communities (Newman, 2003). Instead of directly obtaining a graph level summary vector, hierarchical pooling mechanisms recursively converts the input graph to graphs with smaller sizes. But hierarchical representation (Ying et al., 2018) often fails to perform well in practice mainly due to two major shortcomings. First, there is significant loss of information in learning the sequential hierarchies of a graph when the data is limited. Second, it treats all the nodes within a hierarchy, and all the hierarchies equally while computing the entire graph representation. But for some realworld graphs, the structure between the subcommunities may be more important than that between the nodes or the communities to determine the label of the entire graph (Newman, 2003). Moreover, due to presence of noise, some of the discovered hierarchies may not follow the actual hierarchical structure of the graph (Sun et al., 2017), and can negatively impact the overall graph representation. To address these issues, we again use attention to differentiate different units of a hierarchical graph representation in a GNN framework. Thus, our contributions in this paper are multifold, as follows:

We propose a novel higher order attention mechanism (called subgraph attention) for graph neural networks, which is based on the importance of a subgraph of dynamic size to determine the label of the graph.

We also propose hierarchical attentions in graph representation. More precisely, we propose intralevel and inter level attention which respectively find important nodes within a hierarchy and important hierarchies of the hierarchical representation of the graph. This enables the overall architecture to minimize the loss of information in the hierarchical learning and to achieve robust performance on real world noisy graph datasets.

We propose a novel neural network architecture SubGattPool (SubGraph attention network with hierarchically attentive graph Pooling) to combine the above two ideas for graph classification. Thorough experimentation on both real world and synthetic graphs shows the merit of the proposed algorithms over the stateofthearts.
2. Related Work and the Research Gaps
A survey on network representation (Grover and Leskovec, 2016; Bandyopadhyay et al., 2018) learning and graph neural networks can be found in (Wu et al., 2019). For the interest of space, we briefly discuss some more prominent approaches for graph classification and representation. Graph kernel based approaches (Vishwanathan et al., 2010), which map the graphs to Hilbert space implicitly or explicitly, remain to be the stateoftheart for graph classification for long time. There are different types of graph kernels present in the literature, such as random walk based kernel (Kashima et al., 2003), shortest path based kernels (Borgwardt and Kriegel, 2005), graphlet counting based kernel (Shervashidze et al., 2009), WeisfeilerLehman subtree kernel (Shervashidze et al., 2011) and Deep graph kernel (Yanardag and Vishwanathan, 2015). But most of the existing graph kernels use handcrafted features and they often fail to adapt the data distribution of the graph.
Significant progress happened in the domain of node representation and node level tasks via graph neural networks. Spectral graph convolutional neural networks with fast localized convolutions
(Defferrard et al., 2016; Kipf and Welling, 2016), graph attention (GAT) over a pair of connected node in the graph convolution framework (Veličković et al., 2018), attention over different layers of convolution (Xu et al., 2018), position aware graph neural networks (You et al., 2019) and hyperbolic graph convolution networks (Chami et al., 2019) are some notable examples of GNN for node representation. To go from node embeddings to a single representation for the whole graph, simple aggregation technique such as taking the average of node embeddings in the final layer of a GCN (Duvenaud et al., 2015)and more advanced deep learning architectures that operate over the sets
(Gilmer et al., 2017) have been used. Attention based graph classification technique GAM (Lee et al., 2018) is proposed, which processes only a portion of the graph by adaptively selecting a sequence of informative nodes. DIFFPOOL (Ying et al., 2018) is a recently proposed hierarchical GNN which uses a GCN based pooling to create a set of hierarchical graphs in each level. (Lee et al., 2019) propose a self attention based pooling strategy which determines the importance of a node to find the label of the graph. Different extensions of GNNs, such as EgoCNN (Tzeng and Wu, 2019) and ChebyGIN (Knyazev et al., 2019) are proposed for graph classification. Theoretical frameworks to analyze the representational power of GNNs are proposed in (Xu et al., 2019; Maron et al., 2019). (Knyazev et al., 2019) study the ability of attention GNNs to generalize to larger and complex graphs.Higher order GNNs which operate beyond immediate neighborhood are proposed recently. Based on higher dimensional WeisfeilerLeman algorithm, kGNN (Morris et al., 2019) is proposed which derive the representation of all the subgraphs of size through convolution. Mixhop GNN for node classification is proposed in (AbuElHaija et al., 2019) which aggregates node features according to the higher order adjacency matrices. Though these higher order GNNs are more powerful representation of graphs, they do not employ attention in the higher order neighborhood. To the best of our knowledge, (Yang et al., 2019) is the only work to propose an attention mechanism on the shortest paths starting from a node to generate the node embedding. However, their computation of shortest path depends on the pairwise node attention and this may fail in the cases when a collection of nodes together is important, but not the individual pairs. Our proposed subgraph attention addresses this gap in the literature. Further, hierarchical pooling as proposed in DIFFPOOL (Ying et al., 2018) has become a popular pooling strategy in GNNs (Morris et al., 2019). But it suffers because of the loss of information and its nature to represent the whole graph by the last level (containing only a single node) of the hierarchy. As discussed in Section 1, some intermediate levels may play more important role to determine the label of the entire graph than the last one (Newman, 2003). The intralevel and inter level attention mechanisms proposed in this work precisely address this research gap in hierarchical graph representation.
3. Proposed Approach: SubGattPool
We formally define the problem of graph classification first. Given a set of graphs , and a subset of graphs with each graph is labelled with (the subscript stands for ‘graphs’), the task is to predict the label of a graph using the structure of the graphs and the node attributes, and the graph labels from . Again, this leads to learning a function . Here, is the set of discrete labels for the graphs.
Figure 2 shows the highlevel architecture of SubGattPool. One major component of SubGattPool is the generation of node representations through SubGraph attention (referred as SubGatt) layer. Below, we describe the building blocks of SubGatt for any arbitrary graph. For the ease of reading, we summarize all the important notations used in this paper in Table 1.
Notations  Explanations 

Set of graphs in a graph dataset  
One single graph  
Set of discrete labels for graphs  
Attribute vector for th node  
Multiset of sampled subgraph for the node .  
Derived feature vector of a subgraph  
Maximum size (i.e., number of nodes) of a subtree  
Number of subgraphs to sample for each node  
Final representation of the graph  
Level graphs of some input graph  
Embedding matrix of  
Node assignment matrix from to 
3.1. Subgraph Attention Mechanism
The input to the subgraph attention network is an attributed graph , where is the set of nodes and is the attribute vector of the node . The output of the model is a set of node features (or embeddings) , ( is potentially different from ). We use to denote the set for any positive integer . We define the immediate (or first order) neighborhood of a node as . For the simplicity of notations, we assume an input graph to be undirected for the rest of the paper, but extending it for directed graph is straightforward.
3.1.1. Subgraph selection and Sampling
For each node in the graph, we aim to find the importance of the nearby subgraphs to that node. In general, subgraphs can be of any shape or size. Motivated by the prior works on graph kernels (Shervashidze et al., 2011), we choose to consider only a set of rooted subtrees as the set of candidate subgraphs. So for a node , any tree of the form , or where , or where and , and so on will form the set of candidate subgraphs of . We restrict that maximum size (i.e., number of nodes) of a subtree is . Also note that, the node is always a part of any candidate subgraph for the node according to our design. For example, all possible subgraphs of maximum size 3 for the node a in Figure 1 are: (a), (a,b), (a,d), (a,f), (a,b,c), (a,b,f), (a,b,g), (a,d,e), (a,f,e) and (a,f,b).
Depending on the maximum size () of a rooted subtree, the number of candidate subgraphs for a node can be very large. For example, the number of rooted subgraphs for the node is , where is the degree of a node and . Clearly, computing attention over these many subgraphs for each node is computationally difficult. So we employ a subgraph subsampling technique, inspired by the node subsampling techniques for network embedding (Hamilton et al., 2017). First, we fix the number of subgraphs to sample for each node. Let the number be . For each node in the input graph, if the total number of rooted subtrees of size is more than (or equal to) , we randomly sample number of subtrees without replacement. If the total number of rooted subtrees of size is less than , we use round robin sampling (i.e., permute all the subtrees, picking up samples from the beginning of the list; after consuming all the trees, again start from the beginning till we complete picking
subtrees). For each node, sample of subtrees remains same for one epoch of the algorithm (explained in the next subsection) and new samples are taken in each epoch. In any epoch, let us use the notation
to denote the set (more precisely it is a multiset as subgraphs can repeat) of sampled subgraph for the node .3.1.2. Subgraph Attention Network
This subsection describes the attention mechanism on the set of rooted subtrees selected for each epoch of the algorithm. As mentioned, the node of interest is always positioned as the root of each subgraph generated for that node. Next step is to generate a feature for the subgraph. We tried different simple feature aggregations (for e.g., mean) of the nodes that belong to the subgraph as the feature of the subgraph. It turns out that concatenation of the features of nodes gives better performance. But for the attention to work, we need equal length feature vectors (the length is ) for all the subgraphs. So if a subgraph has less than nodes, we append zeros at the end to assign equal length feature vector for all the subgraphs. For example, if the maximum size of a subgraph is , then the feature of the subgraph is , where is the concatenation operation and is the zero vector in . Let us denote this derived feature vector of any subgraph as , and .
Next, we use selfattention on the features for the sampled subgraphs for each node as described here. As the first step, we use a shared linear transformation, parameterized by a trainable weight matrix
, to the feature of all the sampled subgraphs , and selected in an epoch. Next we introduce a trainable self attention vector to compute the attention coefficient which captures the importance of the subgraph on the node , as follows:(2) 
Here
is a nonlinear activation function. We have used Leaky ReLU as the activation function for all the experiments.
gives normalized attention scores over the set of sampled subgraphs for each node. We use them to compute the representation of a node as shown in Eq. 3.1.2. Please note, the attention mechanism described in (Veličković et al., 2018) operates only over the immediate neighboring nodes, whereas the higher order attention mechanism proposed in this work operates over the subgraphs. Needless to say, one can easily extend the above subgraph attention by multihead attention by employing few independent attention mechanisms of Eq. 3.1.2 and concatenate the resulting representations (Vaswani et al., 2017). This completes one full subgraph attention layer. We can stack such multiple layers to design a full SubGatt network.3.2. Hierarchically Attentive Graph Pooling
This subsection discusses all the components of SubGattPool architecture. As shown in Figure 2, there are different levels of the graph in the hierarchical architecture. The first level is the input graph. Let us denote these level graphs (i.e., graphs at different levels) by . There is a GNN layer between the level graph (i.e., the graph at level ) and the level graph . This GNN layer comprises of an embedding layer which generates the embedding of the nodes of and a pooling layer which maps the nodes of to the nodes of . We refer the GNN layer between the level graph and by th layer of GNN, . Pleas note, number of nodes in the first level graph depends on the input graph, but we keep the number of nodes in the consequent level graphs () fixed for all the input graphs (in a graph classification dataset), which help us to design the shared hierarchical attention mechanisms, as discussed later. As pooling mechanisms shrink a graph, , .
Let us assume that any level graph is defined by its adjacency matrix and the feature matrix (except for , which is the input graph and its feature matrix ). The th embedding layer and the pooling layer are defined by:
(3) 
Here, is the embedding matrix of the nodes of . The softmax after the pooling is applied rowwise. th element of
gives the probability of assigning node
in to node in . Based on these, the graph is constructed as follows,(4) 
The matrix contains information about how nodes in are mapped to the nodes of , and the adjacency matrix contains information about the connection of nodes in . Eq. 4 combines them to generate the connections between the nodes (i.e., the adjacency matrix ) of . Node feature matrix of is also generated similarly. As the embedding and pooling GNNs, we use SubGatt networks (Section 3.1) only after the level graph 1. This is because other level graphs () have more number of soft edges (i.e., with probabilistic edge weights) due to use of softmax at the end of pooling layers. Hence, the number of neighboring rooted subtrees will be high in those level graphs and the chance of having discrete patterns will be less. We use GIN (Xu et al., 2019) as the embedding and pooling GNNs for , . GIN has been shown to be the most powerful 1st order GNN and the th layer of GIN can be defined as:
(5) 
Here,
is the hidden representation of the node
in th layer of GIN and is a learnable parameter.Intralevel attention layer: As observed in (Lee et al., 2019), hierarchical GNNs often suffer because of the loss of information in various embedding and pooling layers, from the input graph to the last level graph summarizing the entire graph. Moreover, the learned hierarchy is often not perfect due to noisy structure of the real world graphs. To alleviate these problems, we propose to use attention mechanisms again, to combine features from different level graphs of our hierarchical architecture. We consider level graphs to for this, as their respective numbers of nodes are same across all the graphs in a dataset. We introduce intralevel attention layer to obtain a global feature for each level graphs , . More precisely, we use the convolution based self attention within the level graph as:
(6) 
Here, the softmax to compute is taken so that a component of becomes the normalized (i.e., probabilistic) importance of the corresponding node in . is the adjacency matrix with added self loops of . is the diagonal matrix of dimension with . is the trainable vector of parameters of intralevel attention, which is shared across all the level graphs , . Intuitively, contains the importance of individual attributes and the components of dimensional gives the same for each node. Finally, multiplying that with produces the (normalized) importance of a node based on its own features and the features of immediate neighbors (for one layer of intralevel attention). Hence, , which is a dimensional representation of the level graph , is a sum of the features of the nodes weighted by the respective normalized node importance. Please note, the impact from the first few level graphs becomes noisy due to too many subsequent operations in a hierarchical pooling method. But representing level graphs separately by the proposed intralevel attention makes their impact more prominent.
Interlevel attention layer: This layer aims to get the final representation, referred as , of the input graph from ; as obtained from the intralevel attention layers. It is fed to a neural classifier. As different level graphs of the hierarchical representation have different importance to determine the label of the input graph, we propose to use the following selfattention mechanism.
(7) 
is the dimensional matrix whose rows correspond to (the output of intralevel attention layer for ), . is a trainable self attention vector. Similar to Eq. 6, softmax is taken to convert
to a probability distribution of importance of different graph levels. Finally, the vector representation
of the input graph is computed as a weighted sum of representations of different level graphs . is fed to a classification layer of the GNN, which is a dense layer followed by a softmax to classify the entire input graph in an endtoend fashion. This completes the construction of SubGattPool architecture.3.3. Key Insights of SubGattPool
First layer of SubGattPool consists of an embedding SubGatt network and a pooling SubGatt network, which have a total of trainable parameters. Consequent layers of SubGattPool have GIN as embedding and pooling layers, which have a total of parameters. Total number of parameters for intralevel attention layers is , as is shared across the level graphs. Finally the interlevel attention layer has parameters. Hence, total number of parameters to train in SubGattPool network is , which is independent of both the average number of nodes and the number of graphs in the dataset. We use ADAM (with learning rate set to 0.001) on the crossentropy loss of graph classification to train these parameters.
Please note that in contrast to existing hierarchical pooling mechanisms in GNN (Ying et al., 2018; Morris et al., 2019), SubGattPool does not only rely on the last level of the GNN hierarchy to obtain the final graph representation. SubGattPool even may have more than 1 node in the last level graph. Essentially information from all the level graphs are aggregated through the attention layers. SubGattPool is also less prone to information loss in the hierarchy and able to learn importance of individual nodes in a hierarchy (i.e., within a level graph) and the importance of different hierarchies. In terms of design, most of the existing GNNs use GCN embedding and pooling layers (Ying et al., 2018). Whereas, we propose subgraph attention mechanism through SubGatt network (discussed in Section 3.1) and use it along with GIN as different embedding and pooling layers of SubGattPool. Following lemma shows that SubGattPool, though have different types of components in the overall architecture, satisfies a fundamental property required to be a graph neural network.
Lemma 3.1 ().
For a graph , with adjacency matrix and node attribute matrix , let us use
to denote the final graph representation generated by SubGattPool on that graph. Let, is any permutation matrix. Assuming that the initialization and random selection strategies of the neural architecture are always the same, .
Proof.
Please note that is the new adjacency matrix and is the new feature matrix of the same graph under the node permutation defined by the permutation matrix . So, to prove the above, we need to show that each component of SubGattPool is invariant to any node permutation. First, SubGatt uses attention mechanism over the neighboring subgraphs through Equation 3.1.2. Clearly, different ordering of neighbors would not affect the node embeddings as we use a weighted sum aggregator where weights are learned through the subgraph attention. Next, the GIN aggregator (as in Equation 5) is also invariant to node permutation. Thus, all the embedding and pooling layers (as shown in Figure 2) present in SubGattPool are invariant to different node permutations. Finally, both intralevel and interlevel attention mechanisms also do not depend on the ordering of the nodes in any level graph, as each of them uses sum aggregation with selfattention. Hence, SubGattPool is invariant to node permutations of the input graph. ∎
4. Experimental Evaluation
This section describes the details of the experiments conducted on both reallife and synthetic datasets.
Dataset  #Graphs  #Max Nodes  Avg. Number of Nodes  #Labels  #Attributes 
MUTAG  188  28  17.93  2  NA 
PTC  344  64  14.29  2  NA 
PROTEINS  1113  620  39.06  2  29 
NCI1  4110  111  29.87  2  NA 
NCI109  4127  111  29.68  2  NA 
IMDBBINARY  1000  136  19.77  2  NA 
IMDBMULTI  1500  89  13.00  3  NA 
4.1. Experimental Setup for Graph Classification
We use 5 bioinformatics graph datasets (MUTAG, PTC, PROTEINS, NCI1 and NCI09) and 2 social network datasets (IMDBBINARY and IMDBMULTI) to evaluate the performance for graph classification. The details of these datasets can be found at (https://bit.ly/39T079X). Table 2 contains a highlevel summary of these datasets.
To compare the performance of SubGattPool, we choose twenty stateoftheart baseline algorithms from the domains of graph kernels, unsupervised graph representation and graph neural networks (Table 3). The reported accuracy numbers of the baseline algorithms are collected from (Maron et al., 2019; Sun et al., 2020; Narayanan et al., 2017a) where the same experimental setup is adopted. Thus, we avoid any degradation of the performance of the baseline algorithms due to insufficient parameter tuning and validation.
We adopt the same experimental setup as there in (Xu et al., 2019)
. We perform 10fold cross validation and report the averaged accuracy and corresponding standard deviation for graph classification. We keep the values of the hyperparameters to be the same across all the datasets, based on the averaged validation accuracy. We set the pooling ratio (defined as
, ) at 0.5, the number of levels R=3 and the maximum subgraph size (T) to be 3. We sample L=12 subgraphs for each node in each epoch of SubGatt. Following most of the literature, we set the embedding dimension K to be 128. We use L2 normalization and dropout in SubGattPool architecture to make the training stable.Algorithms  MUTAG  PTC  PROTEINS  NCI1  NCI109  IMDBB  IMDBM 
GK (Shervashidze et al., 2009)  81.391.7  55.650.5  71.390.3  62.490.3  62.350.3  NA  NA 
RW (Vishwanathan et al., 2010)  79.172.1  55.910.3  59.570.1  NA  NA  NA  NA 
PK (Neumann et al., 2016)  762.7  59.52.4  73.680.7  82.540.5  NA  NA  NA 
WL (Shervashidze et al., 2011)  84.111.9  57.972.5  74.680.5  84.460.5  85.120.3  NA  NA 
AWEDD (Ivanov and Burnaev, 2018)  NA  NA  NA  NA  NA  74.455.8  51.543.6 
AWEFB (Ivanov and Burnaev, 2018)  87.879.7  NA  NA  NA  NA  73.133.2  51.584.6 
node2vec (Grover and Leskovec, 2016)  72.6310.20  58.858.00  57.493.57  54.891.61  52.681.56  NA  NA 
sub2vec (Adhikari et al., 2017)  61.0515.79  59.996.38  53.035.55  52.841.47  50.671.50  55.261.54  36.670.83 
graph2vec (Narayanan et al., 2017b)  83.159.25  60.176.86  73.302.05  73.221.81  74.261.47  71.10.54  50.440.87 
InfoGraph (Sun et al., 2020)  89.011.13  61.651.43  NA  NA  NA  73.030.87  49.690.53 
DGCNN (Zhang et al., 2018)  85.831.7  58.592.5  75.540.9  74.440.5  NA  70.030.9  47.830.9 
PSCN (Niepert et al., 2016)  88.954.4  62.295.7  752.5  76.341.7  NA  712.3  45.232.8 
DCNN (Atwood and Towsley, 2016)  NA  NA  61.291.6  56.611.0  NA  49.061.4  33.491.4 
ECC (Simonovsky and Komodakis, 2017)  76.11  NA  NA  76.82  75.03  NA  NA 
DGK (Yanardag and Vishwanathan, 2015)  87.442.7  60.082.6  75.680.5  80.310.5  80.320.3  66.960.6  44.550.5 
DIFFPOOL (Ying et al., 2018)  83.56  NA  76.25  NA  NA  NA  47.91 
IGN (Maron et al., 2018)  83.8912.95  58.536.86  76.585.49  74.332.71  72.821.45  72.05.54  48.733.41 
GIN (Xu et al., 2019)  89.45.6  64.67.0  76.22.8  82.71.7  NA  75.15.1  52.32.8 
123GNN (Morris et al., 2019)  86.1  60.9  75.5  76.2  NA  74.2  49.5 
3WLGNN (Maron et al., 2019)  90.558.7  66.176.54  77.24.73  83.191.11  81.841.85  72.64.9  503.15 
SubGattPool  93.294.78  67.136.45  76.923.44  82.591.42  80.951.76  76.492.94  52.463.48 
Rank  1  1  2  3  3  1  1 
4.2. Performance on Graph Classification
Table 3 shows the performance of SubGattPool along with the diverse set of baseline algorithms for graph classification on multiple realworld datasets. From the results, we can observe that SubGattPool is able to improve the stateoftheart on MUTAG, PTC, IMDBB and IMDBM for graph classification. On PROTEINS, the performance gap with the best performing baseline (which is 3WLGNN (Maron et al., 2019) for both) is less than 1%. But on NCI1 and NCI109, WL kernel turns out to be the best performing algorithm with a good margin () from all the GNN based algorithms. It is interesting to note that SubGattPool is able to outperform existing hierarchical GNN algorithms DIFFPOOL and 123GNN consistently on all the datasets. This is because of the use of (i) attention over subgraphs in SubGatt embedding and pooling layers, and (ii) use of intralevel and interlevel attention mechanisms over different level graphs which makes the overall architecture more robust and reduces information loss. In terms of standard deviation, SubGattPool is highly competitive and often better than most of the better performing GNNs (specially GIN and 3WLGNN).
4.3. Interpretation of Subgraph Attention via Synthetic Experiment
Subgraph attention is a key component of SubGattPool. Here, we validate the learned attention values on different subgraphs by conducting an experiment on a small synthetic dataset containing 50 graphs, and each graph having 8 nodes. Each graph has 2 balanced communities and exactly for 50% of the graphs, one community consists of a clique of size 4. We label a graph with +1 if the clique of size 4 is present, otherwise the label is 1. The goal of this experiment is to see if SubGattPool is able to learn this simple rule of graph classification by paying proper attention to the substructure which determines the label of a graph.
We run SubGattPool on this synthetic dataset, with , , #SubGatt layers=1, and . Once the training is complete, we randomly select a graph from the positive class and a node in it and plot the attention values of all the subgraphs selected in the last epoch for that node, in Figure 3. Clearly, the attention value corresponding to the clique (containing the nodes 7, 6, 5 and 4) is much higher than that to the other subgraphs. We have manually verified the same observation on multiple graphs in this synthetic dataset. Thus, SubGattPool is able to pay more attention to the correct substructure (i.e., subgraph) and pay less attention to other irrelevant substructures. This also explains the robust behavior of SubGattPool.
4.4. Graph Clustering
Though our proposed algorithm SubGattPool is for graph classification, we also wants to check the quality of the graph representations , obtained in SubGattPool through graph clustering. We use only a subset of recently proposed GNN based algorithm as baselines in this experiment. We use similar hyperparameter values (discussed in Section 4
) as applicable and adopt same hyperparameter tuning strategy to obtain the graph representation for all the algorithms considered. The vector representations obtained for all the graphs by a GNN are given to KMeans++
(Arthur and Vassilvitskii, 2006) algorithm to get the clusters. To evaluate the quality of clustering, we use unsupervised clustering accuracy (Bandyopadhyay et al., 2019, 2020) which uses different permutations of the labels and chooses the label ordering which gives the best possible accuracy . Here is the ground truth labeling of the dataset such that gives the ground truth label of th data point. Similarly is the clustering assignments discovered by some algorithm, and is a permutation of the set of labels. We assume to be a logical operator which returns 1 when the argument is true, otherwise returns 0. Table 4 shows that SubGattPool is able to outperform all the baselines we used for graph clustering on all the three datasets. Please note that DGI and InfoGraph derive the graph embeddings in an unsupervised way, whereas DIFFPOOL and SubGattPool use supervision. Naturally, the performance of the later two are better on all the datasets. Further, the use of subgraph attention along with the hierarchical attention layers helps SubGattPool to perform consistently better than DIFFPOOL which is also hierarchical in nature.Algorithms  MUTAG  PROTEINS  IMDBM 

DGI  74.73  59.20  36.83 
InfoGraph  77.65  59.93  35.93 
DIFFPOOL  82.08  60.81  41.72 
SubGattPool  90.68  65.45  50.23 
4.5. Model Ablation Study
SubGattPool has mainly two novel components. They are the SubGatt layer, and the intralevel and interlevel attention layers which makes SubGattPool a mixture of both global and hierarchical pooling strategy. To see the usefulness of each component, we show the performance after removing that component from SubGattPool. We chose graph visualization of MUTAG in Figure 4 and graph visualization of PTC in Figure 5 as the downstream tasks for this experiment. We use tSNE (van der Maaten and Hinton, 2008) to convert the graph embeddings into two dimensional plane. Different colors represent different labels of the graphs and the performance is better when different colors form different clusters in the plot. We choose DIFFPOOL as the base model in Figure 3(a) because it is also a hierarchical graph representation technique. In Figure 3(b), we replace the SubGatt embedding and pooling layers by GIN embedding and pooling layers in SubGattPool (refer Figure 2). Similarly, in Figure 3(c), we remove inter and intra layer attention and obtain the graph representation from the last level graph (by creating only one node there) in SubGattPool. Finally, Figure 3(d) shows the performance by SubGattPool, which combines all these components into a single network. Clearly, the performances in Figure 3(b) and 3(c) are better than that in Figure 3(a), but the best performance is achieved in Figure 3(d) which uses the complete SubGattPool network on MUTAG. The same observation of improved performance of the variants of SubGattPool over DIFFPOOL and the performance of SubGattPool over its variants is also prominent in Figure 5 on PTC dataset. This clearly shows the individual and combined usefulness of various components of SubGattPool for graph representation.
4.6. Sensitivity Analysis
We aim to conduct sensitivity analysis of the proposed algorithm in this section. SubGatt network has four important hyperparameters. They are: (i) Maximum size of a subgraph (), (ii) Number of subgraphs sampled per node in each epoch () and (iii) Dimension of the final node representation or embedding () (See Eq. 3.1.2) and (iv) Number of SubGatt layers used in the network. We conduct graph classification experiment on MUTAG to see the sensitivity of SubGattPool with respect to each of these hyperparameters. Figure 6 shows the variation of the performance of SubGattPool network for graph classification with respect to all these hyperparameters. We have shown both average graph classification accuracy and standard deviation over 10 repetitions for each experiment.
From Figure , we can see that the performance of SubGattPool on MUTAG improves when maximum length of subgraph is set to 3. As the average size of a graph in MUTAG is quite small, a subgraph of size more than 3 does not help. Similarly, Figure shows that with increasing number of samples () for each node in an epoch, the performance of SubGattPool improves first, and then saturates. The same observation can be made in Figure for embedding dimension of the graphs. We use SubGatt as the embedding and pooling layers of the GNN after level graph 1. Figure shows that best performance on MUTAG is obtained with 2 layers of SubGatt. Adding more number of layers actually deteriorates the performance because of oversmoothing which is a wellknown problem of graph neural networks (Luan et al., 2019). Overall, the variation is as expected and often less with respect to each hyperparameter and hence it shows the robustness of SubGattPool. Please note, when we are varying one hyperparameter of SubGattPool, the values of all other hyperparameters are fixed to the values mentioned in Section 4.1.
5. Conclusion
We have proposed a novel GNN based robust graph classification algorithm called SubGattPool which uses higher order attention over the subgraphs of a graph and also addresses some shortcomings of the existing hierarchical graph representation techniques. We conduct experiments with both real world and synthetic graph datasets on multiple graphlevel downstream tasks to show the robustness of our algorithm. We are also able to improve the stateoftheart graph classification performance on four popularly used graph datasets. In future, we would like to theoretically examine the expressiveness power of SubGatt and SubGattPool for node and graph representations respectively. We will also analyze and see the recovery of communities in a graph in the hierarchical structure of SubGattPool. Overall, we believe that this work would encourage further development in the area of hierarchical graph representation and classification.
References
 MixHop: higherorder graph convolutional architectures via sparsified neighborhood mixing. In International Conference on Machine Learning, pp. 21–29. Cited by: §2.
 Distributed representations of subgraphs. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 111–117. Cited by: Table 3.
 Kmeans++: the advantages of careful seeding. Technical report Stanford. Cited by: §4.4.
 Diffusionconvolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1993–2001. Cited by: Table 3.
 Fscnmf: fusing structure and content via nonnegative matrix factorization for embedding information networks. arXiv preprint arXiv:1804.05313. Cited by: §2.

Outlier aware network embedding for attributed networks.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 12–19. Cited by: §4.4.  Outlier resistant unsupervised deep architectures for attributed network embedding. In Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 25–33. Cited by: §4.4.
 Shortestpath kernels on graphs. In Fifth IEEE international conference on data mining (ICDM’05), pp. 8–pp. Cited by: §2.
 Hyperbolic graph convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 4869–4880. Cited by: §2.
 Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §1, §2.
 Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §1, §1, §2.
 Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1263–1272. Cited by: §1, §1, §2.
 Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §2, Table 3.
 Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1025–1035. Cited by: §1, §3.1.1.
 Anonymous walk embeddings. arXiv preprint arXiv:1805.11921. Cited by: Table 3.
 Marginalized kernels between labeled graphs. In ICML, pp. 321–328. Cited by: §2.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §1, §2.
 Understanding attention in graph neural networks. In Advances in Neural Information Processing Systems, Cited by: §2.
 Graph classification using structural attention. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1666–1674. Cited by: §1, §2.
 Selfattention graph pooling. In International Conference on Machine Learning, pp. 3734–3743. Cited by: §1, §2, §3.2.
 Break the ceiling: stronger multiscale deep graph convolutional networks. In Advances in Neural Information Processing Systems, pp. 10943–10953. Cited by: §4.6.
 Provably powerful graph networks. In Advances in Neural Information Processing Systems, pp. 2153–2164. Cited by: §2, §4.1, §4.2, Table 3.
 Invariant and equivariant graph networks. arXiv preprint arXiv:1812.09902. Cited by: Table 3.
 Weisfeiler and leman go neural: higherorder graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4602–4609. Cited by: §1, §1, §2, §3.3, Table 3.
 Graph2vec: learning distributed representations of graphs. In 13th International Workshop on Mining and Learning with Graphs (MLGWorkshop 2017), Cited by: §4.1.
 Graph2vec: learning distributed representations of graphs. arXiv preprint arXiv:1707.05005. Cited by: Table 3.
 Propagation kernels: efficient graph kernels from propagated information. Machine Learning 102 (2), pp. 209–245. Cited by: Table 3.
 The structure and function of complex networks. SIAM review 45 (2), pp. 167–256. Cited by: §1, §2.
 Learning convolutional neural networks for graphs. In International conference on machine learning, pp. 2014–2023. Cited by: Table 3.
 Weisfeilerlehman graph kernels. Journal of Machine Learning Research 12 (Sep), pp. 2539–2561. Cited by: §2, §3.1.1, Table 3.
 Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, pp. 488–495. Cited by: §2, Table 3.

Dynamic edgeconditioned filters in convolutional neural networks on graphs.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 3693–3702. Cited by: Table 3.  InfoGraph: unsupervised and semisupervised graphlevel representation learning via mutual information maximization. In International Conference on Learning Representations, External Links: Link Cited by: §4.1, Table 3.
 Breaking cycles in noisy hierarchies. In Proceedings of the 2017 ACM on Web Science Conference, pp. 151–160. Cited by: §1.
 Distributed, egocentric representations of graphs for detecting critical structures. In International Conference on Machine Learning, pp. 6354–6362. Cited by: §2.
 Visualizing data using tSNE. Journal of Machine Learning Research 9, pp. 2579–2605. External Links: Link Cited by: §4.5.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.1.2.
 Graph attention networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §2, §3.1.2.
 Graph kernels. Journal of Machine Learning Research 11 (Apr), pp. 1201–1242. Cited by: §2, Table 3.
 A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §2.
 How powerful are graph neural networks?. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §2, §3.2, §4.1, Table 3.
 Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, pp. 5449–5458. Cited by: §2.
 Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374. Cited by: §2, Table 3.
 SPAGAN: shortest path graph attention network. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 4099–4105. Cited by: §2.
 Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pp. 4800–4810. Cited by: §1, §1, §2, §2, §3.3, Table 3.
 Positionaware graph neural networks. In International Conference on Machine Learning, pp. 7134–7143. Cited by: §2.
 An endtoend deep learning architecture for graph classification. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: Table 3.