Introduction
Graphs are one of the most useful data structure in terms of representing real world data like molecules, social networks and citation networks. Consequently, representation learning (Bengio et al., 2013; Hamilton et al., 2017a, b)
on graph structured data has gained prominence in the machine learning community. In recent years, Graph Neural Networks (GNNs)
(Kipf and Welling, 2016; Defferrard et al., 2016; Gilmer et al., 2017; Wu et al., 2020) have become models of choice in learning representations for graph structured data. GNNs operate locally on each node by iteratively aggregating information from its neighbourhood in the form of messages and updating its embedding based on the messages received. This local procedure has been shown to capture some of the global structure of the graph. However, there are many simple topological properties like detecting triangles that GNNs fail to capture (Chen et al., 2020; Knyazev et al., 2019).Theoretically, GNNs have been shown to be no more powerful than 1dimensional Weisfeiler Lehman (1WL) test for graph isomorphism (Xu et al., 2018; Morris et al., 2019) in terms of their distinguishing capacity of nonisomorphic graphs. 1WL is a classical technique of iteratively coloring vertices by injectively hashing the multisets of colors of their adjacent vertices. Although this procedure produces a stable coloring, which can be used to compare graphs, it fails to uniquely color many nonsymmetric vertices. This is precisely the reason for the failure of 1WL and thereby GNNs, to distinguish many simple graph structures (Kiefer et al., 2020; Arvind et al., 2020). Further works on improving the power of GNNs have tried to incorporate higherorder WL generalizations in the form of order GNNs (Morris et al., 2019; Maron et al., 2019; Morris et al., 2020), These order GNNs jettison the local nature of 1WL and operate on tuples of vertices and hence are computationally demanding. However, this suggests techniques for learning better node representations can be found in classical algorithms used to detect graph isomorphisms.
Another possible approach for learning node features is the individualization and refinement (IR) paradigm which provides a useful toolbox for testing graph isomorphism. In fact, all stateoftheart isomorphism solvers follow this approach and are fairly fast in practice (McKay and others, 1981; Darga et al., 2004; Junttila and Kaski, 2011; McKay and Piperno, 2014). IR techniques aim to assign a distinct color to each vertex such that the coloring respects permutation invariance and is canonical, a unique coloring of its isomorphism class (no two nonisomorphic graphs get the same coloring). This is achieved first by individualizion, a process of recoloring a vertex in order to artificially distinguish it from rest of the vertices. Thereafter, a color refinement procedure like the 1WL message passing is applied to propagate this information across the graph. This process of IR is repeated until each vertex gets a unique color. But the coloring is not canonical yet. To preserve permutation invariance, whenever a vertex is individualized, we have to individualize, and thereafter refine, all other vertices with the same color as well. As each individualization of a vertex gives a different final discrete coloring, this generates a searchtree of colorings where each individualization forms a treenode. The tree generated is canonical. The coloring of the graph at each leaf can also be used to label the graph, and the graph with the largest label can be used as a canonical graph. However, the size of searchtree grows exponentially fast and isomorphismsolvers prune the searchtree heavily by detecting symmetries like automorphisms, and by other handcrafted techniques.
In this paper, we propose to improve representations learnt by GNNs by incorporating the inductive biases suggested by individualization and refinement in updating the node embeddings. Unlike isomorphismsolvers, it is not desirable to check for automorphisms to prune the searchtree from a learning perspective. For computational efficiency, we restrict our search to individualizing nodes in each iteration. In order to restrict the search from exponentially blowing up, we take the approach similar to beam search and reduce the refined graphs to a single representative graph and repeat the individualization and refinement process. This simple technique lets us learn richer node embeddings while keeping the computational complexity manageable. We validate our approach with experiments on synthetic and real datasets, where we show our model does well on problems where 1WL GNN models clearly fail. Then we show that our model outperforms other prominent higherorder GNNs by a substantial margin on some real datasets.
Related Work
Over the last few years, there have been considerable number of works on understanding the representative power of GNNs in terms of their distinguishing capacity of nonisomorphic graphs (Xu et al., 2018; Morris et al., 2019; Chen et al., 2019b; Dehmamy et al., 2019; Srinivasan and Ribeiro, 2019; Loukas, 2019; Barceló et al., 2019). These works have established that the GNNs are no more expressive than 1WL kernels in graph classification (Shervashidze et al., 2011). Furthermore, Chen et al. (2020) have shown that many of the commonly occurring substructures cannot be detected by GNNs. One example is counting the number of triangles which can be easily computed from third power of the adjacency matrix of the graph (Knyazev et al., 2019).
To improve the expressive power of GNNs, multiple works have tried to break away from this limitation of local message passing of 1WL. Higherorder GNNs which can be seen as neural versions of dimensional WL tests have been proposed and studied in series of works (Maron et al., 2018; Morris et al., 2019; Maron et al., 2019; Morris et al., 2020). Although, these models are provably more powerful than 1WL GNNs, they suffer from computational bottlenecks as they operate on tuples of nodes. Maron et al. (2019) proposed a computationally feasible model but was limited to 3WL expressivity. Note that 1WL and 2WL have same expressivity (Maron et al., 2019). Other line of works on improving the expressivity of GNNs tend to introduce extra features which can be computed by preprocessing the graphs in the form of distance encoding (Li et al., 2020) and subgraph isomorphism counting (Bouritsas et al., 2020). These techniques help in practice but fall short in the quest of better algorithms to extract such information by improved learning algorithms.
The main problem of message passing in GNNs is that nodes fail to uniquely identify if the successive messages received are from the same or different nodes. In order to address this issue, Sato et al. (2020) and Dasoulas et al. (2019) have proposed to introduce features with random identifiers for each node. These models, though showing some benefit, can only maintain permutation invariance in expectation. Furthermore, structural message passing in the form of matrices with unique identifiers may help as shown in Vignac et al. (2020) but its fast version has the same expressivity as in Maron et al. (2019).
In this work, we aim to address some these concerns by leveraging the individualization and refinement
paradigm of isomorphism solvers which are often fast in practice. We make a few approximations by individualizing fixed number of nodes and keeping a single representative graph in each iteration to manage the computational complexity. To avoid loss in accuracy, we use learning both for adaptively selecting the nodes to individualize and for merging the graphs from data. Our approach provides flexible trade off between robustness and speed of the model using the number of nodes to individualize as a hyperparameter which can be tuned using training data.
Preliminaries
Let be a graph with vertex set and edge set . Two graphs are isomorphic, if there exists an adjacencypreserving bijective mapping , i.e. iff . An automorphism of is an isomorphism that maps onto itself. Intuitively, two vertices can be mapped to each other via an automorphism if they are structurally indistinguishable in the graph. A prominent way of identifying isomorphism is by coloring the nodes based on the structure of the graph in a permutation invariant manner and comparing if two graphs have the same coloring. Formally, a vertex coloring is a surjective function which assigns each vertex of the graph to a color (natural number). A graph is called colored graph , if is a coloring of . Given a graph , a cell of is the set of vertices with a given color. Vertex coloring partitions into cells of color classes and hence is often called a partition. If any two vertices of the same color are adjacent to the same number of vertices of each color, then such a coloring is called equitable coloring, which cannot be further refined. If are equitable colorings of a graph, then is finer than or equal to , if for all . This implies that each cell of is a subset of a cell of , but the converse is not true. A discrete coloring is a coloring where each color cell has a single vertex i.e. each vertex is assigned a distinct color.
Vertex refinement or 1WL test
1dimensional Weisfeiler Lehman coloring is a graph coloring algorithm used to test isomorphism between graphs. Initially, all vertices are labeled uniformly with the same color and then are refined iteratively based on the local neighbourhood of each vertex. In each iteration, two vertices are assigned different colors if the multisets of their colored neighbourhoods are different. Specifically, if is the color of vertex at time step , then the colors are refined with the following update: , where denotes a multiset and is the neighbourhood of . This procedure produces an equitable coloring where no further refinement is possible and the algorithm stops. This is a powerful technique of node coloring but has been shown to be restricted for many classes of graphs which cannot be distinguished by 1WL refinement (Kiefer et al., 2020; Arvind et al., 2020; Chen et al., 2020).
Individualization and refinement
Most of the present graph isomorphism solvers i.e. Nauty (McKay and others, 1981), Traces (McKay and Piperno, 2014), etc. are based on a coloring paradigm called individualization and refinement. In practice, these solvers are quite fast even though they can take exponential time in worst case. A comprehensive explanation of the of individualization and refinement algorithms can be found in McKay and Piperno (2014). Below we give a brief description.
Individualization refers to picking a vertex among vertices of a given color and distinguishing it with a new color. Once a vertex is distinguished from the rest, this information can be propagated to the other nodes by 1WL message passing. An example is shown in the Figure 1, where initially 1WL coloring is unable to distinguish the two graphs. Individualizing one of the blue colored vertices and further refinement produces equitable coloring of the graphs such that they become distinguishable.
The procedure to generate a canonical coloring of the graph is as follows. Initially, vertex refinement is used to get the equitable coloring of the graph. This would partition the graph into a set of color cells. Then one of the color cells called target cell is chosen and a vertex from it is reassigned a new color. This is propagated across the graph till a further refined equitable coloring of the graph is obtained. But this refinement comes at a cost. This can only be done in a permutation invariant manner if all the vertices of the chosen target cell are individualized and thereafter refined. If we chose a target cell of size i.e. the chosen color has been assigned to nodes in the graph, then after individualization and subsequent refinement, we would have colored graphs which are finer than the one before individualization. Note that these graphs can still be not discrete and hence the process is repeated for each of refined graphs until all final graph colorings are discrete. This would take the shape of a search tree where tree nodes are graphs with equitable colorings. This search tree is canonical to the isomorphism class of the graph i.e. the search tree is unique to the graph’s isomorphism class. Finally, one of the leaves is chosen based on a predefined function which sorts all discrete colorings found as leaves of the search tree.
Furthermore, vertices belonging to the same automorphism group in the graph always get the same color in the initial equitable coloring and all of them induce the same refined coloring of the graph when individualized. Intuitively, this is because there is no structural difference between the vertices of the same automorphism group. For example, in Figure 1, individualizing any of the blue vertices would produce the same set of node colorings. Therefore, isomorphism solvers prune the search tree by detecting automorphism groups. Effectively, even if the target cell is large, the branching can be reduced if automorphisms are detected.
Proposed Method
GNNs learn node embeddings, by iteratively aggregating embeddings of its neighbours. In this work, we propose to leverage the individualization and refinement paradigm in learning to distinguish nodes with similar embeddings. The key idea is to approximate the search process generated by IR based isomorphism solvers. To make it computationally feasible, we make neccessary approximations. A conceptual overview of our approach is shown in Figure 2 and pseudocode is shown in Algorithm 1.
Notations:
We refer to hidden vectors with
for embeddings of node and for the aggregated embeddings of graph , generated with appropriate global pooling function applied on node embeddings. We use to denote matrix of node embeddings. is the embedding at layer . Note that layer refers to one full iteration of individualizationrefinement step. are initialized with node features or with a constant value in the absence of node features.Initialization
Graph Neural Networks (GNNs) are neural versions of 1WL message passing and hence node embeddings would converge after iterating for fixed number of steps. At this stage, some of the nodes end up with same embeddings. This is equivalent to the equitable partitioning of the graph. This serves as the stable initialization of the node embeddings and partitions the graph akin to color cells.
Target cell selection
After initial GNN refinement, the graph would be a multiset of color (embedding) classes. To break the symmetry, we need to choose a color cell called the target cell, which contains the chosen set of nodes with the same color, in order to individualize each one of them. We approximate the targetcell selection by choosing one of the node embeddings in the graph and selecting most similar nodes with the chosen embedding. The choice of the target cell is important as it determines the breadth and the depth of the search tree and thereby the richness of the embeddings learnt. Usually, the isomorphism solvers use handengineered techniques to choose the target cell. Note that the size of the target cell itself may not matter much with respect to the refinement it induces on the graph. Some solvers like Nauty choose the smallest nonsingleton cell whereas Traces choose the largest cell in order to shorten the depth of the search tree. This is one place where we can leverage learning from data. Given a dataset, the target cell chosen should best fit the data. Let be a scoring function on nodes which can be used to select topk indices for individualization in each iteration. If is the set of nodes of the target cell to be individualized at layer then,
(1) 
One simple method of implementing such a scoring function would be TopK pooling as done in (Gao and Ji, 2019; Lee et al., 2019). But this may not capture the best target cell as these functions are only parameterized by a projection vector and do not provide specific inductive bias towards an end goal. Intuitively, to bias the scoring function towards better target cells, it should determine the cell based on the current state of the graph i.e., the number and size of distinct embeddings (cells
) and the previous embeddings that have already been individualized i.e. history. Therefore, we model the target cell selector as a Recurrent neural network with a GRU cell as it can track previous states via its hidden state.
Precisely, in each step, the GRU cell outputs a projection vector which is used to rank the nodes based on the similarity of the nodes with the projection vector. At iteration , the GRU cell takes in two inputs, one the pooled feature of the graph and the other is the hidden state of the GRU from the previous time step. It then produces two embeddings, the updated hidden state and an output embedding. The output is considered as a surrogate for the target cell embedding on which all the nodes are projected and ranked based on the projection score. Let be the projection vector output by the GRU, be an aggregator set function, be the hidden state vector and be the set of node embeddings.
Then the following rule is applied to rank the nodes and get the individualization indices.
(2)  
(3)  
(4)  
(5)  
(6) 
where is elementwise product. To select the nodes, the node embeddings are projected onto and a score is computed for each node. The topk ranked indices are selected into the set
. To be able to backpropagate through the scoring function, the topk scores are multiplied with the projection vector and a Readout function is used to generate a separate feature vector which is concatenated with the final graph embedding before prediction.
Individualization
Now that we have the selected indices in , we individualize each of the nodes separately and branch out. Our individualization function is an MLP and acts only on the selected node embeddings. The node embeddings of are updated by passing through individualization function; thereafter, the embedding matrix of the graph is updated by the procedure of masking.
(7) 
This is repeated for each separately as we can only color one node at a time in the graph. The node embedding matrix is updated with the individualized node embedding. The next step is to refine the graph to propagate this information.
Refinement and Aggregation
After each individualization, we run GNN again for a fixed number of steps till the node embeddings converge again. This process of Refinement gives us refined graphs. Ideally, we should expand all the graphs further but this incurs extra computational cost which grows exponentially with depth.
In cases where we need to search over a tree, one of the popular greedy heuristic is beam search, where the search tree of a fixed width is kept. Selection of best graph out of
would require a valuefunction approximator which would score over each of the graphs. Such a function would be difficult to train given the limited amount of labeled data. As an approximation, we instead construct a multiset function to aggregate the embeddings of each node into a single embedding. In principle a universal multiset function approximator should be able to approximate selection as well. Also, here we need a set function of a set of sets i.e. set of graphs. For this we first construct graphaware node embeddings by combining the node embeddings with the pooled embeddings of the same graph. Finally, node embeddings are aggregated across graphs to generate a new representative graph. Precisely, let be the set of node embeddings after individualization and refinement of index in layer . We first compute , a representation for as(8) 
Thereafter, we feed to an MLP and add to each to get a graphrepresentative node embedding for the nodes of the refined graphs.
(9) 
We then max pool the node embeddings across the
graphs i.e. for a node , we pool together embeddings to generate the aggregated new representation for node .(10) 
With the new representative graph embeddings, we repeat the IR procedure times before readout.
On aggregation of graphs and fixing k
IR algorithms usually expand the tree until the colors are discrete. The colors are then used to label the leaves and the leaf with the largest label is selected. One way to simulate this is to learn a method of selecting the correct branch at each internal tree node in order to reach the correct leaf. Instead of selecting a branch, we learn to aggregrate the children of the internal nodes into a single node and find that aggregration works well in our experiments. In an ideal case where the multiset functions used for aggregration are universal approximators (Zaheer et al., 2017; Qi et al., 2017), it would be able to learn to an approximate algorithm that selects one of the graphs, hence can learn to work as well as an algorithm that learns to select. We use a simpler aggregrator that is not necessarily a universal approximator. However, the use of maxpooling operator will likely make simulation of selection easier – if all embeddings are positive, multiplying nondesired embeddings by a small number would result in maxpooling selecting the desired embedding.
As for determining the value of , different graphs can have different cell sizes and hence it is best to choose the value of by crossvalidation. The best that fits the dataset is more likely to be the target cell size on average for a given layer. Crossvalidation also helps when the cells consist of nodes from same automorphism group; for such cells, smaller value of would be sufficient. But as shown in Section Analysis, a larger value of would be more robust by including all nodes of the selected cell.
Analysis
In this section, we discuss some of the properties of the proposed GNN model with individualization and refinement as defined in Algorithm 1, which we call as GNNIR model.
Permutation invariance: Each step of GNNIR with arbitrary width preserves the property of permutationinvariance for a large class of graphs, though it may fail to preserve it for some graphs. All operators used in GNNIR are either permutationinvariant operators on sets or those which operate on individual nodes. If the input graph has unique attributes for all nodes, then the nodeoperators will operate on same nodes irrespective of input nodepermutations. Hence, GNNIR is permutationinvariant for all graphs with unique node attributes. However, if the nodes are not distinguishable, then nodeoperators may operate on different nodes when the permutation of input nodes changes. The stage which decides which nodes are operated on is the target cell selection stage where top nodes are selected for individualization.
Consider the graphs with no nodeattributes. Let the initial coloring of the graph be , where is the set of vertices with same color. For arbitrary width , it is possible that not all vertices of a colorcell are included in the individualization set . Then, permutationinvariance is preserved if, upon individualization, all vertices of induce same refinement on the graph. Effectively, which is included in is irrelevant. This happens when forms an orbit i.e. all vertices of can be mapped to each other via an automorphism. Therefore, for arbitrary , if all are orbit cells i.e. is an orbitpartition, then each step of GNNIR preserves permutation invariance of the graph embeddings. For example, consider one of the graphs in Figure 1, where blue vertices form an orbit i.e. all blue vertices are structurally equivalent, and hence individualization of any blue vertex of the graph will generate same refinement. Also, we show in Lemma 1 that, output of each step of GNNIR remains an orbitpartitioned coloring and hence, permutationinvariance is preserved over multiple steps of GNNIR.
Lemma 1. If the input to a GNNIR step, which includes targetcell selection, individualizationrefinement and aggregation, is an orbitpartitioned coloring, then the output will also be an orbitpartitioned coloring.
Proof: Proof is included in the Appendix.
Expressive power: We characterise the expressive power of GNNIR in terms of its distinguishing capacity of the nonisomorphic graphs.
Proposition 1. Assume we use universal set approximators in GNNIR for targetcell selection, refinement and aggregation steps. GNNIR is more expressive than all 1WL equivalent GNNs i.e., GNNIR can distinguish all graphs distinguishable by GNN and there exist graphs nondistinguishable by GNN which can be distinguished by GNNIR.
Proof sketch.
A detailed proof is given in Appendix. Here, we give a brief sketch of the proof. First, we show that graphs distinguishable by GNNs generate orbitpartitions on 1WL refinement. Hence, by Lemma 1, GNNIR preserves permutationinvariance on these graphs and since, the first layer of GNNIR is a GNN, they are distinguishable by GNNIR. Next, we illustrate two graphs which are notdistinguishable by GNN and show how a step of GNNIR including individualizationrefinement and aggregation operators, distinguishes these two graphs. ∎
Runtime analysis: For bounded degree graphs, runtime of GNN for fixed iterations is . GNNIR builds on GNN and in each IR step, GNN is run times. Assuming constant time aggregation with 1layer MLPs, running GNNIR for steps takes .
Experiments
In this section, we report evaluation results of GNNIR with multiple experiments.
Model architecture: We follow the MPNN (Gilmer et al., 2017) architecture in all our experiments. We first convolve with a 1WL convolution operator, then update with a GRU cell (Chung et al., 2014) and for readout, we use either sumpooling or set2set (Vinyals et al., 2015) function. For convolution, we use either GIN (Xu et al., 2018), NNConv (Gilmer et al., 2017) or PNA (Corso et al., 2020) convolution operators depending on the dataset and availability of edge features. Also, we run GNN in each IR layer for
steps and share parameters of the GNN in each iteration of IR which allows us to go deeper without increasing the parameters. We use 1hidden layer MLPs in all set aggregators along with sum pooling. Also, in datasets with edge features, we consider edges as variables and readout from node and edge variables before final prediction. Code was implemented in PytorchGeometric
(Fey and Lenssen, 2019).Counting Triangles
Train  Test  Time  
orig  large  sec/ep  ratio  
GIN, topk      
ChebyGIN        
GAT      
GIN*  17.1  1  
GNN  IR  
90  76  28  42.1  2.46  
92  86  34  51.7  3.02  
86  78  28  57.0  3.33  
98  97  41  80.8  4.73  
93  91  46  85.6  5.00  
99  99  51  98.7  5.77 
Counting triangles is a simple task and has an analytic solution of trace, where is the adjacency matrix of the graph. But, 1WL GNN models provably cannot count triangles (Chen et al., 2020). Therefore, we evaluate GNNIR on a publicly available synthetic dataset TRIANGLES (Knyazev et al., 2019) of graphs where the task is to count the number of triangles. The dataset comes with 4 splits, training (), validation (), testoriginal () and testlarge (). The first three sets all have graphs with nodes and the testlarge set has as many as nodes. The testlarge is challenging and it tests the generalization ability of the model. We compare with baseline 1WL GNN models, GAT, GIN and ChebyGIN (Knyazev et al., 2019) which is a more powerful GNN.
Table 1 shows the accuracy of the models on both the test sets. To analyse the effect of depth and width of IR, we report results for various values of , number of IR layers and , number of individualizations in each layer. As shown in results, just one layer of IR with individualizations significantly increases the accuracy from to for Testorig. Furthermore, a clear pattern emerging from the results is that accuracy and generalization improves for longer depth and larger width , with highest score saturating at and . We also report the extra time taken by each IR layer, which includes steps of GNN per individualization, and compare it with GIN as it is our base GNN. The time ratio w.r.t GIN shows that with more IR layers and increased width , the amount of computation also increases; but the increase is only linear both in and .
Recognizing Circulant Skip Links (CSL) graphs
mean  median  max  min  std  
GIN  
RPGIN  
3WLGNN  
GNN  IR  
A type of graphs that 1WL GNN models cannot classify are regular graphs that do not provide any information in node degrees. In order to evaluate GNNIR for highly regular graphs, we use Circulant Skip Links (CSL) dataset released by
(Murphy et al., 2019). A CSL graph is a 4regular graph with vertices with an edge between vertices distance from each other. The dataset consists of graphs with classes (). We evaluate GNNIR with CSL dataset with fold crossvalidation as in (Murphy et al., 2019) and report results along with 1WL and more expressive models like RPGNN (Murphy et al., 2019) and 3WL GNN.Since CSL graphs have highly regular structure, the number of individualizations needed to break the symmetry is high. Table 2 shows that 1WL GIN model fails whereas there is clear improvement by GNNIR. Note that higher values of result in much better gain compared to increase in number of IR layers . This suggests that with larger width , either target colorcells with larger width are more helpful in breaking the symmetry or more number of smaller colorcells included for individualization helps in including the optimal targetcell in the selected nodes.
Real world benchmarks
We now evaluate GNNIR on realworld datasets on graph regression task. We use ZINC10K (Jin et al., 2018; Dwivedi et al., 2020), ALCHEMY10K (Chen et al., 2019a) and QM9 (Ruddigkeit et al., 2012; Ramakrishnan et al., 2014) datasets, since these datasets are used by recent prominent models (Morris et al., 2020; Corso et al., 2020) and to compare with the neural versions of higherorder GNNs (Morris et al., 2020) which are more expressive than 1WL GNNs. Note that all kernel versions usually perform much better than their neural counterparts. We use NNConv as our base GNN for QM9 and PNAConv for Alchemy and ZINC. For readout, we use Set2set output for QM9 following (Gilmer et al., 2017) and sum pooling for other datasets. For all datasets we follow the dataset splits and report mean absolute error (MAE) as in (Morris et al., 2020). For more recent comparison with stateoftheart models, we add PNA (Corso et al., 2020) and DGN (Beani et al., 2021)for ZINC and ALCHEMY datasets and LRBPnet (Dupty and Lee, 2020) for QM9.
CSL  TRIANGLES  
mean  median  max  min  std  Train  Test  
orig  large  
GIN  10  10  10  10  0  90  47  18 
GNNIR with  
a) random nodes  11.9  13.3  13.3  10  1.8  94.7  93.6  39.5 
b) without GRU  90.7  96.7  96.7  70  16.6  97.1  97.2  41.2 
c) sum aggr  54.0  60.0  70.0  20  19.4  95.8  92.4  35.2 
GNNIR  98.7  100.0  100.0  96.67  1.8  99.4  99.3  51.1 
Table 5 and 5 shows that the improvement over standard baselines is consistent across the datasets. Specifically for ZINC10K and QM9, the improvement of GNNIR is substantial over kWL GNN variants. Note that it is nontrivial to compare the expressive power of GNNIR in terms of kWL models. Results show that GNNIR outperforms neural versions of kWL. This suggests that GNNIR has at least better inductive bias for these problems than kWL GNNs. But it is not clear which properties of the algorithm result in better inductive bias for these tasks. It would be interesting to understand the conditions under which GNNIR and kWL GNNs do well for different tasks. We leave this study for future work.
Runtime and ablation study: We compare GNNIR with kWL GNNs in terms of their runtimes against accuracy. Table 5
shows ratio of time per epoch of models w.r.t GINE (1WL). There is only linear increase in runtime of GNNIR with increasing
and much better gains compared to kWL GNNs with lesser runtimes.We also conducted ablation experiments to study the effectiveness of various components of GNNIR. For this, we use the best models of GNNIR for CSL and TRIANGLES dataset and consider randomly selecting topk nodes instead of learning, replacing GRU in equation 3 with an MLP and using sum instead of max in aggregating the refined graphs. Table 6 shows that, compared to GNN, randomly selecting top nodes helps in TRIANGLES dataset but does not help at all in CSL dataset; suggesting that learning topk nodes is needed for breaking the symmetry in graph structures. Replacing GRU with MLP and using sum aggregator decreases the performance slightly but fare much better than GNN.
Note: Due to space constraints, we provide results on TUDataset benchmark in the Appendix; GNNIR gives competitive performance against GNNbased methods although it does not give stateoftheart performance when compared to all methods.
Conclusion
In this work, we propose learning richer representations on graph structured data with individualization and refinement, a technique followed by most practical isomorphism solvers. Our approach is computationally feasible and can adaptively select nodes to break the symmetry in GNN embeddings. Experimental evaluation shows that our model substantially outperforms other 1WL and more expressive GNNs on several benchmark datasets. Future work includes understanding the power and limitations of learning individualization and refinement functions for improving GNN models.
References
 On weisfeilerleman invariance: subgraph counts and related graph properties. Journal of Computer and System Sciences 113, pp. 42–59. Cited by: Introduction, Vertex refinement or 1WL test.
 The logical expressiveness of graph neural networks. In International Conference on Learning Representations, Cited by: Related Work.
 Directional graph networks. In International Conference on Machine Learning, pp. 748–758. Cited by: Real world benchmarks.
 Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: Introduction.
 Improving graph neural network expressivity via subgraph isomorphism counting. arXiv preprint arXiv:2006.09252. Cited by: Related Work.
 Alchemy: a quantum chemistry dataset for benchmarking ai models. arXiv preprint arXiv:1906.09427. Cited by: Appendix A, Real world benchmarks.
 Can graph neural networks count substructures?. arXiv preprint arXiv:2002.04025. Cited by: Introduction, Related Work, Vertex refinement or 1WL test, ‣ Counting Triangles.
 On the equivalence between graph isomorphism testing and function approximation with gnns. arXiv preprint arXiv:1905.12560. Cited by: Related Work.
 Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: Experiments.
 Principal neighbourhood aggregation for graph nets. arXiv preprint arXiv:2004.05718. Cited by: Appendix A, Appendix A, Real world benchmarks, Experiments.
 Exploiting structure in symmetry detection for cnf. In Proceedings of the 41st Annual Design Automation Conference, pp. 530–534. Cited by: Introduction.
 Coloring graph neural networks for node disambiguation. arXiv preprint arXiv:1912.06058. Cited by: Related Work.
 Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: Introduction.
 Understanding the representation power of graph neural networks in learning graph topology. arXiv preprint arXiv:1907.05008. Cited by: Related Work.
 Graph neural tangent kernel: fusing graph neural networks with graph kernels. arXiv preprint arXiv:1905.13192. Cited by: Appendix A.
 Neuralizing efficient higherorder belief propagation. arXiv preprint arXiv:2010.09283. Cited by: Real world benchmarks.
 Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982. Cited by: Appendix A, Real world benchmarks.
 A fair comparison of graph neural networks for graph classification. arXiv preprint arXiv:1912.09893. Cited by: Appendix A, Appendix A.
 Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: Appendix A, Experiments.
 Graph unets. In international conference on machine learning, pp. 2083–2092. Cited by: Target cell selection.
 Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1263–1272. Cited by: Appendix A, Appendix A, Introduction, Real world benchmarks, Experiments.
 Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: Appendix A, Introduction.
 Representation learning on graphs: methods and applications. arXiv preprint arXiv:1709.05584. Cited by: Introduction.

Junction tree variational autoencoder for molecular graph generation
. In International Conference on Machine Learning, pp. 2323–2332. Cited by: Appendix A, Real world benchmarks.  Conflict propagation and component recursion for canonical labeling. In International Conference on Theory and Practice of Algorithms in (Computer) Systems, pp. 151–162. Cited by: Introduction.
 Benchmark data sets for graph kernels, 2016. URL http://graphkernels. cs. tudortmund. de 795. Cited by: Appendix A.
 Power and limits of the weisfeilerleman algorithm. Technical report Fachgruppe Informatik. Cited by: Appendix A, Introduction, Vertex refinement or 1WL test.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: Introduction.
 Directional message passing for molecular graphs. arXiv preprint arXiv:2003.03123. Cited by: Appendix A.
 Understanding attention and generalization in graph neural networks. arXiv preprint arXiv:1905.02850. Cited by: Appendix A, Introduction, Related Work, ‣ Counting Triangles.
 Selfattention graph pooling. In International Conference on Machine Learning, pp. 3734–3743. Cited by: Target cell selection.
 Distance encoding–design provably more powerful gnns for structural representation learning. arXiv preprint arXiv:2009.00142. Cited by: Related Work.
 What graph neural networks cannot learn: depth vs width. In International Conference on Learning Representations, Cited by: Related Work.
 Provably powerful graph networks. arXiv preprint arXiv:1905.11136. Cited by: Introduction, Related Work, Related Work.
 Invariant and equivariant graph networks. arXiv preprint arXiv:1812.09902. Cited by: Related Work.
 Practical graph isomorphism. Cited by: Introduction, Individualization and refinement.
 Practical graph isomorphism, ii. Journal of Symbolic Computation 60, pp. 94–112. Cited by: Introduction, Individualization and refinement.
 Weisfeiler and leman go sparse: towards scalable higherorder graph embeddings. Advances in Neural Information Processing Systems 33. Cited by: Appendix A, Appendix A, Appendix A, Appendix A, Appendix A, Introduction, Related Work, Real world benchmarks.

Weisfeiler and leman go neural: higherorder graph neural networks.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 4602–4609. Cited by: Appendix A, Appendix A, Introduction, Related Work, Related Work.  Relational pooling for graph representations. In International Conference on Machine Learning, pp. 4663–4673. Cited by: Appendix A, ‣ Recognizing Circulant Skip Links (CSL) graphs.
 Random walk graph neural networks. Advances in Neural Information Processing Systems 33, pp. 16211–16222. Cited by: Appendix A, Appendix A.

Pointnet: deep learning on point sets for 3d classification and segmentation
. InProceedings of the IEEE conference on computer vision and pattern recognition
, pp. 652–660. Cited by: On aggregation of graphs and fixing k.  Quantum chemistry structures and properties of 134 kilo molecules. Scientific data 1, pp. 140022. Cited by: Appendix A, Real world benchmarks.
 Enumeration of 166 billion organic small molecules in the chemical universe database gdb17. Journal of chemical information and modeling 52 (11), pp. 2864–2875. Cited by: Appendix A, Real world benchmarks.
 Random features strengthen graph neural networks. arXiv preprint arXiv:2002.03155. Cited by: Related Work.
 SchNet–a deep learning architecture for molecules and materials. The Journal of Chemical Physics 148 (24), pp. 241722. Cited by: Appendix A.
 Weisfeilerlehman graph kernels.. Journal of Machine Learning Research 12 (9). Cited by: Appendix A, Related Work.
 On the equivalence between positional node embeddings and structural graph representations. In International Conference on Learning Representations, Cited by: Related Work.

PhysNet: a neural network for predicting energies, forces, dipole moments, and partial charges
. Journal of chemical theory and computation 15 (6), pp. 3678–3693. Cited by: Appendix A.  Building powerful and equivariant graph neural networks with structural messagepassing. arXiv eprints, pp. arXiv–2006. Cited by: Related Work.
 Order matters: sequence to sequence for sets. arXiv preprint arXiv:1511.06391. Cited by: Experiments.
 A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems. Cited by: Introduction.
 How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: Appendix A, Appendix A, Introduction, Related Work, Experiments.
 Deep graph kernels. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1365–1374. Cited by: Appendix A.
 Hierarchical graph representation learning with differentiable pooling. arXiv preprint arXiv:1806.08804. Cited by: Appendix A.
 Deep sets. arXiv preprint arXiv:1703.06114. Cited by: On aggregation of graphs and fixing k.
 An endtoend deep learning architecture for graph classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: Appendix A.
Appendix A Appendix
In this section, we first provide proofs of the propositions. Thereafter, we expand our Experiments section by including additional set of experiments. Specifically, we provide results on TUDataset benchmark and additional details on experiments and datasets.
Proofs
Permutation invariance:
Lemma 1. If the input to a GNNIR round is an orbitpartitioned coloring, then the output will also be an orbitpartitioned coloring.
Proof: Each round of GNNIR consists of selection of nodes, individualizationrefinement of nodes to generate refined colorings and thereafter, aggregation of colorings into a single coloring. Firstly, since input is an orbitpartition, the node selection stage is permutationinvariant. The refined colorings generated after individualizationrefinement stage are all orbitpartitions. This is because any refined subpartition of an orbitpartition is also an orbitpartition. As a proof by contradiction for this: assume the subpartition is not an orbitpartition. Then this implies there exist two vertices in a colorcell of the subpartition which cannot be mapped to each other via an automorphism. From the definition of colorrefinement, all color cells of subpartition are subsets of some orbitcell in the initial orbitpartition which implies there exists an automorphism between the two vertices which is a contradiction of the assumption. Hence, the refined colorings are orbitpartitions.
Now, the refined colorings of the initial orbitpartition are mapped to a single coloring by the aggregation operation. Given that all the refined colorings are finer than the initial coloring, we need to show that any nodewise aggregation will either be finer or at least equal to the initial orbitpartition in terms of the number and size of the color cells, though the embeddings may be transformed. Equivalently, if two vertices have different colors before IR step, then they should have different colors after IR step.
Consider the nodewise aggregation operation. It takes colorings as input and for each node, injectively maps the set of colors/embeddings to a new color. If two vertices have different colors in at least one of the colorings, then they will be mapped to different colors except in one case. The only case when two vertices and with different colors in at least one of the colorings, are mapped to the same color is when the refinements are rotations of each other. For simplicity, consider aggregating refinements; similar argument follows for higher values of . Let and be the individualizing vertices for generating the 2 refinements and be the color of in the refined coloring after individualizing . If the two refined colorings are rotations of each other, then and is true for some vertices and . Then, when we aggregate nodes across colorings as a set of colors, the set becomes same for both and i.e., . Below, we show that such a condition cannot occur for any pair of vertices with different colors in the initial orbitpartition.
For this, consider two vertices and with different colors in the initial orbitpartition i.e. where is the coloring before layer of GNNIR. For clarity, let and be the colors of and respectively with , in the initial orbitpartition. Now, if the individualizing vertices and belong to different colorcells, then the refinements induced will be different. This can be shown from the procedure of vertexrefinement Shervashidze et al. (2011). Therefore, the new colors of and will be different in both the refinements i.e. colorings due to will be and colorings due to will be . Note that and , since after vertexrefinement, the new color of a vertex depends both on its previous color and color of individualizing vertex. In this case, we have different colors for all four vertices i.e., which implies . Therefore, the nodewise aggregated sets are and hence, after aggregation.
If the individualizing vertices and belong to same colorcells, then the refinements induced will be same. But since their initial colors were different, the nodewise aggregated set of colors would be different for each vertex. Equivalently, if the colors of due to individualizing are respectively, then the colors due to will also be . In this case, the nodewise aggregated sets are and hence, after aggregation.
Hence, the aggregation operation generates coloring which is also an orbit partition. ∎
Expressive power of GNNIR: We now proceed to the proof of Proposition 1 which states that GNNIR is more expressive than GNN. For the proof, we need to know the conditions under which the aggregation procedure of GNNIR as described in Algorithm 1 maps two sets of colorings to different colorings. The following Lemma shows one of the conditions under which the aggregation can map refinements of two nonisomorphic graphs to distinct aggregated colorings.
Lemma 2. Assume we use we use universal set approximators in aggregation function to merge refined colorings (colored graphs) in Algorithm 1. Consider two sets of colorings and such that and . If i.e. and do not share any coloring out of colorings, then, irrespective of the node embeddings, the aggregation function of GNNIR maps and to different colorings.
Proof:
With the assumption of universal set approximators for AGG in line 15 and 18 of Algorithm 1, the aggregation function first maps each coloring to a unique embedding. Note that, we then concatenate with each of the node embeddings of the coloring. We then use another set function approximator to reduce node embeddings to one. For and , since , the corresponding set of ’s will be different and consequently, irrespective of node embeddings , every vertex across colorings, will be mapped to different embedding after the concatenation . Therefore, with the set aggregation of the concatenated node embeddings,
and will be mapped to different colorings. ∎
Below, we give the extended proof of Proposition 1.
Proposition 1. Assume we use universal set approximators in GNNIR for targetcell selection, refinement and aggregation steps. GNNIR is more expressive than all 1WL equivalent GNNs i.e., GNNIR can distinguish all graphs distinguishable by GNN and there exist graphs nondistinguishable by GNN which can be distinguished by GNNIR.
Proof: Consider the procedure of generating the final embedding of GNNIR used for graph prediction. After every step of GNNIR, we pool the node embeddings for readout. Then we concatenate the pooled embeddings of each layer and feed it to an MLP for the final prediction. If is the pooled graph embedding after layer , then the concatenated embedding will be for GNNIR with layers. Clearly, this concatenated embedding must be different for all the graphs distinguishable by GNNIR.
First, consider as the set of graphs such that the equitable coloring induced after 1WL/GNN refinement on any uniquely identifies . In other words, is the set of graphs which are distinguishable by any 1WL equivalent GNN. Note that if an equitable coloring uniquely identifies a graph, then its colorcells form the vertex orbits of the automorphism group of the graph Kiefer et al. (2020). Therefore, the equitable coloring induced by GNN refinement on any also forms an orbit partition and GNNIR with any width preserves its permutationinvariance. Therefore, the final GNNIR embedding will be permutationinvariant regardless of width and length . Now, since is distinguishable by GNN, will be unique for and will be distinguishable by GNNIR.
Next, we need to show that there exist graphs which are not distinguishable by any 1WL equivalent GNN but are distinguishable by GNNIR. Graphs A and B in Figure 3 are two such graphs which are not distinguishble by GNN and hence, will be same for these two graphs. It can be shown that initial 1WL refinement of graphs A and B generates orbit partitioned colorings. Such a refinement will effectively partition the graphs into two cells of orbits i.e. blue and red colored vertices. Since, the initial 1WL refinement of both the graphs generates orbitpartitioned colorings, by Lemma 1, GNNIR preserves permutationinvariance for any number of IR steps.
As illustrated in Figure 3, the individualization of either of the blue or red vertices induces different refined colorings for graphs A and B. Since the number of vertices are 6, any value of will result in distinct multiset of colorings and for graphs A and B respectively, such that . With Lemma 2., the aggregation function maps the two graphs to distinct colorings which can be pooled to get . Hence, in the final embedding , for any , will be different for the two graphs.
Therefore, these graphs can be distinguished by GNNIR.∎
Experiments
Graph classification on TUDataset bechmark
We evaluate GNNIR on six datasets from the publicly available TUDataset bechmark, the graph classification benchmark suite (Kersting et al., 2016; Yanardag and Vishwanathan, 2015)
. The datasets are broadly in domains of social media (IMDBBINARY, IMDBMULTI, COLLAB) and chemical science (NCI1, PROTEINS, MUTAG). We use GIN as our base GNN convolution operator and sum/mean pooling for readout.
Experimental setup
The evaluation procedures on TUDataset vary considerably in the recent literature on GNNs. This has given rise to different numbers for the same models. One evaluation setup (Xu et al., 2018) is to split the datasets into training and validation sets in the ratio of 90:10. Thereafter, take the mean of the validation curves across 10 folds and report mean/std of the best epoch of the mean validation curve. This evaluation protocol is nonstandard but is used because of the small size of datasets. However, recently Errica et al. (2019) did a fair comparison of the prominent GNN models. They performed 10fold standard 90:10 traintest split and report the average accuracy on the test set while model selection is done on a separate 10% split on the training set. We adopt this standard method of evaluation and report the accuracies in 10fold experiment as described in Errica et al. (2019) and Nikolentzos and Vazirgiannis (2020).
Note that we are only comparing with neural GNN models and do not claim stateoftheart score on these datasets. Kernel versions like Du et al. (2019); Morris et al. (2020) usually perform better than their neural counterparts on these particular datasets. In fact, the higherorder Local WL models in kernel versions have the stateoftheart score for most of the TUDatasets. Morris et al. (2020) show results of only kernel versions of "Local WL" models on TUDatasets and not neural versions. Nonetheless, since we are comparing with higherorder WL GNNs in our experiments, we report results of 3WLGNN with this setup as well.
For our experiments, we compare the proposed GNNIR model against the following GNN baselines : DGCNN (Zhang et al., 2018), DiffPool (Ying et al., 2018), GIN (Xu et al., 2018), GraphSAGE (Hamilton et al., 2017a) and RWNN (Nikolentzos and Vazirgiannis, 2020) and the higherorder 3WLGNN (Morris et al., 2019). Results for these baselines are as reported in Nikolentzos and Vazirgiannis (2020) and we use the code of 123GNN (Morris et al., 2019) to generate results for 3WLGNN with this setup.
Without nodedegree features
The social media datasets do not come with node attributes and are initialized with nodedegrees as node features. In principle, since nodedegrees can be computed by 1WL message passing, the presence of nodedegree features should not affect the performance of GNN models. However, in practice GNN models perform poorly without nodedegree features. This was shown in the results of Errica et al. (2019). In order to assess the GNNIR’s robustness to the presence of nodedegree features, we report additional results without nodedegree features as well. In this case, we initialize node features with constant values.
Results
Table 7 shows the accuracy of the GNN models on graph classification in TUDataset. Clearly, GNNIR scores are competitive in all the datasets. GNNIR outperforms higherorder 3WLGNN by significant margin in IMDBB, NCI1 datasets while in other datasets increase of 12% points can be seen. Compared to all other baselines, GNNIR achieves the best score in 4 out of 6 datasets and is close to the best performing in other 2 datasets. Note the results in comparison to GIN, which is the base convolution operator used in GNNIR. GNNIR’s clear edge over GIN shows that the improvement is coming from the individualization and refinement mechanism of GNNIR.
Furthermore, the difference is clearer in Table 8, which shows the accuracy of the models on graphs without nodedegree features. With the exception of COLLAB dataset, accuracy of GNNIR does not decrease significantly in both IMDB datasets. This suggests that the GNNIR can capture structural information from the graphs significantly better than the 1WL equivalent GNN models. Since, GNNIR works by breaking the symmetry between nodes, it is more robust to the absence of nodedegree features.
Experimental details
In this section, we give further details of experiments conducted. The code was written in Pytorch Geometric (Fey and Lenssen, 2019) library and the experiments were performed on a GeForce RTX 2080Ti GPU card. Hyperparameter tuning was done on a separate validation set formed from the training set on all the datasets. For a fair comparison, we used implementations of Morris et al. (2020) for ZINC10K, ALCHEMY10K, TUDataset bechmark and Pytorch Geometric example implementations for the rest of the datasets.
For all the datasets, our message passing model consists of a convolution operator for message function and a GRU for update following the MPNN implementation of Gilmer et al. (2017). The specific convolution operator used is given in Table 9 for each dataset. Additionally, we share the ConvGRU parameters across the IR layers as this architecture scales well for deeper layers without increasing the parameter load and does not lose performance as well. Also, it can be said that the improvement shown in model’s performance across datasets is not because of using more parameters. In each IR layer, GNN is run for steps. In the set aggregators, we use 1hidden layer MLP before sumpooling the vectors of the set. And finally, in datasets with edge features, we treat edges as variables and readout from both node and edge variables before the final fully connected layer.
We use cross entropy and mean absolute error (mae) loss functions for graph classification and regression respectively. We report dataset statistics and all the hyperparameters which were used for the results in Table
9. These hyperparameters were chosen with a separate validation set for each dataset. Below we describe details of all datasets and the splits used in the experiments.Dataset  TRIANGLES  CSL  ZINC10K  ALCHEMY10K  QM9  COLLAB  IMDBB  IMDBM  NCI1  PROTEINS  MUTAG 

#graphs  
Node feat  Yes  No  Yes  Yes  Yes  No  No  No  Yes  Yes  Yes 
Edge feat  No  No  Yes  Yes  Yes  No  No  No  No  No  No 
batch size  
hidden layer size  
epochs  
start lr  
decay rate  
decay steps  
patience  
#IR layers  
width  
Conv Operator  GIN  GIN  PNAConv  PNAConv  NNConv  GIN  GIN  GIN  GIN  GIN  GIN 
Datasets
TRIANGLES:
The dataset TRIANGLES (Knyazev et al., 2019) consists of graphs with the task of counting the number of triangles in the graph. The number of classes is . We use node degrees as node features. The data splits are as follows, training (), validation (), testoriginal () and testlarge (). Except testlarge, all data splits have smaller graphs with nodes. The testlarge set has nodes and tests the generalization ability of the model. Figure 4 shows example graphs from the two test sets where both have same number of triangles, . It can be seen in Table 1 that there is significant improvement in the Testlarge set with increasing IR layers and width of each IRlayer suggesting better generalization to larger graphs by our model.
Circulant Skip Links (CSL) :
The Circulant Skip Link dataset is a graph classification dataset introduced in Murphy et al. (2019) in order to test the expressive power of GNNs. A CSL graph is a 4regular graph with vertices and edge between pairs of vertices which are distance away in a cyclical form. The graphs are from classes representing . Figure 5 shows two nonisomorphic graphs from the CSL dataset with 11 nodes . The dataset has graphs with each class having graphs. Following Murphy et al. (2019), we use a 5fold cross validation split, with each split having train, validation and test data in the ratio of . We use the validation split to decay the learning rate and for model selection.
ZINC10K:
ZINC10K dataset (Jin et al., 2018; Dwivedi et al., 2020) contains Zinc molecules out of which are for training and each for validation and test sets. The dataset comes with constrained solubility values associated with the molecules and the task is to regress over these values. The performance measure is the mean absolute error (MAE) for regression on each graph. We use the baseline results as in Morris et al. (2020) and add other recent better performing model PNA (Corso et al., 2020) for a more thorough comparison. We use L1 loss and report absolute error on the test set.
ALCHEMY10K:
ALCHEMY10K (Chen et al., 2019a) is a recently released graph regression dataset. The task to regress over values related to quantum chemistry. The targets are same as in QM9 but the molecules come with more heavy atoms. We use the same smaller version of the dataset as used in Morris et al. (2020) with training, validation and test molecules which are randomly sampled from the original full dataset. Like in ZINC10K, we add PNA (Corso et al., 2020) for a more recent comparison. For this, we ran PNA on the same dataset with code provided officially by PytorchGeometric library. We use L1 loss and report MAE on the regression targets.
QM9:
QM9 (Ruddigkeit et al., 2012; Ramakrishnan et al., 2014) is a prominent large scale dataset in the domain of quantum chemistry. The dataset contains more than 130K druglike molecules with sizes ranging from 429 atoms per molecule. Each molecule may contain up to 9 heavy (nonHydrogen) atoms. The task is to regress on 12 quantummechanical properties associated with each molecule. In our experiments, we follow Gilmer et al. (2017) MPNN architecture in node and edge features. We further add edge variables for message passing and readout from both node and edge variables. For edge message passing, we simply concatenate the two node hidden vectors and pass it through an MLP and update the edge variables before finally reading out from both node and edge variables. We follow the same random split of for training, validation and testing as in Morris et al. (2019, 2020). We compare with the recent GNN models as in Morris et al. (2020) which covers prominent GNN models tested on this dataset. Note that as in Morris et al. (2020), we are not comparing with Schnet (Schütt et al., 2018), Physnet (Unke and Meuwly, 2019) and Dimenet (Klicpera et al., 2020) which incorporate physical knowledge in the modeling of message passing. All compared models are general GNN models. We train with L1 loss and report MAE on the targets.