Introduction
Graphstructured data is ubiquitous across application domains ranging from chemo and bioinformatics to image and social network analysis. To develop successful machine learning models in these domains, we need techniques that can exploit the rich information inherent in graph structure, as well as the feature information contained within a graph’s nodes and edges. In recent years, numerous approaches have been proposed for machine learning graphs—most notably, approaches based on graph kernels
[Vishwanathan et al.2010] or, alternatively, using graph neural network algorithms [Hamilton, Ying, and Leskovec2017b].Kernel approaches typically fix a set of features in advance—e.g., indicator features over subgraph structures or features of local node neighborhoods. For example, one of the most successful kernel approaches, the WeisfeilerLehman subtree kernel [Shervashidze et al.2011], which is based on the dimensional WeisfeilerLeman graph isomorphism heuristic [Grohe2017, pp. 79 ff.], generates node features through an iterative relabeling, or coloring, scheme: First, all nodes are assigned a common initial color; the algorithm then iteratively recolors a node by aggregating over the multiset of colors in its neighborhood, and the final feature representation of a graph is the histogram of the resulting node colors. By iteratively aggregating over local node neighborhoods in this way, the WL subtree kernel is able to effectively summarize the neighborhood substructures present in a graph. However, while powerful, the WL subtree kernel—like other kernel methods—is limited because this feature construction scheme is fixed (i.e., it does not adapt to the given data distribution). Moreover, this approach—like the majority of kernel methods—focuses only on the graph structure and cannot interpret continuous node and edge labels, such as realvalued vectors which play an important role in applications such as bio and chemoinformatics.
Graph neural networks (GNNs) have emerged as a machine learning framework addressing the above challenges. Standard GNNs can be viewed as a neural version of the WL algorithm, where colors are replaced by continuous feature vectors and neural networks are used to aggregate over node neighborhoods [Hamilton, Ying, and Leskovec2017a, Kipf and Welling2017]. In effect, the GNN framework can be viewed as implementing a continuous form of graphbased “message passing”, where local neighborhood information is aggregated and passed on to the neighbors [Gilmer et al.2017]. By deploying a trainable neural network to aggregate information in local node neighborhoods, GNNs can be trained in an endtoend fashion together with the parameters of the classification or regression algorithm, possibly allowing for greater adaptability and better generalization compared to the kernel counterpart of the classical WL algorithm.
Up to now, the evaluation and analysis of GNNs has been largely empirical, showing promising results compared to kernel approaches, see, e.g., [Ying et al.2018b]. However, it remains unclear how GNNs are actually encoding graph structure information into their vector representations, and whether there are theoretical advantages of GNNs compared to kernel based approaches.
Present Work. We offer a theoretical exploration of the relationship between GNNs and kernels that are based on the WL algorithm. We show that GNNs cannot be more powerful than the WL in terms of distinguishing nonisomorphic (sub)graphs, e.g., the properties of subgraphs around each node. This result holds for a broad class of GNN architectures and all possible choices of parameters for them. On the positive side, we show that given the right parameter initialization GNNs have the same expressiveness as the WL algorithm, completing the equivalence. Since the power of the WL has been completely characterized, see, e.g., [Arvind et al.2015, Kiefer, Schweitzer, and Selman2015], we can transfer these results to the case of GNNs, showing that both approaches have the same shortcomings.
Going further, we leverage these theoretical relationships to propose a generalization of GNNs, called GNNs, which are neural architectures based on the dimensional WL algorithm (WL), which are strictly more powerful than GNNs. The key insight in these higherdimensional variants is that they perform message passing directly between subgraph structures, rather than individual nodes. This higherorder form of message passing can capture structural information that is not visible at the nodelevel.
Graph kernels based on the WL have been proposed in the past [Morris, Kersting, and Mutzel2017]. However, a key advantage of implementing higherorder message passing in GNNs—which we demonstrate here—is that we can design hierarchical variants of GNNs, which combine graph representations learned at different granularities in an endtoend trainable framework. Concretely, in the presented hierarchical approach the initial messages in a GNN are based on the output of lowerdimensional GNN (with ), which allows the model to effectively capture graph structures of varying granularity. Many realworld graphs inherit a hierarchical structure—e.g., in a social network we must model both the egonetworks around individual nodes, as well as the coarsegrained relationships between entire communities, see, e.g., [Newman2003]—and our experimental results demonstrate that these hierarchical GNNs are able to consistently outperform traditional GNNs on a variety of graph classification and regression tasks. Across twelve graph regression tasks from the QM9 benchmark, we find that our hierarchical model reduces the mean absolute error by 54.45% on average. For graph classification, we find that our hierarchical models leads to slight performance gains.
Key Contributions. Our key contributions are summarized as follows:

We show that GNNs are not more powerful than the WL in terms of distinguishing nonisomorphic (sub)graphs. Moreover, we show that, assuming a suitable parameter initialization, GNNs have the same power as the WL.

We propose GNNs, which are strictly more powerful than GNNs. Moreover, we propose a hierarchical version of GNNs, socalled GNNs, which are able to work with the fine and coarsegrained structures of a given graph, and relationships between those.

Our theoretical findings are backedup by an experimental study, showing that higherorder graph properties are important for successful graph classification and regression.
Related Work
Our study builds upon a wealth of work at the intersection of supervised learning on graphs, kernel methods, and graph neural networks.
Historically, kernel methods—which implicitly or explicitly map graphs to elements of a Hilbert space—have been the dominant approach for supervised learning on graphs. Important early work in this area includes randomwalk based kernels [Gärtner, Flach, and Wrobel2003, Kashima, Tsuda, and Inokuchi2003]) and kernels based on shortest paths [Borgwardt and Kriegel2005]. More recently, developments in graph kernels have emphasized scalability, focusing on techniques that bypass expensive Gram matrix computations by using explicit feature maps. Prominent examples of this trend include kernels based on graphlet counting [Shervashidze et al.2009], and, most notably, the WeisfeilerLehman subtree kernel [Shervashidze et al.2011] as well as its higherorder variants [Morris, Kersting, and Mutzel2017]. Graphlet and WeisfeilerLeman kernels have been successfully employed within frameworks for smoothed and deep graph kernels [Yanardag and Vishwanathan2015a, Yanardag and Vishwanathan2015b]. Recent works focus on assignmentbased approaches [Kriege, Giscard, and Wilson2016, Nikolentzos, Meladianos, and Vazirgiannis2017, Johansson and Dubhashi2015], spectral approaches [Kondor and Pan2016], and graph decomposition approaches [Nikolentzos et al.2018]. Graph kernels were dominant in graph classification for several years, leading to new stateoftheart results on many classification tasks. However, they are limited by the fact that they cannot effectively adapt their feature representations to a given data distribution, since they generally rely on a fixed set of features. More recently, a number of approaches to graph classification based upon neural networks have been proposed. Most of the neural approaches fit into the graph neural network framework proposed by [Gilmer et al.2017]. Notable instances of this model include Neural Fingerprints [Duvenaud et al.2015], Gated Graph Neural Networks [Li et al.2016], GraphSAGE [Hamilton, Ying, and Leskovec2017a], SplineCNN [Fey et al.2018], and the spectral approaches proposed in [Bruna et al.2014, Defferrard, X., and Vandergheynst2016, Kipf and Welling2017]—all of which descend from early work in [Merkwirth and Lengauer2005] and [Scarselli et al.2009b]. Recent extensions and improvements to the GNN framework include approaches to incorporate different local structures around subgraphs [Xu et al.2018] and novel techniques for pooling node representations in order perform graph classification [Zhang et al.2018, Ying et al.2018b]. GNNs have achieved stateoftheart performance on several graph classification benchmarks in recent years, see, e.g., [Ying et al.2018b]—as well as applications such as proteinprotein interaction prediction [Fout et al.2017], recommender systems [Ying et al.2018a], and the analysis of quantum interactions in molecules [Schütt et al.2017]. A survey of recent advancements in GNN techniques can be found in [Hamilton, Ying, and Leskovec2017b].
Up to this point (and despite their empirical success) there has been very little theoretical work on GNNs—with the notable exceptions of Li et al.’s [Li, Han, and Wu2018] work connecting GNNs to a special form Laplacian smoothing and Lei et al.’s [Lei et al.2017] work showing that the feature maps generated by GNNs lie in the same Hilbert space as some popular graph kernels. Moreover, Scarselli et al. [Scarselli et al.2009a] investigates the approximation capabilities of GNNs.
Preliminaries
We start by fixing notation, and then outline the WeisfeilerLeman algorithm and the standard graph neural network framework.
Notation and Background
A graph is a pair with a finite set of nodes and a set of edges . We denote the set of nodes and the set of edges of by and , respectively. For ease of notation we denote the edge in by or . Moreover, denotes the neighborhood of in , i.e., . We say that two graphs and are isomorphic if there exists an edge preserving bijection , i.e., is in if and only if is in . We write and call the equivalence classes induced by isomorphism types. Let then is the subgraph induced by with . A node coloring is a function with arbitrary codomain . Then a node colored or labeled graph is a graph endowed with a node coloring . We say that is a label or color of . We say that a node coloring refines a node coloring , written , if implies for every in . Two colorings are equivalent if and , and we write . A color class of a node coloring is a maximal set of nodes with for every in . Moreover, let for , let be a set then the set of sets for , which is the set of all subsets with cardinality , and let denote a multiset.
WeisfeilerLeman Algorithm
We now describe the WL algorithm for labeled graphs. Let be a labeled graph. In each iteration, , the WL computes a node coloring , which depends on the coloring from the previous iteration. In iteration , we set . Now in iteration , we set
(1) 
where hash bijectively maps the above pair to a unique value in , which has not been used in previous iterations. To test two graph and for isomorphism, we run the above algorithm in “parallel” on both graphs. Now if the two graphs have a different number of nodes colored in , the WL concludes that the graphs are not isomorphic. Moreover, if the number of colors between two iterations does not change, i.e., the cardinalities of the images of and are equal, the algorithm terminates. Termination is guaranteed after at most iterations. It is easy to see that the algorithm is not able to distinguish all nonisomorphic graphs, e.g., see [Cai, Fürer, and Immerman1992]. Nonetheless, it is a powerful heuristic, which can successfully test isomorphism for a broad class of graphs [Babai and Kucera1979].
The dimensional WeisfeilerLeman algorithm (WL), for , is a generalization of the WL which colors tuples from instead of nodes. That is, the algorithm computes a coloring . In order to describe the algorithm, we define the th neighborhood
(2) 
of a tuple in . That is, the th neighborhood of is obtained by replacing the th component of by every node from . In iteration , the algorithm labels each tuple with its atomic type, i.e., two tuples and in get the same color if the map induces a (labeled) isomorphism between the subgraphs induced from the nodes from and , respectively. For iteration , we define
(3) 
and set
(4) 
Hence, two tuples and with get different colors in iteration if there exists in such that the number of neighbors of and , respectively, colored with a certain color is different. The algorithm then proceeds analogously to the WL. By increasing , the algorithm gets more powerful in terms of distinguishing nonisomorphic graphs, i.e., for each , there are nonisomorphic graphs which can be distinguished by the ()WL but not by the WL [Cai, Fürer, and Immerman1992]. We note here that the above variant is not equal to the folklore variant of WL described in [Cai, Fürer, and Immerman1992], which differs slightly in its update rule. However, it holds that the WL using Equation 4 is as powerful as the folklore WL [Grohe and Otto2015].
WL Kernels. After running the WL algorithm, the concatenation of the histogram of colors in each iteration can be used as a feature vector in a kernel computation. Specifically, in the histogram for every color in there is an entry containing the number of nodes or tuples that are colored with .
Graph Neural Networks
Let be a labeled graph with an initial node coloring that is consistent with . This means that each node is annotated with a feature in such that if and only if . Alternatively, can be an arbitrary realvalued feature vector associated with . Examples include continuous atomic properties in chemoinformatic applications where nodes correspond to atoms, or vector representations of text in social network applications. A GNN model consists of a stack of neural network layers, where each layer aggregates local neighborhood information, i.e., features of neighbors, around each node and then passes this aggregated information on to the next layer.
A basic GNN model can be implemented as follows [Hamilton, Ying, and Leskovec2017b]. In each layer , we compute a new feature
(5) 
in for , where and are parameter matrices from , and
denotes a componentwise nonlinear function, e.g., a sigmoid or a ReLU.
^{1}^{1}1For clarity of presentation we omit biases.Following [Gilmer et al.2017], one may also replace the sum defined over the neighborhood in the above equation by a permutationinvariant, differentiable function, and one may substitute the outer sum, e.g., by a columnwise vector concatenation or LSTMstyle update step. Thus, in full generality a new feature is computed as
(6) 
where aggregates over the set of neighborhood features and merges the node’s representations from step with the computed neighborhood features. Both and may be arbitrary differentiable, permutationinvariant functions (e.g., neural networks), and, by analogy to Equation 5, we denote their parameters as and , respectively. In the rest of this paper, we refer to neural architectures implementing Equation 6 as dimensional GNN architectures (GNNs).
A vector representation over the whole graph can be computed by summing over the vector representations computed for all nodes, i.e.,
where denotes the last layer. More refined approaches use differential pooling operators based on sorting [Zhang et al.2018] and soft assignments [Ying et al.2018b].
In order to adapt the parameters and of Equations 6 and 5
, to a given data distribution, they are optimized in an endtoend fashion (usually via stochastic gradient descent) together with the parameters of a neural network used for classification or regression.
Relationship Between 1WL and 1GNNs
In the following we explore the relationship between the WL and GNNs. Let be a labeled graph, and let denote the GNN parameters given by Equation 5 or Equation 6 up to iteration . We encode the initial labels by vectors , e.g., using a hot encoding.
Our first theoretical result shows that the GNN architectures do not have more power in terms of distinguishing between nonisomorphic (sub)graphs than the WL algorithm. More formally, let and be any two functions chosen in (6). For every encoding of the labels as vectors , and for every choice of , we have that the coloring of WL always refines the coloring induced by a GNN parameterized by .
Theorem 1.
Let be a labeled graph. Then for all and for all choices of initial colorings consistent with , and weights ,
Our second result states that there exist a sequence of parameter matrices such that GNNs have exactly the same power in terms of distinguishing nonisomorphic (sub)graphs as the WL algorithm. This even holds for the simple architecture (5), provided we choose the encoding of the initial labeling
in such a way that different labels are encoded by linearly independent vectors.
Theorem 2.
Let be a labeled graph. Then for all there exists a sequence of weights , and a GNN architecture such that
Hence, in the light of the above results, GNNs may viewed as an extension of the WL which in principle have the same power but are more flexible in their ability to adapt to the learning task at hand and are able to handle continuous node features.
Shortcomings of Both Approaches
The power of WL has been completely characterized, see, e.g., [Arvind et al.2015]. Hence, by using Theorems 2 and 1, this characterization is also applicable to GNNs. On the other hand, GNNs have the same shortcomings as the WL. For example, both methods will give the same color to every node in a graph consisting of a triangle and a cycle, although vertices from the triangle and the vertices from the cycle are clearly different. Moreover, they are not capable of capturing simple graph theoretic properties, e.g., triangle counts, which are an important measure in social network analysis [Milo et al.2002, Newman2003].
dimensional Graph Neural Networks
In the following, we propose a generalization of GNNs, socalled GNNs, which are based on the WL. Due to scalability and limited GPU memory, we consider a setbased version of the WL. For a given , we consider all element subsets over . Let be a set in , then we define the neighborhood of as
The local neighborhood consists of all such that for the unique and the unique . The global neighborhood then is defined as .^{2}^{2}2Note that the definition of the local neighborhood is different from the the one defined in [Morris, Kersting, and Mutzel2017] which is a superset of our definition. Our computations therefore involve sparser graphs.
The set based WL works analogously to the WL, i.e., it computes a coloring as in Uncolored Graphs based on the above neighborhood. Initially, colors each element in with the isomorphism type of .
Let be a labeled graph. In each GNN layer , we compute a feature vector for each set in . For , we set to
, a onehot encoding of the isomorphism type of
labeled by . In each layer , we compute new features byMoreover, one could split the sum into two sums ranging over and respectively, using distinct parameter matrices to enable the model to learn the importance of local and global neighborhoods. To scale GNNs to larger datasets and to prevent overfitting, we propose local GNNs, where we omit the global neighborhood of , i.e.,
The running time for evaluation of the above depends on , and the sparsity of the graph (each iteration can be bounded by the number of subsets of size times the maximum degree). Note that we can scale our method to larger datasets by using sampling strategies introduced in, e.g., [Morris, Kersting, and Mutzel2017, Hamilton, Ying, and Leskovec2017a]. We can now lift the results of the previous section to the dimensional case.
Proposition 3.
Let be a labeled graph and let . Then for all , for all choices of initial colorings consistent with and for all weights ,
Again the second result states that there exists a suitable initialization of the parameter matrices such that GNNs have exactly the same power in terms of distinguishing nonisomorphic (sub)graphs as the setbased WL.
Proposition 4.
Let be a labeled graph and let . Then for all there exists a sequence of weights , and a GNN architecture such that
Hierarchical Variant
One key benefit of the endtoend trainable GNN framework—compared to the discrete WL algorithm—is that we can hierarchically combine representations learned at different granularities. Concretely, rather than simply using onehot indicator vectors as initial feature inputs in a GNN, we propose a hierarchical variant of GNN that uses the features learned by a dimensional GNN, in addition to the (labeled) isomorphism type, as the initial features, i.e.,
for some , where is a matrix of appropriate size, and square brackets denote matrix concatenation.
Hence, the features are recursively learned from dimensions to in an endtoend fashion. This hierarchical model also satisfies Propositions 4 and 3, so its representational capacity is theoretically equivalent to a standard GNN (in terms of its relationship to WL). Nonetheless, hierarchy is a natural inductive bias for graph modeling, since many realworld graphs incorporate hierarchical structure, so we expect this hierarchical formulation to offer empirical utility.
Method  Dataset  

Pro  IMDBBin  IMDBMul  PTCFM  NCI1  Mutag  PTCMR  
Kernel 
Graphlet  72.9  59.4  40.8  58.3  72.1  87.7  54.7 
Shortestpath  76.4  59.2  40.5  62.1  74.5  81.7  58.9  
WL  73.8  72.5  51.5  62.9  83.1  78.3  61.3  
WL  75.2  72.6  50.6  64.7  77.0  77.0  61.9  
WL  74.7  73.5  49.7  61.5  83.1  83.2  62.5  
WLOA  75.3  73.1  50.4  62.7  86.1  84.5  63.6  
GNN 
DCNN  61.3  49.1  33.5  —  62.6  67.0  56.6 
PatchySan  75.9  71.0  45.2  —  78.6  92.6  60.0  
DGCNN  75.5  70.0  47.8  —  74.4  85.8  58.6  
Gnn No Tuning  70.7  69.4  47.3  59.0  58.6  82.7  51.2  
Gnn  72.2  71.2  47.7  59.3  74.3  82.2  59.0  
Gnn No Tuning  75.9  70.3  48.8  60.0  67.4  84.4  59.3  
Gnn  75.5  74.2  49.5  62.8  76.2  86.1  60.9 
Target  Method  

Dtnn [Wu et al.2018]  Mpnn [Wu et al.2018]  Gnn  Gnn  Gnn  Gnn  Gain  
0.358  0.493  0.493  0.476  4.0%  
0.95  0.89  0.78  0.46  65.3%  
0.00388  0.00541  0.00331  0.00328  0.00337  –  
0.00512  0.00623  0.00355  0.00354  0.00351  1.4%  
0.0112  0.0066  0.0049  0.0047  0.0048  6.1%  
28.5  34.1  21.5  25.8  22.9  37.0%  
ZPVE  0.00172  0.00216  0.00124  0.00064  0.00019  85.5%  
2.43  2.05  2.32  0.6855  0.0427  98.5%  
2.43  2.00  2.08  0.686  0.111  94.9%  
2.43  2.02  2.23  0.070  0.794  98.1%  
2.43  2.02  1.94  0.140  0.587  97.6%  
0.27  0.42  0.27  0.0989  0.158  65.0% 
Experimental Study
In the following, we want to investigate potential benefits of GNNs over graph kernels as well as the benefits of our proposed GNN architectures over GNN architectures. More precisely, we address the following questions:
 Q1

How do the (hierarchical) GNNs perform in comparison to stateoftheart graph kernels?
 Q2

How do the (hierarchical) GNNs perform in comparison to the GNN in graph classification and regression tasks?
 Q3

How much (if any) improvement is provided by optimizing the parameters of the GNN aggregation function, compared to just using random GNN parameters while optimizing the parameters of the downstream classification/regression algorithm?
Datasets
To compare our GNN architectures to kernel approaches we use wellestablished benchmark datasets from the graph kernel literature [Kersting et al.2016]. The nodes of each graph in these dataset is annotated with (discrete) labels or no labels.
To demonstrate that our architectures scale to larger datasets and offer benefits on realworld applications, we conduct experiments on the Qm9 dataset [Ramakrishnan et al.2014, Ruddigkeit et al.2012, Wu et al.2018], which consists of 133 385 small molecules. The aim here is to perform regression on twelve targets representing energetic, electronic, geometric, and thermodynamic properties, which were computed using density functional theory.
Baselines
We use the following kernel and GNN methods as baselines for our experiments.
Kernel Baselines. We use the Graphlet kernel [Shervashidze et al.2009], the shortestpath kernel [Borgwardt and Kriegel2005], the WeisfeilerLehman subtree kernel (WL) [Shervashidze et al.2011], the WeisfeilerLehman Optimal Assignment kernel (WLOA) [Kriege, Giscard, and Wilson2016], and the globallocal WL [Morris, Kersting, and Mutzel2017] with in as kernel baselines. For each kernel, we computed the normalized Gram matrix. We used the SVM implementation of LIBSVM [Chang and Lin2011] to compute the classification accuracies using 10fold cross validation. The parameter was selected from by 10fold cross validation on the training folds.
Neural Baselines. To compare GNNs to kernels we used the basic GNN layer of Equation 5, DCNN [Wang et al.2018], PatchySan [Niepert, Ahmed, and Kutzkov2016], DGCNN [Zhang et al.2018]. For the Qm9 dataset we used a GNN layer similar to [Gilmer et al.2017], where we replaced the inner sum of Equation 5 with a 2layer MLP in order incorporate edge features (bond type and distance information). Moreover, we compare against the numbers provided in [Wu et al.2018].
Model Configuration
We always used three layers for GNN, and two layers for (local) GNN and GNN, all with a hiddendimension size of . For the hierarchical variant we used architectures that use features computed by GNN as initial features for the GNN (GNN) and GNN (GNN), respectively. Moreover, using the combination of the former we componentwise concatenated the computed features of the GNN and the GNN (GNN). For the final classification and regression steps, we used a three layer MLP, with binary cross entropy and mean squared error for the optimization, respectively. For classification we used a dropout layer with after the first layer of the MLP. We applied global average pooling to generate a vector representation of the graph from the computed node features for each . The resulting vectors are concatenated columnwise before feeding them into the MLP. Moreover, we used the Adam optimizer with an initial learning rate of and applied an adaptive learning rate decay based on validation results to a minimum of . We trained the classification networks for epochs and the regression networks for epochs.
Experimental Protocol
For the smaller datasets, which we use for comparison against the kernel methods, we performed a 10fold cross validation where we randomly sampled 10% of each training fold to act as a validation set. For the Qm9 dataset, we follow the dataset splits described in [Wu et al.2018]. We randomly sampled 10% of the examples for validation, another 10% for testing, and used the remaining for training. We used the same initial node features as described in [Gilmer et al.2017]. Moreover, in order to illustrate the benefits of our hierarchical GNN architecture, we did not use a complete graph, where edges are annotated with pairwise distances, as input. Instead, we only used pairwise Euclidean distances for connected nodes, computed from the provided node coordinates. The code was built upon the work of [Fey et al.2018] and is provided at https://github.com/chrsmrrs/kgnn.
Results and Discussion
In the following we answer questions Q1 to Q3. Table 1 shows the results for comparison with the kernel methods on the graph classification benchmark datasets. Here, the hierarchical GNN is on par with the kernels despite the small dataset sizes (answering question Q1). We also find that the 123GNN significantly outperforms the 1GNN on all seven datasets (answering Q2), with the 1GNN being the overall weakest method across all tasks.^{3}^{3}3Note that in very recent work, GNNs have shown superior results over kernels when using advanced pooling techniques [Ying et al.2018b]. Note that our layers can be combined with these pooling layers. However, we opted to use standard global pooling in order to compare a typical GNN implementation with standard offtheshelf kernels. We can further see that optimizing the parameters of the aggregation function only leads to slight performance gains on two out of three datasets, and that no optimization even achieves better results on the Proteins benchmark dataset (answering Q3). We contribute this effect to the onehot encoded node labels, which allow the GNN to gather enough information out of the neighborhood of a node, even when this aggregation is not learned.
Table 2 shows the results for the Qm9 dataset. On eleven out of twelve targets all of our hierarchical variants beat the GNN baseline, providing further evidence for Q2. For example, on the target we achieve a large improvement of 98.1% in MAE compared to the baseline. Moreover, on ten out of twelve datasets, the hierarchical GNNs beat the baselines from [Wu et al.2018]. However, the additional structural information extracted by the GNN layers does not serve all tasks equally, leading to huge differences in gains across the targets.
It should be noted that our GNN models have more parameters than the GNN model, since we stack two additional GNN layers for each . However, extending the GNN model by additional layers to match the number of parameters of the GNN did not lead to better results in any experiment.
Conclusion
We presented a theoretical investigation of GNNs, showing that a wide class of GNN architectures cannot be stronger than the WL. On the positive side, we showed that, in principle, GNNs possess the same power in terms of distinguishing between nonisomorphic (sub)graphs, while having the added benefit of adapting to the given data distribution. Based on this insight, we proposed GNNs which are a generalization of GNNs based on the WL. This new model is strictly stronger then GNNs in terms of distinguishing nonisomorphic (sub)graphs and is capable of distinguishing more graph properties. Moreover, we devised a hierarchical variant of GNNs, which can exploit the hierarchical organization of most realworld graphs. Our experimental study shows that GNNs consistently outperform GNNs and beat stateoftheart neural architectures on largescale molecule learning tasks. Future work includes designing taskspecific GNNs, e.g., devising GNNs layers that exploit expertknowledge in bio and chemoinformatic settings.
Acknowledgments
This work is supported by the German research council (DFG) within the Research Training Group 2236 UnRAVeL and the Collaborative Research Center SFB 876, Providing Information by ResourceConstrained Analysis, projects A6 and B2.
References
 [Arvind et al.2015] Arvind, V.; Köbler, J.; Rattan, G.; and Verbitsky, O. 2015. On the power of color refinement. In Symposium on Fundamentals of Computation Theory, 339–350.
 [Babai and Kucera1979] Babai, L., and Kucera, L. 1979. Canonical labelling of graphs in linear average time. In Symposium on Foundations of Computer Science, 39–46.
 [Borgwardt and Kriegel2005] Borgwardt, K. M., and Kriegel, H.P. 2005. Shortestpath kernels on graphs. In ICDM, 74–81.
 [Bruna et al.2014] Bruna, J.; Zaremba, W.; Szlam, A.; and LeCun, Y. 2014. Spectral networks and deep locally connected networks on graphs. In ICLR.
 [Cai, Fürer, and Immerman1992] Cai, J.; Fürer, M.; and Immerman, N. 1992. An optimal lower bound on the number of variables for graph identifications. Combinatorica 12(4):389–410.

[Chang and Lin2011]
Chang, C.C., and Lin, C.J.
2011.
LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology 2:27:1–27:27.  [Defferrard, X., and Vandergheynst2016] Defferrard, M.; X., B.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, 3844–3852.
 [Duvenaud et al.2015] Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; AspuruGuzik, A.; and Adams, R. P. 2015. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, 2224–2232.

[Fey et al.2018]
Fey, M.; Lenssen, J. E.; Weichert, F.; and Müller, H.
2018.
SplineCNN: Fast geometric deep learning with continuous Bspline kernels.
In CVPR.  [Fout et al.2017] Fout, A.; Byrd, J.; Shariat, B.; and BenHur, A. 2017. Protein interface prediction using graph convolutional networks. In NIPS, 6533–6542.
 [Gärtner, Flach, and Wrobel2003] Gärtner, T.; Flach, P.; and Wrobel, S. 2003. On graph kernels: Hardness results and efficient alternatives. In Learning Theory and Kernel Machines. 129–143.
 [Gilmer et al.2017] Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Dahl, G. E. 2017. Neural message passing for quantum chemistry. In ICML.
 [Grohe and Otto2015] Grohe, M., and Otto, M. 2015. Pebble games and linear equations. Journal of Symbolic Logic 80(3):797–844.
 [Grohe2017] Grohe, M. 2017. Descriptive Complexity, Canonisation, and Definable Graph Structure Theory. Lecture Notes in Logic. Cambridge University Press.
 [Hamilton, Ying, and Leskovec2017a] Hamilton, W. L.; Ying, R.; and Leskovec, J. 2017a. Inductive representation learning on large graphs. In NIPS, 1025–1035.
 [Hamilton, Ying, and Leskovec2017b] Hamilton, W. L.; Ying, R.; and Leskovec, J. 2017b. Representation learning on graphs: Methods and applications. IEEE Data Engineering Bulletin 40(3):52–74.
 [Johansson and Dubhashi2015] Johansson, F. D., and Dubhashi, D. 2015. Learning with similarity functions on graphs using matchings of geometric embeddings. In KDD, 467–476.
 [Kashima, Tsuda, and Inokuchi2003] Kashima, H.; Tsuda, K.; and Inokuchi, A. 2003. Marginalized kernels between labeled graphs. In ICML, 321–328.
 [Kersting et al.2016] Kersting, K.; Kriege, N. M.; Morris, C.; Mutzel, P.; and Neumann, M. 2016. Benchmark data sets for graph kernels.
 [Kiefer, Schweitzer, and Selman2015] Kiefer, S.; Schweitzer, P.; and Selman, E. 2015. Graphs identified by logics with counting. In MFCS, 319–330. Springer.
 [Kipf and Welling2017] Kipf, T. N., and Welling, M. 2017. Semisupervised classification with graph convolutional networks. In ICLR.
 [Kondor and Pan2016] Kondor, R., and Pan, H. 2016. The multiscale laplacian graph kernel. In NIPS, 2982–2990.
 [Kriege, Giscard, and Wilson2016] Kriege, N. M.; Giscard, P.L.; and Wilson, R. C. 2016. On valid optimal assignment kernels and applications to graph classification. In NIPS, 1615–1623.
 [Lei et al.2017] Lei, T.; Jin, W.; Barzilay, R.; and Jaakkola, T. S. 2017. Deriving neural architectures from sequence and graph kernels. In ICML, 2024–2033.
 [Li et al.2016] Li, W.; Saidi, H.; Sanchez, H.; Schäf, M.; and Schweitzer, P. 2016. Detecting similar programs via the WeisfeilerLeman graph kernel. In International Conference on Software Reuse, 315–330.

[Li, Han, and Wu2018]
Li, Q.; Han, Z.; and Wu, X.M.
2018.
Deeper insights into graph convolutional networks for semisupervised learning.
In AAAI, 3538–3545.  [Merkwirth and Lengauer2005] Merkwirth, C., and Lengauer, T. 2005. Automatic generation of complementary descriptors with molecular graph networks. Journal of Chemical Information and Modeling 45(5):1159–1168.
 [Milo et al.2002] Milo, R.; ShenOrr, S.; Itzkovitz, S.; Kashtan, N.; Chklovskii, D.; and Alon, U. 2002. Network motifs: simple building blocks of complex networks. Science 298(5594):824–827.
 [Morris, Kersting, and Mutzel2017] Morris, C.; Kersting, K.; and Mutzel, P. 2017. Glocalized WeisfeilerLehman kernels: Globallocal feature maps of graphs. In ICDM, 327–336.
 [Newman2003] Newman, M. E. J. 2003. The structure and function of complex networks. SIAM review 45(2):167–256.
 [Niepert, Ahmed, and Kutzkov2016] Niepert, M.; Ahmed, M.; and Kutzkov, K. 2016. Learning convolutional neural networks for graphs. In ICML, 2014–2023.
 [Nikolentzos et al.2018] Nikolentzos, G.; Meladianos, P.; Limnios, S.; and Vazirgiannis, M. 2018. A degeneracy framework for graph similarity. In IJCAI, 2595–2601.
 [Nikolentzos, Meladianos, and Vazirgiannis2017] Nikolentzos, G.; Meladianos, P.; and Vazirgiannis, M. 2017. Matching node embeddings for graph similarity. In AAAI, 2429–2435.
 [Ramakrishnan et al.2014] Ramakrishnan, R.; Dral, P., O.; Rupp, M.; and von Lilienfeld, O. A. 2014. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data 1.
 [Ruddigkeit et al.2012] Ruddigkeit, L.; van Deursen, R.; Blum, L. C.; and Reymond, J.L. 2012. Enumeration of 166 billion organic small molecules in the chemical universe database gdb17. Journal of Chemical Information and Modeling 52 11:2864–75.
 [Scarselli et al.2009a] Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G. 2009a. Computational capabilities of graph neural networks. IEEE Transactions on Neural Networks 20(1):81–102.
 [Scarselli et al.2009b] Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G. 2009b. The graph neural network model. IEEE Transactions on Neural Networks 20(1):61–80.
 [Schütt et al.2017] Schütt, K.; Kindermans, P. J.; Sauceda, H. E.; Chmiela, S.; Tkatchenko, A.; and Müller, K. R. 2017. SchNet: A continuousfilter convolutional neural network for modeling quantum interactions. In NIPS, 992–1002.
 [Shervashidze et al.2009] Shervashidze, N.; Vishwanathan, S. V. N.; Petri, T. H.; Mehlhorn, K.; and Borgwardt, K. M. 2009. Efficient graphlet kernels for large graph comparison. In AISTATS, 488–495.
 [Shervashidze et al.2011] Shervashidze, N.; Schweitzer, P.; van Leeuwen, E. J.; Mehlhorn, K.; and Borgwardt, K. M. 2011. WeisfeilerLehman graph kernels. JMLR 12:2539–2561.
 [Vishwanathan et al.2010] Vishwanathan, S. V. N.; Schraudolph, N. N.; Kondor, R.; and Borgwardt, K. M. 2010. Graph kernels. JMLR 11:1201–1242.
 [Wang et al.2018] Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S. E.; Bronstein, M. M.; and Solomon, J. M. 2018. Dynamic graph CNN for learning on point clouds. CoRR abs/1801.07829.
 [Wu et al.2018] Wu, Z.; Ramsundar, B.; Feinberg, E. N.; Gomes, J.; Geniesse, C.; Pappu, A. S.; Leswing, K.; and Pande, V. 2018. Moleculenet: a benchmark for molecular machine learning. Chemical Science 9:513–530.
 [Xu et al.2018] Xu, K.; Li, C.; Tian, Y.; Sonobe, T.; Kawarabayashi, K.i.; and Jegelka, S. 2018. Representation learning on graphs with jumping knowledge networks. In ICML, 5453–5462.
 [Yanardag and Vishwanathan2015a] Yanardag, P., and Vishwanathan, S. V. N. 2015a. Deep graph kernels. In KDD, 1365–1374.
 [Yanardag and Vishwanathan2015b] Yanardag, P., and Vishwanathan, S. V. N. 2015b. A structural smoothing framework for robust graph comparison. In NIPS, 2134–2142.
 [Ying et al.2018a] Ying, R.; He, R.; Chen, K.; Eksombatchai, P.; Hamilton, W. L.; and Leskovec, J. 2018a. Graph convolutional neural networks for webscale recommender systems. KDD.
 [Ying et al.2018b] Ying, R.; You, J.; Morris, C.; Ren, X.; Hamilton, W. L.; and Leskovec, J. 2018b. Hierarchical graph representation learning with differentiable pooling. In NIPS.
 [Zhang et al.2018] Zhang, M.; Cui, Z.; Neumann, M.; and Yixin, C. 2018. An endtoend deep learning architecture for graph classification. In AAAI, 4428–4435.
Appendix
In the following we provide proofs for Theorem 1, Theorem 2, Proposition 3, and Proposition 4.
Proof of Theorem 1
Theorem 5 (Theorem 1 in the main paper).
Let be a labeled graph. Then for all and for all choices of initial colorings consistent with , and weights ,
For the theorem we consider a single iteration of the WL algorithm and the GNN on a single graph.
Proof of Theorem 1.
We show for an arbitrary iteration and nodes , that implies . In iteration we have as the initial node coloring is chosen consistent with .
Let and such that . Assume for the induction that holds. As we know from the refinement step of the WL that the old colors of and as well as the multisets and of colors of the neighbors of and are identical.
Let and be the multisets of feature vectors of the neighbors of and respectively. By the induction hypothesis, we know that and such that independent of the choice of and we get . This holds as the input to both functions and is identical. This proves and thereby the theorem. ∎
Proof of Theorem 2
Theorem 6 (Theorem 2 in the main paper).
Let be a labeled graph. Then for all there exists a sequence of weights and a GNN architecture such that
For the proof we start by giving the proof for graphs where all nodes have the same initial color and then extend it to colored graphs. In order to do that we use a slightly adapted but equivalent version of the WL. Note that the extension to colored graphs is mostly technical, while the important idea is already contained in the first case.
Uncolored Graphs
Let be the refinement operator for the WL, mapping the old coloring to the updated one :
We first show that for uncolored graphs this is equivalent to the update rule :
We denote as the all matrix where the size will always be clear from the context.
Lemma 7.
Let be a graph, , and such that . Then for all .
Proof.
Let be minimal such that there are with
(7)  
and  
(8) 
Then , because as there are no initial colors. Let be the color classes of . That is, for all we have if any only if there is an such that . Similarly, let be the color classes of . Observe that the partition of refines the partition . Indeed, if there were , such that and , then all , would satisfy and , contradicting the minimality of .
Hence, the two update rules are equivalent.
Corollary 8.
For all and all we have .
Thus we can use the update rule for the proof on unlabeled graphs. For the proof, it will be convenient to assume that (although we still work with the notation ). It follows that . A node coloring defines a matrix where the row of is defined by . Here we interpret as a node from . As colorings and matrices can be interpreted as one another, given a matrix we write (or ) for a WeisfeilerLeman iteration on the coloring induced by the matrix . For the GNN computation we provide a matrix based notation. Using the adjacency matrix of and a coloring , we can write the update rule of the GNN layer as
where is the refinement operator of GNNs corresponding to a single iteration of the WL. For simplicity of the proof, we choose
and the bias as
Note that we later provide a way to simulate the signfunction using ReLu operations to indicate that choosing the sign function is not really a hard restriction.
Lemma 9.
Let be a matrix such that for all and the rows of are pairwise distinct. Then there is a matrix such that the matrix is nonsingular.
Proof.
Let where is the upper bound on the matrix entries of and . Then the entries of are nonnegative and pairwise distinct. Without loss of generality, we assume that such that . Now we choose numbers such that
(9) 
for all as the are ordered. Let and and . Then has entries , and thus by (9),
(10) 
Thus is nonsingular. Now we simply let . Then . ∎
Let us call a matrix rowindependent modulo equality if the set of all rows appearing in the matrix is linearly independent.
Example 10.
The matrix
is rowindependent modulo equality.
Note that the all matrix is rowindependent modulo equality in all dimensions.
Lemma 11.
Let , and let be row independent modulo equality. Then there is a such that the matrix is row independent modulo equality and
Proof.
Let be the color classes of (that is, for all it holds that ). Let be the matrix with rows for all . Then the rows of are linearly independent, and thus there is a matrix such that is the identity matrix. It follows that is the matrix with entries
(11) 
Let be the matrix with entries . Note that
(12) 
because for all and we have
where the second equality follows from Equation (11). By the definition of as the WL operator on uncolored graphs, we have
(13) 
if we view as a coloring of .
Let be the color classes of , and let be the matrix with rows for all and . Then for all , and the rows of are pairwise distinct. By Lemma 9, there is a matrix such that the matrix is non singular. This implies that the matrix is rowindependent modulo equality. Moreover, by (13). We let be the matrix of obtained from by adding all0 columns. Then
is rowindependent modulo equality and . ∎
Corollary 12.
There is a sequence with such that for all ,
where is given by the fold application of on the initial uniform coloring .
Colored Graphs
We now extend the computation to colored graphs. In order to do that, we again use an equivalent but slightly different variant of the WeisfeilerLeman update rule leading to colorings instead of the usual . We then start by showing that both update rules are equivalent.
We define to be the refinement operator for the WL, mapping a coloring to the updated one as follows:
Note that for we use the initial color of a node whereas used the color from the previous round. The idea of using those old colors is to make sure that any two nodes which got a different color in iteration , get different colors in iteration . This is formalized by the following lemma.
Lemma 14.
Let be a colored graph, , and such that
Comments
There are no comments yet.