Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks

10/04/2018 ∙ by Christopher Morris, et al. ∙ RWTH Aachen University TU Dortmund McGill University 0

In recent years, graph neural networks (GNNs) have emerged as a powerful neural architecture to learn vector representations of nodes and graphs in a supervised, end-to-end fashion. Up to now, GNNs have only been evaluated empirically---showing promising results. The following work investigates GNNs from a theoretical point of view and relates them to the 1-dimensional Weisfeiler-Leman graph isomorphism heuristic (1-WL). We show that GNNs have the same expressiveness as the 1-WL in terms of distinguishing non-isomorphic (sub-)graphs. Hence, both algorithms also have the same shortcomings. Based on this, we propose a generalization of GNNs, so-called k-dimensional GNNs (k-GNNs), which can take higher-order graph structures at multiple scales into account. These higher-order structures play an essential role in the characterization of social networks and molecule graphs. Our experimental evaluation confirms our theoretical findings as well as confirms that higher-order information is useful in the task of graph classification and regression.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Graph-structured data is ubiquitous across application domains ranging from chemo- and bioinformatics to image and social network analysis. To develop successful machine learning models in these domains, we need techniques that can exploit the rich information inherent in graph structure, as well as the feature information contained within a graph’s nodes and edges. In recent years, numerous approaches have been proposed for machine learning graphs—most notably, approaches based on graph kernels

[Vishwanathan et al.2010] or, alternatively, using graph neural network algorithms [Hamilton, Ying, and Leskovec2017b].

Kernel approaches typically fix a set of features in advance—e.g., indicator features over subgraph structures or features of local node neighborhoods. For example, one of the most successful kernel approaches, the Weisfeiler-Lehman subtree kernel [Shervashidze et al.2011], which is based on the -dimensional Weisfeiler-Leman graph isomorphism heuristic [Grohe2017, pp. 79 ff.], generates node features through an iterative relabeling, or coloring, scheme: First, all nodes are assigned a common initial color; the algorithm then iteratively recolors a node by aggregating over the multiset of colors in its neighborhood, and the final feature representation of a graph is the histogram of the resulting node colors. By iteratively aggregating over local node neighborhoods in this way, the WL subtree kernel is able to effectively summarize the neighborhood substructures present in a graph. However, while powerful, the WL subtree kernel—like other kernel methods—is limited because this feature construction scheme is fixed (i.e., it does not adapt to the given data distribution). Moreover, this approach—like the majority of kernel methods—focuses only on the graph structure and cannot interpret continuous node and edge labels, such as real-valued vectors which play an important role in applications such as bio- and chemoinformatics.

Graph neural networks (GNNs) have emerged as a machine learning framework addressing the above challenges. Standard GNNs can be viewed as a neural version of the -WL algorithm, where colors are replaced by continuous feature vectors and neural networks are used to aggregate over node neighborhoods [Hamilton, Ying, and Leskovec2017a, Kipf and Welling2017]. In effect, the GNN framework can be viewed as implementing a continuous form of graph-based “message passing”, where local neighborhood information is aggregated and passed on to the neighbors [Gilmer et al.2017]. By deploying a trainable neural network to aggregate information in local node neighborhoods, GNNs can be trained in an end-to-end fashion together with the parameters of the classification or regression algorithm, possibly allowing for greater adaptability and better generalization compared to the kernel counterpart of the classical -WL algorithm.

Up to now, the evaluation and analysis of GNNs has been largely empirical, showing promising results compared to kernel approaches, see, e.g., [Ying et al.2018b]. However, it remains unclear how GNNs are actually encoding graph structure information into their vector representations, and whether there are theoretical advantages of GNNs compared to kernel based approaches.

Present Work. We offer a theoretical exploration of the relationship between GNNs and kernels that are based on the -WL algorithm. We show that GNNs cannot be more powerful than the -WL in terms of distinguishing non-isomorphic (sub-)graphs, e.g., the properties of subgraphs around each node. This result holds for a broad class of GNN architectures and all possible choices of parameters for them. On the positive side, we show that given the right parameter initialization GNNs have the same expressiveness as the -WL algorithm, completing the equivalence. Since the power of the -WL has been completely characterized, see, e.g., [Arvind et al.2015, Kiefer, Schweitzer, and Selman2015], we can transfer these results to the case of GNNs, showing that both approaches have the same shortcomings.

Going further, we leverage these theoretical relationships to propose a generalization of GNNs, called -GNNs, which are neural architectures based on the -dimensional WL algorithm (-WL), which are strictly more powerful than GNNs. The key insight in these higher-dimensional variants is that they perform message passing directly between subgraph structures, rather than individual nodes. This higher-order form of message passing can capture structural information that is not visible at the node-level.

Graph kernels based on the -WL have been proposed in the past [Morris, Kersting, and Mutzel2017]. However, a key advantage of implementing higher-order message passing in GNNs—which we demonstrate here—is that we can design hierarchical variants of -GNNs, which combine graph representations learned at different granularities in an end-to-end trainable framework. Concretely, in the presented hierarchical approach the initial messages in a -GNN are based on the output of lower-dimensional -GNN (with ), which allows the model to effectively capture graph structures of varying granularity. Many real-world graphs inherit a hierarchical structure—e.g., in a social network we must model both the ego-networks around individual nodes, as well as the coarse-grained relationships between entire communities, see, e.g., [Newman2003]—and our experimental results demonstrate that these hierarchical -GNNs are able to consistently outperform traditional GNNs on a variety of graph classification and regression tasks. Across twelve graph regression tasks from the QM9 benchmark, we find that our hierarchical model reduces the mean absolute error by 54.45% on average. For graph classification, we find that our hierarchical models leads to slight performance gains.

Key Contributions. Our key contributions are summarized as follows:

  1. We show that GNNs are not more powerful than the -WL in terms of distinguishing non-isomorphic (sub-)graphs. Moreover, we show that, assuming a suitable parameter initialization, GNNs have the same power as the -WL.

  2. We propose -GNNs, which are strictly more powerful than GNNs. Moreover, we propose a hierarchical version of -GNNs, so-called --GNNs, which are able to work with the fine- and coarse-grained structures of a given graph, and relationships between those.

  3. Our theoretical findings are backed-up by an experimental study, showing that higher-order graph properties are important for successful graph classification and regression.

Related Work

Our study builds upon a wealth of work at the intersection of supervised learning on graphs, kernel methods, and graph neural networks.

Historically, kernel methods—which implicitly or explicitly map graphs to elements of a Hilbert space—have been the dominant approach for supervised learning on graphs. Important early work in this area includes random-walk based kernels [Gärtner, Flach, and Wrobel2003, Kashima, Tsuda, and Inokuchi2003]) and kernels based on shortest paths [Borgwardt and Kriegel2005]. More recently, developments in graph kernels have emphasized scalability, focusing on techniques that bypass expensive Gram matrix computations by using explicit feature maps. Prominent examples of this trend include kernels based on graphlet counting [Shervashidze et al.2009], and, most notably, the Weisfeiler-Lehman subtree kernel [Shervashidze et al.2011] as well as its higher-order variants [Morris, Kersting, and Mutzel2017]. Graphlet and Weisfeiler-Leman kernels have been successfully employed within frameworks for smoothed and deep graph kernels [Yanardag and Vishwanathan2015a, Yanardag and Vishwanathan2015b]. Recent works focus on assignment-based approaches [Kriege, Giscard, and Wilson2016, Nikolentzos, Meladianos, and Vazirgiannis2017, Johansson and Dubhashi2015], spectral approaches [Kondor and Pan2016], and graph decomposition approaches [Nikolentzos et al.2018]. Graph kernels were dominant in graph classification for several years, leading to new state-of-the-art results on many classification tasks. However, they are limited by the fact that they cannot effectively adapt their feature representations to a given data distribution, since they generally rely on a fixed set of features. More recently, a number of approaches to graph classification based upon neural networks have been proposed. Most of the neural approaches fit into the graph neural network framework proposed by [Gilmer et al.2017]. Notable instances of this model include Neural Fingerprints [Duvenaud et al.2015], Gated Graph Neural Networks [Li et al.2016], GraphSAGE [Hamilton, Ying, and Leskovec2017a], SplineCNN [Fey et al.2018], and the spectral approaches proposed in [Bruna et al.2014, Defferrard, X., and Vandergheynst2016, Kipf and Welling2017]—all of which descend from early work in [Merkwirth and Lengauer2005] and [Scarselli et al.2009b]. Recent extensions and improvements to the GNN framework include approaches to incorporate different local structures around subgraphs [Xu et al.2018] and novel techniques for pooling node representations in order perform graph classification [Zhang et al.2018, Ying et al.2018b]. GNNs have achieved state-of-the-art performance on several graph classification benchmarks in recent years, see, e.g., [Ying et al.2018b]—as well as applications such as protein-protein interaction prediction [Fout et al.2017], recommender systems [Ying et al.2018a], and the analysis of quantum interactions in molecules [Schütt et al.2017]. A survey of recent advancements in GNN techniques can be found in [Hamilton, Ying, and Leskovec2017b].

Up to this point (and despite their empirical success) there has been very little theoretical work on GNNs—with the notable exceptions of Li et al.’s [Li, Han, and Wu2018] work connecting GNNs to a special form Laplacian smoothing and Lei et al.’s [Lei et al.2017] work showing that the feature maps generated by GNNs lie in the same Hilbert space as some popular graph kernels. Moreover, Scarselli et al. [Scarselli et al.2009a] investigates the approximation capabilities of GNNs.


We start by fixing notation, and then outline the Weisfeiler-Leman algorithm and the standard graph neural network framework.

Notation and Background

A graph is a pair with a finite set of nodes and a set of edges . We denote the set of nodes and the set of edges of by and , respectively. For ease of notation we denote the edge in by or . Moreover, denotes the neighborhood of in , i.e., . We say that two graphs and are isomorphic if there exists an edge preserving bijection , i.e., is in if and only if is in . We write and call the equivalence classes induced by isomorphism types. Let then is the subgraph induced by with . A node coloring is a function with arbitrary codomain . Then a node colored or labeled graph is a graph endowed with a node coloring . We say that is a label or color of . We say that a node coloring refines a node coloring , written , if implies for every in . Two colorings are equivalent if and , and we write . A color class of a node coloring is a maximal set of nodes with for every in . Moreover, let for , let be a set then the set of -sets for , which is the set of all subsets with cardinality , and let denote a multiset.

Weisfeiler-Leman Algorithm

We now describe the -WL algorithm for labeled graphs. Let be a labeled graph. In each iteration, , the -WL computes a node coloring , which depends on the coloring from the previous iteration. In iteration , we set . Now in iteration , we set


where hash bijectively maps the above pair to a unique value in , which has not been used in previous iterations. To test two graph and for isomorphism, we run the above algorithm in “parallel” on both graphs. Now if the two graphs have a different number of nodes colored in , the -WL concludes that the graphs are not isomorphic. Moreover, if the number of colors between two iterations does not change, i.e., the cardinalities of the images of and are equal, the algorithm terminates. Termination is guaranteed after at most iterations. It is easy to see that the algorithm is not able to distinguish all non-isomorphic graphs, e.g., see [Cai, Fürer, and Immerman1992]. Nonetheless, it is a powerful heuristic, which can successfully test isomorphism for a broad class of graphs [Babai and Kucera1979].

The -dimensional Weisfeiler-Leman algorithm (-WL), for , is a generalization of the -WL which colors tuples from instead of nodes. That is, the algorithm computes a coloring . In order to describe the algorithm, we define the -th neighborhood


of a -tuple in . That is, the -th neighborhood of is obtained by replacing the -th component of by every node from . In iteration , the algorithm labels each -tuple with its atomic type, i.e., two -tuples and in get the same color if the map induces a (labeled) isomorphism between the subgraphs induced from the nodes from and , respectively. For iteration , we define


and set


Hence, two tuples and with get different colors in iteration if there exists in such that the number of -neighbors of and , respectively, colored with a certain color is different. The algorithm then proceeds analogously to the -WL. By increasing , the algorithm gets more powerful in terms of distinguishing non-isomorphic graphs, i.e., for each , there are non-isomorphic graphs which can be distinguished by the ()-WL but not by the -WL [Cai, Fürer, and Immerman1992]. We note here that the above variant is not equal to the folklore variant of -WL described in [Cai, Fürer, and Immerman1992], which differs slightly in its update rule. However, it holds that the -WL using Equation 4 is as powerful as the folklore -WL [Grohe and Otto2015].

WL Kernels. After running the WL algorithm, the concatenation of the histogram of colors in each iteration can be used as a feature vector in a kernel computation. Specifically, in the histogram for every color in there is an entry containing the number of nodes or -tuples that are colored with .

Graph Neural Networks

Let be a labeled graph with an initial node coloring that is consistent with . This means that each node is annotated with a feature in such that if and only if . Alternatively, can be an arbitrary real-valued feature vector associated with . Examples include continuous atomic properties in chemoinformatic applications where nodes correspond to atoms, or vector representations of text in social network applications. A GNN model consists of a stack of neural network layers, where each layer aggregates local neighborhood information, i.e., features of neighbors, around each node and then passes this aggregated information on to the next layer.

A basic GNN model can be implemented as follows [Hamilton, Ying, and Leskovec2017b]. In each layer , we compute a new feature


in for , where and are parameter matrices from , and

denotes a component-wise non-linear function, e.g., a sigmoid or a ReLU.

111For clarity of presentation we omit biases.

Following [Gilmer et al.2017], one may also replace the sum defined over the neighborhood in the above equation by a permutation-invariant, differentiable function, and one may substitute the outer sum, e.g., by a column-wise vector concatenation or LSTM-style update step. Thus, in full generality a new feature is computed as


where aggregates over the set of neighborhood features and merges the node’s representations from step with the computed neighborhood features. Both and may be arbitrary differentiable, permutation-invariant functions (e.g., neural networks), and, by analogy to Equation 5, we denote their parameters as and , respectively. In the rest of this paper, we refer to neural architectures implementing Equation 6 as -dimensional GNN architectures (-GNNs).

A vector representation over the whole graph can be computed by summing over the vector representations computed for all nodes, i.e.,

where denotes the last layer. More refined approaches use differential pooling operators based on sorting [Zhang et al.2018] and soft assignments [Ying et al.2018b].

In order to adapt the parameters and of Equations 6 and 5

, to a given data distribution, they are optimized in an end-to-end fashion (usually via stochastic gradient descent) together with the parameters of a neural network used for classification or regression.

Relationship Between 1-WL and 1-GNNs

In the following we explore the relationship between the -WL and -GNNs. Let be a labeled graph, and let denote the GNN parameters given by Equation 5 or Equation 6 up to iteration . We encode the initial labels by vectors , e.g., using a -hot encoding.

Our first theoretical result shows that the -GNN architectures do not have more power in terms of distinguishing between non-isomorphic (sub-)graphs than the -WL algorithm. More formally, let and be any two functions chosen in (6). For every encoding of the labels as vectors , and for every choice of , we have that the coloring of -WL always refines the coloring induced by a -GNN parameterized by .

Theorem 1.

Let be a labeled graph. Then for all and for all choices of initial colorings consistent with , and weights ,

Our second result states that there exist a sequence of parameter matrices such that -GNNs have exactly the same power in terms of distinguishing non-isomorphic (sub-)graphs as the -WL algorithm. This even holds for the simple architecture (5), provided we choose the encoding of the initial labeling

in such a way that different labels are encoded by linearly independent vectors.

Theorem 2.

Let be a labeled graph. Then for all there exists a sequence of weights , and a -GNN architecture such that

Hence, in the light of the above results, -GNNs may viewed as an extension of the -WL which in principle have the same power but are more flexible in their ability to adapt to the learning task at hand and are able to handle continuous node features.

Shortcomings of Both Approaches

The power of -WL has been completely characterized, see, e.g., [Arvind et al.2015]. Hence, by using Theorems 2 and 1, this characterization is also applicable to -GNNs. On the other hand, -GNNs have the same shortcomings as the -WL. For example, both methods will give the same color to every node in a graph consisting of a triangle and a -cycle, although vertices from the triangle and the vertices from the -cycle are clearly different. Moreover, they are not capable of capturing simple graph theoretic properties, e.g., triangle counts, which are an important measure in social network analysis [Milo et al.2002, Newman2003].

-dimensional Graph Neural Networks

(a) Hierarchical 1-2-3-GNN network architecture
(b) Pooling from - to -GNN.
Figure 1: Illustration of the proposed hierarchical variant of the -GNN layer. For each subgraph on nodes a feature is learned, which is initialized with the learned features of all -element subgraphs of . Hence, a hierarchical representation of the input graph is learned.

In the following, we propose a generalization of -GNNs, so-called -GNNs, which are based on the -WL. Due to scalability and limited GPU memory, we consider a set-based version of the -WL. For a given , we consider all -element subsets over . Let be a -set in , then we define the neighborhood of as

The local neighborhood consists of all such that for the unique and the unique . The global neighborhood then is defined as .222Note that the definition of the local neighborhood is different from the the one defined in [Morris, Kersting, and Mutzel2017] which is a superset of our definition. Our computations therefore involve sparser graphs.

The set based -WL works analogously to the -WL, i.e., it computes a coloring as in Uncolored Graphs based on the above neighborhood. Initially, colors each element in with the isomorphism type of .

Let be a labeled graph. In each -GNN layer , we compute a feature vector for each -set in . For , we set to

, a one-hot encoding of the isomorphism type of

labeled by . In each layer , we compute new features by

Moreover, one could split the sum into two sums ranging over and respectively, using distinct parameter matrices to enable the model to learn the importance of local and global neighborhoods. To scale -GNNs to larger datasets and to prevent overfitting, we propose local -GNNs, where we omit the global neighborhood of , i.e.,

The running time for evaluation of the above depends on , and the sparsity of the graph (each iteration can be bounded by the number of subsets of size times the maximum degree). Note that we can scale our method to larger datasets by using sampling strategies introduced in, e.g., [Morris, Kersting, and Mutzel2017, Hamilton, Ying, and Leskovec2017a]. We can now lift the results of the previous section to the -dimensional case.

Proposition 3.

Let be a labeled graph and let . Then for all , for all choices of initial colorings consistent with and for all weights ,

Again the second result states that there exists a suitable initialization of the parameter matrices such that -GNNs have exactly the same power in terms of distinguishing non-isomorphic (sub-)graphs as the set-based -WL.

Proposition 4.

Let be a labeled graph and let . Then for all there exists a sequence of weights , and a -GNN architecture such that

Hierarchical Variant

One key benefit of the end-to-end trainable -GNN framework—compared to the discrete -WL algorithm—is that we can hierarchically combine representations learned at different granularities. Concretely, rather than simply using one-hot indicator vectors as initial feature inputs in a -GNN, we propose a hierarchical variant of -GNN that uses the features learned by a -dimensional GNN, in addition to the (labeled) isomorphism type, as the initial features, i.e.,

for some , where is a matrix of appropriate size, and square brackets denote matrix concatenation.

Hence, the features are recursively learned from dimensions to in an end-to-end fashion. This hierarchical model also satisfies Propositions 4 and 3, so its representational capacity is theoretically equivalent to a standard -GNN (in terms of its relationship to -WL). Nonetheless, hierarchy is a natural inductive bias for graph modeling, since many real-world graphs incorporate hierarchical structure, so we expect this hierarchical formulation to offer empirical utility.

Method Dataset


Graphlet 72.9 59.4 40.8 58.3 72.1 87.7 54.7
Shortest-path 76.4 59.2 40.5 62.1 74.5 81.7 58.9
-WL 73.8 72.5 51.5 62.9 83.1 78.3 61.3
-WL 75.2 72.6 50.6 64.7 77.0 77.0 61.9
-WL 74.7 73.5 49.7 61.5 83.1 83.2 62.5
WL-OA 75.3 73.1 50.4 62.7 86.1 84.5 63.6


DCNN 61.3 49.1 33.5 62.6 67.0 56.6
PatchySan 75.9 71.0 45.2 78.6 92.6 60.0
DGCNN 75.5 70.0 47.8 74.4 85.8 58.6
-Gnn No Tuning 70.7 69.4 47.3 59.0 58.6 82.7 51.2
-Gnn 72.2 71.2 47.7 59.3 74.3 82.2 59.0
---Gnn No Tuning 75.9 70.3 48.8 60.0 67.4 84.4 59.3
---Gnn 75.5 74.2 49.5 62.8 76.2 86.1 60.9
Table 1: Classification accuracies in percent on various graph benchmark datasets.
Target Method
Dtnn [Wu et al.2018] Mpnn [Wu et al.2018] -Gnn --Gnn --Gnn ---Gnn Gain
0.358 0.493 0.493 0.476 4.0%
0.95 0.89 0.78 0.46 65.3%
0.00388 0.00541 0.00331 0.00328 0.00337
0.00512 0.00623 0.00355 0.00354 0.00351 1.4%
0.0112 0.0066 0.0049 0.0047 0.0048 6.1%
28.5 34.1 21.5 25.8 22.9 37.0%
ZPVE 0.00172 0.00216 0.00124 0.00064 0.00019 85.5%
2.43 2.05 2.32 0.6855 0.0427 98.5%
2.43 2.00 2.08 0.686 0.111 94.9%
2.43 2.02 2.23 0.070 0.794 98.1%
2.43 2.02 1.94 0.140 0.587 97.6%
0.27 0.42 0.27 0.0989 0.158 65.0%
Table 2: Mean absolute errors on the Qm9 dataset. The far-right column shows the improvement of the best -GNN model in comparison to the -GNN baseline.

Experimental Study

In the following, we want to investigate potential benefits of GNNs over graph kernels as well as the benefits of our proposed -GNN architectures over -GNN architectures. More precisely, we address the following questions:


How do the (hierarchical) -GNNs perform in comparison to state-of-the-art graph kernels?


How do the (hierarchical) -GNNs perform in comparison to the -GNN in graph classification and regression tasks?


How much (if any) improvement is provided by optimizing the parameters of the GNN aggregation function, compared to just using random GNN parameters while optimizing the parameters of the downstream classification/regression algorithm?


To compare our -GNN architectures to kernel approaches we use well-established benchmark datasets from the graph kernel literature [Kersting et al.2016]. The nodes of each graph in these dataset is annotated with (discrete) labels or no labels.

To demonstrate that our architectures scale to larger datasets and offer benefits on real-world applications, we conduct experiments on the Qm9 dataset [Ramakrishnan et al.2014, Ruddigkeit et al.2012, Wu et al.2018], which consists of 133 385 small molecules. The aim here is to perform regression on twelve targets representing energetic, electronic, geometric, and thermodynamic properties, which were computed using density functional theory.


We use the following kernel and GNN methods as baselines for our experiments.

Kernel Baselines. We use the Graphlet kernel [Shervashidze et al.2009], the shortest-path kernel [Borgwardt and Kriegel2005], the Weisfeiler-Lehman subtree kernel (WL[Shervashidze et al.2011], the Weisfeiler-Lehman Optimal Assignment kernel (WL-OA[Kriege, Giscard, and Wilson2016], and the global-local -WL [Morris, Kersting, and Mutzel2017] with in as kernel baselines. For each kernel, we computed the normalized Gram matrix. We used the -SVM implementation of LIBSVM [Chang and Lin2011] to compute the classification accuracies using 10-fold cross validation. The parameter was selected from by 10-fold cross validation on the training folds.

Neural Baselines. To compare GNNs to kernels we used the basic -GNN layer of Equation 5, DCNN [Wang et al.2018], PatchySan [Niepert, Ahmed, and Kutzkov2016], DGCNN [Zhang et al.2018]. For the Qm9 dataset we used a -GNN layer similar to [Gilmer et al.2017], where we replaced the inner sum of Equation 5 with a 2-layer MLP in order incorporate edge features (bond type and distance information). Moreover, we compare against the numbers provided in [Wu et al.2018].

Model Configuration

We always used three layers for -GNN, and two layers for (local) -GNN and -GNN, all with a hidden-dimension size of . For the hierarchical variant we used architectures that use features computed by -GNN as initial features for the -GNN (--GNN) and -GNN (--GNN), respectively. Moreover, using the combination of the former we componentwise concatenated the computed features of the --GNN and the --GNN (---GNN). For the final classification and regression steps, we used a three layer MLP, with binary cross entropy and mean squared error for the optimization, respectively. For classification we used a dropout layer with after the first layer of the MLP. We applied global average pooling to generate a vector representation of the graph from the computed node features for each . The resulting vectors are concatenated column-wise before feeding them into the MLP. Moreover, we used the Adam optimizer with an initial learning rate of and applied an adaptive learning rate decay based on validation results to a minimum of . We trained the classification networks for epochs and the regression networks for epochs.

Experimental Protocol

For the smaller datasets, which we use for comparison against the kernel methods, we performed a 10-fold cross validation where we randomly sampled 10% of each training fold to act as a validation set. For the Qm9 dataset, we follow the dataset splits described in [Wu et al.2018]. We randomly sampled 10% of the examples for validation, another 10% for testing, and used the remaining for training. We used the same initial node features as described in [Gilmer et al.2017]. Moreover, in order to illustrate the benefits of our hierarchical -GNN architecture, we did not use a complete graph, where edges are annotated with pairwise distances, as input. Instead, we only used pairwise Euclidean distances for connected nodes, computed from the provided node coordinates. The code was built upon the work of [Fey et al.2018] and is provided at https://github.com/chrsmrrs/k-gnn.

Results and Discussion

In the following we answer questions Q1 to Q3. Table 1 shows the results for comparison with the kernel methods on the graph classification benchmark datasets. Here, the hierarchical -GNN is on par with the kernels despite the small dataset sizes (answering question Q1). We also find that the 1-2-3-GNN significantly outperforms the 1-GNN on all seven datasets (answering Q2), with the 1-GNN being the overall weakest method across all tasks.333Note that in very recent work, GNNs have shown superior results over kernels when using advanced pooling techniques [Ying et al.2018b]. Note that our layers can be combined with these pooling layers. However, we opted to use standard global pooling in order to compare a typical GNN implementation with standard off-the-shelf kernels. We can further see that optimizing the parameters of the aggregation function only leads to slight performance gains on two out of three datasets, and that no optimization even achieves better results on the Proteins benchmark dataset (answering Q3). We contribute this effect to the one-hot encoded node labels, which allow the GNN to gather enough information out of the neighborhood of a node, even when this aggregation is not learned.

Table 2 shows the results for the Qm9 dataset. On eleven out of twelve targets all of our hierarchical variants beat the -GNN baseline, providing further evidence for Q2. For example, on the target we achieve a large improvement of 98.1% in MAE compared to the baseline. Moreover, on ten out of twelve datasets, the hierarchical -GNNs beat the baselines from [Wu et al.2018]. However, the additional structural information extracted by the -GNN layers does not serve all tasks equally, leading to huge differences in gains across the targets.

It should be noted that our -GNN models have more parameters than the -GNN model, since we stack two additional GNN layers for each . However, extending the -GNN model by additional layers to match the number of parameters of the -GNN did not lead to better results in any experiment.


We presented a theoretical investigation of GNNs, showing that a wide class of GNN architectures cannot be stronger than the -WL. On the positive side, we showed that, in principle, GNNs possess the same power in terms of distinguishing between non-isomorphic (sub-)graphs, while having the added benefit of adapting to the given data distribution. Based on this insight, we proposed -GNNs which are a generalization of GNNs based on the -WL. This new model is strictly stronger then GNNs in terms of distinguishing non-isomorphic (sub-)graphs and is capable of distinguishing more graph properties. Moreover, we devised a hierarchical variant of -GNNs, which can exploit the hierarchical organization of most real-world graphs. Our experimental study shows that -GNNs consistently outperform -GNNs and beat state-of-the-art neural architectures on large-scale molecule learning tasks. Future work includes designing task-specific -GNNs, e.g., devising -GNNs layers that exploit expert-knowledge in bio- and chemoinformatic settings.


This work is supported by the German research council (DFG) within the Research Training Group 2236 UnRAVeL and the Collaborative Research Center SFB 876, Providing Information by Resource-Constrained Analysis, projects A6 and B2.


  • [Arvind et al.2015] Arvind, V.; Köbler, J.; Rattan, G.; and Verbitsky, O. 2015. On the power of color refinement. In Symposium on Fundamentals of Computation Theory, 339–350.
  • [Babai and Kucera1979] Babai, L., and Kucera, L. 1979. Canonical labelling of graphs in linear average time. In Symposium on Foundations of Computer Science, 39–46.
  • [Borgwardt and Kriegel2005] Borgwardt, K. M., and Kriegel, H.-P. 2005. Shortest-path kernels on graphs. In ICDM, 74–81.
  • [Bruna et al.2014] Bruna, J.; Zaremba, W.; Szlam, A.; and LeCun, Y. 2014. Spectral networks and deep locally connected networks on graphs. In ICLR.
  • [Cai, Fürer, and Immerman1992] Cai, J.; Fürer, M.; and Immerman, N. 1992. An optimal lower bound on the number of variables for graph identifications. Combinatorica 12(4):389–410.
  • [Chang and Lin2011] Chang, C.-C., and Lin, C.-J. 2011.

    LIBSVM: A library for support vector machines.

    ACM Transactions on Intelligent Systems and Technology 2:27:1–27:27.
  • [Defferrard, X., and Vandergheynst2016] Defferrard, M.; X., B.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, 3844–3852.
  • [Duvenaud et al.2015] Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; and Adams, R. P. 2015. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, 2224–2232.
  • [Fey et al.2018] Fey, M.; Lenssen, J. E.; Weichert, F.; and Müller, H. 2018.

    SplineCNN: Fast geometric deep learning with continuous B-spline kernels.

    In CVPR.
  • [Fout et al.2017] Fout, A.; Byrd, J.; Shariat, B.; and Ben-Hur, A. 2017. Protein interface prediction using graph convolutional networks. In NIPS, 6533–6542.
  • [Gärtner, Flach, and Wrobel2003] Gärtner, T.; Flach, P.; and Wrobel, S. 2003. On graph kernels: Hardness results and efficient alternatives. In Learning Theory and Kernel Machines. 129–143.
  • [Gilmer et al.2017] Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Dahl, G. E. 2017. Neural message passing for quantum chemistry. In ICML.
  • [Grohe and Otto2015] Grohe, M., and Otto, M. 2015. Pebble games and linear equations. Journal of Symbolic Logic 80(3):797–844.
  • [Grohe2017] Grohe, M. 2017. Descriptive Complexity, Canonisation, and Definable Graph Structure Theory. Lecture Notes in Logic. Cambridge University Press.
  • [Hamilton, Ying, and Leskovec2017a] Hamilton, W. L.; Ying, R.; and Leskovec, J. 2017a. Inductive representation learning on large graphs. In NIPS, 1025–1035.
  • [Hamilton, Ying, and Leskovec2017b] Hamilton, W. L.; Ying, R.; and Leskovec, J. 2017b. Representation learning on graphs: Methods and applications. IEEE Data Engineering Bulletin 40(3):52–74.
  • [Johansson and Dubhashi2015] Johansson, F. D., and Dubhashi, D. 2015. Learning with similarity functions on graphs using matchings of geometric embeddings. In KDD, 467–476.
  • [Kashima, Tsuda, and Inokuchi2003] Kashima, H.; Tsuda, K.; and Inokuchi, A. 2003. Marginalized kernels between labeled graphs. In ICML, 321–328.
  • [Kersting et al.2016] Kersting, K.; Kriege, N. M.; Morris, C.; Mutzel, P.; and Neumann, M. 2016. Benchmark data sets for graph kernels.
  • [Kiefer, Schweitzer, and Selman2015] Kiefer, S.; Schweitzer, P.; and Selman, E. 2015. Graphs identified by logics with counting. In MFCS, 319–330. Springer.
  • [Kipf and Welling2017] Kipf, T. N., and Welling, M. 2017. Semi-supervised classification with graph convolutional networks. In ICLR.
  • [Kondor and Pan2016] Kondor, R., and Pan, H. 2016. The multiscale laplacian graph kernel. In NIPS, 2982–2990.
  • [Kriege, Giscard, and Wilson2016] Kriege, N. M.; Giscard, P.-L.; and Wilson, R. C. 2016. On valid optimal assignment kernels and applications to graph classification. In NIPS, 1615–1623.
  • [Lei et al.2017] Lei, T.; Jin, W.; Barzilay, R.; and Jaakkola, T. S. 2017. Deriving neural architectures from sequence and graph kernels. In ICML, 2024–2033.
  • [Li et al.2016] Li, W.; Saidi, H.; Sanchez, H.; Schäf, M.; and Schweitzer, P. 2016. Detecting similar programs via the Weisfeiler-Leman graph kernel. In International Conference on Software Reuse, 315–330.
  • [Li, Han, and Wu2018] Li, Q.; Han, Z.; and Wu, X.-M. 2018.

    Deeper insights into graph convolutional networks for semi-supervised learning.

    In AAAI, 3538–3545.
  • [Merkwirth and Lengauer2005] Merkwirth, C., and Lengauer, T. 2005. Automatic generation of complementary descriptors with molecular graph networks. Journal of Chemical Information and Modeling 45(5):1159–1168.
  • [Milo et al.2002] Milo, R.; Shen-Orr, S.; Itzkovitz, S.; Kashtan, N.; Chklovskii, D.; and Alon, U. 2002. Network motifs: simple building blocks of complex networks. Science 298(5594):824–827.
  • [Morris, Kersting, and Mutzel2017] Morris, C.; Kersting, K.; and Mutzel, P. 2017. Glocalized Weisfeiler-Lehman kernels: Global-local feature maps of graphs. In ICDM, 327–336.
  • [Newman2003] Newman, M. E. J. 2003. The structure and function of complex networks. SIAM review 45(2):167–256.
  • [Niepert, Ahmed, and Kutzkov2016] Niepert, M.; Ahmed, M.; and Kutzkov, K. 2016. Learning convolutional neural networks for graphs. In ICML, 2014–2023.
  • [Nikolentzos et al.2018] Nikolentzos, G.; Meladianos, P.; Limnios, S.; and Vazirgiannis, M. 2018. A degeneracy framework for graph similarity. In IJCAI, 2595–2601.
  • [Nikolentzos, Meladianos, and Vazirgiannis2017] Nikolentzos, G.; Meladianos, P.; and Vazirgiannis, M. 2017. Matching node embeddings for graph similarity. In AAAI, 2429–2435.
  • [Ramakrishnan et al.2014] Ramakrishnan, R.; Dral, P., O.; Rupp, M.; and von Lilienfeld, O. A. 2014. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data 1.
  • [Ruddigkeit et al.2012] Ruddigkeit, L.; van Deursen, R.; Blum, L. C.; and Reymond, J.-L. 2012. Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17. Journal of Chemical Information and Modeling 52 11:2864–75.
  • [Scarselli et al.2009a] Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G. 2009a. Computational capabilities of graph neural networks. IEEE Transactions on Neural Networks 20(1):81–102.
  • [Scarselli et al.2009b] Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G. 2009b. The graph neural network model. IEEE Transactions on Neural Networks 20(1):61–80.
  • [Schütt et al.2017] Schütt, K.; Kindermans, P. J.; Sauceda, H. E.; Chmiela, S.; Tkatchenko, A.; and Müller, K. R. 2017. SchNet: A continuous-filter convolutional neural network for modeling quantum interactions. In NIPS, 992–1002.
  • [Shervashidze et al.2009] Shervashidze, N.; Vishwanathan, S. V. N.; Petri, T. H.; Mehlhorn, K.; and Borgwardt, K. M. 2009. Efficient graphlet kernels for large graph comparison. In AISTATS, 488–495.
  • [Shervashidze et al.2011] Shervashidze, N.; Schweitzer, P.; van Leeuwen, E. J.; Mehlhorn, K.; and Borgwardt, K. M. 2011. Weisfeiler-Lehman graph kernels. JMLR 12:2539–2561.
  • [Vishwanathan et al.2010] Vishwanathan, S. V. N.; Schraudolph, N. N.; Kondor, R.; and Borgwardt, K. M. 2010. Graph kernels. JMLR 11:1201–1242.
  • [Wang et al.2018] Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S. E.; Bronstein, M. M.; and Solomon, J. M. 2018. Dynamic graph CNN for learning on point clouds. CoRR abs/1801.07829.
  • [Wu et al.2018] Wu, Z.; Ramsundar, B.; Feinberg, E. N.; Gomes, J.; Geniesse, C.; Pappu, A. S.; Leswing, K.; and Pande, V. 2018. Moleculenet: a benchmark for molecular machine learning. Chemical Science 9:513–530.
  • [Xu et al.2018] Xu, K.; Li, C.; Tian, Y.; Sonobe, T.; Kawarabayashi, K.-i.; and Jegelka, S. 2018. Representation learning on graphs with jumping knowledge networks. In ICML, 5453–5462.
  • [Yanardag and Vishwanathan2015a] Yanardag, P., and Vishwanathan, S. V. N. 2015a. Deep graph kernels. In KDD, 1365–1374.
  • [Yanardag and Vishwanathan2015b] Yanardag, P., and Vishwanathan, S. V. N. 2015b. A structural smoothing framework for robust graph comparison. In NIPS, 2134–2142.
  • [Ying et al.2018a] Ying, R.; He, R.; Chen, K.; Eksombatchai, P.; Hamilton, W. L.; and Leskovec, J. 2018a. Graph convolutional neural networks for web-scale recommender systems. KDD.
  • [Ying et al.2018b] Ying, R.; You, J.; Morris, C.; Ren, X.; Hamilton, W. L.; and Leskovec, J. 2018b. Hierarchical graph representation learning with differentiable pooling. In NIPS.
  • [Zhang et al.2018] Zhang, M.; Cui, Z.; Neumann, M.; and Yixin, C. 2018. An end-to-end deep learning architecture for graph classification. In AAAI, 4428–4435.


In the following we provide proofs for Theorem 1, Theorem 2, Proposition 3, and Proposition 4.

Proof of Theorem 1

Theorem 5 (Theorem 1 in the main paper).

Let be a labeled graph. Then for all and for all choices of initial colorings consistent with , and weights ,

For the theorem we consider a single iteration of the -WL algorithm and the GNN on a single graph.

Proof of Theorem 1.

We show for an arbitrary iteration and nodes , that implies . In iteration we have as the initial node coloring is chosen consistent with .

Let and such that . Assume for the induction that holds. As we know from the refinement step of the -WL that the old colors of and as well as the multisets and of colors of the neighbors of and are identical.

Let and be the multisets of feature vectors of the neighbors of and respectively. By the induction hypothesis, we know that and such that independent of the choice of and we get . This holds as the input to both functions and is identical. This proves and thereby the theorem. ∎

Proof of Theorem 2

Theorem 6 (Theorem 2 in the main paper).

Let be a labeled graph. Then for all there exists a sequence of weights and a -GNN architecture such that

For the proof we start by giving the proof for graphs where all nodes have the same initial color and then extend it to colored graphs. In order to do that we use a slightly adapted but equivalent version of the -WL. Note that the extension to colored graphs is mostly technical, while the important idea is already contained in the first case.

Uncolored Graphs

Let be the refinement operator for the -WL, mapping the old coloring to the updated one :

We first show that for uncolored graphs this is equivalent to the update rule :

We denote as the all- matrix where the size will always be clear from the context.

Lemma 7.

Let be a graph, , and such that . Then for all .


Let be minimal such that there are with


Then , because as there are no initial colors. Let be the color classes of . That is, for all we have if any only if there is an such that . Similarly, let be the color classes of . Observe that the partition of refines the partition . Indeed, if there were , such that and , then all , would satisfy and , contradicting the minimality of .

Choose satisfying (7) and (8). By (7), there is an such that . Let such that . By (8), for all we have . As the are disjoint, this implies , which is a contradiction. ∎

Hence, the two update rules are equivalent.

Corollary 8.

For all and all we have .

Thus we can use the update rule for the proof on unlabeled graphs. For the proof, it will be convenient to assume that (although we still work with the notation ). It follows that . A node coloring defines a matrix where the row of is defined by . Here we interpret as a node from . As colorings and matrices can be interpreted as one another, given a matrix we write (or ) for a Weisfeiler-Leman iteration on the coloring induced by the matrix . For the GNN computation we provide a matrix based notation. Using the adjacency matrix of and a coloring , we can write the update rule of the GNN layer as

where is the refinement operator of GNNs corresponding to a single iteration of the -WL. For simplicity of the proof, we choose

and the bias as

Note that we later provide a way to simulate the sign-function using ReLu operations to indicate that choosing the sign function is not really a hard restriction.

Lemma 9.

Let be a matrix such that for all and the rows of are pairwise distinct. Then there is a matrix such that the matrix is non-singular.


Let where is the upper bound on the matrix entries of and . Then the entries of are nonnegative and pairwise distinct. Without loss of generality, we assume that such that . Now we choose numbers such that


for all as the are ordered. Let and and . Then has entries , and thus by (9),


Thus is non-singular. Now we simply let . Then . ∎

Let us call a matrix row-independent modulo equality if the set of all rows appearing in the matrix is linearly independent.

Example 10.

The matrix

is row-independent modulo equality.

Note that the all- matrix is row-independent modulo equality in all dimensions.

Lemma 11.

Let , and let be row independent modulo equality. Then there is a such that the matrix is row independent modulo equality and


Let be the color classes of (that is, for all it holds that ). Let be the matrix with rows for all . Then the rows of are linearly independent, and thus there is a matrix such that is the identity matrix. It follows that is the matrix with entries


Let be the matrix with entries . Note that


because for all and we have

where the second equality follows from Equation (11). By the definition of as the -WL operator on uncolored graphs, we have


if we view as a coloring of .

Let be the color classes of , and let be the matrix with rows for all and . Then for all , and the rows of are pairwise distinct. By Lemma 9, there is a matrix such that the matrix is non singular. This implies that the matrix is row-independent modulo equality. Moreover, by (13). We let be the matrix of obtained from by adding all-0 columns. Then

is row-independent modulo equality and . ∎

Corollary 12.

There is a sequence with such that for all ,

where is given by the -fold application of on the initial uniform coloring .

Remark 13.

The construction in Lemma 11 always outputs a matrix with as many columns as there are color classes in the resulting coloring. Thus we can choose to be

and pad the matrix using additional


Colored Graphs

We now extend the computation to colored graphs. In order to do that, we again use an equivalent but slightly different variant of the Weisfeiler-Leman update rule leading to colorings instead of the usual . We then start by showing that both update rules are equivalent.

We define to be the refinement operator for the -WL, mapping a coloring to the updated one as follows:

Note that for we use the initial color of a node whereas used the color from the previous round. The idea of using those old colors is to make sure that any two nodes which got a different color in iteration , get different colors in iteration . This is formalized by the following lemma.

Lemma 14.

Let be a colored graph, , and such that