How Powerful are Graph Neural Networks?

10/01/2018 ∙ by Keyulu Xu, et al. ∙ MIT Stanford University 13

Graph Neural Networks (GNNs) for representation learning of graphs broadly follow a neighborhood aggregation framework, where the representation vector of a node is computed by recursively aggregating and transforming feature vectors of its neighboring nodes. Many GNN variants have been proposed and have achieved state-of-the-art results on both node and graph classification tasks. However, despite GNNs revolutionizing graph representation learning, there is limited understanding of their representational properties and limitations. Here, we present a theoretical framework for analyzing the expressive power of GNNs in capturing different graph structures. Our results characterize the discriminative power of popular GNN variants, such as Graph Convolutional Networks and GraphSAGE, and show that they cannot learn to distinguish certain simple graph structures. We then develop a simple architecture that is provably the most expressive among the class of GNNs and is as powerful as the Weisfeiler-Lehman graph isomorphism test. We empirically validate our theoretical findings on a number of graph classification benchmarks, and demonstrate that our model achieves state-of-the-art performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

pytorch_geometric

Geometric Deep Learning Extension Library for PyTorch


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning with graph structured data, such as molecules, social, biological, and financial networks, requires effective representation of their graph structure (Hamilton et al., 2017b). Recently, there has been a surge of interest in Graph Neural Network (GNN) approaches for representation learning of graphs (Li et al., 2016; Hamilton et al., 2017a; Kipf & Welling, 2017; Velickovic et al., 2018; Xu et al., 2018). GNNs broadly follow a recursive neighborhood aggregation (or message passing) scheme, where each node aggregates feature vectors of its neighbors to compute its new feature vector (Gilmer et al., 2017; Xu et al., 2018). After iterations of aggregation, a node is represented by its transformed feature vector, which captures the structural information within the node’s -hop network neighborhood. The representation of an entire graph can then be obtained through pooling, for example, by summing the representation vectors of all nodes in the graph.

Many GNN variants with different neighborhood aggregation and graph-level pooling schemes have been proposed (Battaglia et al., 2016; Defferrard et al., 2016; Duvenaud et al., 2015; Hamilton et al., 2017a; Kearnes et al., 2016; Kipf & Welling, 2017; Li et al., 2016; Velickovic et al., 2018; Verma & Zhang, 2018; Ying et al., 2018; Zhang et al., 2018)

. Empirically, these GNNs have achieved state-of-the-art performance in many tasks such as node classification, link prediction, and graph classification. However, the design of new GNNs is mostly based on empirical intuition, heuristics, and experimental trial-and-error. There is little theoretical understanding of the properties and limitations of GNNs, and formal analysis of GNNs’ representational capacity is limited.

Here, we present a theoretical framework for analyzing the representational power of GNNs. We formally characterize how expressive different GNN variants are in learning to represent and distinguish between different graph structures. Our framework is inspired by the close connection between GNNs and the Weisfeiler-Lehman (WL) graph isomorphism test (Weisfeiler & Lehman, 1968), a powerful test known to distinguish a broad class of graphs (Babai & Kucera, 1979). Similar to GNNs, the WL test iteratively updates a given node’s feature vector by aggregating feature vectors of its network neighbors. What makes the WL test so powerful is its injective aggregation update that maps different node neighborhoods to different feature vectors. Our key insight is that a GNN can have as large discriminative power as the WL test if the GNN’s aggregation scheme is highly expressive and can model injective functions.

To mathematically formalize the above insight, our framework first abstracts the feature vectors of a node’s neighbors as a multiset, i.e., a set with possibly repeating elements. Then, the neighbor aggregation in GNNs can be abstracted as a function over the multiset. We rigorously study different variants of multiset functions and theoretically characterize their discriminative power, i.e., how well different aggregation functions can distinguish different multisets. The more discriminative the multiset function is, the more powerful the representational power of the underlying GNN.

Our main results are summarized as follows:

  • [leftmargin=1cm]

  • We show that GNNs are at most as powerful as the WL test in distinguishing graph structures.

  • We establish conditions on the neighbor aggregation and graph pooling functions under which the resulting GNN is as powerful as the WL test.

  • We identify graph structures that cannot be distinguished by popular GNN variants, such as GCN (Kipf & Welling, 2017) and GraphSAGE (Hamilton et al., 2017a), and we precisely characterize the kinds of graph structures such GNN-based models can capture.

  • We develop a simple neural architecture, Graph Isomorphism Network (GIN), and show that its discriminative/representational power is equal to the power of the WL test.

We validate our theory via experiments on graph classification datasets, where the expressive power of GNNs is crucial to capture graph structures. In particular, we compare the performance of GNNs with various aggregation functions. Our results confirm that the most powerful GNN (our Graph Isomorphism Network (GIN)) has high representational power as it almost perfectly fits the training data, whereas the less powerful GNN variants often severely underfit the training data. In addition, the representationally more powerful GNNs outperform the others by test set accuracy and achieve state-of-the-art performance on many graph classification benchmarks.

2 Preliminaries

We begin by summarizing some of the most common GNN models and, along the way, introduce our notation. Let denote a graph with node feature vectors for . Then are two tasks of interest: (1) Node classification, where each node has an associated label and the goal is to learn a representation vector of such that ’s label can be predicted as ; (2) Graph classification, where, given a set of graphs and their labels , we aim to learn a representation vector that helps predict the label of an entire graph, .

Graph Neural Networks. GNNs use the graph structure and node features to learn a representation vector of a node, , or the entire graph, . Modern GNNs follow a neighborhood aggregation strategy, where we iteratively update the representation of a node by aggregating representations of its neighbors. After iterations of aggregation, a node’s representation captures the structural information within its -hop network neighborhood. Formally, the -th layer of a GNN is

(2.1)

where is the feature vector of node at the -th iteration/layer. We initialize , and is a set of nodes adjacent to . The choice of AGGREGATE and COMBINE in GNNs is crucial. A number of architectures for AGGREGATE have been proposed. In Graph Convolutional Networks (GCN) (Kipf & Welling, 2017), AGGREGATE has been formulated as

(2.2)

where is a learnable matrix. In the pooling variant of GraphSAGE (Hamilton et al., 2017a), the mean operation in Eq. 2.2

is replaced by an element-wise max-pooling. The COMBINE step could be a concatenation followed by a linear mapping

as in GraphSAGE. In GCN, the COMBINE step is omitted and the model simply aggregates node with its neighbors as .

For node classification, the node representation of the final iteration is used for prediction. For graph classification, the READOUT function aggregates node features from the final iteration to obtain the entire graph’s representation :

(2.3)

READOUT can be a simple permutation invariant function such as summation or a more sophisticated graph-level pooling function (Ying et al., 2018; Zhang et al., 2018).

Figure 1: Subtree structures at the blue nodes in Weisfeiler-Lehman graph isomorphism test. Two WL iterations can capture and distinguish the structure of rooted subtrees of height 2.

Weisfeiler-Lehman test. The graph isomorphism problem asks whether two graphs are topologically identical. This is a challenging problem: no polynomial-time algorithm is known for it yet (Garey, 1979; Garey & Johnson, 2002; Babai, 2016). Despite some corner cases (Cai et al., 1992), the Weisfeiler-Lehman (WL) test of graph isomorphism (Weisfeiler & Lehman, 1968) is an effective and computationally efficient test that distinguishes a broad class of graphs (Babai & Kucera, 1979). Its 1-dimensional form, “naive vertex refinement”, is analogous to neighborhood aggregation in GNNs. Assuming each node has a categorical label111In case each node has a feature vector, an injective function is used to map it to a categorical node label., the WL test iteratively (1) aggregates the labels of nodes and their neighborhoods, and (2) hashes the aggregated labels into unique new labels. The algorithm decides that two graphs are different if at some iteration their node labels are different.

Based on the WL test, Shervashidze et al. (2011) proposed the WL subtree kernel that measures the similarity between graphs. The kernel uses the counts of node labels at different iterations of the WL test as the feature vector of a graph. Intuitively, a node’s label at the -th iteration of WL test represents a subtree structure of height rooted at the node (Figure 1). Thus, the graph features considered by the WL subtree kernel are essentially counts of different rooted subtrees in the graph.

3 Theoretical framework: overview

We start with an overview of our framework for analyzing the expressive power of GNNs. A GNN recursively updates each node’s feature vector to capture the network structure and features of other nodes around it, i.e., its rooted subtree structures (Figure 1). For notational simplicity, we can assign each feature vector a unique label . Then, feature vectors of a set of neighboring nodes form a multiset: the same element can appear multiple times since different nodes can have identical feature vectors.

Definition 1 (Multiset).

A multiset is a generalized concept of a set that allows multiple instances for its elements. More formally, a multiset is a 2-tuple where is the underlying set of that is formed from its distinct elements, and gives the multiplicity of the elements.

In order to analyze the representational power of a GNN, we analyze when a GNN maps two nodes into the same location in the embedding space. Intuitively, the most powerful GNN maps two nodes to the same location only if they have identical subtree structures with identical features on the corresponding nodes. Since subtree structures are defined recursively via node neighborhoods (Figure 1), we can reduce our analysis recursively to the question when a GNN maps two neighborhoods to the same embedding. The most powerful GNN would never map two different neighborhoods, i.e., multisets of feature vectors, to the same location. This means its aggregation scheme is injective. Thus, we abstract a GNN’s aggregation scheme as a class of functions over multisets that its neural networks can represent, and analyze whether they are able to represent injective multiset functions. Next, we use this reasoning to develop a maximally powerful GNN. In Section 5, we study popular GNN variants and see that their aggregation schemes are inherently not injective and thus less powerful, but that they can capture other interesting properties of graphs.

4 Generalizing the wl test with graph neural networks

Ideally, a representationally powerful GNN could distinguish different graphs by mapping them to different locations in the embedding space. This is, however, equivalent to solving graph isomorphism. In our analysis, we characterize the representational capacity of GNNs via a slightly weaker criterion: the Weisfeiler-Lehman (WL) graph isomorphism test that is known to work well in general, with some few exceptions. Proofs of all lemmas and theorems can be found in the appendix.

Lemma 2.

Let and be any non-isomorphic graphs. If a graph neural network following the neighborhood aggregation scheme maps and to different embeddings, the Weisfeiler-Lehman graph isomorphism test also decides and are not isomorphic.

Hence, any aggregation-based GNN is at most as powerful as the WL test in distinguishing different graphs. A natural follow-up question is whether there exist GNNs that are, in principle, as powerful as the WL test? Our answer, in Theorem 3, is yes: if the neighbor aggregation and graph pooling functions are injective, then the resulting GNN is as powerful as the WL test.

Theorem 3.

Let be a GNN following the neighborhood aggregation scheme. With sufficient iterations, maps any graphs and that the Weisfeiler-Lehman test of isomorphism decides as non-isomorphic, to different embeddings if the following conditions hold:

  • [leftmargin=0.5cm]

  • aggregates and updates node features iteratively with

    where the functions , which operates on multisets, and are injective.

  • ’s graph-level readout, which operates on the multiset of node features , is injective.

We prove Theorem 3

in the appendix. Generally note that GNNs have an important benefit over the WL test: node feature vectors in the WL test are one-hot encodings and thus cannot capture the similarity between subtrees. In contrast, a GNN satisfying the criteria in Theorem 

3 generalizes the WL test by learning to embed the subtrees to continuous space. This enables GNNs to not only discriminate different structures, but also to learn to map similar graph structures to similar embeddings and capture dependencies between graph structures. Such learned embeddings are particularly helpful for generalization when the co-occurrence of subtrees is sparse across different graphs or there are noisy edges (Yanardag & Vishwanathan, 2015).

4.1 Graph Isomorphism Network (GIN)

Next we develop a model that provably satisfies the conditions in Theorem 3 and thus generalizes the WL test. We name the resulting architecture Graph Isomorphism Network (GIN).

To model injective multiset functions for the neighbor aggregation, we develop a theory of “deep multisets”, i.e., parameterizing universal multiset functions with neural networks. Our next lemma states that sum aggregators can represent injective, in fact, universal functions over multisets.

Lemma 4.

Assume is countable. There exists a function so that is unique for each finite multiset . Moreover, any multiset function can be decomposed as for some function .

We prove Lemma 4 in the appendix. The proof extends the setting in (Zaheer et al., 2017) from sets to multisets. An important distinction between deep multisets and sets is that certain popular injective set functions, such as the mean aggregator, are not injective multiset functions. Thanks to the universal approximation theorem (Hornik et al., 1989; Hornik, 1991)

, we can use multi-layer perceptrons (MLPs) to model and learn

and in Lemma 4 for universal injective embeddings. In practice, we model with one MLP, because MLPs can represent the composition of functions. In the first iteration, we do not need MLPs before summation if input features are one-hot encodings as their summation alone is injective. GIN updates node representations as

(4.1)

In contrast to GNNs, which combine a node’s feature with its aggregated neighborhood feature as in Eq. 2.1, GIN does not have the combine step and simply aggregates the node along with its neighbors. Although not intuitive, Theorem 3 suggests this simple scheme is as powerful as more complicated ones. Experimentally we observe that such simplicity improves performance.

4.2 Readout subtree structures of different depths

An important aspect of the graph-level readout is that node representations, corresponding to subtree structures, get more refined and global as the number of iterations increases. A sufficient number of iterations is key to achieving good discriminative power. Yet, features from earlier iterations may sometimes generalize better. To consider all structural information, GIN uses information from all depths/iterations of the model. We achieve this by an architecture similar to Jumping Knowledge Networks (JK-Nets) (Xu et al., 2018), where we replace Eq. 2.3 with graph representations concatenated across all iterations:

(4.2)

By Theorem 3 and Lemma 4, if GIN replaces READOUT in Eq. 4.2 with summing all node features from the same iterations (we do not need an extra MLP before summation for the same reason as in Eq. 4.1), it provably generalizes the WL test and the WL subtree kernel.

5 Less powerful but still interesting GNNs

Figure 2: Ranking by expressive power for sum, mean and max-pooling aggregators over a multiset. Left panel shows the input multiset and the three panels illustrate the aspects of the multiset a given aggregator is able to capture: sum captures the full multiset, mean captures the proportion/distribution of elements of a given type, and the max aggregator ignores multiplicities (reduces the multiset to a simple set).
(a) Mean and Max both fail
(b) Max fails
(c) Mean and Max both fail
Figure 3: Examples of simple graph structures that mean and max-pooling aggregators fail to distinguish. Figure 2 gives reasoning about how different aggregators “compress” different graph structures/multisets.

Next we study GNNs that do not satisfy the conditions in Theorem 3, including GCN (Kipf & Welling, 2017) and GraphSAGE (Hamilton et al., 2017a). We conduct ablation studies on two aspects of the aggregator in Eq. 4.1: (1) 1-layer perceptrons instead of MLPs and (2) mean or max-pooling instead of the sum. We will see that these GNN variants get confused by surprisingly simple graphs and are less powerful than the WL test. Nonetheless, models with mean aggregators like GCN perform well for node classification tasks. To better understand this, we precisely characterize what different GNN variants can and cannot capture about a graph and discuss the implications for learning with graphs.

5.1 1-layer perceptron is insufficient for capturing structures

The function in Lemma 4 helps map distinct multisets to unique embeddings. It can be parameterized by an MLP by the universal approximation theorem (Hornik, 1991). Nonetheless, many existing GNNs instead use a 1-layer perceptron  (Duvenaud et al., 2015; Kipf & Welling, 2017; Zhang et al., 2018)

, a linear mapping followed by a non-linear activation function such as a ReLU. Such 1-layer mappings are examples of Generalized Linear Models 

(Nelder & Wedderburn, 1972). Therefore, we are interested in understanding whether 1-layer perceptrons are enough for graph learning. Lemma 5 suggests that there are indeed network neighborhoods (multisets) that models with 1-layer perceptrons can never distinguish.

Lemma 5.

There exist finite multisets so that for any linear mapping ,

The main idea of the proof for Lemma 5 is that 1-layer perceptrons can behave much like linear mappings, so the GNN layers degenerate into simply summing over neighborhood features. GNNs with -layer perceptrons lack representational capacity, and, as we will later see empirically, when applied to graph classification they may severely underfit, whereas GNNs with MLPs usually do not.

5.2 Structures that confuse mean and max-pooling

What happens if we replace the sum in with mean or max-pooling as in GCN and GraphSAGE? Mean and max-pooling aggregators are still well-defined multiset functions because they are permutation invariant. But, they are not injective. Figure 2 ranks the three aggregators by their representational power, and Figure 3 illustrates pairs of structures that the mean and max-pooling aggregators fail to distinguish. Here, node colors denote different node features, and we assume the GNNs aggregate neighbors first before combining them with the central node.

In Figure 2(a), every node has the same feature and is the same across all nodes (for any function ). When performing neighborhood aggregation, the mean or maximum over remains and, by induction, we always obtain the same node representation everywhere. Thus, mean and max-pooling aggregators fail to capture any structural information. In contrast, a sum aggregator distinguishes the structures because and give different values. The same argument can be applied to any unlabeled graph. If node degrees instead of a constant value is used as node input features, in principle, mean can recover sum, but max-pooling cannot.

Fig. 2(a) suggests that mean and max have trouble distinguishing graphs with nodes that have repeating features. Let ( for red, for green) denote node features transformed by . Fig. 2(b) shows that maximum over the neighborhood of the blue nodes yields and , which collapse to the same representation. Thus, max-pooling fails to distinguish them. In contrast, the sum aggregator still works because and are in general not equivalent. Similarly, in Fig. 2(c), both mean and max fail as .

5.3 Mean learns distributions

To characterize the class of multisets that the mean aggregator can distinguish, consider the example and , where and have the same set of distinct elements, but contains copies of each element of . Any mean aggregator maps and to the same embedding, because it simply takes averages over individual element features. Thus, the mean captures the distribution (proportions) of elements in a multiset, but not the exact multiset.

Corollary 6.

Assume is countable. There exists a function so that for , if and only if finite multisets and have the same distribution. That is, assuming , we have and for some .

The mean aggregator may perform well if, for the task, the statistical and distributional information in the graph is more important than the exact structure. Moreover, when node features are diverse and rarely repeat, the mean aggregator is as powerful as the sum aggregator. This may explain why, despite the limitations identified in Section 5.2

, GNNs with mean aggregators are effective for node classification tasks, such as classifying article subjects and community detection, where node features are rich and the distribution of the neighborhood features provides a strong signal for the task.

5.4 Max-pooling learns sets with distinct elements

The examples in Figure 3 illustrate that max-pooling considers multiple nodes with the same feature as only one node (i.e., treats a multiset as a set). Max-pooling captures neither the exact structure nor the distribution. However, it may be suitable for tasks where it is important to identify representative elements or the “skeleton”, rather than to distinguish the exact structure or distribution. Qi et al. (2017)

empirically show that the max-pooling aggregator learns to identify the skeleton of a 3D point cloud and that it is robust to noise and outliers. For completeness, the next corollary shows that the max-pooling aggregator captures the underlying set of a multiset.

Corollary 7.

Assume is countable. Then there exists a function so that for , if and only if and have the same underlying set.

6 Experiments

We evaluate and compare the training and test performance of GIN and less powerful GNN variants.

Datasets. We use 9 graph classification benchmarks: 4 bioinformatics datasets (MUTAG, PTC, NCI1, PROTEINS) and 5 social network datasets (COLLAB, IMDB-BINARY, IMDB-MULTI, REDDIT-BINARY and REDDIT-MULTI5K) (Kersting et al., 2016). In the bioinformatic graphs, the nodes have categorical input features; in the social networks, they have no features. For the REDDIT datasets, we set all node feature vectors to be the same (thus, features here are uninformative); for the other social graphs, we use one-hot encodings of node degrees. Dataset statistics are summarized in Table 1, and more details of the data can be found in Appendix G.

Models and configurations. We evaluate GIN (Eqs. 4.1 and 4.2) and five less powerful GNN variants: We consider architectures that replace the sum in Eq. 4.1 with mean or max-pooling222For REDDIT-BINARY, REDDIT–MULTI5K, and COLLAB, we did not run experiments for max-pooling due to GPU memory constraints., or replace MLPs with 1-layer perceptrons, i.e., a linear mapping followed by ReLU. In Figure 4 and Table 1, a model is named by the aggregator/perceptron it uses. We apply the same graph-level readout (READOUT in Eq. 4.2) for GINs and all GNN variants, specifically, sum readout on bioinformatics datasets and mean readout on social datasets due to better test performance.

Following (Yanardag & Vishwanathan, 2015; Niepert et al., 2016), we perform 10-fold cross-validation with LIB-SVM (Chang & Lin, 2011), using 9 folds for training and 1 for testing. For all configurations, 5 GNN layers (including the input layer) are applied, and all MLPs have

layers. Batch normalization

(Ioffe & Szegedy, 2015) is applied on every hidden layer. We use the Adam optimizer (Kingma & Ba, 2015) with initial learning rate and decay the learning rate by

every 50 epochs. The hyper-parameters we tune for each dataset are: (1) The number of hidden units

for bioinformatics graphs and for social graphs; (2) batch size ; (3) dropout ratio after the dense layer (Srivastava et al., 2014); (4) the number of epochs.

Baselines. We compare the GNNs above with a number of state-of-the-art baselines for graph classification: (1) The WL subtree kernel (Shervashidze et al., 2011), where -SVM (Chang & Lin, 2011) was used as a classifier. The hyper-parameters we tune are in SVM and the number of WL iterations

; (2) state-of-the-art deep learning architectures Diffusion-convolutional neural networks (DCNN) 

(Atwood & Towsley, 2016), PATCHY-SAN (Niepert et al., 2016) and Deep Graph CNN (DGCNN) (Zhang et al., 2018); (3) Anonymous Walk Embeddings (AWL) (Ivanov & Burnaev, 2018). For the deep learning methods and AWL, we report the accuracies reported in the original papers.

Figure 4: Training set performance of GINs, less powerful GNN variants, and the WL subtree kernel.
Datasets IMDB-B IMDB-M RDT-B RDT-M5K COLLAB MUTAG PROTEINS PTC NCI1

Datasets

# graphs 1000 1500 2000 5000 5000 188 1113 344 4110
# classes 2 3 2 5 3 2 2 2 2
Avg # nodes 19.8 13.0 429.6 508.5 74.5 17.9 39.1 25.5 29.8

Baselines

WL subtree 73.8 50.9 81.0 52.5 78.9 90.4 75.0 59.9 86.0
DCNN 49.1 33.5 52.1 67.0 61.3 56.6 62.6
PatchySan 71.0 45.2 86.3 49.1 72.6 92.6 75.9 60.0 78.6
DGCNN 70.0 47.8 73.7 85.8 75.5 58.6 74.4
AWL 74.6 51.6 87.9 54.7 73.9 87.9

GNN variants

GIN (Sum–MLP) 75.1 52.3 92.4 57.5 80.2 89.4 76.2 64.6 82.7
Sum–1-Layer 74.1 52.2 90.0 55.1 80.6 90.0 76.2 63.1 82.0
Mean–MLP 73.7 52.3 50.0 (71.2) 20.0 (41.3) 79.2 83.5 75.5 66.6 80.9
Mean–1-Layer 74.0 51.9 50.0 (69.7) 20.0 (39.7) 79.0 85.6 76.0 64.2 80.2
Max–MLP 73.2 51.1 84.0 76.0 64.6 77.8
Max–1-Layer 72.3 50.9 85.1 75.9 63.9 77.7
Table 1: Classification accuracies (%).

indicate test accuracies (equal to chance rates) when all nodes have the same feature vector. We also report in the parentheses the test accuracies when the node degrees are used as input node features. The best-performing GNNs are highlighted with boldface. On datasets where GIN’s accuracy is not strictly the highest among GNN variants, GIN is comparable to the best because paired t-test at significance level 10% does not distinguish GIN from the best. If a baseline performs better than all GNNs, we highlight it with boldface and asterisk.

6.1 Results

Training set performance. We validate our theoretical analysis of the representational power of GNNs by comparing their training accuracies. Figure 4 shows training curves of GINs and less powerful GNN variants with the same hyper-parameter settings. First, the theoretically most powerful GNN, i.e. GIN (Sum–MLP), is able to almost perfectly fit all training sets. In comparison, the less powerful GNN variants severely underfit on many datasets. In particular, the training accuracy patterns align with our ranking by the models’ representational power: GNN variants with MLPs tend to have higher training accuracies than those with 1-layer perceptrons, and GNNs with sum aggregators tend to fit the training sets better than those with mean and max-pooling aggregators.

On our datasets training accuracies of the GNNs, however, never exceed those of the WL subtree kernel, which has the same discriminative power as the WL test. For example, on IMDBBINARY, none of the models can perfectly fit the training set, and the GNNs achieve at most the same training accuracy as the WL kernel. This pattern aligns with our result that the WL test provides an upper bound for the representational capacity of the aggregation-based GNNs. Our theoretical results focus on representational power and do not yet take into account optimization (e.g., local minima). Nonetheless, the empirical results align very well with our theory.

Test set performance. Next, we compare test accuracies. Although our theoretical results do not directly speak about generalization ability of GNNs, it is reasonable to expect that GNNs with strong expressive power can accurately capture graph structures of interest and thus generalize well. Table 1 compares test accuracies of GINs (Sum–MLP), other GNN variants, as well as the state-of-the-art baselines.

First, GINs outperform (or achieve comparable performance as) the less powerful GNN variants on all the 9 datasets, achieving state-of-the-art performance. In particular, GINs shine on the social network datasets, which contain a relatively large number of training graphs. On the Reddit datasets, a random vector was used as a node feature. Here GINs and sum-aggregation GNNs accurately capture the graph structure (as predicted in Section 5.2) and significantly outperform other models. Mean-aggregation GNNs, however, fail to capture structural information and do not perform better than random guessing. Even if node degrees are provided as input features, mean-based GNNs perform much worse than sum-based GNNs.

7 Conclusion

In this paper, we developed theoretical foundations for reasoning about expressive power of GNNs and proved tight bounds on the representational capacity of popular GNN variants. Along the way, we also designed a provably most powerful GNN under the neighborhood aggregation framework. An interesting direction for future work is to go beyond the neighborhood aggregation (or message passing) framework in order to pursue even more powerful architectures for learning with graphs. It would also be interesting to understand and improve the generalization properties of GNNs.

Acknowledgments

This research was supported by NSF CAREER award 1553284 and a DARPA D3M award. This research was also supported in part by NSF, ARO MURI, IARPA HFC, Boeing, Huawei, Stanford Data Science Initiative, and Chan Zuckerberg Biohub.

We gratefully thank Prof. Ken-ichi Kawarabayashi and Prof. Masashi Sugiyama for generously supporting this research with GPU computing resources, as well as giving us wonderful advices. We thank Tomohiro Sonobe and Kento Nozawa for their great management of the GPU servers used in this research. We thank Dr. Yasuo Tabei for inviting Keyulu for giving a talk at RIKEN AIP in Nihonbashi, Tokyo, which led to the collaboration of this research. We thank Jingling Li for initiating and arranging the collaboration of this research, as well as her great inspiration and discussions. We thank Rex Ying and William Hamilton for helpful reviews and positive comments on our paper. We thank Chengtao Li for helpful discussions on the title of the paper. Finally, we thank Simon S. Du for his very helpful discussions and positive comments on our work.

References

  • Atwood & Towsley (2016) James Atwood and Don Towsley. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pp. 1993–2001, 2016.
  • Babai (2016) László Babai. Graph isomorphism in quasipolynomial time. In

    Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

    , pp. 684–697. ACM, 2016.
  • Babai & Kucera (1979) László Babai and Ludik Kucera. Canonical labelling of graphs in linear average time. In Foundations of Computer Science, 1979., 20th Annual Symposium on, pp. 39–46. IEEE, 1979.
  • Battaglia et al. (2016) Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. In Advances in Neural Information Processing Systems (NIPS), pp. 4502–4510, 2016.
  • Cai et al. (1992) Jin-Yi Cai, Martin Fürer, and Neil Immerman. An optimal lower bound on the number of variables for graph identification. Combinatorica, 12(4):389–410, 1992.
  • Chang & Lin (2011) Chih-Chung Chang and Chih-Jen Lin.

    Libsvm: a library for support vector machines.

    ACM transactions on intelligent systems and technology (TIST), 2(3):27, 2011.
  • Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (NIPS), pp. 3844–3852, 2016.
  • Duvenaud et al. (2015) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. pp. 2224–2232, 2015.
  • Garey (1979) Michael R Garey. A guide to the theory of np-completeness. Computers and intractability, 1979.
  • Garey & Johnson (2002) Michael R Garey and David S Johnson. Computers and intractability, volume 29. wh freeman New York, 2002.
  • Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In

    International Conference on Machine Learning (ICML)

    , pp. 1273–1272, 2017.
  • Hamilton et al. (2017a) William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (NIPS), pp. 1025–1035, 2017a.
  • Hamilton et al. (2017b) William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. IEEE Data Engineering Bulletin, 40(3):52–74, 2017b.
  • Hornik (1991) Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
  • Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
  • Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pp. 448–456, 2015.
  • Ivanov & Burnaev (2018) Sergey Ivanov and Evgeny Burnaev. Anonymous walk embeddings. In International Conference on Machine Learning (ICML), pp. 2191–2200, 2018.
  • Kearnes et al. (2016) Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design, 30(8):595–608, 2016.
  • Kersting et al. (2016) Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark data sets for graph kernels, 2016. URL http://graphkernels.cs.tu-dortmund.de.
  • Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
  • Kipf & Welling (2017) Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
  • Li et al. (2016) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. In International Conference on Learning Representations (ICLR), 2016.
  • Nelder & Wedderburn (1972) J. A. Nelder and R. W. M. Wedderburn. Generalized linear models. Journal of the Royal Statistical Society, Series A, General, 135:370–384, 1972.
  • Niepert et al. (2016) Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In International Conference on Machine Learning (ICML), pp. 2014–2023, 2016.
  • Qi et al. (2017) Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation.

    Proc. Computer Vision and Pattern Recognition (CVPR), IEEE

    , 1(2):4, 2017.
  • Shervashidze et al. (2011) Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • Velickovic et al. (2018) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations (ICLR), 2018.
  • Verma & Zhang (2018) Saurabh Verma and Zhi-Li Zhang. Graph capsule convolutional neural networks. arXiv preprint arXiv:1805.08090, 2018.
  • Weisfeiler & Lehman (1968) Boris Weisfeiler and AA Lehman. A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsia, 2(9):12–16, 1968.
  • Xu et al. (2018) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning (ICML), pp. 5453–5462, 2018.
  • Yanardag & Vishwanathan (2015) Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374. ACM, 2015.
  • Ying et al. (2018) Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems (NIPS), 2018.
  • Zaheer et al. (2017) Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in Neural Information Processing Systems, pp. 3391–3401, 2017.
  • Zhang et al. (2018) Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep learning architecture for graph classification. In

    AAAI Conference on Artificial Intelligence

    , pp. 4438–4445, 2018.

Appendix A Proof for Lemma 2

Proof.

Suppose after iterations, a graph neural network has but the WL test cannot decide and are non-isomorphic. It follows that from iteration to in the WL test, and always have the same collection of node labels. In particular, because and have the same WL node labels for iteration and for any , and have the same collection, i.e. multiset, of WL node labels as well as the same collection of node neighborhoods . Otherwise, the WL test would have obtained different collections of node labels at iteration for and as different multisets get unique new labels.

The WL test always relabels different multisets of neighboring nodes into different new labels. We show that on the same graph or , if WL node labels , we always have GNN node features for any iteration . This apparently holds for because WL and GNN starts with the same node features. Suppose this holds for iteration , if for any , , then it must be the case that

By our assumption on iteration , we must have

In the aggregation process of the GNN, the same AGGREGATE and COMBINE are applied. The same input, i.e. neighborhood features, generates the same output. Thus, . By induction, if WL node labels , we always have GNN node features for any iteration . This creates a valid mapping such that for any . It follows from and have the same multiset of WL neighborhood labels that and also have the same collection of GNN neighborhood features

Thus, are the same. In particular, we have the same collection of GNN node features for and . Because the graph level readout function is permutation invariant with respect to the collection of node features, . Hence we have reached a contradiction. ∎

Appendix B Proof for Theorem 3

Proof.

Let be a graph neural network where a) and b) hold. Let , be any graphs which the WL test decides as non-isomorphic at iteration . Because the graph-level readout function is injective, i.e. it maps distinct multiset of node features into unique embeddings, it sufficies to show that ’s neighborhood aggregation process, with sufficient iterations, embeds and into different multisets of node features. Let us first assume updates node representations as

with injective funtions and . The WL test applies a predetermined injective hash function to update the WL node labels .

We will show, by induction, that for any iteration , there always exists an injective function such that . This apparently holds for because the initial node features are the same for WL and GNN for all . So could be the identity function for . Suppose this holds for iteration , we show that it also holds for . Substituting with gives us

It follows from the fact composition of injective functions is injective that there exists some injective function so that

Then we have

is injective because the composition of injective functions is injective. Hence for any iteration , there always exists an injective function such that . At the -th iteration, the WL test decides that and are non-isomorphic, that is the multisets are different for and . The graph neural network ’s node embeddings must also be different for and because of the injectivity of .

Now let us prove the theorem for the case where aggregates the central node along with its neighbors

The difficulty in proving this form of aggregation mainly lies in the fact that it does not immediately distinguish the root or central node from its neighbors. For example, let us consider the chain graphs and . If we aggregate once at the middle nodes of the chain graphs and respectively, we essentially get the same new representation for and , although the WL test would have represented these nodes as "rooted trees" and instead. The key insight into solving the problem described above is to notice that it is possible to distinguish the two structures (different roots, but the same neighborhood plus the central node) with two iterations of aggregation, unless the structures are symmetric, that is the two adjacent nodes (roots) in consideration are both adjacent to the same neighbors, in which case there is no need to distinguish because the multiset of node features will be the same. Before we formally prove the theorem, let us look at the chain graph example above again to get more intuition. This time let us apply the neighborhood aggregation again. The node features at the middle nodes and are essentially representative of and . This time we can successfully distinguish the two structures rooted at and . However, if we have complete graphs with nodes , , again. Even if we apply the aggregation twice, the representations will stay the same for roots and , i.e. . But that is fine because the two graphs are isomorphic and we will end up with the same collection of node features. In summary, unless two rooted subtrees of height under consideration are not "symmetric", after two iterations, we can always recover the root and thus distinguish among different rooted trees. If they are symmetric, then we have the same multiset of node features and thus the correct representations. Next, we present a more formal proof.

Let and be adjacent nodes and we are interested in distinguishing the rooted tree structures and from graphs and respectively, where is a set of adjacent nodes. Both rooted trees have the same unrooted multiset representation . Here, we want our graph neural network to have a unique embedding for distinct structures and the goal is to recover the root with the unrooted multiset representations. Then we can apply the same argument as the "AGGREGATE and then COMBINE" aggregation process. Let us consider the node representations at and after two iterations of the aggregation below.

With some abuse of notation, let denote the multiset of neighborhood multisets for each member from . Let denote the neighborhood of a node in , i.e. . We can also define similarly.

Suppose it is the case that, for and , and . That is, the two rooted trees from and are, in fact, symmetric in the sense that form the same triangle. Indeed, in this case, we cannot distinguish the subtrees at and after two iterations of aggregation. However, that is not a problem for distinguishing and because both and have the same structure of , as well as the same, thus correct, collection of node features . Aggregating the central node along with neighbors would not reduce the discriminative power of the GNN in this case.

Now suppose we do not have the symmetric structure for and . Then either a) The neighborhood of in and that of in are not equivalent, i.e. . b) . After two iterations of aggregation, the node representations for in and in will be and respectively. If any of a) or b) is true, i.e. non-symmetric case, then the node representations for and are different. Our graph neural network successfully distinguishes the embeddings of the rooted subtrees, despite only using unrooted representations.

In conclusion, the aggregation process in the following form

has the same discriminative power in embedding different graphs as the following aggregation form

if , are valid injective functions. Moreover, GNNs in both forms have the same discriminative power as the WL test.

Appendix C Proof for Lemma 4

Proof.

We first prove that there exists a mapping so that is unique for each finite multiset . Because is countable, there exists a mapping from to natural numbers. Because the multisets are finite, there exists a number so that for all . Then an example of such is . This can be viewed as a more compressed form of an one-hot vector or -digit presentation. Thus, is an injective function of multisets.

is permutation invariant so it is a well-defined multiset function. For any multiset function , we can construct such by letting . Note that such is well-defined because is injective.

Appendix D Proof for Lemma 5

Proof.

Let us consider the example and , i.e. two different multisets of positive numbers that sum up to the same value. We will be using the homogeneity of ReLU.

Let

be an arbitrary linear transform that maps

into . It is clear that, at the same coordinates, are either positive or negative for all because all in and are positive. It follows that ReLU are either positive or at the same coordinate for all in . For the coordinates where ReLU are , we have . For the coordinates where are positive, linearity still holds. It follows from linearity that

where could be or . Because , we have the following as desired.

Appendix E Proof for Corollary 6

Proof.

Suppose multisets and have the same distribution, without loss of generality, let us assume and for some , i.e. and have the same underlying set and the multiplicity of each element in is times of that in . Then we have and . Thus,

Now we show that there exists a function so that is unique for distributionally equivalent . Because is countable, there exists a mapping from to natural numbers. Because the multisets are finite, there exists a number so that for all . Then an example of such is . ∎

Appendix F Proof for Corollary 7

Proof.

Suppose multisets and have the same underlying set , then we have

Now we show that there exists a mapping so that is unique for s with the same underlying set. Because is countable, there exists a mapping from to natural numbers. Then an example of such is defined as for and otherwise, where is the -th coordinate of . Such an essentially maps a multiset to its one-hot embedding. ∎

Appendix G Details of datasets

We give detailed descriptions of datasets used in our experiments. Furthre details can be found in (Yanardag & Vishwanathan, 2015).

Social networks datasets.

IMDB-BINARY and IMDB-MULTI are movie collaboration datasets. Each graph corresponds to an ego-network for each actor/actress, where nodes correspond to actors/actresses and an edge is drawn betwen two actors/actresses if they appear in the same movie. Each graph is derived from a pre-specified genre of movies, and the task is to classify the genre graph it is derived from. REDDIT-BINARY and REDDIT-MULTI5K are balanced datasets where each graph corresponds to an online discussion thread and nodes correspond to users. An edge was drawn between two nodes if at least one of them responded to another’s comment. The task is to classify each graph to a community or a subreddit it belongs to. COLLAB is a scientific collaboration dataset, derived from 3 public collaboration datasets, namely, High Energy Physics, Condensed Matter Physics and Astro Physics. Each graph corresponds to an ego-network of different researchers from each field. The task is to classify each graph to a field the corresponding researcher belongs to.

Bioinformatics datasets.

MUTAG is a dataset of 188 mutagenic aromatic and heteroaromatic nitro compounds with 7 discrete labels. PROTEINS is a dataset where nodes are secondary structure elements (SSEs) and there is an edge between two nodes if they are neighbors in the amino-acid sequence or in 3D space. It has 3 discrete labels, representing helix, sheet or turn. PTC is a dataset of 344 chemical compounds that reports the carcinogenicity for male and female rats and it has 19 discrete labels. NCI1 is a dataset made publicly available by the National Cancer Institute (NCI) and is a subset of balanced datasets of chemical compounds screened for ability to suppress or inhibit the growth of a panel of human tumor cell lines, having 37 discrete labels.