Nested Graph Neural Networks

by   Muhan Zhang, et al.
Peking University

Graph neural network (GNN)'s success in graph classification is closely related to the Weisfeiler-Lehman (1-WL) algorithm. By iteratively aggregating neighboring node features to a center node, both 1-WL and GNN obtain a node representation that encodes a rooted subtree around the center node. These rooted subtree representations are then pooled into a single representation to represent the whole graph. However, rooted subtrees are of limited expressiveness to represent a non-tree graph. To address it, we propose Nested Graph Neural Networks (NGNNs). NGNN represents a graph with rooted subgraphs instead of rooted subtrees, so that two graphs sharing many identical subgraphs (rather than subtrees) tend to have similar representations. The key is to make each node representation encode a subgraph around it more than a subtree. To achieve this, NGNN extracts a local subgraph around each node and applies a base GNN to each subgraph to learn a subgraph representation. The whole-graph representation is then obtained by pooling these subgraph representations. We provide a rigorous theoretical analysis showing that NGNN is strictly more powerful than 1-WL. In particular, we proved that NGNN can discriminate almost all r-regular graphs, where 1-WL always fails. Moreover, unlike other more powerful GNNs, NGNN only introduces a constant-factor higher time complexity than standard GNNs. NGNN is a plug-and-play framework that can be combined with various base GNNs. We test NGNN with different base GNNs on several benchmark datasets. NGNN uniformly improves their performance and shows highly competitive performance on all datasets.



There are no comments yet.


page 1

page 2

page 3

page 4


Subgraph Neural Networks

Deep learning methods for graphs achieve remarkable performance on many ...

On Explainability of Graph Neural Networks via Subgraph Explorations

We consider the problem of explaining the predictions of graph neural ne...

Scalable Graph Neural Networks for Heterogeneous Graphs

Graph neural networks (GNNs) are a popular class of parametric model for...

Deconfounded Training for Graph Neural Networks

Learning powerful representations is one central theme of graph neural n...

Scaling Up Graph Neural Networks Via Graph Coarsening

Scalability of graph neural networks remains one of the major challenges...

Measuring and Sampling: A Metric-guided Subgraph Learning Framework for Graph Neural Network

Graph neural network (GNN) has shown convincing performance in learning ...

SEA: Graph Shell Attention in Graph Neural Networks

A common issue in Graph Neural Networks (GNNs) is known as over-smoothin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph is an important tool to model relational data in the real world. Representation learning over graphs has become a popular topic of machine learning in recent years. While network embedding methods, such as DeepWalk 

(Perozzi et al., 2014), can learn node representations well, they fail to generalize to whole-graph representations, which are crucial for applications such as graph classification, molecule modeling, and drug discovery. On the contrary, although traditional graph kernels (Haussler, 1999; Shervashidze et al., 2009; Kondor et al., 2009; Borgwardt and Kriegel, 2005; Neumann et al., 2016; Shervashidze et al., 2011)

can be used for graph classification, they define graph similarity often in a heuristic way, which is not parameterized and lacks some flexibility to deal with features.

In this context, graph neural networks (GNNs) have regained people’s attention and become the state-of-the-art graph representation learning tool (Scarselli et al., 2009; Bruna et al., 2013; Duvenaud et al., 2015; Li et al., 2015; Kipf and Welling, 2016; Defferrard et al., 2016; Dai et al., 2016; Veličković et al., 2017; Zhang et al., 2018; Ying et al., 2018). GNNs use message passing to propagate features between connected nodes. By iteratively aggregating neighboring node features to the center node, GNNs learn node representations encoding their local structure and feature information. These node representations can be further pooled into a graph representation, enabling graph-level tasks such as graph classification. In this paper, we will use message passing GNNs to denote this class of GNNs based on repeated neighbor aggregation (Gilmer et al., 2017), in order to distinguish them from some high-order GNN variants (Morris et al., 2019; Maron et al., 2019a; Chen et al., 2019) where the effective message passing happens between high-order node tuples instead of nodes.

GNNs’ message passing scheme mimics the 1-dimensional Weisfeiler-Lehman (1-WL) algorithm (Weisfeiler and Lehman, 1968)

, which iteratively refines a node’s color according to its current color and the multiset of its neighbors’ colors. This procedure essentially encodes a rooted subtree around each node into its final color, where the rooted subtree is constructed by recursively expanding the neighbors of the root node. One critical reason for GNN’s success in graph classification is because two graphs sharing many identical or similar rooted subtrees are more likely classified into the same class, which actually aligns with the inductive bias that two graphs are similar if they have many common substructures 

(Vishwanathan et al., 2010).

Despite this, rooted subtrees are still limited in terms of expressing all possible substructures that can appear in a graph. It is likely that two graphs, despite sharing a lot of identical rooted subtrees, are not similar at all because their other substructure patterns are not similar. Take the two graphs and in Figure 1 as an example. If we apply 1-WL or a message passing GNN to them, the two graphs will always have the same representation no matter how many iterations/layers we use. This is because all nodes in the two graphs have identical rooted subtrees across all tree heights. However, the two graphs are quite different from a holistic perspective. is composed of two triangles, while is a hexagon. The intrinsic reason for such a failure is that rooted subtrees have limited expressiveness for representing general graphs, especially those with cycles.

Figure 1: The two original graphs and are non-isomorphic. is composed of two triangles, while is a hexagon. However, both 1-WL and message passing GNNs cannot differentiate them, since all nodes in the two graphs share identical rooted subtrees at any height (see the rooted subtrees around and in the middle block for example). In comparison, we can discriminate the two graphs by comparing their height-1 rooted subgraphs around any nodes. For example, the height-1 rooted subgraph around is a closed triangle, but the height-1 rooted subgraph around is an open triangle (see the red boxes in the right block).

To address this issue, we propose Nested Graph Neural Networks (NGNNs). The core idea is, instead of encoding a rooted subtree, we want the final representation of a node to encode a rooted subgraph (local -hop subgraph) around it. The subgraph is not restricted to be of any particular graph type such as tree, but serves as a general description of the local neighborhood around a node. Rooted subgraphs offer much better representation power than rooted subtrees, e.g., we can easily discriminate the two graphs in Figure 1 by only comparing their height-1 rooted subgraphs.

To represent a graph with rooted subgraphs, NGNN uses two levels of GNNs: base (inner) GNNs and an outer GNN. By extracting a local rooted subgraph around each node, NGNN first applies a base GNN to each node’s subgraph independently. Then, a subgraph pooling layer is applied to each subgraph to aggregate the intermediate node representations into a subgraph representation. This subgraph representation is used as the final representation of the root node. Rather than encoding a rooted subtree, this final node representation encodes the local subgraph around it, which contains more information than a subtree. Finally, all the final node representations are further fed into an outer GNN to learn a representation for the entire graph. Figure 2 shows one NGNN implementation using message passing GNNs as the base GNNs and a simple graph pooling layer as the outer GNN.

One may wonder that the base GNN seems to still learn only rooted subtrees if it is message-passing-based. Then why is NGNN more powerful than GNN? One key reason lies in the subgraph pooling layer. Take the height-1 rooted subgraphs (marked with red boxes) around and in Figure 1 as an example. Although and ’s height-1 rooted subtrees are still the same, their neighbors (labeled by 1 and 2) have different height-1 rooted subtrees. Thus, applying a one-layer message passing GNN plus a subgraph pooling as the base GNN is sufficient to discriminate and .

Figure 2: A particular implementation of the NGNN framework. It first extracts (copies) a rooted subgraph (height=1) around each node from the original graph, and then applies a base GNN with a subgraph pooling layer to each rooted subgraph independently to learn a subgraph representation. The subgraph representation is used as the root node’s final representation in the original graph. Then, a graph pooling layer is used to summarize the final node representations into a graph representation.

The NGNN framework has multiple exclusive advantages. Firstly, it allows freely choosing the base GNN, and can enhance the base GNN’s representation power in a plug-and-play fashion. Theoretically, we proved that NGNN is more powerful than message passing GNNs and 1-WL by being able to discriminate almost all -regular graphs (where 1-WL always fails). Secondly, by extracting rooted subgraphs, NGNN allows augmenting the initial features of a node with subgraph-specific structural features such as distance encoding (Li et al., 2020b) to improve the quality of the learned node representations. Thirdly, unlike other more powerful graph neural networks, especially those based on higher-order WL tests (Morris et al., 2019; Maron et al., 2019a; Chen et al., 2019; Morris et al., 2020), NGNN still has linear time and space complexity w.r.t. graph size like standard message passing GNNs, thus maintaining good scalability. We demonstrate the effectiveness of the NGNN framework in various synthetic/real-world graph classification/regression datasets. On synthetic datasets, NGNN demonstrates higher-than-1-WL expressive power, matching very well with our theorem. On real-world datasets, NGNN consistently enhances a wide range of base GNNs’ performance, achieving highly competitive results on all datasets.

2 Preliminaries

2.1 Notation and problem definition

We consider the graph classification/regression problem. Given a graph where is the node set and is the edge set, we aim to learn a function mapping to its class or target value . The nodes and edges in

can have feature vectors associated with them, denoted by

(for node ) and (for edge ), respectively.

2.2 Weisfeiler-Lehman test

The Wesfeiler-Lehman (1-WL) test (Weisfeiler and Lehman, 1968) is a popular algorithm for graph isomorphism checking. The classical 1-WL works as follows. At first, all nodes receive a color 1. Each node collects its neighbors’ colors into a multiset. Then, 1-WL will update each node’s color so that two nodes get the same new color if and only if their current colors are the same and they have identical multisets of neighbor colors. Repeat this process until the number of colors does not increase between two iterations. Then, 1-WL will return that two graphs are non-isomorphic if their node colors are different at some iteration, or fail to determine whether they are non-isomorphic. See (Shervashidze et al., 2011; Zhang and Chen, 2017) for more details.

1-WL essentially encodes the rooted subtrees around each node at different heights into its color representations. Figure 1 middle shows the rooted subtrees around and . Two nodes will have the same color at iteration if and only if their height- rooted subtrees are the same.

3 Nested Graph Neural Network

In this section, we introduce our Nested Graph Neural Network (NGNN) framework and theoretically demonstrate its higher representation power than message passing GNNs.

3.1 Limitations of the message passing GNNs

Most existing GNNs follow the message passing framework (Gilmer et al., 2017): given a graph , each node’s hidden state is updated based on its previous state and the messages from its neighbors


Here are the message and update functions at time stamp , is the feature of edge , and is the set of ’s neighbors in graph . The initial hidden states are given by the raw node features . After time stamps (iterations), the final node representations are summarized into a whole-graph representation with a readout (pooling) function (e.g., mean or sum):


Such a message passing (or neighbor aggregation) scheme iteratively aggregates neighbor information into a center node’s hidden state, making it encode a local rooted subtree around the node. The final node representations will contain both the local structure and feature information around nodes, enabling node-level tasks such as node classification. After a pooling layer, these node representations can be further summarized into a graph representation, enabling graph-level tasks. When there is no edge feature and the node features are from a countable space, it is shown that message passing GNNs are at most as powerful as the 1-WL test for discriminating non-isomorphic graphs (Xu et al., 2018; Morris et al., 2019).

For an -layer message passing GNN, it will give two nodes the same final representation if they have identical height- rooted subtrees (i.e., both the structures and the features on the corresponding nodes/edges are the same). If two graphs have a lot of identical (or similar) rooted subtrees, they will also have similar graph representations after pooling. This insight is crucial for the success of modern GNNs in graph classification, because it aligns with the inductive bias that two graphs are similar if they have many common substructures. Such insight has also been used in designing the WL subtree kernel (Shervashidze et al., 2011), a state-of-the-art graph classification method before GNNs.

However, message passing GNNs have several limitations. Firstly, rooted subtree is only one specific substructure. It is not general enough to represent arbitrary subgraphs, especially those with cycles due to the natural restriction of tree structure. Secondly, using rooted subtree as the elementary substructure results in a discriminating power bounded by the 1-WL test. For example, all -node -regular graphs cannot be discriminated by message passing GNNs. Thirdly, standard message passing GNNs do not allow using root-node-specific structural features (such as the distance between a node and the root node) to improve the quality of the learned root node’s representation. We might need to break through such limitations in order to design more powerful GNNs.

3.2 The NGNN framework

To address the above limitations, we propose the Nested Graph Neural Network (NGNN) framework. NGNN no longer aims to encode a rooted subtree around each node. Instead, in NGNN, each node’s final representation encodes the general local subgraph information around it more than a subtree, so that two graphs sharing a lot of identical or similar rooted subgraphs will have similar representations.

Definition 1.

(Rooted subgraph) Given a graph and a node , the height- rooted subgraph of is the subgraph induced from by the nodes within hops of (including -hop nodes).

To make a node’s final representation encode a rooted subgraph, we need to compute a subgraph representation. To achieve this, we resort to an arbitrary GNN, which we call the base GNN of NGNN. For example, the base GNN can be simply a message passing GNN, which performs message passing within each rooted subgraph to learn an intermediate representation for every node of the subgraph, and then uses a pooling layer to summarize a subgraph representation from the intermediate node representations. This subgraph representation is used as the final representation of the root node in the original graph. Take root node as an example. We first perform rounds of message passing within node ’s rooted subgraph . Let be any node appearing in . We have


Here are the message and update functions of the base GNN at time stamp , denotes the set of ’s neighbors within ’s rooted subgraph , and and denote node ’s hidden state and message specific to rooted subgraph at time stamp . Note that when node attends different nodes’ rooted subgraphs, its hidden states and messages will also be different. This is in contrast to standard GNNs where a node’s hidden state and message at time is the same regardless of which root node it contributes to. For example, and in Eq. 1 do not depend on any particular rooted subgraph.

After rounds of message passing, we apply a subgraph pooling layer to summarize a subgraph representation from the intermediate node representations .


where is the subgraph pooling layer. This subgraph representation will be used as root node ’s final representation in the original graph. Note that the base GNNs are simultaneously applied to all nodes’ rooted subgraphs to return a final node representation for every node in the original graph, and all the base GNNs share the same parameters. With such node representations, NGNN uses an outer GNN to further process and aggregate them into a graph representation of the whole graph. For simplicity, we let the outer GNN be simply a graph pooling layer denoted by :


The Nested GNN framework can be understood as a two-level GNN, or a GNN of GNNs—the inner subgraph-level GNNs (base GNNs) are used to learn node representations from their rooted subgraphs, while the outer graph-level GNN is used to return a whole-graph representation from the inner GNNs’ outputs. The inner GNNs all share the same parameters which are trained end-to-end with the outer GNN. Figure 2 depicts the implementation of the NGNN framework described above.

Compared to message passing GNNs, NGNN changes the “receptive field” of each node from a rooted subtree to a rooted subgraph, in order to capture better local substructure information. The rooted subgraph is read by a base GNN to learn a subgraph representation. Finally, the outer GNN reads the subgraph representations output by the base GNNs to return a graph representation.

Note that, when we apply the base GNN to a rooted subgraph, this rooted subgraph is extracted (copied) out of the original graph and treated as a completely independent graph from the other rooted subgraphs and the original graph. This allows the same node to have different representations within different rooted subgraphs. For example, in Figure 2, the same node appears in four different rooted subgraphs. Sometimes it is the root node, while other times it is a 1-hop neighbor of the root node. NGNN enables learning different representations for the same node when it appears in different rooted subgraphs, in contrast to standard GNNs where a node only has one single representation at one time stamp (Eq. 1). Similarly, NGNN also enables using different initial features for the same node when it appears in different rooted subgraphs. This allows us to customize a node’s initial features based on its structural role within a rooted subgraph, as opposed to using the same initial features for a node across all rooted subgraphs. For example, we can optionally augment node ’s initial features with the distance between node and the root—when node is the root node, we give it an additional feature ; and when is a -hop neighbor of the root, we give it an additional feature . Such feature augmentation may help better capture a node’s structural role within a rooted subgraph. It is an exclusive advantage of NGNN and is not possible in standard GNNs.

3.3 The representation power of NGNN

We theoretically characterize the additional expressive power of NGNN (using message passing GNNs as base GNNs) as opposed to standard message passing GNNs. We focus on the ability to discriminate regular graphs because they form an important category of graphs which standard GNNs cannot represent well. Using 1-WL or message passing GNNs, any two -sized -regular graphs will have the same representation, unless discriminative node features are available. In contrast, we prove that NGNN can distinguish almost all pairs of -sized -regular graphs regardless of node features.

Definition 2.

If the message passing (Eq. 3) and the two-level graph pooling (Eqs. 4,5) are all injective given input from a countable space, then the NGNN is called proper.

A proper NGNN always exists due to the representation power of fully-connected neural networks used for message passing and Deep Set for graph pooling (Zaheer et al., 2017). For all pairs of graphs that 1-WL can discriminate, there always exists a proper NGNN that can also discriminate them, because two graphs discriminated by 1-WL means they must have different multisets of rooted subtrees at some height , while a rooted subtree is always included in a rooted subgraph with the same height.

Now we present our main theorem.

Theorem 1.

Consider all pairs of -sized -regular graphs, where . For any small constant , there exists a proper NGNN using at most -height rooted subgraphs and -layer message passing, which distinguishes almost all () such pairs of graphs.

We include the proof in Appendix A. Theorem 1 has three implications. Firstly, since NGNN can discriminate almost all -regular graphs where 1-WL always fails, it is strictly more powerful than 1-WL and message passing GNNs. Secondly, it implies that NGNN does not need to extract subgraphs with a too large height (about ) to be more powerful. Moreover, NGNN is already powerful with very few layers, i.e., an arbitrarily small constant times (as few as 1 layer). This benefit comes from the subgraph pooling (Eq. 4), freeing us from using deep base GNNs. We further conduct a simulation experiment in Appendix D to verify Theorem 1 by testing how well NGNN discriminates -regular graphs in practice. The results match almost perfectly with our theory.

Although NGNN is strictly more powerful than 1-WL and 2-WL (1-WL and 2-WL have the same discriminating power (Maron et al., 2019a)), it is unclear whether NGNN is more powerful than 3-WL. Our early-stage analysis shows both NGNN and 3-WL cannot discriminate strongly regular graphs with the same parameters Brouwer and Haemers (2012). We leave the exact comparison between NGNN and 3-WL to future work.

3.4 Discussion

Base GNN. NGNN is a general plug-and-play framework to increase the power of a base GNN. For the base GNN, we are not restricted to message passing GNNs as described in Section 3.2. For example, we can also use GNNs approximating the power of higher-dimensional WL tests, such as 1-2-3-GNN Morris et al. (2019) and PPGN/Ring-GNN (Maron et al., 2019a; Chen et al., 2019), as the base GNN. In fact, one limitation of these high-order GNNs is their complexity. Using the NGNN framework we can greatly alleviate this by applying the higher-order GNN to multiple small rooted subgraphs instead of the whole graph. Suppose a rooted subgraph has at most nodes, then by applying a high-order GNN to all rooted subgraphs, we can reduce the time complexity from to .

Complexity. We compare the time complexity of NGNN (using message passing GNNs as base GNNs) with a standard message passing GNN. Suppose the graph has nodes with a maximum degree , and the maximum number of nodes in a rooted subgraph is . Each message passing iteration in a standard message passing GNN takes operations. In NGNN, we need to perform message passing over all nodes’ rooted subgraphs, which takes . We will keep small (which can be achieved by using a small ) to improve NGNN’s scalability. Additionally, a small enables the base GNN to focus on learning local subgraph patterns.

In Appendix B, we discuss some other design choices of NGNN.

4 Related work

Understanding GNN’s representation power is a fundamental problem in GNN research. Xu et al. (2018) and Morris et al. (2019) first proved that the discriminating power of message passing GNNs is bounded by the 1-WL test, namely they cannot discriminate two non-isomorphic graphs that 1-WL fails to discriminate (such as -regular graphs). Since then, there is increasing effort in enhancing GNN’s discriminating power beyond 1-WL (Morris et al., 2019; Chen et al., 2019; Maron et al., 2019a; Murphy et al., 2019; Li et al., 2020b; Bouritsas et al., 2020; You et al., 2021; Beaini et al., 2020; Morris et al., 2020). Many GNNs have been proposed to mimic higher-dimensional WL tests, such as 1-2-3-GNN (Morris et al., 2019), Ring-GNN (Chen et al., 2019) and PPGN (Maron et al., 2019a). However, these models generally require learning the representations of all node tuples of certain cardinality (e.g., node pairs, node triples and so on), thus cannot leverage the sparsity of graph structure and are difficult to scale to large graphs. Some works study the universality of GNNs for approximating any invariant or equivariant functions over graphs (Maron et al., 2018; Chen et al., 2019; Maron et al., 2019b; Keriven and Peyré, 2019; Azizian and Lelarge, 2020). However, reaching universality would require polynomial(

)-order tensors, which hold more theoretical value than practical applicability.

Dasoulas et al. (2019) propose to augment nodes of identical attributes with different colors, which requires exhausting all the coloring choices to reach universality. Similarly, Relational Pooling (RP) (Murphy et al., 2019) uses the ensemble of permutation-aware functions over graphs to reach universality, which requires exhausting all permutations to achieve its theoretical power. Its local version Local Relational Pooling (LRP) (Chen et al., 2020) applies RP over subgraphs around nodes, which is similar to our work yet still requires exhausting node permutations in local subgraphs and even more loses RP’s theoretical power. In contrast, NGNN maintains a controllable cost by only applying a message passing GNN to local subgraphs, and is guaranteed to be more powerful than 1-WL.

Because of the high cost of mimicking high-dimensional WL tests, several works have been proposed to increase GNN’s representation power within the message passing framework. Observing that different neighbors are indistinguishable during neighbor aggregation, some works propose to add one-hot node index features or random features to GNNs (Loukas, 2019; Sato et al., 2020). These methods work well when nodes naturally have distinct identities irrespective of the graph structure. However, although making GNNs more discriminative, they also lose some of GNNs’ generalization ability by not being able to guarantee nodes with identical neighborhoods to have the same embedding; the resulting models are also no longer permutation invariant. Repeating random initialization helps with avoiding such an issue but gets much slower convergence Abboud et al. (2020). An exception is structural message-passing (SMP) (Vignac et al., 2020), which propagates one-hot node index features to learn a global feature matrix for each node. The feature matrix is further pooled to learn a permutation-invariant node representation.

On the contrary, some works propose to use structural features to augment GNNs without hurting the generalization ability of GNNs. SEAL (Zhang and Chen, 2018; Zhang et al., 2020), IGMC (Zhang and Chen, 2020) and DE (Li et al., 2020b) use distance-based features, where a distance vector w.r.t. the target node set to predict is calculated for each node as its additional features. Our NGNN framework is naturally compatible with such distance-based features due to its independent rooted subgraph processing. GSN (Bouritsas et al., 2020) uses the count of certain substructures to augment node/edge features, which also surpasses 1-WL theoretically. However, GSN needs a properly defined substructure set to incorporate domain-specific inductive biases, while NGNN aims to learn arbitrary substructures around nodes without the need to predefine a substructure set.

Concurrent to our work, You et al. (2021) propose Identity-aware GNN (ID-GNN). ID-GNN uses different weight parameters between each root node and its context nodes during message passing. It also extracts a rooted subgraph around each node, and thus can be viewed as a special case of NGNN with: 1) the number of message passing layers equivalent to the subgraph height, 2) directly using the root node’s intermediate representation as its final representation without subgraph pooling, and 3) augmenting initial node features with 0/1 “identity”. However, the extra power of ID-GNN only comes from the “identity” feature, while the power of NGNN comes from the subgraph pooling—without using any node features, NGNN is still provably more discriminative than 1-WL. Another similar work to ours is natural graph network (NGN) (de Haan et al., 2020). NGN argues that graph convolution weights need not be shared among all nodes but only (locally) isomorphic nodes. If we view our distance-based node features as refining the graph convolution weights so that nodes within a center node’s neighborhood are no longer treated symmetrically, then our NGNN reduces to an NGN.

The idea of independently performing message passing within -hop neighborhood is also explored in -hop GNN (Nikolentzos et al., 2020) and MixHop (Abu-El-Haija et al., 2019). However, MixHop directly concatenates the aggregation results of neighbors at different hops as the root representation, which ignores the connections between other nodes in the rooted subgraph. -hop GNN sequentially performs message passing for -hop, -hop, …, and 0-hop node (the update of -hop nodes depend on the updated states of -hop nodes), while NGNN simultaneously performs message passing for all nodes in the subgraph thus is more parallelizable. Both MixHop and -hop GNN directly use the root node’s representation as its final node representation. In contrast, NGNN uses a subgraph pooling to summarize all node representations within the subgraph as the final root representation, which distinguishes NGNN from other -hop models. As Theorem 1 shows, the subgraph pooling enables using a much smaller number of message passing layers (as small as 1) than the depth of the subgraph, while MixHop and -hop GNN always require . MixHop and -hop GNN also do not have the strong theoretical power of NGNN to discriminate -regular graphs. Like SEAL and -hop GNN, G-Meta (Huang and Zitnik, 2020) is another work extracting subgraphs around nodes/links. It focuses specifically on a meta-learning setting.

5 Experiments

In this section, we study the effectiveness of the NGNN framework for graph classification and regression tasks. In particular, we want to answer the following questions:

Q1 Can NGNN reach its theoretical power to discriminate 1-WL-indistinguishable graphs?
Q2 How often and how much does NGNN improve the performance of a base GNN?
Q3 How does NGNN perform in comparison to state-of-the-art GNN methods in open benchmarks?
Q4 How much extra computation time does NGNN incur?

We implement the NGNN framework based on the PyTorch Geometric library 

(Fey and Lenssen, 2019b). Our code is available at

5.1 Datasets

To answer Q1, we use a simulation dataset of -regular graphs and the EXP dataset (Abboud et al., 2020) containing 600 pairs of 1-WL-indistinguishable but non-isomorphic graphs. To answer Q2, we use the QM9 dataset (Ramakrishnan et al., 2014; Wu et al., 2018) and the TU datasets (Kersting et al., 2016). QM9 contains 130K small molecules. The task here is to perform regression on twelve targets representing energetic, electronic, geometric, and thermodynamic properties, based on the graph structure and node/edge features. TU contains five graph classification datasets including D&D (Dobson and Doig, 2003), MUTAG (Debnath et al., 1991), PROTEINS (Dobson and Doig, 2003), PTC_MR (Toivonen et al., 2003), and ENZYMES (Schomburg et al., 2004). We used the datasets provided by PyTorch Geometric (Fey and Lenssen, 2019b), where for QM9 we performed unit conversions to match the units used by (Morris et al., 2019)

. The evaluation metric is Mean Absolute Error (MAE) for QM9 and Accuracy (%) for TU. To answer

Q3, we use two Open Graph Benchmark (OGB) datasets (Hu et al., 2020), ogbg-molhiv and ogbg-molpcba. The ogbg-molhiv dataset contains 41K small molecules, the task of which is to classify whether a molecule inhibits HIV virus or not. ROC-AUC is used for evaluation. The ogbg-molpcba dataset contains 438K molecules with 128 classification tasks. The evaluation metric is Average Precision (AP) averaged over all the tasks. We include the statistics for QM9 and OGB datasets in Table 1.

Dataset #Graphs Avg. #nodes Avg. #edges Split ratio #Tasks Task type Metric
QM9 129,433 18.0 18.6 80/10/10 12 Regression MAE
ogbl-molhiv 41,127 25.5 27.5 80/10/10 1 Classification ROC-AUC
ogbl-molpcba 437,929 26.0 28.1 80/10/10 128 Classification AP
Table 1: Statistics and evaluation metrics of the QM9 and OGB datasets.

5.2 Models

QM9. We use 1-GNN, 1-2-GNN, 1-3-GNN, and 1-2-3-GNN from (Morris et al., 2019) as both the baselines and the base GNNs of NGNN. Among them, 1-GNN is a standard message passing GNN with 1-WL power. 1-2-GNN is a GNN mimicking 2-WL, where message passing happens among 2-tuples of nodes. 1-3-GNN and 1-2-3-GNN mimic 3-WL, where message passing happens among 3-tuples of nodes. 1-2-GNN and 1-3-GNN use features computed by 1-GNN as initial node features, and 1-2-3-GNN uses the concatenated features from 1-2-GNN and 1-3-GNN. We additionally include numbers provided by (Wu et al., 2018) and Deep LRP (Chen et al., 2020) as baselines. Note that we omit more recent methods (Anderson et al., 2019; Klicpera et al., 2020; Qiao et al., 2020)

using advanced physical representations calculated from angles, atom coordinates, and quantum mechanics, which may obscure the comparison of models’ pure graph representation power. For NGNN, we uniformly use height-3 rooted subgraphs. For a fair comparison, the base GNNs in NGNN use exactly the same hyperparameters as when they are used alone, except for 1-GNN where we increase the number of message passing layers from 3 to 5 to make the number of layers larger than the subgraph height, similar to

(Zeng et al., 2020). For subgraph pooling and graph pooling layers, we uniformly use mean pooling. All other settings follow (Morris et al., 2019).

TU. We use four widely adopted GNNs as the baselines and the base GNNs of NGNN: GCN (Kipf and Welling, 2016), GraphSAGE (Hamilton et al., 2017), GIN (Xu et al., 2018), and GAT (Veličković et al., 2017). Since TU datasets suffer from inconsistent evaluation standards (Errica et al., 2019), we uniformly use the 10-fold cross validation framework provided by PyTorch Geomtric (Fey and Lenssen, 2019a) for all the models to ensure a fair comparison. For GNNs, we search the number of message passing layers in . For NGNNs, we similarly search the subgraph height in , so that both NGNNs and GNNs can have equal-depth local receptive fields. For NGNNs, we always use message passing layers instead of searching it together with

, because that will make NGNNs have more hyperparameters to tune. All models have 32 hidden dimensions, and are trained for 100 epochs with a batch size of 128. For each fold, we record the test accuracy with the hyperparameters chosen based on the best validation performance of this fold. Finally, we report the average test accuracy across all the 10 folds.

OGB. We use GNNs achieving top places on the OGB graph classification leaderboard111 (at the time of submission) as the baselines, including GCN (Kipf and Welling, 2016), GIN (Xu et al., 2018), DeeperGCN (Li et al., 2020a), Deep LRP (Chen et al., 2020), PNA (Corso et al., 2020), DGN (Beaini et al., 2020), GINE (Brossard et al., 2020), and PHC-GNN (Le et al., 2021). Note that those high-order GNNs (Morris et al., 2019; Maron et al., 2019a; Chen et al., 2019; Morris et al., 2020) are not included here, because despite being theoretically more discriminative, these GNNs are not among the GNNs with the best empirical performance on modern large-scale graph benchmarks, and their complexity also raises a scalability issue. For NGNN, we use GIN as the base GNN (although GIN is not among the strongest baselines here). Some baselines additionally use the virtual node technique (Gilmer et al., 2017; Li et al., 2015; Ishiguro et al., 2019), which are marked by “*”. For NGNN, we search the subgraph height in , and the number of layers in . We train the NGNN models for 100 and 150 epochs for ogbg-molhiv and ogbg-molpcba

, respectively, and report the validation and test scores at the best validation epoch. We also find that our models are subject to high performance variance across epochs, likely due to the increased expressiveness. Thus, we save a model checkpoint every 10 epochs, and additionally report the ensemble performance by averaging the predictions from all checkpoints. The final hyperparameter choices and more details about the experimental settings are included in Appendix 

C. All results are averaged over 10 independent runs.

In the following, we uniformly use “Nested GNN” to denote an NGNN model using “GNN” as the base GNN. For example, Nested GIN denotes an NGNN model using GIN (Xu et al., 2018) as the base GNN. For the NGNN models in QM9, TU and OGB datasets, we augment the initial features of a node with Distance Encoding (DE) (Li et al., 2020b), which uses the (generalized) distance between a node and the root as its additional feature, due to DE’s successful applications in link-level tasks (Zhang and Chen, 2018, 2020). Note that such feature augmentation is not applicable to the baseline models as discussed in Section 3.2. An ablation study on the effects of the DE features is included in Appendix E.

5.3 Results and discussion

To answer Q1, we first run a simulation to test NGNN’s power for discriminating -regular graphs. The results are presented in Appendix D. They match almost perfectly with Theorem 1, demonstrating that a practical NGNN can fulfil its theoretical power for discriminating -regular graphs.

Method Test Accuracy
GCN-RNI (Abboud et al., 2020) 98.01.85
PPGN (Maron et al., 2019a) 50.00.00
1-2-3-GNN (Morris et al., 2019) 50.00.00
3-GCN (Abboud et al., 2020) 99.70.004
Nested GIN 99.90.26
Table 2: Results (%) on EXP.

We also test NGNN’s expressive power using the EXP dataset provided by (Abboud et al., 2020), which contains 600 carefully constructed 1-WL indistinguishable but non-isomorphic graph pairs. Each pair of graphs have different labels, thus a standard message passing GNN cannot predict them both correctly, resulting in an expected classification accuracy of only 50%. We exactly follow the experimental settings and copy the baseline results in (Abboud et al., 2020). In Table 2, our Nested GIN model achieves a 99.9% classification accuracy, which outperforms all the baselines and distinguishes almost all the 1-WL indistinguishable graph pairs. These results verified that NGNN’s expressive power is indeed beyond 1-WL and message passing GNNs.

To answer Q2, we adopt the QM9 and TU datasets. We show the QM9 results in Table 3. If the Nested version of a base GNN achieves a better result than the base GNN itself, we color that cell with light green. As we can see, NGNN brings performance gains to all base GNNs on most targets, sometimes by large margins. We also show the results on TU in Table 5. NGNNs also show improvement over their base GNNs in most cases. These results indicate that NGNN is a general framework for improving a GNN’s power. We further compute the maximum reduction of MAE for QM9 and maximum improvement of accuracy for TU before and after applying NGNN. NGNN reduces the MAE by up to 7.9 times for QM9, and increases the accuracy by up to 14.3% for TU. These results answer Q2, indicating that NGNN can bring steady and significant improvement to base GNNs.

Target Method (Ne. for Nested)
DTNN MPNN Deep LRP 1-GNN 1-2-GNN 1-3-GNN 1-2-3-GNN Ne. 1-GNN Ne. 1-2-GNN Ne. 1-3-GNN Ne. 1-2-3-GNN Max. reduction
0.244 0.358 0.364 0.493 0.493 0.473 0.476 0.428 0.437 0.436 0.433 1.2
0.95 0.89 0.298 0.78 0.27 0.46 0.27 0.29 0.278 0.261 0.265 2.7
0.00388 0.00541 0.00254 0.00321 0.00331 0.00328 0.00337 0.00265 0.00275 0.00265 0.00279 1.2
0.00512 0.00623 0.00277 0.00355 0.00350 0.00354 0.00351 0.00297 0.00271 0.00269 0.00276 1.3
0.0112 0.0066 0.00353 0.0049 0.0047 0.0046 0.0048 0.0038 0.0039 0.0039 0.0039 1.8
17.0 28.5 19.3 34.1 21.5 25.8 22.9 20.5 20.4 20.2 20.1 1.7
ZPVE 0.00172 0.00216 0.00055 0.00124 0.00018 0.00064 0.00019 0.00020 0.00017 0.00017 0.00015 6.2
2.43 2.05 0.413 2.32 0.0357 0.6855 0.0427 0.295 0.252 0.291 0.205 7.9
2.43 2.00 0.413 2.08 0.107 0.686 0.111 0.361 0.265 0.278 0.200 5.8
2.43 2.02 0.413 2.23 0.070 0.794 0.0419 0.305 0.241 0.267 0.249 7.3
2.43 2.02 0.413 1.94 0.140 0.587 0.0469 0.489 0.272 0.287 0.253 4.0
0.27 0.42 0.129 0.27 0.0989 0.158 0.0944 0.174 0.0891 0.0879 0.0811 1.8
Table 3: MAE results on QM9 (smaller the better). A colored cell means NGNN is better than the base GNN.
#Graphs 1178 188 1113 344 600
Avg. #nodes 284.32 17.93 39.06 14.29 32.63
GCN 71.62.8 73.410.8 71.74.7 56.47.1 27.35.5
GraphSAGE 71.63.0 74.08.8 71.25.2 57.05.5 30.76.3
GIN 70.53.9 84.58.9 70.64.3 51.29.2 38.36.4
GAT 71.04.4 73.910.7 72.03.3 57.07.3 30.24.2
Nested GCN 76.33.8 82.911.1 73.34.0 57.37.7 31.26.7
Nested GraphSAGE 77.44.2 83.910.7 74.23.7 57.05.9 30.76.3
Nested GIN 77.83.9 87.98.2 73.95.1 54.17.7 29.08.0
Nested GAT 76.04.4 81.910.2 73.74.8 56.78.1 29.55.7
Max. improvement 10.4% 13.4% 4.7% 5.7% 14.3%
Table 5: Results (%) on OGB datasets (* virtual node).
ogbg-molhiv (AUC) ogbg-molpcba (AP)
Method Validation Test Validation Test
CCN* 83.840.91 75.991.19 24.950.42 24.240.34
GIN* 84.790.68 77.071.49 27.980.25 27.030.23
Deep LRP 82.091.16 77.191.40
DeeperGCN* 29.200.25 27.810.38
HIMP 78.800.82
PNA 85.190.99 79.051.32
DGN 84.700.47 79.700.97
GINE* 30.650.30 29.170.15
PHC-GNN 82.170.89 79.341.16 30.680.25 29.470.26
Nested GIN* 83.171.99 78.341.86 29.150.35 28.320.41
Nested GIN* (ens) 80.802.78 79.861.05 30.590.56 30.070.37
Table 4: Accuracy results (%) on TU datasets.

To answer Q3, we compare Nested GIN with leading methods on the OGB leaderboard. The results are shown in Table 5. Nested GIN achieves highly competitive performance with these leading GNN models, albeit using a relatively weak base GNN (GIN). Compared to GIN alone, Nested GIN shows clear performance gains. It achieves test scores up to 79.86 and 30.07 on ogbg-molhiv and ogbg-molpcba, respectively, which outperform all the baselines. In particular, for the challenging ogbg-molpcba, our Nested GIN can achieve 30.07 and 28.32 test AP with and without ensemble, respectively, outperforming the plain GIN model (with 27.03 test AP) significantly. These results demonstrate the great empirical performance and potential of NGNN even compared to heavily tuned open leaderboard models, despite using only GIN as the base GNN.

To answer Q4, we report the training time per epoch for GIN and Nested GIN on OGB datasets. On ogbg-molhiv, GIN takes 54s per epoch, while Nested GIN takes 183s. On ogbg-molpcba, GIN takes 10min per epoch, while Nested GIN takes 20min. This verifies that NGNN has comparable time complexity with message passing GNNs. The extra complexity comes from independently learning better node representations from rooted subgraphs, which is a trade-off for the higher expressivity.

In summary, our experiments have firmly shown that NGNN is a theoretically sound method which brings consistent gains to its base GNNs in a plug-and-play way. Furthermore, NGNN still maintains a controllable time complexity compared to other more powerful GNNs.

Finally, we point out one memory limitation of the current NGNN implementation. Currently, NGNN does not scale to graph datasets with a large average node number (such as REDDIT-BINARY) or datasets with a large average node degree (such as ogbg-ppa) due to copying a rooted subgraph for each node to the GPU memory. Reducing batch size or subgraph height helps, but at the same time leads to performance degradation. One may wonder why materializing all the subgraphs into GPU memory is necessary. The reason is that we want to batch-process all the subgraphs simultaneously. Otherwise, we have to sequentially extract subgraphs on the fly, which results in a much higher latency. We leave the exploration of memory efficient NGNN to the future work.

6 Conclusions

We have proposed Nested Graph Neural Network (NGNN), a general framework for improving GNN’s representation power. NGNN learns node representations encoding rooted subgraphs instead of rooted subtrees. Theoretically, we prove NGNN can discriminate almost all -regular graphs where 1-WL always fails. Empirically, NGNN consistently improves the performance of various base GNNs across different datasets without incurring the complexity like other more powerful GNNs.


The authors greatly thank the actionable suggestions from the reviewers to improve the manuscript. Li is partly supported by the 2021 JP Morgan Faculty Award and the National Science Foundation (NSF) award HDR-2117997.


  • R. Abboud, İ. İ. Ceylan, M. Grohe, and T. Lukasiewicz (2020) The surprising power of graph neural networks with random node initialization. arXiv preprint arXiv:2010.01179. Cited by: §4, §5.1, §5.3, Table 2.
  • S. Abu-El-Haija, B. Perozzi, A. Kapoor, N. Alipourfard, K. Lerman, H. Harutyunyan, G. Ver Steeg, and A. Galstyan (2019) Mixhop: higher-order graph convolutional architectures via sparsified neighborhood mixing. In international conference on machine learning, pp. 21–29. Cited by: §4.
  • B. Anderson, T. Hy, and R. Kondor (2019) Cormorant: covariant molecular neural networks. arXiv preprint arXiv:1906.04015. Cited by: §5.2.
  • W. Azizian and M. Lelarge (2020) Characterizing the expressive power of invariant and equivariant graph neural networks. arXiv preprint arXiv:2006.15646. Cited by: §4.
  • D. Beaini, S. Passaro, V. Létourneau, W. L. Hamilton, G. Corso, and P. Liò (2020) Directional graph networks. arXiv preprint arXiv:2010.02863. Cited by: §4, §5.2.
  • K. M. Borgwardt and H. Kriegel (2005) Shortest-path kernels on graphs. In 5th IEEE International Conference on Data Mining, pp. 8–pp. Cited by: §1.
  • G. Bouritsas, F. Frasca, S. Zafeiriou, and M. M. Bronstein (2020) Improving graph neural network expressivity via subgraph isomorphism counting. arXiv preprint arXiv:2006.09252. Cited by: §4, §4.
  • R. Brossard, O. Frigo, and D. Dehaene (2020) Graph convolutions that can finally model local structure. arXiv preprint arXiv:2011.15069. Cited by: §5.2.
  • A. E. Brouwer and W. H. Haemers (2012) Strongly regular graphs. In Spectra of Graphs, pp. 115–149. Cited by: §3.3.
  • J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §1.
  • Z. Chen, L. Chen, S. Villar, and J. Bruna (2020) Can graph neural networks count substructures?. Advances in neural information processing systems. Cited by: §4, §5.2, §5.2.
  • Z. Chen, S. Villar, L. Chen, and J. Bruna (2019) On the equivalence between graph isomorphism testing and function approximation with gnns. In Advances in Neural Information Processing Systems, pp. 15894–15902. Cited by: §1, §1, §3.4, §4, §5.2.
  • G. Corso, L. Cavalleri, D. Beaini, P. Liò, and P. Veličković (2020) Principal neighbourhood aggregation for graph nets. arXiv preprint arXiv:2004.05718. Cited by: §5.2.
  • H. Dai, B. Dai, and L. Song (2016) Discriminative embeddings of latent variable models for structured data. In Proceedings of The 33rd International Conference on Machine Learning, pp. 2702–2711. Cited by: §1.
  • G. Dasoulas, L. D. Santos, K. Scaman, and A. Virmaux (2019) Coloring graph neural networks for node disambiguation. arXiv preprint arXiv:1912.06058. Cited by: §4.
  • P. de Haan, T. Cohen, and M. Welling (2020) Natural graph networks. arXiv preprint arXiv:2007.08349. Cited by: §4.
  • A. K. Debnath, d. C. R. Lopez, G. Debnath, A. J. Shusterman, and C. Hansch (1991) Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity.. Journal of medicinal chemistry 34 (2), pp. 786–797. Cited by: §5.1.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3837–3845. Cited by: §1.
  • P. D. Dobson and A. J. Doig (2003) Distinguishing enzyme structures from non-enzymes without alignments. Journal of molecular biology 330 (4), pp. 771–783. Cited by: §5.1.
  • D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §1.
  • F. Errica, M. Podda, D. Bacciu, and A. Micheli (2019) A fair comparison of graph neural networks for graph classification. arXiv preprint arXiv:1912.09893. Cited by: §5.2.
  • M. Fey and J. E. Lenssen (2019a) Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: §5.2.
  • M. Fey and J. E. Lenssen (2019b) Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428. Cited by: §5.1, §5.
  • H. Gao and S. Ji (2019) Graph u-nets. arXiv preprint arXiv:1905.05178. Cited by: Appendix B.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §1, §3.1, §5.2.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1025–1035. Cited by: §5.2.
  • D. Haussler (1999) Convolution kernels on discrete structures. Technical report Citeseer. Cited by: §1.
  • W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec (2020) Open graph benchmark: datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687. Cited by: §5.1.
  • K. Huang and M. Zitnik (2020) Graph meta learning via local subgraphs. Advances in Neural Information Processing Systems 33. Cited by: §4.
  • K. Ishiguro, S. Maeda, and M. Koyama (2019) Graph warp module: an auxiliary module for boosting the power of graph neural networks. arXiv preprint arXiv:1902.01020. Cited by: §5.2.
  • N. Keriven and G. Peyré (2019) Universal invariant and equivariant graph neural networks. arXiv preprint arXiv:1905.04943. Cited by: §4.
  • K. Kersting, N. M. Kriege, C. Morris, P. Mutzel, and M. Neumann (2016) Benchmark data sets for graph kernels. External Links: Link Cited by: §5.1.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §5.2, §5.2.
  • D. J. Klein and M. Randić (1993) Resistance distance. Journal of Mathematical Chemistry 12 (1), pp. 81–95. Cited by: Appendix C.
  • J. Klicpera, J. Groß, and S. Günnemann (2020) Directional message passing for molecular graphs. arXiv preprint arXiv:2003.03123. Cited by: §5.2.
  • R. Kondor, N. Shervashidze, and K. M. Borgwardt (2009) The graphlet spectrum. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 529–536. Cited by: §1.
  • T. Le, M. Bertolini, F. Noé, and D. Clevert (2021) Parameterized hypercomplex graph neural networks for graph classification. arXiv preprint arXiv:2103.16584. Cited by: §5.2.
  • G. Li, C. Xiong, A. Thabet, and B. Ghanem (2020a) Deepergcn: all you need to train deeper gcns. arXiv preprint arXiv:2006.07739. Cited by: §5.2.
  • P. Li, Y. Wang, H. Wang, and J. Leskovec (2020b) Distance encoding–design provably more powerful gnns for structural representation learning. arXiv preprint arXiv:2009.00142. Cited by: Appendix A, Appendix A, Appendix E, §1, §4, §4, §5.2.
  • Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §1, §5.2.
  • A. Loukas (2019) What graph neural networks cannot learn: depth vs width. arXiv preprint arXiv:1907.03199. Cited by: §4.
  • H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman (2019a) Provably powerful graph networks. In Advances in Neural Information Processing Systems, pp. 2156–2167. Cited by: §1, §1, §3.3, §3.4, §4, §5.2, Table 2.
  • H. Maron, H. Ben-Hamu, N. Shamir, and Y. Lipman (2018) Invariant and equivariant graph networks. arXiv preprint arXiv:1812.09902. Cited by: §4.
  • H. Maron, E. Fetaya, N. Segol, and Y. Lipman (2019b) On the universality of invariant networks. In International conference on machine learning, pp. 4363–4371. Cited by: §4.
  • C. Morris, G. Rattan, and P. Mutzel (2020) Weisfeiler and leman go sparse: towards scalable higher-order graph embeddings. Cited by: §1, §4, §5.2.
  • C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe (2019) Weisfeiler and leman go neural: higher-order graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4602–4609. Cited by: §1, §1, §3.1, §3.4, §4, §5.1, §5.2, §5.2, Table 2.
  • R. Murphy, B. Srinivasan, V. Rao, and B. Ribeiro (2019) Relational pooling for graph representations. In International Conference on Machine Learning, pp. 4663–4673. Cited by: §4.
  • M. Neumann, R. Garnett, C. Bauckhage, and K. Kersting (2016) Propagation kernels: efficient graph kernels from propagated information. Machine Learning 102 (2), pp. 209–245. Cited by: §1.
  • G. Nikolentzos, G. Dasoulas, and M. Vazirgiannis (2020) K-hop graph neural networks. Neural Networks 130, pp. 195–205. Cited by: §4.
  • B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §1.
  • Z. Qiao, M. Welborn, A. Anandkumar, F. R. Manby, and T. F. Miller III (2020)

    OrbNet: deep learning for quantum chemistry using symmetry-adapted atomic-orbital features

    The Journal of Chemical Physics 153 (12), pp. 124111. Cited by: §5.2.
  • R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. Von Lilienfeld (2014) Quantum chemistry structures and properties of 134 kilo molecules. Scientific data 1 (1), pp. 1–7. Cited by: §5.1.
  • R. Sato, M. Yamada, and H. Kashima (2020) Random features strengthen graph neural networks. arXiv preprint arXiv:2002.03155. Cited by: §4.
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2009) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §1.
  • I. Schomburg, A. Chang, C. Ebeling, M. Gremse, C. Heldt, G. Huhn, and D. Schomburg (2004)

    BRENDA, the enzyme database: updates and major new developments

    Nucleic acids research 32 (suppl_1), pp. D431–D433. Cited by: §5.1.
  • N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt (2011) Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12 (Sep), pp. 2539–2561. Cited by: §1, §2.2, §3.1.
  • N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. M. Borgwardt (2009) Efficient graphlet kernels for large graph comparison.. In AISTATS, Vol. 5, pp. 488–495. Cited by: §1.
  • H. Toivonen, A. Srinivasan, R. D. King, S. Kramer, and C. Helma (2003) Statistical evaluation of the predictive toxicology challenge 2000–2001. Bioinformatics 19 (10), pp. 1183–1193. Cited by: §5.1.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §5.2.
  • C. Vignac, A. Loukas, and P. Frossard (2020) Building powerful and equivariant graph neural networks with structural message-passing. arXiv e-prints, pp. arXiv–2006. Cited by: §4.
  • S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt (2010) Graph kernels. Journal of Machine Learning Research 11 (Apr), pp. 1201–1242. Cited by: §1.
  • B. Weisfeiler and A. Lehman (1968) A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsia 2 (9), pp. 12–16. Cited by: §1, §2.2.
  • Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande (2018) MoleculeNet: a benchmark for molecular machine learning. Chemical science 9 (2), pp. 513–530. Cited by: §5.1, §5.2.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §3.1, §4, §5.2, §5.2, §5.2.
  • Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec (2018) Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pp. 4800–4810. Cited by: Appendix B, §1.
  • J. You, J. Gomes-Selman, R. Ying, and J. Leskovec (2021) Identity-aware graph neural networks. arXiv preprint arXiv:2101.10320. Cited by: §4, §4.
  • M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola (2017) Deep sets. In Advances in Neural Information Processing Systems, pp. 3391–3401. Cited by: §3.3.
  • H. Zeng, M. Zhang, Y. Xia, A. Srivastava, A. Malevich, R. Kannan, V. Prasanna, L. Jin, and R. Chen (2020) Deep graph neural networks with shallow subgraph samplers. arXiv preprint arXiv:2012.01380. Cited by: Appendix B, §5.2.
  • M. Zhang and Y. Chen (2017) Weisfeiler-lehman neural machine for link prediction. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 575–583. Cited by: §2.2.
  • M. Zhang and Y. Chen (2018) Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, pp. 5165–5175. Cited by: Appendix E, §4, §5.2.
  • M. Zhang and Y. Chen (2020) Inductive matrix completion based on graph neural networks. In International Conference on Learning Representations, External Links: Link Cited by: Appendix E, §4, §5.2.
  • M. Zhang, Z. Cui, M. Neumann, and Y. Chen (2018) An end-to-end deep learning architecture for graph classification. In AAAI, pp. 4438–4445. Cited by: Appendix B, §1.
  • M. Zhang, P. Li, Y. Xia, K. Wang, and L. Jin (2020) Revisiting graph neural networks for link prediction. arXiv preprint arXiv:2010.16103. Cited by: §4.

Appendix A Proof of Theorem 1

The proof is inspired by the previous theoretical characterization on the power of distance features Li et al. [2020b]. Basically, performing height- subgraph extraction around a center node is essentially equivalent to injecting distance features that indicate whether the distance between a node and the center node is less than . In the following part, we will explicitly show how these distance features make NGNN more powerful than the 1-WL test. Let us first introduce the outline of the proof. Consider two -node -regular graphs and and we pick two nodes, each from one graph, denoted by and . By performing certain-height (at most -height) rooted subgraph extraction around these two nodes, due to the implicit distance features, we may prove that the nodes on the boundary of the obtained two subgraphs will obtain special node representations. These special node representations will be propagated within the subgraphs. After some steps of propagation, we can prove that NGNN by leveraging the subgraph pooling (Eq. 4) can distinguish these two subgraphs. This tells that NGNN may generate different node representations for and respectively. Then, a union bound can be used to transform such difference in node representations into the difference in the representations of and . Note that the proof will assume that there are no node/edge attributes that can be leveraged. Additional node/edge attributes may only improve the possibility to distinguish these two graphs.

The first lemma is to analyze the difference between the structures of the rooted subgraphs around two nodes over two -node -regular graphs. Before introducing that, we need to define a notion termed edge configuration. For a node in graph , let denote the set of nodes in that are exactly -hop neighbors of , i.e., the shortest path distance between and any node is . Then, we know the height- rooted subgraph over around the center node is the subgraph induced by the node set .

Definition 3.

The edge configuration between and is a list where denotes the number of nodes in of which each has exactly edges from .

When we say two edge configurations (between and ), (between and ) are equal, we mean that these two lists are component-wise equal to each other. Obviously, we should also have if . Now, we are ready to propose the first lemma.

Lemma 1.

For two graphs and that are uniformly independently sampled from all -node -regular graphs, where , we pick any two nodes, each from one graph, denoted by and respectively. Then, there is at least one

with probability

such that . Moreover, with at least the same probability, for all , the number of edges between and are at least for .


This lemma can be obtained by following the steps 1-3 in the proof of Theorem 3.3 in Li et al. [2020b]. ∎

Now, we set . We focus on the two extracted subgraphs and . We first prove a lemma that shows with a certain number of layers, a proper NGNN will generate different representations for and , i.e., and in Eq. 4.

Lemma 2.

For two graphs and that are uniformly independently sampled from all -node -regular graphs, where , we pick any two nodes, each from one graph, denoted by and respectively, and do -height rooted subgraph extraction around and . With at most many layers, a proper message passing GNN (with injective and subgraph pooling) will generate different representations for the extracted two subgraphs with probability at least .


According to Lemma 1, we know that with probability , there exists at least one such that . So there exists at least one that make (thus the difference in edge configurations appears in and ) and we pick the largest .

Now let us consider running a message passing GNN over the two subgraphs , . All nodes are initialized with the same node features. The nodes of these two subgraphs can be categorized into (), for respectively. Next, let us consider the node representations in these categories during the message passing procedure. We have the following observations.

  1. Note that all the nodes other than those in have degree in both subgraphs. Therefore, in the -th iteration, the nodes in for will share the same node representation. We call this node representation as default representation. Note that if we do not perform rooted subgraph extraction, then all nodes in all -regular graph hold default representation.

  2. Node representations that are different from default representations will first appear among the nodes in after the first iteration. This is because there are at least edges between and before performing subgraph extraction (due to Lemma 1) and all these edges will not appear in the extracted subgraphs. Then, almost all nodes in hold only degree one (and thus do not have degree to keep default representations) within the corresponding extracted subgraphs. We uniformly call the node representations that are different from the default ones as new representations. New representations may be mutually different.

  3. Those new different node representations will propagate to nodes in , and so on and so forth via iterative message passing. Moreover, during such propagation procedure, after iterations, new representations will at least make almost all nodes in hold representations different form almost all nodes in for , which can be easily obtained by doing induction from to .

Observing the above three points, We may compare the above propagating procedure between and . Suppose in the first steps of message passing, the set of node representations (both the default ones and the new ones) can keep the same between the two extracted subgraphs. If this is not true, we have already proven the results. As they hold different edge configurations in , when the new node representations propagate from to , it will definitely induce different sets of new node representations between and . Currently, node representations are kept the same between and for as they are all default node representations. Though also hold new node representations, they are different from those in for . At this point, if an injective subgraph pooling operation is adopted, then the obtained representations of and , i.e., and , are different. ∎

Based on Lemma 2, using a union bound by comparing a node representation of with all node representaitons of , we may achieve the final conclusion. Specifically, we consider a node of , say , and another arbitrary node of , say . Using Lemma 2, we know with probability , is different from . Then, using the union bound, with probability , we have . Therefore, if the final graph pooling (Eq. 5) is injective, we may guarantee that NGNN can generate different representations for and .

Appendix B Design choices of NGNN

In this section, we discuss some other design choices of NGNN.

High-order NGNN. NGNN is a two-level GNN (a GNN of GNNs), where a base GNN is used to learn a final node representation from a rooted subgraph and an outer GNN (graph pooling) is used to learn a graph representation from the base GNNs’ outputs. This design thus involves one level of nesting, which we call first-order NGNN. To extend the framework, we propose high-order NGNN, where we make the base GNN itself an NGNN. That is, we perform the subgraph representation learning tasks each using a first-order NGNN, where we treat each subgraph the same as the graph in the original NGNN. This way, we arrive at a second-order NGNN with two levels of nesting (a GNN of NGNNs, or a GNN of GNNs of GNNs). Repeating this construction, we can in principle construct an arbitrary-order NGNN. It is interesting to investigate whether high-order NGNNs can further enhance the representation power and the practical performance of a base GNN. We leave the exploration of such architectures to future work.

Pooling functions and . To summarize node representations into a subgraph/graph representation, we need a readout (pooling) function. Popular choices include sum, mean, max, as well as more complex ones such as selecting top- nodes [Zhang et al., 2018, Gao and Ji, 2019] and hierarchical approaches [Ying et al., 2018]. In this paper, we find mean pooling works very well, which directly takes the mean of node representations as the subgraph/graph representation. We also find another pooling function to be sometimes useful for subgraph pooling, called center pooling (CP). CP directly uses the root node’s representation to represent the entire subgraph. The success of CP relies on using more layers of message passing than the height of the rooted subgraph, so that even the intermediate representation of the center root node alone can have sufficient information about the entire subgraph. This is feasible for rooted subgraphs with a small height. Note that when using a number of message passing layers smaller than the subgraph height, NGNN with CP will reduce to a standard message passing GNN.

Subgraph height and base GNN layers . NGNN is flexible in terms of choosing the subgraph height and the number of message passing layers in the base GNN. Theorem 1 provides a guide for choosing and when discriminating -regular graphs. In practice, we find using and generally performs well across various tasks. Using a small will restrict the receptive field, causing NGNN to learn too local features. Using a too large might cause each rooted subgraph to include the entire graph. For the number of message passing layers , we find that using performs better. This can be explained by that using a large makes each node in a rooted subgraph to more sufficiently absorb the whole-subgraph information thus learning a better intermediate node representation reflecting its structural position within the subgraph. Please refer to [Zeng et al., 2020] for more motivations for using deeper message passing layers than the subgraph height.

Appendix C More details about the experimental settings

The experiments were run on a Linux server with 64GB memory, two NVIDIA RTX 2080S (8GB) GPUs and an INTEL i9-9900 8-core CPU. For ogbg-molhiv, the final NGNN architecture used a rooted subgraph height and number of GIN layers . Mean pooling is used in both the subgraph and graph pooling. The final NGNN architecture for ogbg-molpcba used a rooted subgraph height and the number of GIN layers . Center pooling (CP) is used in the subgraph pooling and mean pooling is used in the graph pooling. Although we searched and , we found the final performance is not very sensitive to these hyperparameters as long as is between 3 and 5 and . For the DE features, we use shortest path distance and resistance distance [Klein and Randić, 1993].

Appendix D Simulation experiments to verify Theorem 1

We conduct a simulation over random regular graphs to validate Lemma 2 (how well NGNN distinguishes nodes of regular graphs) and Theorem 1 (how well NGNN distinguishes regular graphs). The results are shown in Figure 3, which match our theory almost perfectly. Basically, we sample 100 -node 3-regular graphs uniformly at random, and then apply an untrained NGNN to these graphs to see how often NGNN can distinguish the nodes and graphs at different rooted subgraph height and node number . The required at different matches almost perfectly with the lower bound in Lemma 2. More details are contained in the caption of Figure 3.

Figure 3: Simulation to verify Theorem 1. The left graph shows the node-level (with only subgraph pooling) simulation results. The right graph shows the graph-level (with both subgraph and graph pooling) simulation results. We uniformly sample 100 -node 3-regular graphs with ranging from 10 to 1280. We let the rooted subgraph height range from 1 to 10. We apply an untrained Nested GIN with one message passing layer to these graphs (with a uniform 1 as node features). In the left figure, we compare the final node representations (after subgraph pooling) from all graphs output by the Nested GIN. If the difference between two node representations is greater than machine accuracy, they are regarded as indistinguishable. The shade of each scatter point’s color reflects the portion of indistinguishable node pairs at certain . The darker, the more indistinguishable node pairs. In the right graph, we compare the final graph representations (after graph pooling) output by the Nested GIN. The blue and red dashed lines show the theoretical upper and lower bounds for to discriminate almost all nodes in -node 3-regular graphs, respectively. As we can see, the node-level simulation results perfectly match the theory (Lemma 2)—when is larger than , almost all nodes from -regular graphs are distinguishable by NGNN. When is even larger than , the nodes can hardly be distinguished because each subgraph contains the entire regular graph. The graph-level simulation results show that even using a very small NGNN can still discriminate almost all -regular graphs— in practice even does not need to be always chosen beyond . This is because although most nodes from two -regular graphs cannot be distinguished when , the graph pooling can still distinguish the two graphs as long as there exists one single node from one graph holding a representation different from any node representation from the other graph.

Appendix E Ablation study on DE

In this paper, we choose Distance Encoding (DE) [Li et al., 2020b] to augment the initial node features of NGNN, due to its good theoretical properties for improving the expressive power of message passing GNNs as well as its superb empirical performance on link prediction tasks [Zhang and Chen, 2018, 2020]. DE encodes the distance between a node and the root node into a vector through an embedding layer. The distance embedding is concatenated with the raw features of a node as its new features (in this rooted subgraph) input to the base GNN. Note that when this node appears in another rooted subgraph, it may have a different distance to that root node, thus resulting in different DE features in different subgraphs. Only the NGNN framework can leverage such a subgraph-specific feature augmentation—a standard GNN treats a node always the same no matter which node’s rooted subgraph/subtree it is in.

Method ZPVE
1-GNN 0.493 0.78 0.00321 0.00355 0.0049 34.1 0.00124 2.32 2.08 2.23 1.94 0.27
Nested 1-GNN (no DE) 0.466 0.38 0.00292 0.00294 0.0042 24.0 0.00040 1.09 1.76 1.04 1.19 0.111
Nested 1-GNN (with DE) 0.428 0.29 0.00265 0.00297 0.0038 20.5 0.00020 0.295 0.361 0.305 0.489 0.174
1-2-GNN 0.493 0.27 0.00331 0.00350 0.0047 21.5 0.00018 0.0357 0.107 0.070 0.140 0.0989
Nested 1-2-GNN (no DE) 0.454 0.308 0.00280 0.00278 0.0041 23.3 0.00029 0.349 0.281 0.395 0.307 0.0945
Nested 1-2-GNN (with DE) 0.437 0.278 0.00275 0.00271 0.0039 20.4 0.00017 0.252 0.265 0.241 0.272 0.0891
1-3-GNN 0.473 0.46 0.00328 0.00354 0.0046 25.8 0.00064 0.6855 0.686 0.794 0.587 0.158
Nested 1-3-GNN (no DE) 0.448 0.298 0.00276 0.00276 0.0040 22.0 0.00025 0.410 0.396 0.370 0.422 0.0936
Nested 1-3-GNN (with DE) 0.436 0.261 0.00265 0.00269 0.0039 20.2 0.00017 0.291 0.278 0.267 0.287 0.0879
1-2-3-GNN 0.476 0.27 0.00337 0.00351 0.0048 22.9 0.00019 0.0427 0.111 0.0419 0.0469 0.0944
Nested 1-2-3-GNN (no DE) 0.449 0.306 0.00282 0.00286 0.0041 22.0 0.00023 0.220 0.218 0.268 0.205 0.0975
Nested 1-2-3-GNN (with DE) 0.433 0.265 0.00279 0.00276 0.0039 20.1 0.00015 0.205 0.200 0.249 0.253 0.0811
Table 6: Ablation study on QM9 comparing Nested GNNs with and without DE features.

In this section, we do ablation experiments to study the effect of the DE features. We choose QM9 as the testbed. The base GNNs are the same as in Table 3. For each base GNN, we compare it with its Nested GNN version without DE features (no DE) and its Nested GNN version with DE features (with DE). The results are shown in Table 6.

In Table 6, we color the cell with light green if the NGNN (no DE) is better than the base GNN, and mark the cell with green if the NGNN (with DE) is additionally better than the NGNN (no DE). From the results, we can first observe that NGNNs (no DE) generally outperform the base GNNs, validating that even without any feature augmentation the NGNN framework still enhances the performance of base GNNs. Furthermore, we can observe that if NGNN improves over the base GNN, adding DE features could further enlarge the performance improvement by achieving the smallest MAEs among the three (i.e., base GNN, NGNN (no DE) and NGNN (with DE)). This demonstrates the usefulness of augmenting NGNN with DE features. Note that adding such DE features can be done simultaneously with the rooted subgraph extraction process, which only adds a negligible amount of time. Thus, augmenting NGNN with DE features is almost a free yet powerful operation to further enhance NGNN’s power, which motivates us to make it a default choice of NGNN.