1 Introduction
Graph is an important tool to model relational data in the real world. Representation learning over graphs has become a popular topic of machine learning in recent years. While network embedding methods, such as DeepWalk
(Perozzi et al., 2014), can learn node representations well, they fail to generalize to wholegraph representations, which are crucial for applications such as graph classification, molecule modeling, and drug discovery. On the contrary, although traditional graph kernels (Haussler, 1999; Shervashidze et al., 2009; Kondor et al., 2009; Borgwardt and Kriegel, 2005; Neumann et al., 2016; Shervashidze et al., 2011)can be used for graph classification, they define graph similarity often in a heuristic way, which is not parameterized and lacks some flexibility to deal with features.
In this context, graph neural networks (GNNs) have regained people’s attention and become the stateoftheart graph representation learning tool (Scarselli et al., 2009; Bruna et al., 2013; Duvenaud et al., 2015; Li et al., 2015; Kipf and Welling, 2016; Defferrard et al., 2016; Dai et al., 2016; Veličković et al., 2017; Zhang et al., 2018; Ying et al., 2018). GNNs use message passing to propagate features between connected nodes. By iteratively aggregating neighboring node features to the center node, GNNs learn node representations encoding their local structure and feature information. These node representations can be further pooled into a graph representation, enabling graphlevel tasks such as graph classification. In this paper, we will use message passing GNNs to denote this class of GNNs based on repeated neighbor aggregation (Gilmer et al., 2017), in order to distinguish them from some highorder GNN variants (Morris et al., 2019; Maron et al., 2019a; Chen et al., 2019) where the effective message passing happens between highorder node tuples instead of nodes.
GNNs’ message passing scheme mimics the 1dimensional WeisfeilerLehman (1WL) algorithm (Weisfeiler and Lehman, 1968)
, which iteratively refines a node’s color according to its current color and the multiset of its neighbors’ colors. This procedure essentially encodes a rooted subtree around each node into its final color, where the rooted subtree is constructed by recursively expanding the neighbors of the root node. One critical reason for GNN’s success in graph classification is because two graphs sharing many identical or similar rooted subtrees are more likely classified into the same class, which actually aligns with the inductive bias that two graphs are similar if they have many common substructures
(Vishwanathan et al., 2010).Despite this, rooted subtrees are still limited in terms of expressing all possible substructures that can appear in a graph. It is likely that two graphs, despite sharing a lot of identical rooted subtrees, are not similar at all because their other substructure patterns are not similar. Take the two graphs and in Figure 1 as an example. If we apply 1WL or a message passing GNN to them, the two graphs will always have the same representation no matter how many iterations/layers we use. This is because all nodes in the two graphs have identical rooted subtrees across all tree heights. However, the two graphs are quite different from a holistic perspective. is composed of two triangles, while is a hexagon. The intrinsic reason for such a failure is that rooted subtrees have limited expressiveness for representing general graphs, especially those with cycles.
To address this issue, we propose Nested Graph Neural Networks (NGNNs). The core idea is, instead of encoding a rooted subtree, we want the final representation of a node to encode a rooted subgraph (local hop subgraph) around it. The subgraph is not restricted to be of any particular graph type such as tree, but serves as a general description of the local neighborhood around a node. Rooted subgraphs offer much better representation power than rooted subtrees, e.g., we can easily discriminate the two graphs in Figure 1 by only comparing their height1 rooted subgraphs.
To represent a graph with rooted subgraphs, NGNN uses two levels of GNNs: base (inner) GNNs and an outer GNN. By extracting a local rooted subgraph around each node, NGNN first applies a base GNN to each node’s subgraph independently. Then, a subgraph pooling layer is applied to each subgraph to aggregate the intermediate node representations into a subgraph representation. This subgraph representation is used as the final representation of the root node. Rather than encoding a rooted subtree, this final node representation encodes the local subgraph around it, which contains more information than a subtree. Finally, all the final node representations are further fed into an outer GNN to learn a representation for the entire graph. Figure 2 shows one NGNN implementation using message passing GNNs as the base GNNs and a simple graph pooling layer as the outer GNN.
One may wonder that the base GNN seems to still learn only rooted subtrees if it is messagepassingbased. Then why is NGNN more powerful than GNN? One key reason lies in the subgraph pooling layer. Take the height1 rooted subgraphs (marked with red boxes) around and in Figure 1 as an example. Although and ’s height1 rooted subtrees are still the same, their neighbors (labeled by 1 and 2) have different height1 rooted subtrees. Thus, applying a onelayer message passing GNN plus a subgraph pooling as the base GNN is sufficient to discriminate and .
The NGNN framework has multiple exclusive advantages. Firstly, it allows freely choosing the base GNN, and can enhance the base GNN’s representation power in a plugandplay fashion. Theoretically, we proved that NGNN is more powerful than message passing GNNs and 1WL by being able to discriminate almost all regular graphs (where 1WL always fails). Secondly, by extracting rooted subgraphs, NGNN allows augmenting the initial features of a node with subgraphspecific structural features such as distance encoding (Li et al., 2020b) to improve the quality of the learned node representations. Thirdly, unlike other more powerful graph neural networks, especially those based on higherorder WL tests (Morris et al., 2019; Maron et al., 2019a; Chen et al., 2019; Morris et al., 2020), NGNN still has linear time and space complexity w.r.t. graph size like standard message passing GNNs, thus maintaining good scalability. We demonstrate the effectiveness of the NGNN framework in various synthetic/realworld graph classification/regression datasets. On synthetic datasets, NGNN demonstrates higherthan1WL expressive power, matching very well with our theorem. On realworld datasets, NGNN consistently enhances a wide range of base GNNs’ performance, achieving highly competitive results on all datasets.
2 Preliminaries
2.1 Notation and problem definition
We consider the graph classification/regression problem. Given a graph where is the node set and is the edge set, we aim to learn a function mapping to its class or target value . The nodes and edges in
can have feature vectors associated with them, denoted by
(for node ) and (for edge ), respectively.2.2 WeisfeilerLehman test
The WesfeilerLehman (1WL) test (Weisfeiler and Lehman, 1968) is a popular algorithm for graph isomorphism checking. The classical 1WL works as follows. At first, all nodes receive a color 1. Each node collects its neighbors’ colors into a multiset. Then, 1WL will update each node’s color so that two nodes get the same new color if and only if their current colors are the same and they have identical multisets of neighbor colors. Repeat this process until the number of colors does not increase between two iterations. Then, 1WL will return that two graphs are nonisomorphic if their node colors are different at some iteration, or fail to determine whether they are nonisomorphic. See (Shervashidze et al., 2011; Zhang and Chen, 2017) for more details.
1WL essentially encodes the rooted subtrees around each node at different heights into its color representations. Figure 1 middle shows the rooted subtrees around and . Two nodes will have the same color at iteration if and only if their height rooted subtrees are the same.
3 Nested Graph Neural Network
In this section, we introduce our Nested Graph Neural Network (NGNN) framework and theoretically demonstrate its higher representation power than message passing GNNs.
3.1 Limitations of the message passing GNNs
Most existing GNNs follow the message passing framework (Gilmer et al., 2017): given a graph , each node’s hidden state is updated based on its previous state and the messages from its neighbors
(1) 
Here are the message and update functions at time stamp , is the feature of edge , and is the set of ’s neighbors in graph . The initial hidden states are given by the raw node features . After time stamps (iterations), the final node representations are summarized into a wholegraph representation with a readout (pooling) function (e.g., mean or sum):
(2) 
Such a message passing (or neighbor aggregation) scheme iteratively aggregates neighbor information into a center node’s hidden state, making it encode a local rooted subtree around the node. The final node representations will contain both the local structure and feature information around nodes, enabling nodelevel tasks such as node classification. After a pooling layer, these node representations can be further summarized into a graph representation, enabling graphlevel tasks. When there is no edge feature and the node features are from a countable space, it is shown that message passing GNNs are at most as powerful as the 1WL test for discriminating nonisomorphic graphs (Xu et al., 2018; Morris et al., 2019).
For an layer message passing GNN, it will give two nodes the same final representation if they have identical height rooted subtrees (i.e., both the structures and the features on the corresponding nodes/edges are the same). If two graphs have a lot of identical (or similar) rooted subtrees, they will also have similar graph representations after pooling. This insight is crucial for the success of modern GNNs in graph classification, because it aligns with the inductive bias that two graphs are similar if they have many common substructures. Such insight has also been used in designing the WL subtree kernel (Shervashidze et al., 2011), a stateoftheart graph classification method before GNNs.
However, message passing GNNs have several limitations. Firstly, rooted subtree is only one specific substructure. It is not general enough to represent arbitrary subgraphs, especially those with cycles due to the natural restriction of tree structure. Secondly, using rooted subtree as the elementary substructure results in a discriminating power bounded by the 1WL test. For example, all node regular graphs cannot be discriminated by message passing GNNs. Thirdly, standard message passing GNNs do not allow using rootnodespecific structural features (such as the distance between a node and the root node) to improve the quality of the learned root node’s representation. We might need to break through such limitations in order to design more powerful GNNs.
3.2 The NGNN framework
To address the above limitations, we propose the Nested Graph Neural Network (NGNN) framework. NGNN no longer aims to encode a rooted subtree around each node. Instead, in NGNN, each node’s final representation encodes the general local subgraph information around it more than a subtree, so that two graphs sharing a lot of identical or similar rooted subgraphs will have similar representations.
Definition 1.
(Rooted subgraph) Given a graph and a node , the height rooted subgraph of is the subgraph induced from by the nodes within hops of (including hop nodes).
To make a node’s final representation encode a rooted subgraph, we need to compute a subgraph representation. To achieve this, we resort to an arbitrary GNN, which we call the base GNN of NGNN. For example, the base GNN can be simply a message passing GNN, which performs message passing within each rooted subgraph to learn an intermediate representation for every node of the subgraph, and then uses a pooling layer to summarize a subgraph representation from the intermediate node representations. This subgraph representation is used as the final representation of the root node in the original graph. Take root node as an example. We first perform rounds of message passing within node ’s rooted subgraph . Let be any node appearing in . We have
(3) 
Here are the message and update functions of the base GNN at time stamp , denotes the set of ’s neighbors within ’s rooted subgraph , and and denote node ’s hidden state and message specific to rooted subgraph at time stamp . Note that when node attends different nodes’ rooted subgraphs, its hidden states and messages will also be different. This is in contrast to standard GNNs where a node’s hidden state and message at time is the same regardless of which root node it contributes to. For example, and in Eq. 1 do not depend on any particular rooted subgraph.
After rounds of message passing, we apply a subgraph pooling layer to summarize a subgraph representation from the intermediate node representations .
(4) 
where is the subgraph pooling layer. This subgraph representation will be used as root node ’s final representation in the original graph. Note that the base GNNs are simultaneously applied to all nodes’ rooted subgraphs to return a final node representation for every node in the original graph, and all the base GNNs share the same parameters. With such node representations, NGNN uses an outer GNN to further process and aggregate them into a graph representation of the whole graph. For simplicity, we let the outer GNN be simply a graph pooling layer denoted by :
(5) 
The Nested GNN framework can be understood as a twolevel GNN, or a GNN of GNNs—the inner subgraphlevel GNNs (base GNNs) are used to learn node representations from their rooted subgraphs, while the outer graphlevel GNN is used to return a wholegraph representation from the inner GNNs’ outputs. The inner GNNs all share the same parameters which are trained endtoend with the outer GNN. Figure 2 depicts the implementation of the NGNN framework described above.
Compared to message passing GNNs, NGNN changes the “receptive field” of each node from a rooted subtree to a rooted subgraph, in order to capture better local substructure information. The rooted subgraph is read by a base GNN to learn a subgraph representation. Finally, the outer GNN reads the subgraph representations output by the base GNNs to return a graph representation.
Note that, when we apply the base GNN to a rooted subgraph, this rooted subgraph is extracted (copied) out of the original graph and treated as a completely independent graph from the other rooted subgraphs and the original graph. This allows the same node to have different representations within different rooted subgraphs. For example, in Figure 2, the same node appears in four different rooted subgraphs. Sometimes it is the root node, while other times it is a 1hop neighbor of the root node. NGNN enables learning different representations for the same node when it appears in different rooted subgraphs, in contrast to standard GNNs where a node only has one single representation at one time stamp (Eq. 1). Similarly, NGNN also enables using different initial features for the same node when it appears in different rooted subgraphs. This allows us to customize a node’s initial features based on its structural role within a rooted subgraph, as opposed to using the same initial features for a node across all rooted subgraphs. For example, we can optionally augment node ’s initial features with the distance between node and the root—when node is the root node, we give it an additional feature ; and when is a hop neighbor of the root, we give it an additional feature . Such feature augmentation may help better capture a node’s structural role within a rooted subgraph. It is an exclusive advantage of NGNN and is not possible in standard GNNs.
3.3 The representation power of NGNN
We theoretically characterize the additional expressive power of NGNN (using message passing GNNs as base GNNs) as opposed to standard message passing GNNs. We focus on the ability to discriminate regular graphs because they form an important category of graphs which standard GNNs cannot represent well. Using 1WL or message passing GNNs, any two sized regular graphs will have the same representation, unless discriminative node features are available. In contrast, we prove that NGNN can distinguish almost all pairs of sized regular graphs regardless of node features.
Definition 2.
A proper NGNN always exists due to the representation power of fullyconnected neural networks used for message passing and Deep Set for graph pooling (Zaheer et al., 2017). For all pairs of graphs that 1WL can discriminate, there always exists a proper NGNN that can also discriminate them, because two graphs discriminated by 1WL means they must have different multisets of rooted subtrees at some height , while a rooted subtree is always included in a rooted subgraph with the same height.
Now we present our main theorem.
Theorem 1.
Consider all pairs of sized regular graphs, where . For any small constant , there exists a proper NGNN using at most height rooted subgraphs and layer message passing, which distinguishes almost all () such pairs of graphs.
We include the proof in Appendix A. Theorem 1 has three implications. Firstly, since NGNN can discriminate almost all regular graphs where 1WL always fails, it is strictly more powerful than 1WL and message passing GNNs. Secondly, it implies that NGNN does not need to extract subgraphs with a too large height (about ) to be more powerful. Moreover, NGNN is already powerful with very few layers, i.e., an arbitrarily small constant times (as few as 1 layer). This benefit comes from the subgraph pooling (Eq. 4), freeing us from using deep base GNNs. We further conduct a simulation experiment in Appendix D to verify Theorem 1 by testing how well NGNN discriminates regular graphs in practice. The results match almost perfectly with our theory.
Although NGNN is strictly more powerful than 1WL and 2WL (1WL and 2WL have the same discriminating power (Maron et al., 2019a)), it is unclear whether NGNN is more powerful than 3WL. Our earlystage analysis shows both NGNN and 3WL cannot discriminate strongly regular graphs with the same parameters Brouwer and Haemers (2012). We leave the exact comparison between NGNN and 3WL to future work.
3.4 Discussion
Base GNN. NGNN is a general plugandplay framework to increase the power of a base GNN. For the base GNN, we are not restricted to message passing GNNs as described in Section 3.2. For example, we can also use GNNs approximating the power of higherdimensional WL tests, such as 123GNN Morris et al. (2019) and PPGN/RingGNN (Maron et al., 2019a; Chen et al., 2019), as the base GNN. In fact, one limitation of these highorder GNNs is their complexity. Using the NGNN framework we can greatly alleviate this by applying the higherorder GNN to multiple small rooted subgraphs instead of the whole graph. Suppose a rooted subgraph has at most nodes, then by applying a highorder GNN to all rooted subgraphs, we can reduce the time complexity from to .
Complexity. We compare the time complexity of NGNN (using message passing GNNs as base GNNs) with a standard message passing GNN. Suppose the graph has nodes with a maximum degree , and the maximum number of nodes in a rooted subgraph is . Each message passing iteration in a standard message passing GNN takes operations. In NGNN, we need to perform message passing over all nodes’ rooted subgraphs, which takes . We will keep small (which can be achieved by using a small ) to improve NGNN’s scalability. Additionally, a small enables the base GNN to focus on learning local subgraph patterns.
In Appendix B, we discuss some other design choices of NGNN.
4 Related work
Understanding GNN’s representation power is a fundamental problem in GNN research. Xu et al. (2018) and Morris et al. (2019) first proved that the discriminating power of message passing GNNs is bounded by the 1WL test, namely they cannot discriminate two nonisomorphic graphs that 1WL fails to discriminate (such as regular graphs). Since then, there is increasing effort in enhancing GNN’s discriminating power beyond 1WL (Morris et al., 2019; Chen et al., 2019; Maron et al., 2019a; Murphy et al., 2019; Li et al., 2020b; Bouritsas et al., 2020; You et al., 2021; Beaini et al., 2020; Morris et al., 2020). Many GNNs have been proposed to mimic higherdimensional WL tests, such as 123GNN (Morris et al., 2019), RingGNN (Chen et al., 2019) and PPGN (Maron et al., 2019a). However, these models generally require learning the representations of all node tuples of certain cardinality (e.g., node pairs, node triples and so on), thus cannot leverage the sparsity of graph structure and are difficult to scale to large graphs. Some works study the universality of GNNs for approximating any invariant or equivariant functions over graphs (Maron et al., 2018; Chen et al., 2019; Maron et al., 2019b; Keriven and Peyré, 2019; Azizian and Lelarge, 2020). However, reaching universality would require polynomial(
)order tensors, which hold more theoretical value than practical applicability.
Dasoulas et al. (2019) propose to augment nodes of identical attributes with different colors, which requires exhausting all the coloring choices to reach universality. Similarly, Relational Pooling (RP) (Murphy et al., 2019) uses the ensemble of permutationaware functions over graphs to reach universality, which requires exhausting all permutations to achieve its theoretical power. Its local version Local Relational Pooling (LRP) (Chen et al., 2020) applies RP over subgraphs around nodes, which is similar to our work yet still requires exhausting node permutations in local subgraphs and even more loses RP’s theoretical power. In contrast, NGNN maintains a controllable cost by only applying a message passing GNN to local subgraphs, and is guaranteed to be more powerful than 1WL.Because of the high cost of mimicking highdimensional WL tests, several works have been proposed to increase GNN’s representation power within the message passing framework. Observing that different neighbors are indistinguishable during neighbor aggregation, some works propose to add onehot node index features or random features to GNNs (Loukas, 2019; Sato et al., 2020). These methods work well when nodes naturally have distinct identities irrespective of the graph structure. However, although making GNNs more discriminative, they also lose some of GNNs’ generalization ability by not being able to guarantee nodes with identical neighborhoods to have the same embedding; the resulting models are also no longer permutation invariant. Repeating random initialization helps with avoiding such an issue but gets much slower convergence Abboud et al. (2020). An exception is structural messagepassing (SMP) (Vignac et al., 2020), which propagates onehot node index features to learn a global feature matrix for each node. The feature matrix is further pooled to learn a permutationinvariant node representation.
On the contrary, some works propose to use structural features to augment GNNs without hurting the generalization ability of GNNs. SEAL (Zhang and Chen, 2018; Zhang et al., 2020), IGMC (Zhang and Chen, 2020) and DE (Li et al., 2020b) use distancebased features, where a distance vector w.r.t. the target node set to predict is calculated for each node as its additional features. Our NGNN framework is naturally compatible with such distancebased features due to its independent rooted subgraph processing. GSN (Bouritsas et al., 2020) uses the count of certain substructures to augment node/edge features, which also surpasses 1WL theoretically. However, GSN needs a properly defined substructure set to incorporate domainspecific inductive biases, while NGNN aims to learn arbitrary substructures around nodes without the need to predefine a substructure set.
Concurrent to our work, You et al. (2021) propose Identityaware GNN (IDGNN). IDGNN uses different weight parameters between each root node and its context nodes during message passing. It also extracts a rooted subgraph around each node, and thus can be viewed as a special case of NGNN with: 1) the number of message passing layers equivalent to the subgraph height, 2) directly using the root node’s intermediate representation as its final representation without subgraph pooling, and 3) augmenting initial node features with 0/1 “identity”. However, the extra power of IDGNN only comes from the “identity” feature, while the power of NGNN comes from the subgraph pooling—without using any node features, NGNN is still provably more discriminative than 1WL. Another similar work to ours is natural graph network (NGN) (de Haan et al., 2020). NGN argues that graph convolution weights need not be shared among all nodes but only (locally) isomorphic nodes. If we view our distancebased node features as refining the graph convolution weights so that nodes within a center node’s neighborhood are no longer treated symmetrically, then our NGNN reduces to an NGN.
The idea of independently performing message passing within hop neighborhood is also explored in hop GNN (Nikolentzos et al., 2020) and MixHop (AbuElHaija et al., 2019). However, MixHop directly concatenates the aggregation results of neighbors at different hops as the root representation, which ignores the connections between other nodes in the rooted subgraph. hop GNN sequentially performs message passing for hop, hop, …, and 0hop node (the update of hop nodes depend on the updated states of hop nodes), while NGNN simultaneously performs message passing for all nodes in the subgraph thus is more parallelizable. Both MixHop and hop GNN directly use the root node’s representation as its final node representation. In contrast, NGNN uses a subgraph pooling to summarize all node representations within the subgraph as the final root representation, which distinguishes NGNN from other hop models. As Theorem 1 shows, the subgraph pooling enables using a much smaller number of message passing layers (as small as 1) than the depth of the subgraph, while MixHop and hop GNN always require . MixHop and hop GNN also do not have the strong theoretical power of NGNN to discriminate regular graphs. Like SEAL and hop GNN, GMeta (Huang and Zitnik, 2020) is another work extracting subgraphs around nodes/links. It focuses specifically on a metalearning setting.
5 Experiments
In this section, we study the effectiveness of the NGNN framework for graph classification and regression tasks. In particular, we want to answer the following questions:
Q1 Can NGNN reach its theoretical power to discriminate 1WLindistinguishable graphs?
Q2 How often and how much does NGNN improve the performance of a base GNN?
Q3 How does NGNN perform in comparison to stateoftheart GNN methods in open benchmarks?
Q4 How much extra computation time does NGNN incur?
We implement the NGNN framework based on the PyTorch Geometric library
(Fey and Lenssen, 2019b). Our code is available at https://github.com/muhanzhang/NestedGNN.5.1 Datasets
To answer Q1, we use a simulation dataset of regular graphs and the EXP dataset (Abboud et al., 2020) containing 600 pairs of 1WLindistinguishable but nonisomorphic graphs. To answer Q2, we use the QM9 dataset (Ramakrishnan et al., 2014; Wu et al., 2018) and the TU datasets (Kersting et al., 2016). QM9 contains 130K small molecules. The task here is to perform regression on twelve targets representing energetic, electronic, geometric, and thermodynamic properties, based on the graph structure and node/edge features. TU contains five graph classification datasets including D&D (Dobson and Doig, 2003), MUTAG (Debnath et al., 1991), PROTEINS (Dobson and Doig, 2003), PTC_MR (Toivonen et al., 2003), and ENZYMES (Schomburg et al., 2004). We used the datasets provided by PyTorch Geometric (Fey and Lenssen, 2019b), where for QM9 we performed unit conversions to match the units used by (Morris et al., 2019)
. The evaluation metric is Mean Absolute Error (MAE) for QM9 and Accuracy (%) for TU. To answer
Q3, we use two Open Graph Benchmark (OGB) datasets (Hu et al., 2020), ogbgmolhiv and ogbgmolpcba. The ogbgmolhiv dataset contains 41K small molecules, the task of which is to classify whether a molecule inhibits HIV virus or not. ROCAUC is used for evaluation. The ogbgmolpcba dataset contains 438K molecules with 128 classification tasks. The evaluation metric is Average Precision (AP) averaged over all the tasks. We include the statistics for QM9 and OGB datasets in Table 1.Dataset  #Graphs  Avg. #nodes  Avg. #edges  Split ratio  #Tasks  Task type  Metric 

QM9  129,433  18.0  18.6  80/10/10  12  Regression  MAE 
ogblmolhiv  41,127  25.5  27.5  80/10/10  1  Classification  ROCAUC 
ogblmolpcba  437,929  26.0  28.1  80/10/10  128  Classification  AP 
5.2 Models
QM9. We use 1GNN, 12GNN, 13GNN, and 123GNN from (Morris et al., 2019) as both the baselines and the base GNNs of NGNN. Among them, 1GNN is a standard message passing GNN with 1WL power. 12GNN is a GNN mimicking 2WL, where message passing happens among 2tuples of nodes. 13GNN and 123GNN mimic 3WL, where message passing happens among 3tuples of nodes. 12GNN and 13GNN use features computed by 1GNN as initial node features, and 123GNN uses the concatenated features from 12GNN and 13GNN. We additionally include numbers provided by (Wu et al., 2018) and Deep LRP (Chen et al., 2020) as baselines. Note that we omit more recent methods (Anderson et al., 2019; Klicpera et al., 2020; Qiao et al., 2020)
using advanced physical representations calculated from angles, atom coordinates, and quantum mechanics, which may obscure the comparison of models’ pure graph representation power. For NGNN, we uniformly use height3 rooted subgraphs. For a fair comparison, the base GNNs in NGNN use exactly the same hyperparameters as when they are used alone, except for 1GNN where we increase the number of message passing layers from 3 to 5 to make the number of layers larger than the subgraph height, similar to
(Zeng et al., 2020). For subgraph pooling and graph pooling layers, we uniformly use mean pooling. All other settings follow (Morris et al., 2019).TU. We use four widely adopted GNNs as the baselines and the base GNNs of NGNN: GCN (Kipf and Welling, 2016), GraphSAGE (Hamilton et al., 2017), GIN (Xu et al., 2018), and GAT (Veličković et al., 2017). Since TU datasets suffer from inconsistent evaluation standards (Errica et al., 2019), we uniformly use the 10fold cross validation framework provided by PyTorch Geomtric (Fey and Lenssen, 2019a) for all the models to ensure a fair comparison. For GNNs, we search the number of message passing layers in . For NGNNs, we similarly search the subgraph height in , so that both NGNNs and GNNs can have equaldepth local receptive fields. For NGNNs, we always use message passing layers instead of searching it together with
, because that will make NGNNs have more hyperparameters to tune. All models have 32 hidden dimensions, and are trained for 100 epochs with a batch size of 128. For each fold, we record the test accuracy with the hyperparameters chosen based on the best validation performance of this fold. Finally, we report the average test accuracy across all the 10 folds.
OGB. We use GNNs achieving top places on the OGB graph classification leaderboard^{1}^{1}1https://ogb.stanford.edu/docs/leader_graphprop/ (at the time of submission) as the baselines, including GCN (Kipf and Welling, 2016), GIN (Xu et al., 2018), DeeperGCN (Li et al., 2020a), Deep LRP (Chen et al., 2020), PNA (Corso et al., 2020), DGN (Beaini et al., 2020), GINE (Brossard et al., 2020), and PHCGNN (Le et al., 2021). Note that those highorder GNNs (Morris et al., 2019; Maron et al., 2019a; Chen et al., 2019; Morris et al., 2020) are not included here, because despite being theoretically more discriminative, these GNNs are not among the GNNs with the best empirical performance on modern largescale graph benchmarks, and their complexity also raises a scalability issue. For NGNN, we use GIN as the base GNN (although GIN is not among the strongest baselines here). Some baselines additionally use the virtual node technique (Gilmer et al., 2017; Li et al., 2015; Ishiguro et al., 2019), which are marked by “*”. For NGNN, we search the subgraph height in , and the number of layers in . We train the NGNN models for 100 and 150 epochs for ogbgmolhiv and ogbgmolpcba
, respectively, and report the validation and test scores at the best validation epoch. We also find that our models are subject to high performance variance across epochs, likely due to the increased expressiveness. Thus, we save a model checkpoint every 10 epochs, and additionally report the ensemble performance by averaging the predictions from all checkpoints. The final hyperparameter choices and more details about the experimental settings are included in Appendix
C. All results are averaged over 10 independent runs.In the following, we uniformly use “Nested GNN” to denote an NGNN model using “GNN” as the base GNN. For example, Nested GIN denotes an NGNN model using GIN (Xu et al., 2018) as the base GNN. For the NGNN models in QM9, TU and OGB datasets, we augment the initial features of a node with Distance Encoding (DE) (Li et al., 2020b), which uses the (generalized) distance between a node and the root as its additional feature, due to DE’s successful applications in linklevel tasks (Zhang and Chen, 2018, 2020). Note that such feature augmentation is not applicable to the baseline models as discussed in Section 3.2. An ablation study on the effects of the DE features is included in Appendix E.
5.3 Results and discussion
To answer Q1, we first run a simulation to test NGNN’s power for discriminating regular graphs. The results are presented in Appendix D. They match almost perfectly with Theorem 1, demonstrating that a practical NGNN can fulfil its theoretical power for discriminating regular graphs.
Method  Test Accuracy 

GCNRNI (Abboud et al., 2020)  98.01.85 
PPGN (Maron et al., 2019a)  50.00.00 
123GNN (Morris et al., 2019)  50.00.00 
3GCN (Abboud et al., 2020)  99.70.004 
Nested GIN  99.90.26 
We also test NGNN’s expressive power using the EXP dataset provided by (Abboud et al., 2020), which contains 600 carefully constructed 1WL indistinguishable but nonisomorphic graph pairs. Each pair of graphs have different labels, thus a standard message passing GNN cannot predict them both correctly, resulting in an expected classification accuracy of only 50%. We exactly follow the experimental settings and copy the baseline results in (Abboud et al., 2020). In Table 2, our Nested GIN model achieves a 99.9% classification accuracy, which outperforms all the baselines and distinguishes almost all the 1WL indistinguishable graph pairs. These results verified that NGNN’s expressive power is indeed beyond 1WL and message passing GNNs.
To answer Q2, we adopt the QM9 and TU datasets. We show the QM9 results in Table 3. If the Nested version of a base GNN achieves a better result than the base GNN itself, we color that cell with light green. As we can see, NGNN brings performance gains to all base GNNs on most targets, sometimes by large margins. We also show the results on TU in Table 5. NGNNs also show improvement over their base GNNs in most cases. These results indicate that NGNN is a general framework for improving a GNN’s power. We further compute the maximum reduction of MAE for QM9 and maximum improvement of accuracy for TU before and after applying NGNN. NGNN reduces the MAE by up to 7.9 times for QM9, and increases the accuracy by up to 14.3% for TU. These results answer Q2, indicating that NGNN can bring steady and significant improvement to base GNNs.
Target  Method (Ne. for Nested)  

DTNN  MPNN  Deep LRP  1GNN  12GNN  13GNN  123GNN  Ne. 1GNN  Ne. 12GNN  Ne. 13GNN  Ne. 123GNN  Max. reduction  
0.244  0.358  0.364  0.493  0.493  0.473  0.476  0.428  0.437  0.436  0.433  1.2  
0.95  0.89  0.298  0.78  0.27  0.46  0.27  0.29  0.278  0.261  0.265  2.7  
0.00388  0.00541  0.00254  0.00321  0.00331  0.00328  0.00337  0.00265  0.00275  0.00265  0.00279  1.2  
0.00512  0.00623  0.00277  0.00355  0.00350  0.00354  0.00351  0.00297  0.00271  0.00269  0.00276  1.3  
0.0112  0.0066  0.00353  0.0049  0.0047  0.0046  0.0048  0.0038  0.0039  0.0039  0.0039  1.8  
17.0  28.5  19.3  34.1  21.5  25.8  22.9  20.5  20.4  20.2  20.1  1.7  
ZPVE  0.00172  0.00216  0.00055  0.00124  0.00018  0.00064  0.00019  0.00020  0.00017  0.00017  0.00015  6.2 
2.43  2.05  0.413  2.32  0.0357  0.6855  0.0427  0.295  0.252  0.291  0.205  7.9  
2.43  2.00  0.413  2.08  0.107  0.686  0.111  0.361  0.265  0.278  0.200  5.8  
2.43  2.02  0.413  2.23  0.070  0.794  0.0419  0.305  0.241  0.267  0.249  7.3  
2.43  2.02  0.413  1.94  0.140  0.587  0.0469  0.489  0.272  0.287  0.253  4.0  
0.27  0.42  0.129  0.27  0.0989  0.158  0.0944  0.174  0.0891  0.0879  0.0811  1.8 
D&D  MUTAG  PROTEINS  PTC_MR  ENZYMES  
#Graphs  1178  188  1113  344  600 
Avg. #nodes  284.32  17.93  39.06  14.29  32.63 
GCN  71.62.8  73.410.8  71.74.7  56.47.1  27.35.5 
GraphSAGE  71.63.0  74.08.8  71.25.2  57.05.5  30.76.3 
GIN  70.53.9  84.58.9  70.64.3  51.29.2  38.36.4 
GAT  71.04.4  73.910.7  72.03.3  57.07.3  30.24.2 
Nested GCN  76.33.8  82.911.1  73.34.0  57.37.7  31.26.7 
Nested GraphSAGE  77.44.2  83.910.7  74.23.7  57.05.9  30.76.3 
Nested GIN  77.83.9  87.98.2  73.95.1  54.17.7  29.08.0 
Nested GAT  76.04.4  81.910.2  73.74.8  56.78.1  29.55.7 
Max. improvement  10.4%  13.4%  4.7%  5.7%  14.3% 
ogbgmolhiv (AUC)  ogbgmolpcba (AP)  
Method  Validation  Test  Validation  Test 
CCN*  83.840.91  75.991.19  24.950.42  24.240.34 
GIN*  84.790.68  77.071.49  27.980.25  27.030.23 
Deep LRP  82.091.16  77.191.40  –  – 
DeeperGCN*  –  –  29.200.25  27.810.38 
HIMP  –  78.800.82  –  – 
PNA  85.190.99  79.051.32  –  – 
DGN  84.700.47  79.700.97 –  –  
GINE*  –  –  30.650.30  29.170.15 
PHCGNN  82.170.89  79.341.16  30.680.25  29.470.26 
Nested GIN*  83.171.99  78.341.86  29.150.35  28.320.41 
Nested GIN* (ens)  80.802.78  79.861.05  30.590.56  30.070.37 
To answer Q3, we compare Nested GIN with leading methods on the OGB leaderboard. The results are shown in Table 5. Nested GIN achieves highly competitive performance with these leading GNN models, albeit using a relatively weak base GNN (GIN). Compared to GIN alone, Nested GIN shows clear performance gains. It achieves test scores up to 79.86 and 30.07 on ogbgmolhiv and ogbgmolpcba, respectively, which outperform all the baselines. In particular, for the challenging ogbgmolpcba, our Nested GIN can achieve 30.07 and 28.32 test AP with and without ensemble, respectively, outperforming the plain GIN model (with 27.03 test AP) significantly. These results demonstrate the great empirical performance and potential of NGNN even compared to heavily tuned open leaderboard models, despite using only GIN as the base GNN.
To answer Q4, we report the training time per epoch for GIN and Nested GIN on OGB datasets. On ogbgmolhiv, GIN takes 54s per epoch, while Nested GIN takes 183s. On ogbgmolpcba, GIN takes 10min per epoch, while Nested GIN takes 20min. This verifies that NGNN has comparable time complexity with message passing GNNs. The extra complexity comes from independently learning better node representations from rooted subgraphs, which is a tradeoff for the higher expressivity.
In summary, our experiments have firmly shown that NGNN is a theoretically sound method which brings consistent gains to its base GNNs in a plugandplay way. Furthermore, NGNN still maintains a controllable time complexity compared to other more powerful GNNs.
Finally, we point out one memory limitation of the current NGNN implementation. Currently, NGNN does not scale to graph datasets with a large average node number (such as REDDITBINARY) or datasets with a large average node degree (such as ogbgppa) due to copying a rooted subgraph for each node to the GPU memory. Reducing batch size or subgraph height helps, but at the same time leads to performance degradation. One may wonder why materializing all the subgraphs into GPU memory is necessary. The reason is that we want to batchprocess all the subgraphs simultaneously. Otherwise, we have to sequentially extract subgraphs on the fly, which results in a much higher latency. We leave the exploration of memory efficient NGNN to the future work.
6 Conclusions
We have proposed Nested Graph Neural Network (NGNN), a general framework for improving GNN’s representation power. NGNN learns node representations encoding rooted subgraphs instead of rooted subtrees. Theoretically, we prove NGNN can discriminate almost all regular graphs where 1WL always fails. Empirically, NGNN consistently improves the performance of various base GNNs across different datasets without incurring the complexity like other more powerful GNNs.
Acknowledge
The authors greatly thank the actionable suggestions from the reviewers to improve the manuscript. Li is partly supported by the 2021 JP Morgan Faculty Award and the National Science Foundation (NSF) award HDR2117997.
References
 The surprising power of graph neural networks with random node initialization. arXiv preprint arXiv:2010.01179. Cited by: §4, §5.1, §5.3, Table 2.
 Mixhop: higherorder graph convolutional architectures via sparsified neighborhood mixing. In international conference on machine learning, pp. 21–29. Cited by: §4.
 Cormorant: covariant molecular neural networks. arXiv preprint arXiv:1906.04015. Cited by: §5.2.
 Characterizing the expressive power of invariant and equivariant graph neural networks. arXiv preprint arXiv:2006.15646. Cited by: §4.
 Directional graph networks. arXiv preprint arXiv:2010.02863. Cited by: §4, §5.2.
 Shortestpath kernels on graphs. In 5th IEEE International Conference on Data Mining, pp. 8–pp. Cited by: §1.
 Improving graph neural network expressivity via subgraph isomorphism counting. arXiv preprint arXiv:2006.09252. Cited by: §4, §4.
 Graph convolutions that can finally model local structure. arXiv preprint arXiv:2011.15069. Cited by: §5.2.
 Strongly regular graphs. In Spectra of Graphs, pp. 115–149. Cited by: §3.3.
 Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §1.
 Can graph neural networks count substructures?. Advances in neural information processing systems. Cited by: §4, §5.2, §5.2.
 On the equivalence between graph isomorphism testing and function approximation with gnns. In Advances in Neural Information Processing Systems, pp. 15894–15902. Cited by: §1, §1, §3.4, §4, §5.2.
 Principal neighbourhood aggregation for graph nets. arXiv preprint arXiv:2004.05718. Cited by: §5.2.
 Discriminative embeddings of latent variable models for structured data. In Proceedings of The 33rd International Conference on Machine Learning, pp. 2702–2711. Cited by: §1.
 Coloring graph neural networks for node disambiguation. arXiv preprint arXiv:1912.06058. Cited by: §4.
 Natural graph networks. arXiv preprint arXiv:2007.08349. Cited by: §4.
 Structureactivity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity.. Journal of medicinal chemistry 34 (2), pp. 786–797. Cited by: §5.1.
 Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3837–3845. Cited by: §1.
 Distinguishing enzyme structures from nonenzymes without alignments. Journal of molecular biology 330 (4), pp. 771–783. Cited by: §5.1.
 Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §1.
 A fair comparison of graph neural networks for graph classification. arXiv preprint arXiv:1912.09893. Cited by: §5.2.
 Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: §5.2.
 Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428. Cited by: §5.1, §5.
 Graph unets. arXiv preprint arXiv:1905.05178. Cited by: Appendix B.
 Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1263–1272. Cited by: §1, §3.1, §5.2.
 Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1025–1035. Cited by: §5.2.
 Convolution kernels on discrete structures. Technical report Citeseer. Cited by: §1.
 Open graph benchmark: datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687. Cited by: §5.1.
 Graph meta learning via local subgraphs. Advances in Neural Information Processing Systems 33. Cited by: §4.
 Graph warp module: an auxiliary module for boosting the power of graph neural networks. arXiv preprint arXiv:1902.01020. Cited by: §5.2.
 Universal invariant and equivariant graph neural networks. arXiv preprint arXiv:1905.04943. Cited by: §4.
 Benchmark data sets for graph kernels. External Links: Link Cited by: §5.1.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §5.2, §5.2.
 Resistance distance. Journal of Mathematical Chemistry 12 (1), pp. 81–95. Cited by: Appendix C.
 Directional message passing for molecular graphs. arXiv preprint arXiv:2003.03123. Cited by: §5.2.
 The graphlet spectrum. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 529–536. Cited by: §1.
 Parameterized hypercomplex graph neural networks for graph classification. arXiv preprint arXiv:2103.16584. Cited by: §5.2.
 Deepergcn: all you need to train deeper gcns. arXiv preprint arXiv:2006.07739. Cited by: §5.2.
 Distance encoding–design provably more powerful gnns for structural representation learning. arXiv preprint arXiv:2009.00142. Cited by: Appendix A, Appendix A, Appendix E, §1, §4, §4, §5.2.
 Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §1, §5.2.
 What graph neural networks cannot learn: depth vs width. arXiv preprint arXiv:1907.03199. Cited by: §4.
 Provably powerful graph networks. In Advances in Neural Information Processing Systems, pp. 2156–2167. Cited by: §1, §1, §3.3, §3.4, §4, §5.2, Table 2.
 Invariant and equivariant graph networks. arXiv preprint arXiv:1812.09902. Cited by: §4.
 On the universality of invariant networks. In International conference on machine learning, pp. 4363–4371. Cited by: §4.
 Weisfeiler and leman go sparse: towards scalable higherorder graph embeddings. Cited by: §1, §4, §5.2.
 Weisfeiler and leman go neural: higherorder graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4602–4609. Cited by: §1, §1, §3.1, §3.4, §4, §5.1, §5.2, §5.2, Table 2.
 Relational pooling for graph representations. In International Conference on Machine Learning, pp. 4663–4673. Cited by: §4.
 Propagation kernels: efficient graph kernels from propagated information. Machine Learning 102 (2), pp. 209–245. Cited by: §1.
 Khop graph neural networks. Neural Networks 130, pp. 195–205. Cited by: §4.
 Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §1.

OrbNet: deep learning for quantum chemistry using symmetryadapted atomicorbital features
. The Journal of Chemical Physics 153 (12), pp. 124111. Cited by: §5.2.  Quantum chemistry structures and properties of 134 kilo molecules. Scientific data 1 (1), pp. 1–7. Cited by: §5.1.
 Random features strengthen graph neural networks. arXiv preprint arXiv:2002.03155. Cited by: §4.
 The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §1.

BRENDA, the enzyme database: updates and major new developments
. Nucleic acids research 32 (suppl_1), pp. D431–D433. Cited by: §5.1.  Weisfeilerlehman graph kernels. Journal of Machine Learning Research 12 (Sep), pp. 2539–2561. Cited by: §1, §2.2, §3.1.
 Efficient graphlet kernels for large graph comparison.. In AISTATS, Vol. 5, pp. 488–495. Cited by: §1.
 Statistical evaluation of the predictive toxicology challenge 2000–2001. Bioinformatics 19 (10), pp. 1183–1193. Cited by: §5.1.
 Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §5.2.
 Building powerful and equivariant graph neural networks with structural messagepassing. arXiv eprints, pp. arXiv–2006. Cited by: §4.
 Graph kernels. Journal of Machine Learning Research 11 (Apr), pp. 1201–1242. Cited by: §1.
 A reduction of a graph to a canonical form and an algebra arising during this reduction. NauchnoTechnicheskaya Informatsia 2 (9), pp. 12–16. Cited by: §1, §2.2.
 MoleculeNet: a benchmark for molecular machine learning. Chemical science 9 (2), pp. 513–530. Cited by: §5.1, §5.2.
 How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §3.1, §4, §5.2, §5.2, §5.2.
 Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pp. 4800–4810. Cited by: Appendix B, §1.
 Identityaware graph neural networks. arXiv preprint arXiv:2101.10320. Cited by: §4, §4.
 Deep sets. In Advances in Neural Information Processing Systems, pp. 3391–3401. Cited by: §3.3.
 Deep graph neural networks with shallow subgraph samplers. arXiv preprint arXiv:2012.01380. Cited by: Appendix B, §5.2.
 Weisfeilerlehman neural machine for link prediction. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 575–583. Cited by: §2.2.
 Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, pp. 5165–5175. Cited by: Appendix E, §4, §5.2.
 Inductive matrix completion based on graph neural networks. In International Conference on Learning Representations, External Links: Link Cited by: Appendix E, §4, §5.2.
 An endtoend deep learning architecture for graph classification. In AAAI, pp. 4438–4445. Cited by: Appendix B, §1.
 Revisiting graph neural networks for link prediction. arXiv preprint arXiv:2010.16103. Cited by: §4.
Appendix A Proof of Theorem 1
The proof is inspired by the previous theoretical characterization on the power of distance features Li et al. [2020b]. Basically, performing height subgraph extraction around a center node is essentially equivalent to injecting distance features that indicate whether the distance between a node and the center node is less than . In the following part, we will explicitly show how these distance features make NGNN more powerful than the 1WL test. Let us first introduce the outline of the proof. Consider two node regular graphs and and we pick two nodes, each from one graph, denoted by and . By performing certainheight (at most height) rooted subgraph extraction around these two nodes, due to the implicit distance features, we may prove that the nodes on the boundary of the obtained two subgraphs will obtain special node representations. These special node representations will be propagated within the subgraphs. After some steps of propagation, we can prove that NGNN by leveraging the subgraph pooling (Eq. 4) can distinguish these two subgraphs. This tells that NGNN may generate different node representations for and respectively. Then, a union bound can be used to transform such difference in node representations into the difference in the representations of and . Note that the proof will assume that there are no node/edge attributes that can be leveraged. Additional node/edge attributes may only improve the possibility to distinguish these two graphs.
The first lemma is to analyze the difference between the structures of the rooted subgraphs around two nodes over two node regular graphs. Before introducing that, we need to define a notion termed edge configuration. For a node in graph , let denote the set of nodes in that are exactly hop neighbors of , i.e., the shortest path distance between and any node is . Then, we know the height rooted subgraph over around the center node is the subgraph induced by the node set .
Definition 3.
The edge configuration between and is a list where denotes the number of nodes in of which each has exactly edges from .
When we say two edge configurations (between and ), (between and ) are equal, we mean that these two lists are componentwise equal to each other. Obviously, we should also have if . Now, we are ready to propose the first lemma.
Lemma 1.
For two graphs and that are uniformly independently sampled from all node regular graphs, where , we pick any two nodes, each from one graph, denoted by and respectively. Then, there is at least one
with probability
such that . Moreover, with at least the same probability, for all , the number of edges between and are at least for .Proof.
This lemma can be obtained by following the steps 13 in the proof of Theorem 3.3 in Li et al. [2020b]. ∎
Now, we set . We focus on the two extracted subgraphs and . We first prove a lemma that shows with a certain number of layers, a proper NGNN will generate different representations for and , i.e., and in Eq. 4.
Lemma 2.
For two graphs and that are uniformly independently sampled from all node regular graphs, where , we pick any two nodes, each from one graph, denoted by and respectively, and do height rooted subgraph extraction around and . With at most many layers, a proper message passing GNN (with injective and subgraph pooling) will generate different representations for the extracted two subgraphs with probability at least .
Proof.
According to Lemma 1, we know that with probability , there exists at least one such that . So there exists at least one that make (thus the difference in edge configurations appears in and ) and we pick the largest .
Now let us consider running a message passing GNN over the two subgraphs , . All nodes are initialized with the same node features. The nodes of these two subgraphs can be categorized into (), for respectively. Next, let us consider the node representations in these categories during the message passing procedure. We have the following observations.

Note that all the nodes other than those in have degree in both subgraphs. Therefore, in the th iteration, the nodes in for will share the same node representation. We call this node representation as default representation. Note that if we do not perform rooted subgraph extraction, then all nodes in all regular graph hold default representation.

Node representations that are different from default representations will first appear among the nodes in after the first iteration. This is because there are at least edges between and before performing subgraph extraction (due to Lemma 1) and all these edges will not appear in the extracted subgraphs. Then, almost all nodes in hold only degree one (and thus do not have degree to keep default representations) within the corresponding extracted subgraphs. We uniformly call the node representations that are different from the default ones as new representations. New representations may be mutually different.

Those new different node representations will propagate to nodes in , and so on and so forth via iterative message passing. Moreover, during such propagation procedure, after iterations, new representations will at least make almost all nodes in hold representations different form almost all nodes in for , which can be easily obtained by doing induction from to .
Observing the above three points, We may compare the above propagating procedure between and . Suppose in the first steps of message passing, the set of node representations (both the default ones and the new ones) can keep the same between the two extracted subgraphs. If this is not true, we have already proven the results. As they hold different edge configurations in , when the new node representations propagate from to , it will definitely induce different sets of new node representations between and . Currently, node representations are kept the same between and for as they are all default node representations. Though also hold new node representations, they are different from those in for . At this point, if an injective subgraph pooling operation is adopted, then the obtained representations of and , i.e., and , are different. ∎
Based on Lemma 2, using a union bound by comparing a node representation of with all node representaitons of , we may achieve the final conclusion. Specifically, we consider a node of , say , and another arbitrary node of , say . Using Lemma 2, we know with probability , is different from . Then, using the union bound, with probability , we have . Therefore, if the final graph pooling (Eq. 5) is injective, we may guarantee that NGNN can generate different representations for and .
Appendix B Design choices of NGNN
In this section, we discuss some other design choices of NGNN.
Highorder NGNN. NGNN is a twolevel GNN (a GNN of GNNs), where a base GNN is used to learn a final node representation from a rooted subgraph and an outer GNN (graph pooling) is used to learn a graph representation from the base GNNs’ outputs. This design thus involves one level of nesting, which we call firstorder NGNN. To extend the framework, we propose highorder NGNN, where we make the base GNN itself an NGNN. That is, we perform the subgraph representation learning tasks each using a firstorder NGNN, where we treat each subgraph the same as the graph in the original NGNN. This way, we arrive at a secondorder NGNN with two levels of nesting (a GNN of NGNNs, or a GNN of GNNs of GNNs). Repeating this construction, we can in principle construct an arbitraryorder NGNN. It is interesting to investigate whether highorder NGNNs can further enhance the representation power and the practical performance of a base GNN. We leave the exploration of such architectures to future work.
Pooling functions and . To summarize node representations into a subgraph/graph representation, we need a readout (pooling) function. Popular choices include sum, mean, max, as well as more complex ones such as selecting top nodes [Zhang et al., 2018, Gao and Ji, 2019] and hierarchical approaches [Ying et al., 2018]. In this paper, we find mean pooling works very well, which directly takes the mean of node representations as the subgraph/graph representation. We also find another pooling function to be sometimes useful for subgraph pooling, called center pooling (CP). CP directly uses the root node’s representation to represent the entire subgraph. The success of CP relies on using more layers of message passing than the height of the rooted subgraph, so that even the intermediate representation of the center root node alone can have sufficient information about the entire subgraph. This is feasible for rooted subgraphs with a small height. Note that when using a number of message passing layers smaller than the subgraph height, NGNN with CP will reduce to a standard message passing GNN.
Subgraph height and base GNN layers . NGNN is flexible in terms of choosing the subgraph height and the number of message passing layers in the base GNN. Theorem 1 provides a guide for choosing and when discriminating regular graphs. In practice, we find using and generally performs well across various tasks. Using a small will restrict the receptive field, causing NGNN to learn too local features. Using a too large might cause each rooted subgraph to include the entire graph. For the number of message passing layers , we find that using performs better. This can be explained by that using a large makes each node in a rooted subgraph to more sufficiently absorb the wholesubgraph information thus learning a better intermediate node representation reflecting its structural position within the subgraph. Please refer to [Zeng et al., 2020] for more motivations for using deeper message passing layers than the subgraph height.
Appendix C More details about the experimental settings
The experiments were run on a Linux server with 64GB memory, two NVIDIA RTX 2080S (8GB) GPUs and an INTEL i99900 8core CPU. For ogbgmolhiv, the final NGNN architecture used a rooted subgraph height and number of GIN layers . Mean pooling is used in both the subgraph and graph pooling. The final NGNN architecture for ogbgmolpcba used a rooted subgraph height and the number of GIN layers . Center pooling (CP) is used in the subgraph pooling and mean pooling is used in the graph pooling. Although we searched and , we found the final performance is not very sensitive to these hyperparameters as long as is between 3 and 5 and . For the DE features, we use shortest path distance and resistance distance [Klein and Randić, 1993].
Appendix D Simulation experiments to verify Theorem 1
We conduct a simulation over random regular graphs to validate Lemma 2 (how well NGNN distinguishes nodes of regular graphs) and Theorem 1 (how well NGNN distinguishes regular graphs). The results are shown in Figure 3, which match our theory almost perfectly. Basically, we sample 100 node 3regular graphs uniformly at random, and then apply an untrained NGNN to these graphs to see how often NGNN can distinguish the nodes and graphs at different rooted subgraph height and node number . The required at different matches almost perfectly with the lower bound in Lemma 2. More details are contained in the caption of Figure 3.
Appendix E Ablation study on DE
In this paper, we choose Distance Encoding (DE) [Li et al., 2020b] to augment the initial node features of NGNN, due to its good theoretical properties for improving the expressive power of message passing GNNs as well as its superb empirical performance on link prediction tasks [Zhang and Chen, 2018, 2020]. DE encodes the distance between a node and the root node into a vector through an embedding layer. The distance embedding is concatenated with the raw features of a node as its new features (in this rooted subgraph) input to the base GNN. Note that when this node appears in another rooted subgraph, it may have a different distance to that root node, thus resulting in different DE features in different subgraphs. Only the NGNN framework can leverage such a subgraphspecific feature augmentation—a standard GNN treats a node always the same no matter which node’s rooted subgraph/subtree it is in.
Method  ZPVE  

1GNN  0.493  0.78  0.00321  0.00355  0.0049  34.1  0.00124  2.32  2.08  2.23  1.94  0.27 
Nested 1GNN (no DE)  0.466  0.38  0.00292  0.00294  0.0042  24.0  0.00040  1.09  1.76  1.04  1.19  0.111 
Nested 1GNN (with DE)  0.428  0.29  0.00265  0.00297  0.0038  20.5  0.00020  0.295  0.361  0.305  0.489  0.174 
12GNN  0.493  0.27  0.00331  0.00350  0.0047  21.5  0.00018  0.0357  0.107  0.070  0.140  0.0989 
Nested 12GNN (no DE)  0.454  0.308  0.00280  0.00278  0.0041  23.3  0.00029  0.349  0.281  0.395  0.307  0.0945 
Nested 12GNN (with DE)  0.437  0.278  0.00275  0.00271  0.0039  20.4  0.00017  0.252  0.265  0.241  0.272  0.0891 
13GNN  0.473  0.46  0.00328  0.00354  0.0046  25.8  0.00064  0.6855  0.686  0.794  0.587  0.158 
Nested 13GNN (no DE)  0.448  0.298  0.00276  0.00276  0.0040  22.0  0.00025  0.410  0.396  0.370  0.422  0.0936 
Nested 13GNN (with DE)  0.436  0.261  0.00265  0.00269  0.0039  20.2  0.00017  0.291  0.278  0.267  0.287  0.0879 
123GNN  0.476  0.27  0.00337  0.00351  0.0048  22.9  0.00019  0.0427  0.111  0.0419  0.0469  0.0944 
Nested 123GNN (no DE)  0.449  0.306  0.00282  0.00286  0.0041  22.0  0.00023  0.220  0.218  0.268  0.205  0.0975 
Nested 123GNN (with DE)  0.433  0.265  0.00279  0.00276  0.0039  20.1  0.00015  0.205  0.200  0.249  0.253  0.0811 
In this section, we do ablation experiments to study the effect of the DE features. We choose QM9 as the testbed. The base GNNs are the same as in Table 3. For each base GNN, we compare it with its Nested GNN version without DE features (no DE) and its Nested GNN version with DE features (with DE). The results are shown in Table 6.
In Table 6, we color the cell with light green if the NGNN (no DE) is better than the base GNN, and mark the cell with green if the NGNN (with DE) is additionally better than the NGNN (no DE). From the results, we can first observe that NGNNs (no DE) generally outperform the base GNNs, validating that even without any feature augmentation the NGNN framework still enhances the performance of base GNNs. Furthermore, we can observe that if NGNN improves over the base GNN, adding DE features could further enlarge the performance improvement by achieving the smallest MAEs among the three (i.e., base GNN, NGNN (no DE) and NGNN (with DE)). This demonstrates the usefulness of augmenting NGNN with DE features. Note that adding such DE features can be done simultaneously with the rooted subgraph extraction process, which only adds a negligible amount of time. Thus, augmenting NGNN with DE features is almost a free yet powerful operation to further enhance NGNN’s power, which motivates us to make it a default choice of NGNN.
Comments
There are no comments yet.