1. Introduction
Graph Neural Networks (GNNs) capture local graph structure and feature information in a trainable fashion to derive powerful node representations. They have shown promising success on multiple graphbased machine learning tasks
(ying2018graph; scarselli2008graph; hu2020open) and are widely adopted by various web applications including social network (ying2018graph; rossi2020temporal), recommendation (berg2017graph; fan2019graph; yu2021self), fraud detection (wang2019fdgars; li2019spam; liu2020alleviating), etc. Various strategies have been proposed to improve the expressiveness of GNNs (hamilton2017inductive; velivckovic2017graph; xu2018powerful; schlichtkrull2018modeling).One natural candidate for improving the performance of a GNN is to increase its parameter size by either expanding the hidden dimension or the number of GNN layers. However, this can result in a large computational cost with only a modest performance gain. As a representative example, Figure 1 displays the performance of GraphSage (hamilton2017inductive) under different settings on the ogbnproducts dataset and the corresponding model parameter sizes. From these results, it can be seen that either increasing the hidden dimension or increasing the number of GNN layers increases the model parameter size exponentially, but brings little performance improvement in tersm of test accuracy. For example, in order to improve the accuracy of a 3layer GraphSage model by 1%, we need to add 2.3 more parameters (by increasing the hidden dimension from 256 to 512). Furthermore, with a larger hidden dimension a model is more likely to overfit the training data. On the other head, stacking multiple GNN layers may oversmooth the features of nodes (oono2019graph; chen2020measuring). As shown in Figure 0(a), GrageSage reaches its peak performance with only 3 GNN layers and a hidden dimension of 512.
Inspired by the NetworkinNetwork architecture (lin2013network), we present NetworkinGraph NeuralNetwork (NGNN ), a model agnostic methodology that allows arbitrary GNN models to increase their model capacity by making the model deeper. However, instead of adding more GNN layers, NGNN deepens a GNN model by inserting nonlinear feedforward neural network layer(s) within each GNN layer. This leads to a much smaller memory footprint than recent alternative deep GNN architectures (li2019deepgcns; li2020deepergcn) and can be applied to all kinds of GNN models with various training methods including fullgraph training, neighbor sampling (hamilton2017inductive), clusterbased sampling (chiang2019cluster) and local subgraph sampling (zhang2018link). Thus, it can easily scale to large graphs. Moreover, analysis of NGNN in conjunction with GraphSage on perturbed ogbnproducts showed that NGNN is a cheap yet effective way to keep the model stable against either node feature or graph structure perturbations.
In this work, we applied NGNN to GCN (kipf2016semi), GraphSage (hamilton2017inductive), GAT (velivckovic2017graph) and AGDN (sun2020adaptive) and SEAL (zhang2018link). We also combine the proposed technique with different minibatch training methods including neighbor sampling, graph clustering and local subgraph sampling. We conducted comprehensive experiments on several largescale graph datasets for both node classification and link prediction leading to the following conclusions (which hold as of the time of this submission):

NGNN improves the performance of GraphSage and GAT and their variants on node classification datasets including ogbnproducts, ogbnarxiv, ogbnproteins and reddit. It improves the test accuracy by 1.6% on the ogbnproducts datasets for GraphSage. Furthermore, NGNN with AGDN+BoT+selfKD+C&S (huang2020combining) achieves the forth place on the ogbnarxiv leaderboard^{1}^{1}1https://ogb.stanford.edu/docs/leader_nodeprop/ and NGNN with GAT+BoT (wang2021bag) achieves second place on the ogbnproteins leaderboard with many fewer model parameters.

NGNN improves the performance of SEAL, GCN and GraphSage and their variants on link prediction datasets including ogblcollab, ogblppa and ogblppi. For example, it increases the test hits@100 score by 7.08% on the ogblppa dataset for SEAL, which outperforms all the stateoftheart approaches on the ogblppa leaderboard^{2}^{2}2https://ogb.stanford.edu/docs/leader_linkprop/ by a substantial margin. Furthermore, NGNN achieves an improvement of the test hits@20 score by 6.22% on the ogblppi dataset for GraphSage+EdgeAttr, which also takes the first place on the ogblppi learderboard.

NGNN improves the performance of GraphSage and GAT under different training methods including fullgraph training, neighbor sampling, graph clustering, and subgraph sampling.

NGNN is a more effective way of improving the model performance than expanding the hidden dimension. It takes less parameter size and less training time to get better performance than simply doubling the hidden dimension.
In summary, we present NGNN , a method that deepens a GNN model without adding extra GNN messagepassing layers. We show that NGNN significantly improves the performance of vanilla GNNs on various datasets for both node classification and link prediction. We demonstrate the generality of NGNN by applying them to various GNN architectures.
2. Related Work
Deep models have been widely studied in various domains including computer vision
(simonyan2014very; he2016deep)(brown2020language), and speech recognition (zhang2017very). VGG (simonyan2014very)investigates the effect of the convolutional neural network depth on its accuracy in the largescale image recognition setting. It demonstrates the depth of representations is essential to the model performance. But when the depth grows, the accuracy will not always grow. Resnet
(he2016deep)eases the difficulties on training the deep model by introducing residual connections between input and output layers. DenseNet
(huang2017cvpr)takes this idea a step further by adding connections across layers. GPT3
(brown2020language) presents an autoregressive language model with 96 layers that achieves SOTA performance on various NLP tasks. Even so, while deep neural networks have achieved great success in various domains, the use of deep models in graph representation leaning is less wellestablished.Most recent works (li2019deepgcns; li2020deepergcn; li2021training) attempt to train deep GNN models with a large number of parameters and achieved SOTA performance. For example, DeepGCN (li2019deepgcns) adapts the concept of residual connections, dense connections, and dilated convolutions (yu2015multi) to training very deep GCNs. However DeepGCN and its successor DeeperGCN (li2020deepergcn) have large memory footprints during model training which can be subject to current hardware limitations. RevGNN (li2021training) explored grouped reversible graph connections to train a deep GNN and has a much smaller memory footprint. However, RevGNN can only work with fullgraph training and clusterbased minibatch training, which makes it difficult to work with other methods designed for large scale graphs such as neighbor sampling (hamilton2017inductive) and layerwise sampling (chen2018fastgcn). In contrast, NGNN deepens a GNN model by inserting nonlinear feedforward layer(s) within each GNN layer. It can be applied to all kinds of GNN models with various training methods including fullgraph training, neighbor sampling (hamilton2017inductive), layerwise sampling (chen2018fastgcn) and clusterbased sampling (chiang2019cluster).
Xu et al. (xu2018powerful)
used Multilayer Perceptrons (MLPs) to learn the injective functions of the Graph Isomophism Network (GIN) model and showed its effectiveness on graph classification tasks. But they did not show whether adding an MLP within GNN layers works effectively across wideranging node classification and link prediction tasks. Additionally, You et al.
(you2020design) mentioned that adding MLPs within GNN layer could benefit the performance. However, they did not systematically analyze the reason for the performance improvement introduced by extra nonlinear layers, nor evaluate with numerous SOTA GNN architectures on largescale graph datasets for both node classification and link prediction tasks.3. Building Network in Graph Neural Network Models
3.1. Preliminaries
A graph is composed of nodes and edges , where is the set of nodes and is the set of edges. Furthermore, denotes the corresponding adjacency matrix of . Let be the node feature space such that where represents the node feature of . Formally, the th layer of a GNN is defined as:^{3}^{3}3We omit edge features for simplicity.
(1) 
where the function is determined by learnable parameters and
is an optional activation function. Additionally,
represents the embeddings of the nodes in the th layer, and when . With an layer GNN, the node embeddings in the last layer are used by downstream tasks like node classification and link prediction.3.2. Basic NGNN Design
Inspired by the networkinnetwork architecture (lin2013network), we deepen a GNN model by inserting nonlinear feedforward neural network layer(s) within each GNN layer. The th layer in NGNN is thus constructed as:
(2) 
The calculation of is defined layerwise as:
(3) 
where are learnable weight matrices, is an activation function, and is the number of inGNN nonlinear feedforward neural network layers. The first inGNN layer takes the output of
as input and performs the nonlinear transformation.
3.3. Discussion
In this section, we demonstrate that a NGNN architecture can better handle both noisy node features and noisy graph structures relative to its vanilla GNN counterpart.
Remark 1 ().
GNNs work well when the input features consist of distinguishable true features and noise. But when the true features are mixed with noise, GNNs can struggle to filter out the noise, especially as the noise level increases.
GNNs follow a neural message passing scheme (gilmer2017neural) to aggregate information from neighbors of a target node. In doing so, they can perform noise filtering and learn from the resulting signal when the noise is in some way distinguishable from true features, such as when the latter are mostly lowfrequency (nt2019revisiting). However, when the noise level becomes too large and is mixed with true features, it cannot easily be reduced by GNNs (huang2019residual). Figure 2 demonstrates this scenario. Here we randomly added Gaussian noise to node features in ogbnproducts data, where
is the standard deviation ranging from 0.1 to 5.0. We adopt two different methods for adding noise: 1)
as shown in Figure 1(a), where is a concatenation operation, and 2) as shown in Figure 1(b). We trained GraphSage models (using the DGL (wang2020deep)implementation) under five different settings: 1) baseline GraphSage with default 3layer structure and hidden dimension of 256, denoted as GraphSage; 2) GraphSage with the hidden dimension increased to 512, denoted as GraphSage512; 3) 4layer GraphSage, denoted as GraphSage4layer; 4) GraphSage with one additional nonlinear layer in each GNN layer, denoted as NGNNGraphSage1 and 5) GraphSage with two additional nonlinear layers in each GNN layer, denoted as NGNNGraphSage2. In all cases we used ReLU as the activation function. As shown in Figure
1(a), GraphSage performs well when the noise is highly distinguishable from the true features. But the performance starts dropping when the noise is mixed with the true features and decays faster when becomes larger than 1.0 as shown in Figure 1(b).The same scenario happens with the gfNN model (nt2019revisiting), which is formed by transforming input node features via muliplications of the adjacency matrix followed by application of a single MLP block. This relatively simple model was shown to be more noise tolerant than GCN and SGC (wu2019simplifying); however, the performance of gfNN turns out to be much lower than the baseline GraphSage model in our experiments, so we do not present results here.
Remark 2 ().
NGNN is a cheap yet effective way to form a GNN architecture that is stable against node feature perturbations.
One potential way to improve the denoising capability of a GNN model is to increase the parameter count via a larger hidden dimension. As shown in Figure 1(b), GraphSage512 does perform better than the baseline GraphSage. But it is also more expensive as its parameter size (675,887) is larger than that of baseline GraphSage (206,895). And it is still not as effective as either NGNN model, both of which use considerably fewer parameters (see below) and yet have more stable performance as the noise level increases.
An alternative strategy for increasing the model parameter count is to add more GNN layers. As shown in Figure 1(b), by adding one more GNN layer, GraphSage4layer does outperform baseline GraphSage when is smaller than 4.0. However, as a deeper GNN potentially aggregates more noisy information from its hop neighbors (zeng2020deep), the performance of GraphSage4layer drops below baseline GraphSage when is 5.0.
In contrast to the above two methods, NGNNGraphSage achieves much better performance as shown in Figure 1(b) with fewer parameters (272,687 for NGNNGraphSage1 and 338,479 for NGNNGraphSage2) than GraphSage512 and without introducing new GNN layers. It can help maintain model performance when is smaller than 1.0 and slow the downward trend when is larger than 1.0 compared to the other three counterparts.
Remark 3 ().
NGNN with GNNs can also keep the model stable against graph structure perturbation.
We now show that by applying NGNN to a GNN, it can better deal with graph structure data perturbation. For this purpose, we randomly added edges to the original graph of ogbnproducts, where is the ratio of newly added noise edges to the existing edges. For example means we randomly added 618.6K edges.^{4}^{4}4The graph of ogbnproduct has 61,859,140 edges. We trained 3layer GraphSage models with a hidden dimension of 256 under three different settings: 1) vanilla GraphSage, denoted as GraphSage; 2) GraphSage with one additional nonlinear layer in each GNN layer, denoted as NGNN GraphSage1 and 3) GraphSage with two additional nonlinear layers in each GNN layer, denoted as NGNN GraphSage2. Figure 3 shows the results. It can be seen that, NGNN can help preserve the model performance when is smaller than 0.01 and ease the trend of performance downgrade after is larger than 0.01 comparing to vanilla GraphSage.
4. Experiments
We next provide experimental evidence to show that NGNN works will with various GNN architectures for both node classification and link prediction tasks in Sections 4.2 and 4.3. We also show that NGNN works with different training methods in Section 4.4. Finally, we discuss the impact of different NGNN settings in Sections 4.5 and 4.6.
4.1. Evaluation Setup
Datasets
We conducted experiments on seven datasets, including ogbnproducts, ogbnarxiv and ogbnproteins from ogbn (hu2020open) and reddit^{5}^{5}5http://snap.stanford.edu/graphsage/ for node classification, and ogblcollab, ogblppa and ogblcitaiton2 from ogbl (hu2020open) for link prediction. The detailed statistics are summarized in Table 1.
Datasets  # Nodes  # Edges 

Node Classification  
ogbnproducts  2,449,029  61,859,140 
ogbnarxiv  169,343  1,166,243 
ogbnproteins  132,524  39,561,252 
232,965  114,615,892  
Link Prediction  
ogblcollab  235,868  1,285,465 
ogblppa  576,289  30,326,273 
ogblddi  4,267  1,334,889 
GNN Model  Description 

Node classification task  
GraphSage  Vanilla GraphSage with neighbor sampling. 
GraphSageCluster  Vanilla GraphSage with cluster based sampling (chiang2019cluster). 
GATFLAG  GAT with FLAG (kong2020flag) enhancement. 
GAT+BoT  GAT with bag of tricks (wang2021bag). 
AGDN+BoT  AGDN with bag of tricks. 
AGDN+BoT+selfKD+C&S  AGDN with bag of tricks, knowledge distillation and correct&smooth(huang2020combining). 
Link prediction task  
SEALDGCNN  Vanilla SEAL using DGCNN (zhang2018end) as the backbone GNN. 
GCNfull  Vanilla GCN with full graph training. 
GraphSagefull  Vanilla Sage with full graph training. 
GraphSage+EdgeAttr  GraphSage with edge attribute. 

GraphSage+EdgeAttr comes from https://github.com/lustoo/OGB_link_prediction
Dataset  ogbnproducts  ogbnarxiv  ogbnproteins  

Eval Metric  Accuracy(%)  Accuracy(%)  ROCAUC(%)  Accuracy(%)  
GraphSage  Vanilla  78.270.45  71.151.66  75.671.72  96.190.08 
NGNN  79.880.34  71.771.18  76.300.96  96.210.04  
GraphSageCluster  Vanilla  78.720.63  56.571.56  67.451.21  95.270.09 
NGNN  78.910.59  56.761.08  68.120.96  95.340.09  
GATNS  Vanilla  79.230.16  72.101.12  81.760.17  96.120.02 
NGNN  79.670.09  71.881.10  81.910.21  96.450.05  
GATFLAG  Vanilla  80.750.14  71.561.11  81.810.15  95.270.02 
NGNN  80.990.09  71.741.10  81.840.11  95.680.03 

Any score difference between vanilla GNN and NGNNGNN that is greater than 0.5% is highlighted with boldface.
Dataset  Model  Accuracy(%)  

ogbnarxiv  AGDN+BoT  Vanilla  74.030.15 
ogbnarxiv  NGNN  74.250.17  
ogbnarxiv  AGDN+BoT+  Vanilla  74.280.13 
ogbnarxiv  selfKD+C&S  NGNN  74.340.14 
ogbnproteins  GAT+BoT  Vanilla  87.730.18 
ogbnproteins  NGNN  88.090.1 
Metric (%)  ogblcollab  ogblppa  ogblddi  

Vanilla GNN  NGNN  Vanilla GNN  NGNN  Vanilla GNN  NGNN  
SEAL  hit@20  45.760.72  46.190.58  16.101.85  20.821.76  30.752.12  31.933.00 
DGCNN  hit@50  54.700.49  54.820.20  32.581.42  37.250.98  43.991.11  42.393.23 
hit@100  60.130.32  60.700.18  49.361.24  56.440.99  51.251.60  49.633.65  
GCNfull  hit@10  35.941.60  36.690.82  4.001.46  5.640.93  47.82 5.90  48.22 7.00 
hit@50  49.520.70  51.830.50  14.231.81  18.441.88  79.563.83  82.564.03  
hit@100  55.740.44  57.410.22  20.21 1.92  26.780.92  87.581.33  89.481.68  
GraphSage  hit@10  32.593.56  36.832.56  3.681.02  3.521.24  54.279.86  60.754.94 
full  hit@50  51.660.35  52.621.04  15.021.69  15.551.92  82.184.00  84.581.89 
hit@100  56.910.72  57.960.56  23.561.58  24.452.34  91.940.64  92.580.88  
GraphSage+  hit@20          87.064.81  93.281.61 
EdgeAttr  hit@50          97.980.42  98.390.21 
hit@100          98.980.16  99.210.08 

The evaluation metrics used in ogb learderboard are hit@50 for ogblcollab, hit@100 for ogblppa and hit@20 for ogblddi.

Any hit score difference between vanilla GNN and NGNN GNN that is greater than 1% is highlighted with boldface.

The evaluation metrics used for ogblddi when profiling GCNfull and GraphSagefull are hit@20, hit@50 and hit@100.
We evaluated the effectiveness of NGNN by applying it to various GNN models including GCN (kipf2016semi), Graphsage (hamilton2017inductive), Graph Attention Network (GAT) (velivckovic2017graph), Adaptive Graph Diffusion Networks (AGDN) (sun2020adaptive), and SEAL (zhang2020revisiting) and their variants. Table 2 presents all the baseline models. We directly followed the implementation and configuration of each baseline model from the OGB (hu2020open) leaderboard and added nonlinear layer(s) into each GNN layer for NGNN . Table 11 presents the detail configuration of each model. All models were trained on a single V100 GPU with 32GB memory. We report average performance over 10 runs for all models except SEAL related models. As training SEAL models is very expensive, we took 5 runs instead.
4.2. Node classification
Firstly, we analyzed how NGNN improves the performance of GNN models on node classification tasks. Table 3 presents the overall results. It can be seen that NGNNbased models outperform their baseline models in most of the cases. Notably, NGNN tends to performs well with GraphSage. It improves the test accuracy of GraphSage on ogbnproducts and ogbnarxiv by 1.61 and 0.62 respectively. It also improves the ROCAUC score of GraphSage on ogbnproteins by 0.63. But as the baseline performance of reddit dataset is quite high, not surprisingly, the overall improvement of NGNN is not significant.
We further analysis the performance of NGNN combined with bag of tricks (wang2021bag) on ogbnarxiv and ogbnproteins in Table 4. It can be seen that NGNNbased models outperform their vanilla counterparts. NGNN with AGDN+BoT+selfKD+C&S even achieves the first place over all the methods with no extension to the input data on the ogbnarxiv leaderboard as of the time of this submission (The forth place on the entire ogbnarxiv leaderboard). NGNN with GAT+BoT also achieves the second place on the ogbnproteins leaderboard with 5.83 times fewer parameters compared with the current leading method RevGNNWide.^{6}^{6}6NGNN GAT+Bot has 11,740,552 parameters while RevGNNWide has 68,471,608 parameters.
4.3. Link prediction
Secondly, we analyzed how NGNN improves the performance of GNN models on link prediction tasks. Table 5 presents the results on the ogblcollab, ogblppa and ogblppi datasets. As shown in the tables, the performance improvement of NGNN over SEAL models is significant. NGNN improves the hit@20, hit@50 and hit@100 of SEALDGCNN by 4.72%, 4.67% and 7.08% respectively on ogblppa. NGNN with SEALDGCNN achieves the first place on the ogbnppa leaderboard with an improvement of hit@100 by 5.82% over the current leading method MLP+CN&RA&AA ^{7}^{7}7https://github.com/lustoo/OGB_link_prediction. Furthermore, NGNN with GraphSage+EdgeAttr achieves the first place on the ogblddi leaderboard with an improvement of hit@20 by 5.47% over the current leading method vanilla GraphSage+EdgeAttr. As GraphSage+EdgeAttr only provided the performance on ogblppi, we do not compare its performance on other datasets. NGNN also works with GCN and GraphSage on link prediction tasks. As shown in the tables, It improves the performance of GCN and GraphSage in all cases. In particular, it improves the hit@20, hit@50 and hit@100 of GCN by 1.64%, 4.21% and 6.57% respectively on ogblppa.
Sampling  fullgraph  neighbor  clusterbased 

Methods  sampling  sampling  
GraphSage  78.27  78.70  78.72 
GraphSageNGNN  79.88  79.11  78.91 
GAT  80.75  79.23  71.41 
GATNGNN  80.99  79.67  76.76 
Model  GraphSage  

Hiddensize  128  256  512 
baseline  77.44  78.27  79.37 
NGNN 1layer  77.39  79.53  79.12 
NGNN 2layer  78.79  79.88  79.94 
NGNN 4layer  78.79  79.52  79.88 
Model  GAT  
Hiddensize  64  128  256 
baseline  68.41  79.23  75.26 
NGNN 1layer  69.72  79.67  77.53 
NGNN 2layer  69.86  78.26  78.76 
NGNN 4layer  69.41  78.23  78.61 
Model  GraphSage  

Hiddensize  128  256  512 
baseline  70,703  206,895  675,887 
NGNN 1layer  87,215  272,687  938,543 
NGNN 2layer  103,727  338,479  1,201,199 
NGNN 4layer  136,751  470,063  1,726,511 
Model  GAT  
Hiddensize  64  128  256 
baseline  510,056  1,543,272  5,182,568 
NGNN 1layer  514,152  1,559,656  5,248,104 
NGNN 2layer  518,248  1,576,040  5,313,640 
NGNN 4layer  526,440  1,608,808  5,444,712 
Model  GraphSage (secs)  

Hiddensize  128  256  512 
baseline  18.710.41  20.710.74  25.400.35 
NGNN 1layer  19.571.04  20.980.47  29.070.69 
NGNN 2layer  19.250.87  21.360.48  30.010.13 
NGNN 4layer  19.790.72  24.410.38  32.330.19 
The single training epoch time of GraphSage with different number of nonlinear layers added into GNN layers on ogbnproducts.
4.4. NGNN with Different Training Methods
Finally, we presents the effectiveness of using NGNN with different training methods including fullgraph training, neighbor sampling and clusterbased sampling. Table 6 presents the results. It can be seen that NGNN improves the performance of GraphSage and GAT with all kinds of training methods on ogbnproducts. It is worth mentioning that NGNN also works with local subgraph sampling method proposed by SEAL (zhang2018link) as shown in Section 4.3.
4.5. Effectiveness of Multiple NGNN Layers
We studied the effectiveness of adding multiple nonlinear layers to GNN layers on ogbnproducts using GraphSage and GAT. Table 7 presents the results. The baseline model is a threelayer GNN model. We applied 1, 2 or 4 nonlinear layers to each hidden GNN layer denoted as NGNN1layer, NGNN2layer and NGNN4layer respectively. The GAT models use eight attention heads and all heads share the same NGNN layer(s). Table 7 presents the result. As shown in the table, NGNN2layer always performed best with different hidden sizes in most of the cases. This reveals that adding nonlinear layers can be effective, but the effect may vanish significantly when we continuously add more layers. The reason is straightforward, given that adding more nonlinear layers can eventually cause overfitting.
We also observe that deeper models can achieve better performance with many fewer trainable parameters than wider models. Table 8 presents the model parameter size of each model. As shown in the table, the parameter size of GraphSage with NGNN2layer and a hidden size of 256 is 338,479 which is 2 smaller than the parameter size of vanilla GraphSage with a hiddensize of 512, i.e., 675,887. And its performance is much better than vanilla GraphSage with a hidden size of 512.
Furthermore, we also observe that adding NGNNlayers only slightly increase the model training time. Table 9 presents the single training epoch time of GraphSage under different configurations. As shown in the table, the epoch time of GraphSage with NGNN2layer and a hidden size of 256 is only 3.1% longer than that of vanilla GraphSage with the same hidden size. However the corresponding parameter size is 1.63 larger.
GraphSage  GAT  

baseline  78.27  79.23 
NGNN all  79.88  79.49 
NGNN input  79.81  78.87 
NGNN hidden  79.91  79.68 
NGNN output  78.60  78.45 
4.6. Effectiveness of Applying NGNN to Different GNN Layers
Finally, we studied the effectiveness of applying NGNN to only the input GNN layer (NGNNinput), only the hidden GNN layers (NGNNhidden), only the output GNN layer (NGNNoutput) and all the GNN layers on the ogbnproducts dataset using GraphSage and GAT. The baseline model is a threelayer GNN model. The hidden dimension size is 256 and 128 for GraphSage and GAT respectively. Table 10 presents the results. As the table shows, only applying NGNN to the output GNN layer brings little or no benefit. While applying NGNN to hidden and input GNN layers can improve the model performance, especially applying NGNN to hidden layers. It demonstrates that the benefit of NGNN mainly comes from adding additional nonlinear layers into the input and hidden GNN layers.
Dataset  Model  hidden size  layers  aggregation  NGNN position  NGNN setting 
Node classification tasks  
ogbnproduct  GraphSage  256  3  mean  hiddenonly  1relu+1sigmoid 
ogbnproduct  GraphSagecluster  256  3  mean  hiddenonly  1relu+1sigmoid 
ogbnproduct  GATflag  256  3  sum  hiddenonly  1relu+1sigmoid 
ogbnproduct  GATns  256  3  sum  hiddenonly  1relu+1sigmoid 
ogbnarxiv  GraphSage  256  3  mean  hiddenonly  1relu+1sigmoid 
ogbnarxiv  GraphSagecluster  256  3  mean  hiddenonly  1relu+1sigmoid 
ogbnarxiv  GATflag  256  3  sum  hiddenonly  1relu+1sigmoid 
ogbnarxiv  GAT+BoT  120  6  sum  hiddenonly  2relu 
ogbnarxiv  AGDN+BoT  256  3  GATHA  hiddenonly  1relu 
ogbnarxiv  AGDN+BoT+selfKD+C&S  256  3  GATHA  hiddenonly  1relu 
ogbnprotein  GraphSage  256  3  mean  hiddenonly  1relu 
ogbnprotein  GraphSagecluster  256  3  mean  hiddenonly  1relu 
ogbnprotein  GATflag  256  3  sum  hiddenonly  1relu 
ogbnprotein  GATns  256  3  sum  hiddenonly  1relu 
GraphSage  256  3  mean  hiddenonly  1relu+1sigmoid  
GraphSagecluster  256  3  mean  hiddenonly  1relu+1sigmoid  
GATflag  256  3  sum  hiddenonly  1relu+1sigmoid  
GATns  256  3  sum  hiddenonly  1relu+1sigmoid  
Link prediction tasks  
ogblcollab  SealDGCNN  256  3  sum  alllayers  1relu 
ogblcollab  GCNfull  256  3  mean  hiddenonly  2relu 
ogblcollab  GraphSagefull  256  3  mean  hiddenonly  2relu 
ogblppa  SealDGCNN  32  3  sum  alllayers  1relu 
ogblppa  GCNfull  256  3  mean  alllayers  2relu 
ogblppa  GraphSagefull  256  3  mean  alllayers  2relu 
ogblppi  SealDGCNN  32  3  sum  hiddenlayers  1relu 
ogblppi  GCNfull  256  2  mean  inputonly  1relu 
ogblppi  GraphSagefull  256  2  mean  inputonly  1relu 
ogblppi  GraphSage+EdgeAttr  512  2  mean  alllayers  2relu 
5. Conclusion and Future Work
We present NGNN, a model agnostic methodology that allows arbitrary GNN models to increase their model capacity by inserting nonlinear feedforward neural network layer(s) inside GNN layers. Moreover, unlike existing deep GNN approaches, NGNN does not have large memory overhead and can work with various training methods including neighbor sampling, graph clustering and local subgraph sampling. Empirically, we demonstrate that NGNN can work with various GNN models on both node classification and link prediction tasks and achieve stateoftheart results. Future work includes evaluating NGNN on more GNN models and investigating whether NGNN can work on broader graphrelated prediction tasks. We also plan to explore methodologies to make a single GNN layer deeper in the future.
Comments
There are no comments yet.