Graph Neural Networks (GNNs) capture local graph structure and feature information in a trainable fashion to derive powerful node representations. They have shown promising success on multiple graph-based machine learning tasks(ying2018graph; scarselli2008graph; hu2020open) and are widely adopted by various web applications including social network (ying2018graph; rossi2020temporal), recommendation (berg2017graph; fan2019graph; yu2021self), fraud detection (wang2019fdgars; li2019spam; liu2020alleviating), etc. Various strategies have been proposed to improve the expressiveness of GNNs (hamilton2017inductive; velivckovic2017graph; xu2018powerful; schlichtkrull2018modeling).
One natural candidate for improving the performance of a GNN is to increase its parameter size by either expanding the hidden dimension or the number of GNN layers. However, this can result in a large computational cost with only a modest performance gain. As a representative example, Figure 1 displays the performance of GraphSage (hamilton2017inductive) under different settings on the ogbn-products dataset and the corresponding model parameter sizes. From these results, it can be seen that either increasing the hidden dimension or increasing the number of GNN layers increases the model parameter size exponentially, but brings little performance improvement in tersm of test accuracy. For example, in order to improve the accuracy of a 3-layer GraphSage model by 1%, we need to add 2.3 more parameters (by increasing the hidden dimension from 256 to 512). Furthermore, with a larger hidden dimension a model is more likely to overfit the training data. On the other head, stacking multiple GNN layers may oversmooth the features of nodes (oono2019graph; chen2020measuring). As shown in Figure 0(a), GrageSage reaches its peak performance with only 3 GNN layers and a hidden dimension of 512.
Inspired by the Network-in-Network architecture (lin2013network), we present Network-in-Graph Neural-Network (NGNN ), a model agnostic methodology that allows arbitrary GNN models to increase their model capacity by making the model deeper. However, instead of adding more GNN layers, NGNN deepens a GNN model by inserting non-linear feedforward neural network layer(s) within each GNN layer. This leads to a much smaller memory footprint than recent alternative deep GNN architectures (li2019deepgcns; li2020deepergcn) and can be applied to all kinds of GNN models with various training methods including full-graph training, neighbor sampling (hamilton2017inductive), cluster-based sampling (chiang2019cluster) and local subgraph sampling (zhang2018link). Thus, it can easily scale to large graphs. Moreover, analysis of NGNN in conjunction with GraphSage on perturbed ogbn-products showed that NGNN is a cheap yet effective way to keep the model stable against either node feature or graph structure perturbations.
In this work, we applied NGNN to GCN (kipf2016semi), GraphSage (hamilton2017inductive), GAT (velivckovic2017graph) and AGDN (sun2020adaptive) and SEAL (zhang2018link). We also combine the proposed technique with different mini-batch training methods including neighbor sampling, graph clustering and local subgraph sampling. We conducted comprehensive experiments on several large-scale graph datasets for both node classification and link prediction leading to the following conclusions (which hold as of the time of this submission):
NGNN improves the performance of GraphSage and GAT and their variants on node classification datasets including ogbn-products, ogbn-arxiv, ogbn-proteins and reddit. It improves the test accuracy by 1.6% on the ogbn-products datasets for GraphSage. Furthermore, NGNN with AGDN+BoT+self-KD+C&S (huang2020combining) achieves the forth place on the ogbn-arxiv leaderboard111https://ogb.stanford.edu/docs/leader_nodeprop/ and NGNN with GAT+BoT (wang2021bag) achieves second place on the ogbn-proteins leaderboard with many fewer model parameters.
NGNN improves the performance of SEAL, GCN and GraphSage and their variants on link prediction datasets including ogbl-collab, ogbl-ppa and ogbl-ppi. For example, it increases the test hits@100 score by 7.08% on the ogbl-ppa dataset for SEAL, which outperforms all the state-of-the-art approaches on the ogbl-ppa leaderboard222https://ogb.stanford.edu/docs/leader_linkprop/ by a substantial margin. Furthermore, NGNN achieves an improvement of the test hits@20 score by 6.22% on the ogbl-ppi dataset for GraphSage+EdgeAttr, which also takes the first place on the ogbl-ppi learderboard.
NGNN improves the performance of GraphSage and GAT under different training methods including full-graph training, neighbor sampling, graph clustering, and subgraph sampling.
NGNN is a more effective way of improving the model performance than expanding the hidden dimension. It takes less parameter size and less training time to get better performance than simply doubling the hidden dimension.
In summary, we present NGNN , a method that deepens a GNN model without adding extra GNN message-passing layers. We show that NGNN significantly improves the performance of vanilla GNNs on various datasets for both node classification and link prediction. We demonstrate the generality of NGNN by applying them to various GNN architectures.
2. Related Work
Deep models have been widely studied in various domains including computer vision(simonyan2014very; he2016deep)brown2020language), and speech recognition (zhang2017very). VGG (simonyan2014very)
investigates the effect of the convolutional neural network depth on its accuracy in the large-scale image recognition setting. It demonstrates the depth of representations is essential to the model performance. But when the depth grows, the accuracy will not always grow. Resnet(he2016deep)
eases the difficulties on training the deep model by introducing residual connections between input and output layers. DenseNet(huang2017cvpr)
takes this idea a step further by adding connections across layers. GPT-3(brown2020language) presents an autoregressive language model with 96 layers that achieves SOTA performance on various NLP tasks. Even so, while deep neural networks have achieved great success in various domains, the use of deep models in graph representation leaning is less well-established.
Most recent works (li2019deepgcns; li2020deepergcn; li2021training) attempt to train deep GNN models with a large number of parameters and achieved SOTA performance. For example, DeepGCN (li2019deepgcns) adapts the concept of residual connections, dense connections, and dilated convolutions (yu2015multi) to training very deep GCNs. However DeepGCN and its successor DeeperGCN (li2020deepergcn) have large memory footprints during model training which can be subject to current hardware limitations. RevGNN (li2021training) explored grouped reversible graph connections to train a deep GNN and has a much smaller memory footprint. However, RevGNN can only work with full-graph training and cluster-based mini-batch training, which makes it difficult to work with other methods designed for large scale graphs such as neighbor sampling (hamilton2017inductive) and layer-wise sampling (chen2018fastgcn). In contrast, NGNN deepens a GNN model by inserting non-linear feedforward layer(s) within each GNN layer. It can be applied to all kinds of GNN models with various training methods including full-graph training, neighbor sampling (hamilton2017inductive), layer-wise sampling (chen2018fastgcn) and cluster-based sampling (chiang2019cluster).
Xu et al. (xu2018powerful)
used Multilayer Perceptrons (MLPs) to learn the injective functions of the Graph Isomophism Network (GIN) model and showed its effectiveness on graph classification tasks. But they did not show whether adding an MLP within GNN layers works effectively across wide-ranging node classification and link prediction tasks. Additionally, You et al.(you2020design) mentioned that adding MLPs within GNN layer could benefit the performance. However, they did not systematically analyze the reason for the performance improvement introduced by extra non-linear layers, nor evaluate with numerous SOTA GNN architectures on large-scale graph datasets for both node classification and link prediction tasks.
3. Building Network in Graph Neural Network Models
A graph is composed of nodes and edges , where is the set of nodes and is the set of edges. Furthermore, denotes the corresponding adjacency matrix of . Let be the node feature space such that where represents the node feature of . Formally, the -th layer of a GNN is defined as:333We omit edge features for simplicity.
where the function is determined by learnable parameters and
is an optional activation function. Additionally,represents the embeddings of the nodes in the -th layer, and when . With an -layer GNN, the node embeddings in the last layer are used by downstream tasks like node classification and link prediction.
3.2. Basic NGNN Design
Inspired by the network-in-network architecture (lin2013network), we deepen a GNN model by inserting non-linear feedforward neural network layer(s) within each GNN layer. The -th layer in NGNN is thus constructed as:
The calculation of is defined layer-wise as:
where are learnable weight matrices, is an activation function, and is the number of in-GNN non-linear feedforward neural network layers. The first in-GNN layer takes the output of
as input and performs the non-linear transformation.
In this section, we demonstrate that a NGNN architecture can better handle both noisy node features and noisy graph structures relative to its vanilla GNN counterpart.
Remark 1 ().
GNNs work well when the input features consist of distinguishable true features and noise. But when the true features are mixed with noise, GNNs can struggle to filter out the noise, especially as the noise level increases.
GNNs follow a neural message passing scheme (gilmer2017neural) to aggregate information from neighbors of a target node. In doing so, they can perform noise filtering and learn from the resulting signal when the noise is in some way distinguishable from true features, such as when the latter are mostly low-frequency (nt2019revisiting). However, when the noise level becomes too large and is mixed with true features, it cannot easily be reduced by GNNs (huang2019residual). Figure 2 demonstrates this scenario. Here we randomly added Gaussian noise to node features in ogbn-products data, where
is the standard deviation ranging from 0.1 to 5.0. We adopt two different methods for adding noise: 1)as shown in Figure 1(a), where is a concatenation operation, and 2) as shown in Figure 1(b). We trained GraphSage models (using the DGL (wang2020deep)
implementation) under five different settings: 1) baseline GraphSage with default 3-layer structure and hidden dimension of 256, denoted as GraphSage; 2) GraphSage with the hidden dimension increased to 512, denoted as GraphSage-512; 3) 4-layer GraphSage, denoted as GraphSage-4layer; 4) GraphSage with one additional non-linear layer in each GNN layer, denoted as NGNN-GraphSage-1 and 5) GraphSage with two additional non-linear layers in each GNN layer, denoted as NGNN-GraphSage-2. In all cases we used ReLU as the activation function. As shown in Figure1(a), GraphSage performs well when the noise is highly distinguishable from the true features. But the performance starts dropping when the noise is mixed with the true features and decays faster when becomes larger than 1.0 as shown in Figure 1(b).
The same scenario happens with the gfNN model (nt2019revisiting), which is formed by transforming input node features via muliplications of the adjacency matrix followed by application of a single MLP block. This relatively simple model was shown to be more noise tolerant than GCN and SGC (wu2019simplifying); however, the performance of gfNN turns out to be much lower than the baseline GraphSage model in our experiments, so we do not present results here.
Remark 2 ().
NGNN is a cheap yet effective way to form a GNN architecture that is stable against node feature perturbations.
One potential way to improve the denoising capability of a GNN model is to increase the parameter count via a larger hidden dimension. As shown in Figure 1(b), GraphSage-512 does perform better than the baseline GraphSage. But it is also more expensive as its parameter size (675,887) is larger than that of baseline GraphSage (206,895). And it is still not as effective as either NGNN model, both of which use considerably fewer parameters (see below) and yet have more stable performance as the noise level increases.
An alternative strategy for increasing the model parameter count is to add more GNN layers. As shown in Figure 1(b), by adding one more GNN layer, GraphSage-4layer does outperform baseline GraphSage when is smaller than 4.0. However, as a deeper GNN potentially aggregates more noisy information from its hop neighbors (zeng2020deep), the performance of GraphSage-4layer drops below baseline GraphSage when is 5.0.
In contrast to the above two methods, NGNN-GraphSage achieves much better performance as shown in Figure 1(b) with fewer parameters (272,687 for NGNN-GraphSage-1 and 338,479 for NGNN-GraphSage-2) than GraphSage-512 and without introducing new GNN layers. It can help maintain model performance when is smaller than 1.0 and slow the downward trend when is larger than 1.0 compared to the other three counterparts.
Remark 3 ().
NGNN with GNNs can also keep the model stable against graph structure perturbation.
We now show that by applying NGNN to a GNN, it can better deal with graph structure data perturbation. For this purpose, we randomly added edges to the original graph of ogbn-products, where is the ratio of newly added noise edges to the existing edges. For example means we randomly added 618.6K edges.444The graph of ogbn-product has 61,859,140 edges. We trained 3-layer GraphSage models with a hidden dimension of 256 under three different settings: 1) vanilla GraphSage, denoted as GraphSage; 2) GraphSage with one additional non-linear layer in each GNN layer, denoted as NGNN -GraphSage-1 and 3) GraphSage with two additional non-linear layers in each GNN layer, denoted as NGNN -GraphSage-2. Figure 3 shows the results. It can be seen that, NGNN can help preserve the model performance when is smaller than 0.01 and ease the trend of performance downgrade after is larger than 0.01 comparing to vanilla GraphSage.
We next provide experimental evidence to show that NGNN works will with various GNN architectures for both node classification and link prediction tasks in Sections 4.2 and 4.3. We also show that NGNN works with different training methods in Section 4.4. Finally, we discuss the impact of different NGNN settings in Sections 4.5 and 4.6.
4.1. Evaluation Setup
We conducted experiments on seven datasets, including ogbn-products, ogbn-arxiv and ogbn-proteins from ogbn (hu2020open) and reddit555http://snap.stanford.edu/graphsage/ for node classification, and ogbl-collab, ogbl-ppa and ogbl-citaiton2 from ogbl (hu2020open) for link prediction. The detailed statistics are summarized in Table 1.
|Datasets||# Nodes||# Edges|
|Node classification task|
|GraphSage||Vanilla GraphSage with neighbor sampling.|
|GraphSage-Cluster||Vanilla GraphSage with cluster based sampling (chiang2019cluster).|
|GAT-FLAG||GAT with FLAG (kong2020flag) enhancement.|
|GAT+BoT||GAT with bag of tricks (wang2021bag).|
|AGDN+BoT||AGDN with bag of tricks.|
|AGDN+BoT+self-KD+C&S||AGDN with bag of tricks, knowledge distillation and correct&smooth(huang2020combining).|
|Link prediction task|
|SEAL-DGCNN||Vanilla SEAL using DGCNN (zhang2018end) as the backbone GNN.|
|GCN-full||Vanilla GCN with full graph training.|
|GraphSage-full||Vanilla Sage with full graph training.|
|GraphSage+EdgeAttr||GraphSage with edge attribute.|
GraphSage+EdgeAttr comes from https://github.com/lustoo/OGB_link_prediction
Any score difference between vanilla GNN and NGNNGNN that is greater than 0.5% is highlighted with boldface.
|Vanilla GNN||NGNN||Vanilla GNN||NGNN||Vanilla GNN||NGNN|
|GCN-full||hit@10||35.941.60||36.690.82||4.001.46||5.640.93||47.82 5.90||48.22 7.00|
The evaluation metrics used in ogb learderboard are hit@50 for ogbl-collab, hit@100 for ogbl-ppa and hit@20 for ogbl-ddi.
Any hit score difference between vanilla GNN and NGNN GNN that is greater than 1% is highlighted with boldface.
The evaluation metrics used for ogbl-ddi when profiling GCN-full and GraphSage-full are hit@20, hit@50 and hit@100.
We evaluated the effectiveness of NGNN by applying it to various GNN models including GCN (kipf2016semi), Graphsage (hamilton2017inductive), Graph Attention Network (GAT) (velivckovic2017graph), Adaptive Graph Diffusion Networks (AGDN) (sun2020adaptive), and SEAL (zhang2020revisiting) and their variants. Table 2 presents all the baseline models. We directly followed the implementation and configuration of each baseline model from the OGB (hu2020open) leaderboard and added non-linear layer(s) into each GNN layer for NGNN . Table 11 presents the detail configuration of each model. All models were trained on a single V100 GPU with 32GB memory. We report average performance over 10 runs for all models except SEAL related models. As training SEAL models is very expensive, we took 5 runs instead.
4.2. Node classification
Firstly, we analyzed how NGNN improves the performance of GNN models on node classification tasks. Table 3 presents the overall results. It can be seen that NGNN-based models outperform their baseline models in most of the cases. Notably, NGNN tends to performs well with GraphSage. It improves the test accuracy of GraphSage on ogbn-products and ogbn-arxiv by 1.61 and 0.62 respectively. It also improves the ROC-AUC score of GraphSage on ogbn-proteins by 0.63. But as the baseline performance of reddit dataset is quite high, not surprisingly, the overall improvement of NGNN is not significant.
We further analysis the performance of NGNN combined with bag of tricks (wang2021bag) on ogbn-arxiv and ogbn-proteins in Table 4. It can be seen that NGNN-based models outperform their vanilla counterparts. NGNN with AGDN+BoT+self-KD+C&S even achieves the first place over all the methods with no extension to the input data on the ogbn-arxiv leaderboard as of the time of this submission (The forth place on the entire ogbn-arxiv leaderboard). NGNN with GAT+BoT also achieves the second place on the ogbn-proteins leaderboard with 5.83 times fewer parameters compared with the current leading method RevGNN-Wide.666NGNN -GAT+Bot has 11,740,552 parameters while RevGNN-Wide has 68,471,608 parameters.
4.3. Link prediction
Secondly, we analyzed how NGNN improves the performance of GNN models on link prediction tasks. Table 5 presents the results on the ogbl-collab, ogbl-ppa and ogbl-ppi datasets. As shown in the tables, the performance improvement of NGNN over SEAL models is significant. NGNN improves the hit@20, hit@50 and hit@100 of SEAL-DGCNN by 4.72%, 4.67% and 7.08% respectively on ogbl-ppa. NGNN with SEAL-DGCNN achieves the first place on the ogbn-ppa leaderboard with an improvement of hit@100 by 5.82% over the current leading method MLP+CN&RA&AA 777https://github.com/lustoo/OGB_link_prediction. Furthermore, NGNN with GraphSage+EdgeAttr achieves the first place on the ogbl-ddi leaderboard with an improvement of hit@20 by 5.47% over the current leading method vanilla GraphSage+EdgeAttr. As GraphSage+EdgeAttr only provided the performance on ogbl-ppi, we do not compare its performance on other datasets. NGNN also works with GCN and GraphSage on link prediction tasks. As shown in the tables, It improves the performance of GCN and GraphSage in all cases. In particular, it improves the hit@20, hit@50 and hit@100 of GCN by 1.64%, 4.21% and 6.57% respectively on ogbl-ppa.
The single training epoch time of GraphSage with different number of non-linear layers added into GNN layers on ogbn-products.
4.4. NGNN with Different Training Methods
Finally, we presents the effectiveness of using NGNN with different training methods including full-graph training, neighbor sampling and cluster-based sampling. Table 6 presents the results. It can be seen that NGNN improves the performance of GraphSage and GAT with all kinds of training methods on ogbn-products. It is worth mentioning that NGNN also works with local subgraph sampling method proposed by SEAL (zhang2018link) as shown in Section 4.3.
4.5. Effectiveness of Multiple NGNN Layers
We studied the effectiveness of adding multiple non-linear layers to GNN layers on ogbn-products using GraphSage and GAT. Table 7 presents the results. The baseline model is a three-layer GNN model. We applied 1, 2 or 4 non-linear layers to each hidden GNN layer denoted as NGNN-1layer, NGNN-2layer and NGNN-4layer respectively. The GAT models use eight attention heads and all heads share the same NGNN layer(s). Table 7 presents the result. As shown in the table, NGNN-2layer always performed best with different hidden sizes in most of the cases. This reveals that adding non-linear layers can be effective, but the effect may vanish significantly when we continuously add more layers. The reason is straightforward, given that adding more non-linear layers can eventually cause overfitting.
We also observe that deeper models can achieve better performance with many fewer trainable parameters than wider models. Table 8 presents the model parameter size of each model. As shown in the table, the parameter size of GraphSage with NGNN-2layer and a hidden size of 256 is 338,479 which is 2 smaller than the parameter size of vanilla GraphSage with a hidden-size of 512, i.e., 675,887. And its performance is much better than vanilla GraphSage with a hidden size of 512.
Furthermore, we also observe that adding NGNNlayers only slightly increase the model training time. Table 9 presents the single training epoch time of GraphSage under different configurations. As shown in the table, the epoch time of GraphSage with NGNN-2layer and a hidden size of 256 is only 3.1% longer than that of vanilla GraphSage with the same hidden size. However the corresponding parameter size is 1.63 larger.
4.6. Effectiveness of Applying NGNN to Different GNN Layers
Finally, we studied the effectiveness of applying NGNN to only the input GNN layer (NGNN-input), only the hidden GNN layers (NGNN-hidden), only the output GNN layer (NGNN-output) and all the GNN layers on the ogbn-products dataset using GraphSage and GAT. The baseline model is a three-layer GNN model. The hidden dimension size is 256 and 128 for GraphSage and GAT respectively. Table 10 presents the results. As the table shows, only applying NGNN to the output GNN layer brings little or no benefit. While applying NGNN to hidden and input GNN layers can improve the model performance, especially applying NGNN to hidden layers. It demonstrates that the benefit of NGNN mainly comes from adding additional non-linear layers into the input and hidden GNN layers.
|Dataset||Model||hidden size||layers||aggregation||NGNN position||NGNN setting|
|Node classification tasks|
|Link prediction tasks|
5. Conclusion and Future Work
We present NGNN, a model agnostic methodology that allows arbitrary GNN models to increase their model capacity by inserting non-linear feedforward neural network layer(s) inside GNN layers. Moreover, unlike existing deep GNN approaches, NGNN does not have large memory overhead and can work with various training methods including neighbor sampling, graph clustering and local subgraph sampling. Empirically, we demonstrate that NGNN can work with various GNN models on both node classification and link prediction tasks and achieve state-of-the-art results. Future work includes evaluating NGNN on more GNN models and investigating whether NGNN can work on broader graph-related prediction tasks. We also plan to explore methodologies to make a single GNN layer deeper in the future.