Network In Graph Neural Network

11/23/2021
by   Xiang Song, et al.
Amazon
Washington University in St Louis
0

Graph Neural Networks (GNNs) have shown success in learning from graph structured data containing node/edge feature information, with application to social networks, recommendation, fraud detection and knowledge graph reasoning. In this regard, various strategies have been proposed in the past to improve the expressiveness of GNNs. For example, one straightforward option is to simply increase the parameter size by either expanding the hid-den dimension or increasing the number of GNN layers. However, wider hidden layers can easily lead to overfitting, and incrementally adding more GNN layers can potentially result in over-smoothing.In this paper, we present a model-agnostic methodology, namely Network In Graph Neural Network (NGNN ), that allows arbitrary GNN models to increase their model capacity by making the model deeper. However, instead of adding or widening GNN layers, NGNN deepens a GNN model by inserting non-linear feedforward neural network layer(s) within each GNN layer. An analysis of NGNN as applied to a GraphSage base GNN on ogbn-products data demonstrate that it can keep the model stable against either node feature or graph structure perturbations. Furthermore, wide-ranging evaluation results on both node classification and link prediction tasks show that NGNN works reliably across diverse GNN architectures.For instance, it improves the test accuracy of GraphSage on the ogbn-products by 1.6 improves the hits@100 score of SEAL on ogbl-ppa by 7.08 of GraphSage+Edge-Attr on ogbl-ppi by 6.22 submission, it achieved two first places on the OGB link prediction leaderboard.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/14/2022

Structure Enhanced Graph Neural Networks for Link Prediction

Graph Neural Networks (GNNs) have shown promising results in various tas...
05/11/2022

NDGGNET-A Node Independent Gate based Graph Neural Networks

Graph Neural Networks (GNNs) is an architecture for structural data, and...
02/27/2018

Link Prediction Based on Graph Neural Networks

Traditional methods for link prediction can be categorized into three ma...
12/20/2020

Analyzing the Performance of Graph Neural Networks with Pipe Parallelism

Many interesting datasets ubiquitous in machine learning and deep learni...
05/25/2020

NENET: An Edge Learnable Network for Link Prediction in Scene Text

Text detection in scenes based on deep neural networks have shown promis...
11/08/2021

GROWL: Group Detection With Link Prediction

Interaction group detection has been previously addressed with bottom-up...
08/23/2021

Integrating Transductive And Inductive Embeddings Improves Link Prediction Accuracy

In recent years, inductive graph embedding models, viz., graph neural ne...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Graph Neural Networks (GNNs) capture local graph structure and feature information in a trainable fashion to derive powerful node representations. They have shown promising success on multiple graph-based machine learning tasks  

(ying2018graph; scarselli2008graph; hu2020open) and are widely adopted by various web applications including social network (ying2018graph; rossi2020temporal), recommendation (berg2017graph; fan2019graph; yu2021self), fraud detection (wang2019fdgars; li2019spam; liu2020alleviating), etc. Various strategies have been proposed to improve the expressiveness of GNNs (hamilton2017inductive; velivckovic2017graph; xu2018powerful; schlichtkrull2018modeling).

(a) The test accuracy of GraphSage with different settings.
(b) The model parameter sizes of GraphSage with different settings.
Figure 1. The test accuracy of GraphSage on ogbn-products with different number of GNN layers (from 2 to 4) and different hidden dimension sizes (from 128 to 1024) and the corresponding model parameter sizes.

One natural candidate for improving the performance of a GNN is to increase its parameter size by either expanding the hidden dimension or the number of GNN layers. However, this can result in a large computational cost with only a modest performance gain. As a representative example, Figure 1 displays the performance of GraphSage (hamilton2017inductive) under different settings on the ogbn-products dataset and the corresponding model parameter sizes. From these results, it can be seen that either increasing the hidden dimension or increasing the number of GNN layers increases the model parameter size exponentially, but brings little performance improvement in tersm of test accuracy. For example, in order to improve the accuracy of a 3-layer GraphSage model by 1%, we need to add 2.3 more parameters (by increasing the hidden dimension from 256 to 512). Furthermore, with a larger hidden dimension a model is more likely to overfit the training data. On the other head, stacking multiple GNN layers may oversmooth the features of nodes (oono2019graph; chen2020measuring). As shown in Figure 0(a), GrageSage reaches its peak performance with only 3 GNN layers and a hidden dimension of 512.

Inspired by the Network-in-Network architecture (lin2013network), we present Network-in-Graph Neural-Network (NGNN ), a model agnostic methodology that allows arbitrary GNN models to increase their model capacity by making the model deeper. However, instead of adding more GNN layers, NGNN deepens a GNN model by inserting non-linear feedforward neural network layer(s) within each GNN layer. This leads to a much smaller memory footprint than recent alternative deep GNN architectures (li2019deepgcns; li2020deepergcn) and can be applied to all kinds of GNN models with various training methods including full-graph training, neighbor sampling (hamilton2017inductive), cluster-based sampling (chiang2019cluster) and local subgraph sampling (zhang2018link). Thus, it can easily scale to large graphs. Moreover, analysis of NGNN in conjunction with GraphSage on perturbed ogbn-products showed that NGNN is a cheap yet effective way to keep the model stable against either node feature or graph structure perturbations.

In this work, we applied NGNN to GCN (kipf2016semi), GraphSage (hamilton2017inductive), GAT (velivckovic2017graph) and AGDN (sun2020adaptive) and SEAL (zhang2018link). We also combine the proposed technique with different mini-batch training methods including neighbor sampling, graph clustering and local subgraph sampling. We conducted comprehensive experiments on several large-scale graph datasets for both node classification and link prediction leading to the following conclusions (which hold as of the time of this submission):

  • NGNN improves the performance of GraphSage and GAT and their variants on node classification datasets including ogbn-products, ogbn-arxiv, ogbn-proteins and reddit. It improves the test accuracy by 1.6% on the ogbn-products datasets for GraphSage. Furthermore, NGNN with AGDN+BoT+self-KD+C&S (huang2020combining) achieves the forth place on the ogbn-arxiv leaderboard111https://ogb.stanford.edu/docs/leader_nodeprop/ and NGNN with GAT+BoT  (wang2021bag) achieves second place on the ogbn-proteins leaderboard with many fewer model parameters.

  • NGNN improves the performance of SEAL, GCN and GraphSage and their variants on link prediction datasets including ogbl-collab, ogbl-ppa and ogbl-ppi. For example, it increases the test hits@100 score by 7.08% on the ogbl-ppa dataset for SEAL, which outperforms all the state-of-the-art approaches on the ogbl-ppa leaderboard222https://ogb.stanford.edu/docs/leader_linkprop/ by a substantial margin. Furthermore, NGNN achieves an improvement of the test hits@20 score by 6.22% on the ogbl-ppi dataset for GraphSage+EdgeAttr, which also takes the first place on the ogbl-ppi learderboard.

  • NGNN improves the performance of GraphSage and GAT under different training methods including full-graph training, neighbor sampling, graph clustering, and subgraph sampling.

  • NGNN is a more effective way of improving the model performance than expanding the hidden dimension. It takes less parameter size and less training time to get better performance than simply doubling the hidden dimension.

In summary, we present NGNN , a method that deepens a GNN model without adding extra GNN message-passing layers. We show that NGNN significantly improves the performance of vanilla GNNs on various datasets for both node classification and link prediction. We demonstrate the generality of NGNN by applying them to various GNN architectures.

2. Related Work

Deep models have been widely studied in various domains including computer vision 

(simonyan2014very; he2016deep)

, natural language processing 

(brown2020language), and speech recognition (zhang2017very). VGG (simonyan2014very)

investigates the effect of the convolutional neural network depth on its accuracy in the large-scale image recognition setting. It demonstrates the depth of representations is essential to the model performance. But when the depth grows, the accuracy will not always grow. Resnet 

(he2016deep)

eases the difficulties on training the deep model by introducing residual connections between input and output layers. DenseNet 

(huang2017cvpr)

takes this idea a step further by adding connections across layers. GPT-3 

(brown2020language) presents an autoregressive language model with 96 layers that achieves SOTA performance on various NLP tasks. Even so, while deep neural networks have achieved great success in various domains, the use of deep models in graph representation leaning is less well-established.

Most recent works (li2019deepgcns; li2020deepergcn; li2021training) attempt to train deep GNN models with a large number of parameters and achieved SOTA performance. For example, DeepGCN (li2019deepgcns) adapts the concept of residual connections, dense connections, and dilated convolutions (yu2015multi) to training very deep GCNs. However DeepGCN and its successor DeeperGCN (li2020deepergcn) have large memory footprints during model training which can be subject to current hardware limitations. RevGNN (li2021training) explored grouped reversible graph connections to train a deep GNN and has a much smaller memory footprint. However, RevGNN can only work with full-graph training and cluster-based mini-batch training, which makes it difficult to work with other methods designed for large scale graphs such as neighbor sampling (hamilton2017inductive) and layer-wise sampling (chen2018fastgcn). In contrast, NGNN deepens a GNN model by inserting non-linear feedforward layer(s) within each GNN layer. It can be applied to all kinds of GNN models with various training methods including full-graph training, neighbor sampling (hamilton2017inductive), layer-wise sampling (chen2018fastgcn) and cluster-based sampling (chiang2019cluster).

Xu et al. (xu2018powerful)

used Multilayer Perceptrons (MLPs) to learn the injective functions of the Graph Isomophism Network (GIN) model and showed its effectiveness on graph classification tasks. But they did not show whether adding an MLP within GNN layers works effectively across wide-ranging node classification and link prediction tasks. Additionally, You et al. 

(you2020design) mentioned that adding MLPs within GNN layer could benefit the performance. However, they did not systematically analyze the reason for the performance improvement introduced by extra non-linear layers, nor evaluate with numerous SOTA GNN architectures on large-scale graph datasets for both node classification and link prediction tasks.

(a) Gaussian noise is concatenated with node features.
(b) Gaussian noise is added to node features.
Figure 2. The test accuracy (%) of GraphSage, GraphSage with the hidden dimension of 512 (GraphSage-512), 4-layer GraphSage (GraphSage-4layer), GraphSage with one additional non-linear layer in each GNN layer (NGNN-GraphSage-1) and GraphSage with two additional non-linear layers in each GNN layer (NGNN-GraphSage-2) on ogbn-product with randomly added Gaussian noise in node features. By default, a GraphSage model has three GNN layers with a hidden dimension of 256.

3. Building Network in Graph Neural Network Models

3.1. Preliminaries

A graph is composed of nodes and edges , where is the set of nodes and is the set of edges. Furthermore, denotes the corresponding adjacency matrix of . Let be the node feature space such that where represents the node feature of . Formally, the -th layer of a GNN is defined as:333We omit edge features for simplicity.

(1)

where the function is determined by learnable parameters and

is an optional activation function. Additionally,

represents the embeddings of the nodes in the -th layer, and when . With an -layer GNN, the node embeddings in the last layer are used by downstream tasks like node classification and link prediction.

3.2. Basic NGNN Design

Inspired by the network-in-network architecture (lin2013network), we deepen a GNN model by inserting non-linear feedforward neural network layer(s) within each GNN layer. The -th layer in NGNN is thus constructed as:

(2)

The calculation of is defined layer-wise as:

(3)

where are learnable weight matrices, is an activation function, and is the number of in-GNN non-linear feedforward neural network layers. The first in-GNN layer takes the output of

as input and performs the non-linear transformation.

3.3. Discussion

In this section, we demonstrate that a NGNN architecture can better handle both noisy node features and noisy graph structures relative to its vanilla GNN counterpart.

Figure 3. The model performance of GraphSage, GraphSage with one additional non-linear layer in each GNN layer (NGNN-GraphSage-1) and GraphSage with two additional non-linear layers in each GNN layer (NGNN-GraphSage-2) on ogbn-product with randomly added noise edges.
Remark 1 ().

GNNs work well when the input features consist of distinguishable true features and noise. But when the true features are mixed with noise, GNNs can struggle to filter out the noise, especially as the noise level increases.

GNNs follow a neural message passing scheme (gilmer2017neural) to aggregate information from neighbors of a target node. In doing so, they can perform noise filtering and learn from the resulting signal when the noise is in some way distinguishable from true features, such as when the latter are mostly low-frequency (nt2019revisiting). However, when the noise level becomes too large and is mixed with true features, it cannot easily be reduced by GNNs (huang2019residual). Figure 2 demonstrates this scenario. Here we randomly added Gaussian noise to node features in ogbn-products data, where

is the standard deviation ranging from 0.1 to 5.0. We adopt two different methods for adding noise: 1)

as shown in Figure 1(a), where is a concatenation operation, and 2) as shown in Figure 1(b). We trained GraphSage models (using the DGL  (wang2020deep)

implementation) under five different settings: 1) baseline GraphSage with default 3-layer structure and hidden dimension of 256, denoted as GraphSage; 2) GraphSage with the hidden dimension increased to 512, denoted as GraphSage-512; 3) 4-layer GraphSage, denoted as GraphSage-4layer; 4) GraphSage with one additional non-linear layer in each GNN layer, denoted as NGNN-GraphSage-1 and 5) GraphSage with two additional non-linear layers in each GNN layer, denoted as NGNN-GraphSage-2. In all cases we used ReLU as the activation function. As shown in Figure 

1(a), GraphSage performs well when the noise is highly distinguishable from the true features. But the performance starts dropping when the noise is mixed with the true features and decays faster when becomes larger than 1.0 as shown in Figure 1(b).

The same scenario happens with the gfNN model (nt2019revisiting), which is formed by transforming input node features via muliplications of the adjacency matrix followed by application of a single MLP block. This relatively simple model was shown to be more noise tolerant than GCN and SGC (wu2019simplifying); however, the performance of gfNN turns out to be much lower than the baseline GraphSage model in our experiments, so we do not present results here.

Remark 2 ().

NGNN is a cheap yet effective way to form a GNN architecture that is stable against node feature perturbations.

One potential way to improve the denoising capability of a GNN model is to increase the parameter count via a larger hidden dimension. As shown in Figure 1(b), GraphSage-512 does perform better than the baseline GraphSage. But it is also more expensive as its parameter size (675,887) is larger than that of baseline GraphSage (206,895). And it is still not as effective as either NGNN model, both of which use considerably fewer parameters (see below) and yet have more stable performance as the noise level increases.

An alternative strategy for increasing the model parameter count is to add more GNN layers. As shown in Figure 1(b), by adding one more GNN layer, GraphSage-4layer does outperform baseline GraphSage when is smaller than 4.0. However, as a deeper GNN potentially aggregates more noisy information from its hop neighbors (zeng2020deep), the performance of GraphSage-4layer drops below baseline GraphSage when is 5.0.

In contrast to the above two methods, NGNN-GraphSage achieves much better performance as shown in Figure 1(b) with fewer parameters (272,687 for NGNN-GraphSage-1 and 338,479 for NGNN-GraphSage-2) than GraphSage-512 and without introducing new GNN layers. It can help maintain model performance when is smaller than 1.0 and slow the downward trend when is larger than 1.0 compared to the other three counterparts.

Remark 3 ().

NGNN with GNNs can also keep the model stable against graph structure perturbation.

We now show that by applying NGNN to a GNN, it can better deal with graph structure data perturbation. For this purpose, we randomly added edges to the original graph of ogbn-products, where is the ratio of newly added noise edges to the existing edges. For example means we randomly added 618.6K edges.444The graph of ogbn-product has 61,859,140 edges. We trained 3-layer GraphSage models with a hidden dimension of 256 under three different settings: 1) vanilla GraphSage, denoted as GraphSage; 2) GraphSage with one additional non-linear layer in each GNN layer, denoted as NGNN -GraphSage-1 and 3) GraphSage with two additional non-linear layers in each GNN layer, denoted as NGNN -GraphSage-2. Figure 3 shows the results. It can be seen that, NGNN can help preserve the model performance when is smaller than 0.01 and ease the trend of performance downgrade after is larger than 0.01 comparing to vanilla GraphSage.

4. Experiments

We next provide experimental evidence to show that NGNN works will with various GNN architectures for both node classification and link prediction tasks in Sections 4.2 and 4.3. We also show that NGNN works with different training methods in Section 4.4. Finally, we discuss the impact of different NGNN settings in Sections 4.5 and 4.6.

4.1. Evaluation Setup

Datasets

We conducted experiments on seven datasets, including ogbn-products, ogbn-arxiv and ogbn-proteins from ogbn (hu2020open) and reddit555http://snap.stanford.edu/graphsage/ for node classification, and ogbl-collab, ogbl-ppa and ogbl-citaiton2 from ogbl (hu2020open) for link prediction. The detailed statistics are summarized in Table 1.

Datasets # Nodes # Edges
Node Classification
ogbn-products 2,449,029 61,859,140
ogbn-arxiv 169,343 1,166,243
ogbn-proteins 132,524 39,561,252
reddit 232,965 114,615,892
Link Prediction
ogbl-collab 235,868 1,285,465
ogbl-ppa 576,289 30,326,273
ogbl-ddi 4,267 1,334,889
Table 1. Datasets statistics.
GNN Model Description
Node classification task
GraphSage Vanilla GraphSage with neighbor sampling.
GraphSage-Cluster Vanilla GraphSage with cluster based sampling (chiang2019cluster).
GAT-FLAG GAT with FLAG (kong2020flag) enhancement.
GAT+BoT GAT with bag of tricks (wang2021bag).
AGDN+BoT AGDN with bag of tricks.
AGDN+BoT+self-KD+C&S AGDN with bag of tricks, knowledge distillation and correct&smooth(huang2020combining).
Link prediction task
SEAL-DGCNN Vanilla SEAL using DGCNN (zhang2018end) as the backbone GNN.
GCN-full Vanilla GCN with full graph training.
GraphSage-full Vanilla Sage with full graph training.
GraphSage+EdgeAttr GraphSage with edge attribute.
Table 2. Baseline GNN models used in evaluation.
Dataset ogbn-products ogbn-arxiv ogbn-proteins reddit
Eval Metric Accuracy(%) Accuracy(%) ROC-AUC(%) Accuracy(%)
GraphSage Vanilla 78.270.45 71.151.66 75.671.72 96.190.08
NGNN 79.880.34 71.771.18 76.300.96 96.210.04
GraphSage-Cluster Vanilla 78.720.63 56.571.56 67.451.21 95.270.09
NGNN 78.910.59 56.761.08  68.120.96 95.340.09
GAT-NS Vanilla 79.230.16 72.101.12 81.760.17 96.120.02
NGNN 79.670.09 71.881.10 81.910.21 96.450.05
GAT-FLAG Vanilla 80.750.14 71.561.11 81.810.15 95.270.02
NGNN 80.990.09 71.741.10 81.840.11 95.680.03
  • Any score difference between vanilla GNN and NGNNGNN that is greater than 0.5% is highlighted with boldface.

Table 3. Performance of NGNN on ogbn-products, ogbn-arxiv, ogbn-proteins and reddit.
Dataset Model Accuracy(%)
ogbn-arxiv AGDN+BoT Vanilla 74.030.15
ogbn-arxiv NGNN 74.250.17
ogbn-arxiv AGDN+BoT+ Vanilla 74.280.13
ogbn-arxiv self-KD+C&S NGNN 74.340.14
ogbn-proteins GAT+BoT Vanilla 87.730.18
ogbn-proteins NGNN 88.090.1
Table 4. Performance (as measured by classification accuracy and ROC-AUC for ogbn-arxiv and ogbn-proteins, respectively) of NGNN combined with bag of tricks on ogbn-arxiv and ogbn-proteins.
Metric (%) ogbl-collab ogbl-ppa ogbl-ddi
Vanilla GNN NGNN Vanilla GNN NGNN Vanilla GNN NGNN
SEAL- hit@20 45.760.72 46.190.58 16.101.85 20.821.76 30.752.12 31.933.00
DGCNN hit@50 54.700.49 54.820.20 32.581.42 37.250.98 43.991.11 42.393.23
hit@100 60.130.32 60.700.18 49.361.24 56.440.99 51.251.60 49.633.65
GCN-full hit@10 35.941.60 36.690.82 4.001.46 5.640.93 47.82 5.90 48.22 7.00
hit@50 49.520.70 51.830.50 14.231.81 18.441.88 79.563.83 82.564.03
hit@100 55.740.44 57.410.22 20.21 1.92 26.780.92 87.581.33 89.481.68
GraphSage- hit@10 32.593.56 36.832.56 3.681.02 3.521.24 54.279.86 60.754.94
full hit@50 51.660.35 52.621.04 15.021.69 15.551.92 82.184.00 84.581.89
hit@100 56.910.72 57.960.56 23.561.58 24.452.34 91.940.64 92.580.88
GraphSage+ hit@20 - - - - 87.064.81 93.281.61
EdgeAttr hit@50 - - - - 97.980.42 98.390.21
hit@100 - - - - 98.980.16 99.210.08
  • The evaluation metrics used in ogb learderboard are hit@50 for ogbl-collab, hit@100 for ogbl-ppa and hit@20 for ogbl-ddi.

  • Any hit score difference between vanilla GNN and NGNN GNN that is greater than 1% is highlighted with boldface.

  • The evaluation metrics used for ogbl-ddi when profiling GCN-full and GraphSage-full are hit@20, hit@50 and hit@100.

Table 5. Performance of NGNN on ogbl-collab, ogbl-ppa and ogbl-ppi. We use the hit@20, hit@50 and hit@100 as the evaluation metrics.

We evaluated the effectiveness of NGNN by applying it to various GNN models including GCN (kipf2016semi), Graphsage (hamilton2017inductive), Graph Attention Network (GAT) (velivckovic2017graph), Adaptive Graph Diffusion Networks (AGDN) (sun2020adaptive), and SEAL (zhang2020revisiting) and their variants. Table 2 presents all the baseline models. We directly followed the implementation and configuration of each baseline model from the OGB (hu2020open) leaderboard and added non-linear layer(s) into each GNN layer for NGNN . Table 11 presents the detail configuration of each model. All models were trained on a single V100 GPU with 32GB memory. We report average performance over 10 runs for all models except SEAL related models. As training SEAL models is very expensive, we took 5 runs instead.

4.2. Node classification

Firstly, we analyzed how NGNN improves the performance of GNN models on node classification tasks. Table 3 presents the overall results. It can be seen that NGNN-based models outperform their baseline models in most of the cases. Notably, NGNN tends to performs well with GraphSage. It improves the test accuracy of GraphSage on ogbn-products and ogbn-arxiv by 1.61 and 0.62 respectively. It also improves the ROC-AUC score of GraphSage on ogbn-proteins by 0.63. But as the baseline performance of reddit dataset is quite high, not surprisingly, the overall improvement of NGNN is not significant.

We further analysis the performance of NGNN combined with bag of tricks (wang2021bag) on ogbn-arxiv and ogbn-proteins in Table 4. It can be seen that NGNN-based models outperform their vanilla counterparts. NGNN with AGDN+BoT+self-KD+C&S even achieves the first place over all the methods with no extension to the input data on the ogbn-arxiv leaderboard as of the time of this submission (The forth place on the entire ogbn-arxiv leaderboard). NGNN with GAT+BoT also achieves the second place on the ogbn-proteins leaderboard with 5.83 times fewer parameters compared with the current leading method RevGNN-Wide.666NGNN -GAT+Bot has 11,740,552 parameters while RevGNN-Wide has 68,471,608 parameters.

4.3. Link prediction

Secondly, we analyzed how NGNN improves the performance of GNN models on link prediction tasks. Table 5 presents the results on the ogbl-collab, ogbl-ppa and ogbl-ppi datasets. As shown in the tables, the performance improvement of NGNN over SEAL models is significant. NGNN improves the hit@20, hit@50 and hit@100 of SEAL-DGCNN by 4.72%, 4.67% and 7.08% respectively on ogbl-ppa. NGNN with SEAL-DGCNN achieves the first place on the ogbn-ppa leaderboard with an improvement of hit@100 by 5.82% over the current leading method MLP+CN&RA&AA 777https://github.com/lustoo/OGB_link_prediction. Furthermore, NGNN with GraphSage+EdgeAttr achieves the first place on the ogbl-ddi leaderboard with an improvement of hit@20 by 5.47% over the current leading method vanilla GraphSage+EdgeAttr. As GraphSage+EdgeAttr only provided the performance on ogbl-ppi, we do not compare its performance on other datasets. NGNN also works with GCN and GraphSage on link prediction tasks. As shown in the tables, It improves the performance of GCN and GraphSage in all cases. In particular, it improves the hit@20, hit@50 and hit@100 of GCN by 1.64%, 4.21% and 6.57% respectively on ogbl-ppa.

Sampling full-graph neighbor cluster-based
Methods sampling sampling
GraphSage 78.27 78.70 78.72
GraphSage-NGNN 79.88 79.11 78.91
GAT 80.75 79.23 71.41
GAT-NGNN 80.99 79.67 76.76
Table 6. Test accuracy (%) of GraphSage and GAT with and without NGNN trained with different training methods on ogbn-products.
Model GraphSage
Hidden-size 128 256 512
baseline 77.44 78.27 79.37
NGNN -1layer 77.39 79.53 79.12
NGNN -2layer 78.79 79.88 79.94
NGNN -4layer 78.79 79.52 79.88
Model GAT
Hidden-size 64 128 256
baseline 68.41 79.23 75.26
NGNN -1layer 69.72 79.67 77.53
NGNN -2layer 69.86 78.26 78.76
NGNN -4layer 69.41 78.23 78.61
Table 7. Test accuracy (%) of GraphSage and GAT with different number of non-linear layers added into GNN layers on ogbn-products.
Model GraphSage
Hidden-size 128 256 512
baseline 70,703 206,895 675,887
NGNN -1layer 87,215 272,687 938,543
NGNN -2layer 103,727 338,479 1,201,199
NGNN -4layer 136,751 470,063 1,726,511
Model GAT
Hidden-size 64 128 256
baseline 510,056 1,543,272 5,182,568
NGNN -1layer 514,152 1,559,656 5,248,104
NGNN -2layer 518,248 1,576,040 5,313,640
NGNN -4layer 526,440 1,608,808 5,444,712
Table 8. The parameter size of each model in Table 7.
Model GraphSage (secs)
Hidden-size 128 256 512
baseline 18.710.41 20.710.74 25.400.35
NGNN -1layer 19.571.04 20.980.47 29.070.69
NGNN -2layer 19.250.87 21.360.48 30.010.13
NGNN -4layer 19.790.72 24.410.38 32.330.19
Table 9.

The single training epoch time of GraphSage with different number of non-linear layers added into GNN layers on ogbn-products.

4.4. NGNN with Different Training Methods

Finally, we presents the effectiveness of using NGNN with different training methods including full-graph training, neighbor sampling and cluster-based sampling. Table 6 presents the results. It can be seen that NGNN improves the performance of GraphSage and GAT with all kinds of training methods on ogbn-products. It is worth mentioning that NGNN also works with local subgraph sampling method proposed by SEAL (zhang2018link) as shown in Section 4.3.

4.5. Effectiveness of Multiple NGNN Layers

We studied the effectiveness of adding multiple non-linear layers to GNN layers on ogbn-products using GraphSage and GAT. Table 7 presents the results. The baseline model is a three-layer GNN model. We applied 1, 2 or 4 non-linear layers to each hidden GNN layer denoted as NGNN-1layer, NGNN-2layer and NGNN-4layer respectively. The GAT models use eight attention heads and all heads share the same NGNN layer(s). Table 7 presents the result. As shown in the table, NGNN-2layer always performed best with different hidden sizes in most of the cases. This reveals that adding non-linear layers can be effective, but the effect may vanish significantly when we continuously add more layers. The reason is straightforward, given that adding more non-linear layers can eventually cause overfitting.

We also observe that deeper models can achieve better performance with many fewer trainable parameters than wider models. Table 8 presents the model parameter size of each model. As shown in the table, the parameter size of GraphSage with NGNN-2layer and a hidden size of 256 is 338,479 which is 2 smaller than the parameter size of vanilla GraphSage with a hidden-size of 512, i.e., 675,887. And its performance is much better than vanilla GraphSage with a hidden size of 512.

Furthermore, we also observe that adding NGNNlayers only slightly increase the model training time. Table 9 presents the single training epoch time of GraphSage under different configurations. As shown in the table, the epoch time of GraphSage with NGNN-2layer and a hidden size of 256 is only 3.1% longer than that of vanilla GraphSage with the same hidden size. However the corresponding parameter size is 1.63 larger.

GraphSage GAT
baseline 78.27 79.23
NGNN -all 79.88 79.49
NGNN -input 79.81 78.87
NGNN -hidden 79.91 79.68
NGNN -output 78.60 78.45
Table 10. Test accuracy (%) of GraphSage and GAT when applying NGNN on different GNN layers on ogbn-product.

4.6. Effectiveness of Applying NGNN to Different GNN Layers

Finally, we studied the effectiveness of applying NGNN to only the input GNN layer (NGNN-input), only the hidden GNN layers (NGNN-hidden), only the output GNN layer (NGNN-output) and all the GNN layers on the ogbn-products dataset using GraphSage and GAT. The baseline model is a three-layer GNN model. The hidden dimension size is 256 and 128 for GraphSage and GAT respectively. Table 10 presents the results. As the table shows, only applying NGNN to the output GNN layer brings little or no benefit. While applying NGNN to hidden and input GNN layers can improve the model performance, especially applying NGNN to hidden layers. It demonstrates that the benefit of NGNN mainly comes from adding additional non-linear layers into the input and hidden GNN layers.

Dataset Model hidden size layers aggregation NGNN position NGNN setting
Node classification tasks
ogbn-product GraphSage 256 3 mean hidden-only 1-relu+1-sigmoid
ogbn-product GraphSage-cluster 256 3 mean hidden-only 1-relu+1-sigmoid
ogbn-product GAT-flag 256 3 sum hidden-only 1-relu+1-sigmoid
ogbn-product GAT-ns 256 3 sum hidden-only 1-relu+1-sigmoid
ogbn-arxiv GraphSage 256 3 mean hidden-only 1-relu+1-sigmoid
ogbn-arxiv GraphSage-cluster 256 3 mean hidden-only 1-relu+1-sigmoid
ogbn-arxiv GAT-flag 256 3 sum hidden-only 1-relu+1-sigmoid
ogbn-arxiv GAT+BoT 120 6 sum hidden-only 2-relu
ogbn-arxiv AGDN+BoT 256 3 GAT-HA hidden-only 1-relu
ogbn-arxiv AGDN+BoT+self-KD+C&S 256 3 GAT-HA hidden-only 1-relu
ogbn-protein GraphSage 256 3 mean hidden-only 1-relu
ogbn-protein GraphSage-cluster 256 3 mean hidden-only 1-relu
ogbn-protein GAT-flag 256 3 sum hidden-only 1-relu
ogbn-protein GAT-ns 256 3 sum hidden-only 1-relu
reddit GraphSage 256 3 mean hidden-only 1-relu+1-sigmoid
reddit GraphSage-cluster 256 3 mean hidden-only 1-relu+1-sigmoid
reddit GAT-flag 256 3 sum hidden-only 1-relu+1-sigmoid
reddit GAT-ns 256 3 sum hidden-only 1-relu+1-sigmoid
Link prediction tasks
ogbl-collab Seal-DGCNN 256 3 sum all-layers 1-relu
ogbl-collab GCN-full 256 3 mean hidden-only 2-relu
ogbl-collab GraphSage-full 256 3 mean hidden-only 2-relu
ogbl-ppa Seal-DGCNN 32 3 sum all-layers 1-relu
ogbl-ppa GCN-full 256 3 mean all-layers 2-relu
ogbl-ppa GraphSage-full 256 3 mean all-layers 2-relu
ogbl-ppi Seal-DGCNN 32 3 sum hidden-layers 1-relu
ogbl-ppi GCN-full 256 2 mean input-only 1-relu
ogbl-ppi GraphSage-full 256 2 mean input-only 1-relu
ogbl-ppi GraphSage+EdgeAttr 512 2 mean all-layers 2-relu
Table 11. Model settings of NGNN models. The column of NGNN position presents where we put the non-linear layers. hidden-only means only applying NGNN to the hidden GNN layers, input-only means only applying NGNN to the input layer, all-layer means applying NGNN to all the GNN layers.. The column of NGNN setting presents how we organize each NGNN layer. For example, 1-relu+1-sigmoid means NGNN contains one feedforward neural network with ReLU as its activation function followed by another feedforward neural network with Sigmoid as its activation function and 2-relu means NGNN contains two feedforward neural network layers with ReLU as the activation function of each layer.

5. Conclusion and Future Work

We present NGNN, a model agnostic methodology that allows arbitrary GNN models to increase their model capacity by inserting non-linear feedforward neural network layer(s) inside GNN layers. Moreover, unlike existing deep GNN approaches, NGNN does not have large memory overhead and can work with various training methods including neighbor sampling, graph clustering and local subgraph sampling. Empirically, we demonstrate that NGNN can work with various GNN models on both node classification and link prediction tasks and achieve state-of-the-art results. Future work includes evaluating NGNN on more GNN models and investigating whether NGNN can work on broader graph-related prediction tasks. We also plan to explore methodologies to make a single GNN layer deeper in the future.

References