Log In Sign Up

Graph Attention Multi-Layer Perceptron

by   Wentao Zhang, et al.

Graph neural networks (GNNs) have recently achieved state-of-the-art performance in many graph-based applications. Despite the high expressive power, they typically need to perform an expensive recursive neighborhood expansion in multiple training epochs and face a scalability issue. Moreover, most of them are inflexible since they are restricted to fixed-hop neighborhoods and insensitive to actual receptive field demands for different nodes. We circumvent these limitations by introducing a scalable and flexible Graph Attention Multilayer Perceptron (GAMLP). With the separation of the non-linear transformation and feature propagation, GAMLP significantly improves the scalability and efficiency by performing the propagation procedure in a pre-compute manner. With three principled receptive field attention, each node in GAMLP is flexible and adaptive in leveraging the propagated features over the different sizes of reception field. We conduct extensive evaluations on the three large open graph benchmarks (e.g., ogbn-papers100M, ogbn-products and ogbn-mag), demonstrating that GAMLP not only achieves the state-of-art performance, but also additionally provide high scalability and efficiency.


page 1

page 2

page 3

page 4


Direct Multi-hop Attention based Graph Neural Network

Introducing self-attention mechanism in graph neural networks (GNNs) ach...

Evaluating Deep Graph Neural Networks

Graph Neural Networks (GNNs) have already been widely applied in various...

GIPA: General Information Propagation Algorithm for Graph Learning

Graph neural networks (GNNs) have been popularly used in analyzing graph...

Tree Decomposed Graph Neural Network

Graph Neural Networks (GNNs) have achieved significant success in learni...

Decoupling the Depth and Scope of Graph Neural Networks

State-of-the-art Graph Neural Networks (GNNs) have limited scalability w...

Adaptive Graph Diffusion Networks with Hop-wise Attention

Graph Neural Networks (GNNs) have received much attention recent years a...

Graph Summarization with Graph Neural Networks

The goal of graph summarization is to represent large graphs in a struct...

Code Repositories

1 Introduction

Graph Neural Networks (GNNs) are powerful deep neural networks for graph-structured data, becoming the de facto

method in many semi-supervised and unsupervised graph representation learning scenarios such as node classification, link prediction, recommendation, and knowledge graphs 

kipf2016semi; hamilton2017inductive; bo2020structural; cui2020adaptive; fan2019graph; trouillon2017knowledge. Through stacking graph convolution layers, GNNs can learn node representations by utilizing information from the -hop neighborhood and thus enhance the model performance by getting more unlabeled nodes involved in the training process.

Unlike the images, text, or tabular data, where training data are independently distributed, graph data contains extra relationship information between nodes. Besides, the real-world graph is usually huge. For example, the users and their relationships in Wechat can be formed as a graph, and this graph has billions of nodes and ten billion edges. Every node in a -layer GNN will incorporate a set of nodes, including the node itself and its -hop neighbors. This set is called Receptive Field(RF). As the size of RF grows exponentially to the number of GNN layers, the rapidly expanding RF introduces high computation and memory cost in a single machine. Besides, even in a distributed environment, GNN has to read great amount of data of neighboring nodes to compute the single target node representation, leading to high communication cost. Despite their effectiveness, the utilization of neighborhood information in GNNs leads to the scalability issue for training on large graphs.

A commonly used approach to tackle this issue is sampling, such as node sampling hamilton2017inductive; DBLP:conf/icml/ChenZS18, layer sampling DBLP:conf/nips/Huang0RH18; DBLP:conf/iclr/ChenMX18 and graph sampling chiang2019cluster; DBLP:conf/iclr/ZengZSKP20. However, the sampling-based methods are imperfect because they still face high communication costs, and the sampling quality highly influences the model performance. Besides, a recent direction for scalable GNNs is based on model simplification. For example, Simplified GCN (SGC) wu2019simplifying decouples the feature propagation and the non-linear transformation process, and the former is executed during pre-processing. Unlike the sampling-based methods, which still need feature propagation in each training epoch, this time-consuming process in SGC is only executed once, and only the nodes of training set get involved in the model training. As a result, SGC is computation and memory efficient in a single machine and scalable in distributed settings since it does not require fetch features of neighboring nodes in the model training process. Despite the scalability, SGC adopts fixed layers of feature propagation, leading to a fixed RF of all nodes. Such graph-wise propagation lacks the flexibility to model the interesting correlations on node features under different reception fields. This either makes that long-range dependencies cannot be fully leveraged due to the undersized RF, or loses local information due to introducing over-smoothed noise with the oversized RF. Both results in non-optimal discriminative node representations.

(a) Inconsistent optimal steps
(b) Inconsistent RF expansion speed
Figure 1:

(Left) Test accuracy of SGC on 20 randomly sampled nodes of Citeseer. The X-axis is the node id, and Y-axis is the propagation steps (layers). The color from white to blue represents the ratio of being predicted correctly in 50 different runs. (Right) The local graph structures for two nodes in different regions; the node in the dense region has larger RF within two iterations of propagation.

Lines of simplified models have been proposed to better use propagated features under different propagation layers and RF. As graph-wise propagation only considers features under a fixed layer, SIGN frasca2020sign proposes to concatenate all these features without information loss, while SGC zhu2021simple averages all these features to generate the combined feature with the same dimension. Although they have considered the influence of different layers, the importance of different propagated features is ignored. Under a large propagation layer, some over-smoothed features with oversized RF will introduce feature noise and degrade the model performance. GBP chen2020scalable tackles this issue by adopting a constant decay factor for weighted average in the propagated features. Motivated by Personalized PageRank, the propagated features with larger propagation layers face a higher risk of over-smoothing, and they will contribute less to the final averaged features in GBP. All these methods adopt a layer-wise propagation mechanism and consider the features after different layers of propagation. Despite their effectiveness, they fail to consider the feature combination from a node-wise level.

As shown in Figure 1(a), different nodes require different propagation steps and corresponding smoothness levels. Besides, the homogeneous and non-adaptive feature averaging may be unsuitable for all nodes due to the inconsistent RF expansion speed shown in Figure 1(b). To support scalable and node-adaptive graph learning, we propose a novel MLP with three RF attention, abbreviated as GAMLP. Experimental results demonstrate that GAMLP achieves the state-of-the-art performance on the three largest ogbn datasets, while maintains high scalability and efficiency.

2 Preliminaries

In this section, we introduce the notations and review some current works tackling GNN scalability.

Notations. We consider an undirected graph = (,) with nodes and edges. We denote by the adjacency matrix of

, weighted or not. Nodes can possibly have features vector of size

, stacked up in an matrix . denotes the degree matrix of , where is the degree of node . Suppose is the labeled set, and our goal is to predict the labels for nodes in the unlabeled set with the supervision of .

Sampling. A commonly used method to tackle the scalability issue (i.e., the recursive neighborhood expansion) in GNN is sampling. As a node-wise sampling method, GraphSAGE hamilton2017inductive randomly samples a fixed size set of neighbors for computing in each mini-batch. VR-GCN DBLP:conf/icml/ChenZS18

analyzes the variance reduction so that it can reduce the size of samples with an additional memory cost. For the layer-wise sampling, Fast-GCN 

DBLP:conf/iclr/ChenMX18 samples a fixed number of nodes at each layer, and ASGCN DBLP:conf/nips/Huang0RH18 proposes the adaptive layer-wise sampling with better variance control. In the graph level, Cluster-GCN chiang2019cluster firstly clusters the nodes and then samples the nodes in the clusters, and GraphSAINT DBLP:conf/iclr/ZengZSKP20 directly samples a subgraph for mini-batch training. As an orthogonal way to model simplification, sampling has already been widely used in many GNNs and GNN systems distdgl_ai3_2020; aligraph_vldb_2019; pygeometric_iclr_2019.

Graph-wise Propagation. Recently studies have observed that non-linear feature transformation contributes little to the performance of the GNNs as compared to feature propagation. Thus, a new direction recently emerging for scalable GNN is based on the simplified GCN (SGC) wu2019simplifying, which successively removes nonlinearities and collapsing weight matrices between consecutive layers. This reduces GNNs into a linear model operating on -layers propagated features:


where , is the -layers propagated feature, and . By setting 0.5, 1 and 0, represents the symmetric normalization adjacency matrix  DBLP:conf/iclr/KlicperaBG19

, the transition probability matrix

 DBLP:conf/iclr/ZengZSKP20, or the reverse transition probability matrix  xu2018representation, respectively. As the propagated features can be precomputed, SGC is more scalable and efficient for the large graph. However, such graph-wise propagation restricts the same propagation steps and a fixed RF for each node. Therefore, some nodes’ features may be over-smoothed or under-smoothed due to the inconsistent RF expansion speed, leading to non-optimal performance.

Layer-wise Propagation. Following SGC, some recent methods adopt layer-wise propagation to combine the features with different propagation layers. SIGN frasca2020sign proposes to concatenate the different iterations of propagated features with linear transformation: . SGC zhu2021simple proposes the simple spectral graph convolution to average the propagated features in different iterations as . In addition, GBP chen2020scalable further improves the combination process by weighted averaging as with the layer weight . Similar to these works, we also use a linear model for higher training scalability. The difference lies in that we consider the propagation from a node-wise perspective and each node in GAMLP has a personalized combination of different steps of the propagated features.

Figure 2: Overview of the proposed GAMLP, including (1) feature propagation, (2) feature combination with RF attention, and (3) MLP training. The feature propagation can be pre-processed.

3 The GAMLP Model

3.1 Overview

As shown in Figure 2, GAMLP decomposes the end-to-end GNN training into three parts: feature propagation, feature combination with RF attention, and the MLP training. As the feature propagation is pre-processed only once, and MLP training is efficient and salable, we can easily scale GAMLP to large graphs. Besides, with the RF attention, each node in GAMLP can adaptively get the suitable combination weights for propagated features under different RF, thus boosting model performance.

Figure 3: The architecture of GAMLP with JK Attention.

3.2 Establishment of GAMLP

3.2.1 Feature Propagation

We separate the essential operation of GNNs — feature propagation by removing the neural network and nonlinear activation for feature transformation. Specifically, we construct a parameter-free -step feature propagation as:


where contains the features of a fixed RF: the node itself and its -hop neighborhoods.

After -step feature propagation shown in E.q. 9, we correspondingly get a list of propagated features under different propagation steps: . For a node-wise propagation, we propose to average these propagated features in a weighted manner:


where is the diagonal matrix derived from vector , and is an -dimension vector derived from vector , and measures the importance of the -step propagated feature for node . To satisfy different RF requirements for each node, we introduce three RF attention mechanisms to get .

3.2.2 Receptive Field Attention

Smoothing Attention. Suppose we execute feature propagation for infinite times. In that case, the node embedding within the same connected component will reach a stationary state, and it is hard to distinguish a specific node from others. This issue is referred to as over-smoothing li2018deeper. Concretely, when applying as adjacency matrix , the stationary state follows


To avoid the over-smoothing issue introduced by large RF, the weight parameterized by node and aggregation step is defined as:


where stands for concatenation, and is a trainable vector. Larger means that the -step propagated feature of node is more distant from the stationary state, and has less risk of being noise. Therefore, propagated feature with larger should contribute more to the feature combination.

Recursive Attention At each propagation step , we recursively measure the feature information gain compared with the previous combined feature as:


As combines the graph information under different propagation steps and RF, large proportion of the information in may have already existed in , leading to small information gain . larger means the feature is more important to the current state of node since combining will introduce higher information gain.

JK Attention Jumping Knowledge Network (JK-Net) xu2018representation adopts layer aggregation to combine the node embeddings of different GCN layers, and thus it can leverage the propagated nodes information with different RF. Motivated by JK-Net, we propose to guide the feature combination process with the model prediction trained on all the propagated features. Figure 3 shows the corresponding model architecture of GAMLP with JK attention, which includes two branches: the concatenated JK branch and the attention-based combination branch. We define the MLP prediction of the JK branch as , and then define the combination weight as:


The JK branch aims to create a multi-scale feature representation for each node, which helps the attention mechanism learn the weight . The learned weights are then fed into the attention-based combination branch to generate each node’s refined attention feature representation. As the training process continues, the attention-based combination branch will gradually emphasize those neighborhood regions that are more helpful to the target nodes. The JK attention can model a wider neighborhood while enhancing correlations, bringing a better feature representation for each node.

3.2.3 Incorporating Label Propagation.

To reinforce the model performance, we propose a simple and scalable way to take advantage of the node labels of the training set. First, the label embedding matrix is initialized as all zero. Then, we use the hard training labels,

, to fill in the all zero matrix and propagate it with the normalized adjacency matrix



where is the labeled node set. After -steps Label Propagation, we get the final label embedding and then use it to enhance the model prediction.

3.2.4 Model Training

Previous work zhang2021evaluating shows that the main limitations of deep GNNs are the over-smoothing introduced by large steps of propagation and model degradation introduced by too many layers of non-linear transformation. The proposed attention-based feature propagation can adaptively leverage the propagated features over the different sizes of reception field and avoid the over-smoothing issue.

Large graph zhang2021evaluating require large steps of non-liner transformation. To tackle the model degradation problem, we propose to use initial residual as:


where is the MLP layers, and is the combined feature matrix.

Then the output of this layers MLP is added with the label embedding from Sec. 3.2.3 to get the final output embedding:


The MLP here is used to map the embedding to the same space as .

We adopt the Cross-Entropy (CE) measurement between the predicted softmax outputs and the one-hot ground-truth label distributions as the objective function:


where is the one-hot label indicator vector.

3.2.5 Reliable Label Utilization (RLU)

Reliable Label Propagation.

To better utilize the predicted soft label (i.e., softmax outputs), we split the whole training process into multiple stages, each containing a full training procedure of the GAMLP model. At the first stage, the GAMLP model is trained according to the above-mentioned procedure. However, at later stages, we take advantage of the predicted reliable soft label of the last stage to improve the label embedding . Here, we denote the prediction results of -th stage as :


where the parameter controls the softness of the softmax distribution. Lower values of leads to more hardened distribution.

Suppose we are now at the beginning of the -th stage () of the whole training process. Rather than just using the training labels to construct the initial label embedding , we adopt the predicted results for the nodes in the validation set and the test set at the last stage as well. To ensure the reliability of the predicted soft label, we use a threshold to filter out the low-confident nodes in the validation set and the test set. The formulation for the reliable label is as follows:


In the formulation, the reliable node set is composed of nodes whose predicted probability belonging to the most likely class at -th stage is greater than the threshold .

Reliable Label Distillation.

To fully take advantage of the helpful information of the last stage, we also included a knowledge distillation module in our model. Again to guarantee the reliability of the knowledge distillation module, we only include the nodes in the reliable node set at -th stage () and then define the weighted KL divergence as:


where for reliable node is its predicted probability belonging to the mostly likely class at -th stage. We further incorporate here to better guide the distillation process, assigning higher weights to more confident nodes.

The complete training loss for -th stage () is defined as:



is a hyperparameter, balancing the importance of the knowledge distillation module.

3.3 Relation with current methods

GAMLP vs. GBP. Both GAMLP and GBP propose to weighted average the propagated features under different propagation steps and RF. However, GBP adopts a layer-wise propagation and ignores the inconsistent RF expansion speed for different nodes. As the optimal propagation steps and smoothing levels of different nodes are different, some nodes may face the over-smoothing or under-smoothing issue even propagated the same step. GAMLP considers the feature propagation in a more fine-grained node perspective. Compared with GBP, SGC, and SGC, the limitation of GAMLP is that all the propagated features are required in the model training, leading to high memory cost.

GAMLP vs. GAT. Each node in a GAT layer learns to weighted combine the embedding (or feature) of its neighborhoods with an attention mechanism, and the attention weights are measured by the local information in a fixed RF – the node itself and its direct neighbors. Different from the attention mechanism in GAT, GAMLP considers more global information under different RF.

GAMLP vs. DAGNN. DAGNN can adaptively learn the combination weights via the gating mechanism and thus assign proper weights for different nodes. However, the gating mechanism in DAGNN is correlated with the parameterized node embedding rather than the training-free node feature used in GAMLP, leading to low scalability and efficiency.

GAMLP vs. JK-Net. Motivated by JK-Net, GAMLP with JK attention concatenate the propagated features under different propagation steps. However, the model prediction based on the concatenated feature is just used as a reference vector for the attention-based combination branch in GAMLP rather than the final results. Compared with JK-Net, GAMLP with JK attention is more effective in alleviating the over-smoothing and scalability issue that deep architecture introduces.

4 Experiments

In this section, we verify the effectiveness of GAMLP on seven graph datasets. We aim to answer the following two questions. Q1: Compared with current methods, can GAMLP achieve higher predictive accuracy? Q2: Why GAMLP is effective?

4.1 Experimental Setup

Dataset #Nodes #Features #Edges #Classes #Train/Val/Test
ogbn-products 2,449,029 100 61,859,140 47 196K/49K/2,204K
ogbn-papers100M 111,059,956 128 1,615,685,872 172 1,207K/125K/214K
Table 1: Overview of datasets.

Datasets and baselines. We conduct the experiments on the ogbn-products and ogbn-papers100M datasets in hu2021ogb. The dataset statistics are shown in Table 1. For the comparison on the ogbn-products dataset, we choose the following baseline methods: GCN kipf2016semi, GraphSAGE hamilton2017inductive, SIGN frasca2020sign, DeeperGCN li2020deepergcn, SAGN and SAGN+SLE sun2021scalable, UniMP shi2020masked, and MLP+C&S huang2020combining. For the comparison on the ogbn-papers100M dataset, we choose the following baseline methods: SGC wu2019simplifying, SIGN and SIGN-XL frasca2020sign, SAGN and SAGN+SLE sun2021scalable. The validation and test accuracy of these all the baseline methods are directly from the OGB leaderboard.


To alleviate the influence of randomness, we repeat each method ten times and report the mean performance and the standard deviations. The experiments are conducted on a machine with Intel(R) Xeon(R) Platinum 8255C CPU@2.50GHz, and a single Tesla V100 GPU with 32GB GPU memory. The operating system of the machine is Ubuntu 16.04. As for software versions, we use Python 3.6, Pytorch 1.7.1, and CUDA 10.1. The hyper-parameters in each baseline are set according to the original paper if available. Please refer to Appendix B for the detailed hyperparameter settings for our GAMLP+RLU.

Methods Validation Accuracy Test Accuracy
GCN 92.000.03 75.640.21
GraphSAGE 92.240.07 78.500.14
SIGN 92.990.04 80.520.16
DeeperGCN 92.380.09 80.980.20
SAGN 93.090.04 81.200.07
UniMP 93.080.17 82.560.31
MLP+C&S 91.470.09 84.180.07
SAGN+SLE 92.870.03 84.280.14
GAMLP 93.120.03 83.540.09
GAMLP+RLU 93.240.05 84.590.10
Table 2: Test accuracy on ogbn-products dataset.
Methods Validation Accuracy Test Accuracy
SGC 66.480.20 63.290.19
SIGN 69.320.06 65.680.06
SIGN-XL 69.840.06 66.060.19
SAGN 70.340.99 66.750.84
SAGN+SLE 71.310.10 68.000.15
GAMLP 71.170.14 67.710.20
GAMLP+RLU 71.590.05 68.250.11
Table 3: Test accuracy on ogbn-papers100M dataset.

4.2 Experimental Results.

End-to-end comparison. To answer Q1, Table 23 show the validation and test accuracy of our GAMLP and GAMLP+RLU and all the baseline methods. The raw performance of our method GAMLP is very competitive compared to other baseline methods, which outperforms the current SOTA single model UniMP and SAGN. With the help of RLU, the test accuracy of GAMLP+RLU gets a decent improvement further and outperforms all the compared baselines, exceeding the strongest baseline SAGN+SLE by 0.17% and 0.25% on ogbn-products and ogbn-papers100M datasets, respectively. The performance of our GAMLP empirically illustrates the effectiveness of our proposed attention mechanisms.

Figure 4: The average attention weights of propagated features of different steps on 60 randomly selected nodes from ogbn-products.

Interpretability. GAMLP can adaptively and effectively combine multi-scale propagated features for each node. To demonstrate this, Figure 4 shows the average attention weights of propagated features according to the number of steps and degrees of input nodes, where the maximum step is 6. In this experiment, we randomly select 20 nodes for each degree range (1-4, 5-8, 9-12) and plot the relative weight based on the maximum value. We get two observations from the heat map: 1) The 1-step and 2-step propagated features are always of great importance, which shows that GAMLP captures the local information as those widely 2-layer methods do; 2) The weights of propagated features with larger steps drop faster as the degree grows, which indicates that our attention mechanism could prevent high-degree nodes from including excessive irrelevant nodes which lead to over-smoothing. From the two observations, we conclude that GAMLP is able to identify the different RF demands of nodes and explicitly weight each propagated feature.

5 Conclusion

This paper presents Graph Attention Multilayer Perceptron (GAMLP), a scalable, efficient, and powerful graph learning method based on the reception field attention. Concretely, GAMLP defines three principled attention mechanisms, i.e., smoothing attention, recursive attention, and JK attention, and each node in GAMLP can leverage the propagated features over different size of RF in a node-specific way. Extensive experiments on three large ogbn graphs verify the effectiveness of the proposed method.


Appendix A Experiments on ogbn-mag

a.1 Compared Baselines

Ogbn-mag dataset is a heterogeneous graph consists of 1,939,743 nodes and 21,111,007 edges of different types. For comparison, we choose eight baseline methods from the OGB ogbn-mag leaderboard: R-GCN [schlichtkrull2018modeling], SIGN [frasca2020sign], HGT [hu2020heterogeneous], R-GSN [wu2021r], HGConv [yu2020hybrid], R-HGNN [yu2021heterogeneous], NARS [yu2020scalable], and NARS-SAGN+SLE [sun2021scalable].

a.2 Adapt GAMLP to Heterogeneous Graphs

In its original design, GAMLP does not support training on heterogeneous graphs. Here we imitate the model design of NARS to adapt GAMLP to heterogeneous graphs.

First, we sample subgraphs from the original heterogeneous graphs according to relation types and regard the subgraph as a homogeneous graph although it may have different kinds of nodes and edges. Then, on each subgraph, the propagated features of different steps are generated. The propagated features of the same propagation step across different subgraphs are aggregated using 1-d convolution. After that, aggregated features of different steps are fed into our GAMLP to get the final results. This variant of our GAMLP is called NARS-GAMLP as it mimics the design of NARS.

As ogbn-mag dataset only contains node features for “paper” nodes, we here adopt the ComplEx algorithm [trouillon2017knowledge] to generate features for other nodes.

a.3 Experiment Results

We report the validation and test accuracy of our proposed GAMLP and GAMLP+RLU on ogbn-mag dataset in Table 4. It can be seen from the results that NARS-GAMLP achieves great performance on the heterogeneous graph ogbn-mag, outperforming the performance of the strongest single model baseline NARS. Equipped with RLU, NARS-GAMLP+RLU further improves the test accuracy, and exceeds the test accuracy of SOTA method NARS-SAGN+SLE by a significant margin of 1.50%.

Methods Validation Accuracy Test Accuracy
R-GCN 40.840.41 39.770.46
SIGN 40.680.10 40.460.12
HGT 49.840.47 49.270.61
R-GSN 51.820.41 50.320.37
HGConv 53.000.18 50.450.17
R-HGNN 53.610.22 52.040.26
NARS 53.720.09 52.400.16
NARS-SAGN+SLE 55.910.17 54.400.15
NARS-GAMLP 55.480.08 53.960.18
NARS-GAMLP+RLU 57.020.41 55.900.27
Table 4: Test accuracy on ogbn-mag dataset.

Appendix B Detailed Hyperparameters

We provide the detailed hyperparameter setting on GAMLP+RLU in Table 56 and 7 to help reproduce the results. To reproduce the experimental results of GAMLP, just follow the same hyperparameter setting yet only run the first stage.

Datasets attention type hidden size num layer in JK num layer activation
ogb-products Recursive 512 / 4

leaky relu, a=0.2

ogb-papers100M JK 1024 4 6 sigmoid
ogb-mag JK 512 4 4 leaky relu, a=0.2
Table 5: Detailed hyperparameter setting on OGB datasets.
Datasets hops hops for label input dropout attention dropout dropout
ogb-products 5 9 0.2 0.5 0.5
ogb-papers100M 6 9 0 0 0.5
ogb-mag 5 3 0.1 0 0.5
Table 6: Detailed hyperparameter setting on OGB datasets.
Datasets gamma threshold temperature batch size stages
ogb-products 0.1 0.85 1 50000 400, 300, 300, 300
ogb-papers100M 1 0 0.001 5000 100, 150, 150, 150
ogb-mag 10 0.4 1 10000 250, 200, 200, 200
Table 7: Detailed hyperparameter setting on OGB datasets.