1 Introduction
Graph Neural Networks (GNNs) have recently become very popular for graph machine learning (GML) research and have shown great results on node classification tasks (Kipf and Welling, 2016; Hamilton et al., 2017; Veličković et al., 2017) like product prediction on copurchasing graphs and paper category prediction on citation graphs. However, for largescale industrial applications, MLPs remain the major workhorse, despite common (implicit) underlying graphs and suitability for GML formalisms. One reason for this academicindustrial gap is the challenges in scalability and deployment brought by data dependency in GNNs (Zhang et al., 2020; Jia et al., 2020), which makes GNNs hard to deploy for latencyconstraint applications that require fast inference.
Neighborhood fetching caused by graph dependency is one of the major sources of GNN latency. Inference on a target node necessitates fetching topology and features of many neighbor nodes, especially on smallworld graphs (detailed discussion in Section 4). Common inference acceleration techniques like pruning (Zhou et al., 2021) and quantization (Tailor et al., 2021; Zhao et al., 2020) can speed up GNNs to some extent by reducing MultiplicationandACcumulation (MAC) operations. However, their improvements are limited given the graph dependency is not resolved. Unlike GNNs, MLPs have no dependency on graph data and are easier to deploy than GNNs. They also enjoy the auxiliary benefit of sidestepping the coldstart problem that often happens during the online prediction of relational data (Wei et al., 2020), meaning MLPs can infer reasonably even when neighbor information of a new encountered node is not immediately available. On the other hand, this lack of graph dependency typically hurts for relational learning tasks, limiting MLP performance on GML tasks compared to GNNs. We thus ask: can we bridge the two worlds, enjoy the lowlatency and dependencyfree nature of MLPs and the graph context awareness of GNNs at the same time?
Present work. Our key finding is that it is possible to distill knowledge from GNNs to MLPs without losing significant performance, but reducing the inference time drastically for node classification. The knowledge distillation (KD) can be done offline, coupled with model training. In other words, we can shift considerable work from the latencyconstrained inference step, where time reduction in milliseconds makes a huge difference, to the less timesensitive training step, where time cost in hours or days is often tolerable. We call our approach Graphless Neural Network (GLNN). Specifically, GLNN is a modeling paradigm involving KD from a GNN teacher to a student MLP; the resulting GLNN is an MLP optimized through KD, so it enjoys the benefits of graph contextawareness in training but has no graph dependency in inference. Regarding speed, GLNNs have superior efficiency and are 146×273× faster than GNNs and 14×27× faster than other inference acceleration methods. Regarding performance, under a production setting involving both transductive and inductive predictions on 7 datasets, GLNN accuracies improve over MLPs by 12.36% on average and match GNNs on 6/7 datasets. We comprehensively study when and why GLNNs can achieve competitive results as GNNs. Our analysis suggests the critical factors for such great performance are large MLP sizes and high mutual information between node features and labels. Our observations align with recent results in vision and language, which posit that large enough (or slightly modified) MLPs can achieve similar results as CNNs and Transformers (Liu et al., 2021; Tolstikhin et al., 2021; MelasKyriazi, 2021; Touvron et al., 2021; Ding et al., 2021). Our core contributions are as follows:

[nosep,leftmargin=1em,labelwidth=*,align=left]

We propose GLNNs, which eliminate the neighborhoodfetching latency in GNN inference via crossmodel KD to MLPs.

We show GLNNs has competitive performance as GNNs, while enjoying 239×758× faster inference than vanilla GNNs and 27×32× faster inference than other inference acceleration methods.

We study GLNN properties comprehensively by investigating their performance under different settings, how they work as regularizers, their inductive bias, expressiveness, and limitations.
2 Related Work
Graph Neural Networks. The idea of GNNs started with generalizing convolution to graphs (Bruna et al., 2014; Defferrard et al., 2017) and later simplified the to a messagepassing neural network (MPNN) framework by GCN (Kipf and Welling, 2016). The majority of GNNs after are build on GCN and can be put under MPNN. For example, GAT employs attention (Veličković et al., 2017), PPNP employs personalized PageRank (Klicpera et al., 2019)
, GCNII and DeeperGCN employ residual connections and dense connections
(Chen et al., 2020; Li et al., 2019).Inference Acceleration. Inference acceleration schemes have been proposed by hardware improvements (Chen et al., 2016; Judd et al., 2016) and algorithmic improvements through pruning (Han et al., 2015), quantization (Gupta et al., 2015), and KD (Hinton et al., 2015). Specifically for GNNs, pruning (Zhou et al., 2021) and quantizing GNN parameters (Tailor et al., 2021; Zhao et al., 2020), or distilling to smaller GNNs (Yang et al., 2021b; Yan et al., 2020; Yang et al., 2021a) have been studied. These approaches speed up GNN inference to a certain extent but do not eliminate the neighborhoodfetching latency. In contrast, our crossmodel KD approach solves this issue. Another line of research focuses on fast GNN training via sampling (Hamilton et al., 2017; Zou et al., 2019; Chen et al., 2018). They are different and complementary to our goal on inference acceleration.
3 Preliminaries
Notations. For GML tasks, the input is usually a graph and its node features, which we write as , with stands for all nodes, and stands for all edges. Let denote the total number of nodes. We use to represent node features, with row being the dimensional feature of node . We represent edges with an adjacency matrix , with if edge , and 0 otherwise. For node classification, one of the most important GML applications, the prediction targets are , where row is a
dim onehot vector for node
. For a given , usually a small portion of nodes will be labeled, which we mark using superscript , i.e. , , and . The majority of nodes will be unlabeled, and we mark using the superscript , i.e. , , and .Graph Neural Networks. Most GNNs fit under the messagepassing framework, where the representation of each node is updated iteratively in each layer by collecting messages from its neighbors denoted as . For the th layer, is obtained from the previous layer representation () via an aggregation operation AGGR followed by an UPDATE operation as
4 Motivation
As mentioned in Sec 1, GNNs can have considerable inference latency due to their graph dependency. An additional GNN layer involves fetching one more hop of neighbor nodes. To infer for a single node with a layer GNN on a graph with average degree will require fetchings. can be large for realworld graphs, e.g. 208 for the Twitter social graph (Ching et al., 2015). Also, as fetching for successive layers must be done sequentially, the total latency explodes quickly as increases. Figure 1 illustrates the successive dependence added by each GNN layer, and the exponential explosion in the number of nodes and inference time induced by multiple GNN layers. In contrast, the inference time of an MLP with the same number of layers is much smaller and only grows linearly. This marked gap contributes greatly to the practicality of MLPs in industrial applications over GNNs.
The nodefetching latency is exacerbated by two factors: firstly, newer GNN architectures are getting deeper, e.g. 64 layers (Chen et al., 2020), 112 layers (Li et al., 2019), and even 1001 layers (Li et al., 2021)
. Secondly, industrialscale graphs are frequently too large to fit into the memory of a single machine, necessitating sharding of the graph out of the main memory. For example, Twitter has 288M monthly active users (nodes) and an estimated 60B followers (edges) as of 3/2015. Facebook has 1.39B active users with more than 400B edges as of 12/2014
(Ching et al., 2015). Even when stored in a sparsematrixfriendly format (often COO or CSR), these graphs are on the order of TBs and are constantly growing. Moving away from inmemory storage results in even slower neighborfetching.MLPs, on the other hand, lack the means to exploit graph topology, which hurts their performance for node classification. For example, test accuracy on Products is 78.61 for GraphSAGE compared to 62.47 for an equalsized MLP. Nonetheless, recent results in vision and language posit that large (or slightly modified) MLPs can achieve similar results as CNNs and Transformers (Liu et al., 2021). We thus also ask: Can we bridge the best of GNNs and MLPs to get highaccuracy and lowlatency models? This motivates us to do crossmodel KD from GNNs to MLPs.
5 Graphless Neural Networks
In this section, we introduce GLNNs in details, clarify experimental settings for our investigations, and answer the following exploration questions regarding GLNN properties: 1) How do GLNNs compare to MLPs and GNNs? 2) Can GLNNs work well under both transductive and inductive settings? 3) How do GLNNs compare to other inference acceleration methods? 4) How do GLNNs benefit from KD? 5) Do GLNNs have sufficient model expressiveness? 6) When will GLNNs fail to work?
5.1 The Glnn Framework
The idea of GLNN is straightforward, yet as we will see, extremely effective. In short, we train a “boosted” MLP via KD from a teacher GNN. KD was introduced in Hinton et al. (2015), with the main idea being transferring knowledge from a cumbersome teacher to a simpler student. In our case, we generate soft targets for each node with a teacher GNN. Then we train the student MLP with both the true label and . The objective is as Equation 1, with being a weight parameter, being the crossentropy loss between and student predictions , being the KLdivergence.
(1) 
The model after KD, i.e. GLNN, is essentially a MLP. Therefore, GLNNs have no graph dependency during inference and are as fast as MLPs. On the other hand, through offline KD, GLNN parameters are optimized to predict and generalize as well as GNNs, with the added benefit of faster inference and easier deployment. In Figure 2, we show the offline KD and online inference steps of GLNNs.
5.2 Experiment Settings
Datasets. We consider all five datasets used in the CPF paper (Yang et al., 2021a), i.e. Cora, Citeseer, Pubmed, Acomputer, and Aphoto. To fully evaluate our method, we also include two more larger OGB datasets (Hu et al., 2020), i.e. Arxiv and Products.
Model Architectures. For consistent results, we use GraphSAGE (Hamilton et al., 2017) with GCN aggregation as the teacher. We conduct ablation studies of other GNN teachers like GCN (Kipf and Welling, 2016), GAT (Veličković et al., 2017) and, APPNP (Klicpera et al., 2019) in Section 6.
Evaluation Protocol.
For all experiments in this section, we report the average and standard deviation over ten runs with different random seeds. Model performance is measured as accuracy, and results are reported on test data with the best model selected using validation data.
Transductive vs. Inductive. Given , , and , we divide the node classification goal into two settings: transductive (tran) and inductive (ind). For the inductive setting, we hold out part of the test data for inductive evaluation only. We first select inductive nodes to hold out, which partitions into the disjoint inductive subset and observed subset, i.e. . Then we hold out all the edges connected to nodes in as well. Therefore, we end up with two disjoint graphs with shared nodes or edges. Node features and labels are partitioned into three disjoint sets, i.e. , and . Concretely, the input/output of both settings becomes

[nosep,leftmargin=1em,labelwidth=*,align=left]

tran: train on , , and ; evaluate on ; KD uses for .

ind: train on , , , and ; evaluate on ; KD uses for .
Further details of data split, hyperparameters, and model settings are given in Appendix
A.5.3 How do Glnns compare to MLPs and GNNs?
We start by comparing GLNNs to equalsized MLPs and GNNs. Specifically, the same number of layers and hidden dimensions. We first study the transductive setting, i.e. the standard setting for all used datasets in past works, so our experiment results in Table 2 can be directly compared to the results reported in previous literature like Yang et al. (2021a) and Hu et al. (2020).
As shown in Table 2, the performance of all GLNNs improve over MLPs by large margins. On smaller datasets (first 5 rows), GLNNs can even outperform the teacher GNNs. In other words, for each of these tasks, with the same parameter budget, there exists a set of MLP parameters that has GNNcompetitive performance. We investigate the rationale in Sections 5.6 and 5.7. For the larger OGB datasets (last 2 rows), the GLNN performance is improved over MLPs but still worse than the teacher GNNs. However, as we show in Table 2, this gap can be mitigated by increasing MLP size to MLPw^{1}^{1}1Suffix w is used to note hidden layers are enlarged times, e.g. MLPw4 has 4times wider hidden layers than the MLP given in the context.. In Figure 3 (right), we visualize the tradeoff between prediction accuracy and model inference time with different model sizes. We show that gradually increasing GLNN size pushes its performance to be close to SAGE. On the other hand, when we reduce the number of layers of SAGE^{2}^{2}2Suffix L is used to explicitly note a model with layers, e.g. SAGEL2 represents a 2layer SAGE., the accuracy quickly drops to be worse than GLNNs. A detailed discussion of the rationale for increasing MLP sizes is in Appendix B.
5.4 Can Glnns work well under both transductive and inductive settings?
Although transductive is the commonly studied setting for node classification, it does not encompass prediction on unseen nodes. Therefore, it may not be the best way to evaluate a deployed model, which must often generate predictions for new data points as well as reliably maintain performance on old ones. Thus, to better understand the effectiveness of GLNN, we also consider their performance under a realistic production setting, which contains both transductive and inductive predictions.
To evaluate a model inductively, we hold out some unlabeled test nodes from training to form an inductive subset, i.e. . In production, a model might be retrained periodically, e.g. weekly. The holdout nodes in represent new nodes that entered the graph between two times of training. is usually small compared to the size of – e.g. Graham (2012) estimates 57% for the fastestgrowing tech startups. In our case, to mitigate randomness and better evaluate generalizability, we consider a larger containing 20% of the test data. We also evaluate on
containing the other 80% of the test data, representing the standard transductive prediction on existing unlabeled nodes, since inference is commonly redone on existing nodes in realworld cases. We report both results and a interpolated production (
prod) results per dataset in Table 3. The prod results paint a clearer picture of model generalization as well as accuracy in production settings. We also do an ablation study of different inductive split ratios other than 2080 in Section 6.In Table 3, we see that GLNNs can still improve over MLP by large margins for inductive predictions. On 6/7 datasets, the GLNN prod performance are competitive to GNNs, which supports deploying GLNN as a much faster model with no or only slight performance loss. On the Arxiv dataset, the GLNN performance is notably less than GNNs – we hypothesize this is due to Arxiv having a particularly challenging data split which causes distribution shift between test nodes and training nodes, which is hard for GLNNs to capture without utilizing neighbor information like GNNs. However, we note that GLNN performance is substantially improved over MLP.
5.5 How do Glnns compare to other inference acceleration methods?
Common techniques of inference acceleration include pruning and quantization. These approaches can reduce model parameters and MultiplicationandACcumulation (MACs) operations. Still, they don’t eliminate neighborhoodfetching latency. Therefore, their speed gain on GNNs is less significant than on NNs. For GNNs specifically, we can also perform neighbor sampling to reduce the fetching latency. We show an explicit speed comparison between vanilla SAGE, quantized SAGE from FP32 to INT8 (QSAGE), SAGE with 50% weights pruned (PSAGE), inference neighbor sampling with fanout 15, and GLNN in Table 4. All numbers are on Products, with setting the same as Figure 1. We see that GLNN is considerably faster.
Two other kinds of methods considered as inference acceleration are GNNtoGNN KD like TinyGNN (Yan et al., 2020) and Graph AugmentedMLPs (GAMLPs) like SGC (Wu et al., 2019) or SIGN (Frasca et al., 2020). Inference of GNNtoGNN KD is likely to be slower than a GNNL with the same as the student, since there will usually be overheads introduced by some extra modules like the PeerAware Module (PAM) in TinyGNN. GAMLPs precompute augmented node features and apply MLPs to them. With precomputation, their inference time will be the same as MLPs for dimensionpreserving augmentation (SGC) and the same as enlarged MLPw for augmentation involves concatenation (SIGN). Thus, for both kinds of approaches, it is sufficient to compare GLNN with GNNL and MLPw, which we have already shown in Figure 3 (left). We see that GNNLs are much slower than MLPs. For GAMLPs, since full precomputation cannot be done for inductive nodes, GAMLPs still need to fetch neighbor nodes. This makes them much slower than MLPw in the inductive setting, and even slower than pruned GNNs and TinyGNN as shown in Zhou et al. (2021).
5.6 How does Glnn benefit from distillation?
We showed that GNNs are markedly better than MLPs on node classification tasks. But, with KD, GLNN
s can often become competitive to GNNs. This indicates that there exist suitable MLP parameters which can well approximate the ideal prediction function from node features to labels. However, these parameters can be difficult to learn through standard stochastic gradient descent. We hypothesize that KD helps to find them through regularization and transfer of inductive bias.
First, we show that KD can help to regularize the student model and prevent overfitting. By comparing the loss curves of a directly trained MLP and the GLNN in Figure LABEL:figure:loss_curve, the gap between training and validation loss is visibly larger for MLPs than GLNNs, and MLPs show obvious overfitting trends. Second, we analyze the inductive bias that making GNNs powerful on node classification, which suggests that node inferences should be influenced by the graph topology, especially by neighbor nodes. In contrast, MLPs are more flexible but have less inductive bias. Similar difference exists between Transformers (Vaswani et al., 2017) and MLPs. In Liu et al. (2021), it is shown that the inductive bias in Transformers can be mitigated by a simple gate on large MLPs. For node classification, we hypothesize that KD helps to mitigate the inductive bias, so GLNN
s can perform competitively. Soft labels from GNN teachers are heavily influenced by the graph topology due to inductive bias. They maintain nonzero probabilities on classes other than the ground truth provided by onehot labels, which can be useful for the student to learn to complement the missing inductive bias in MLPs. To evaluate this hypothesis quantitatively, we define the cut loss
in Equation 2 to measure the consistency between model predictions and graph topology (details in Appendix C):(2) 
Here is the soft classification probability output by the model, and are the adjacency and degree matrices. When is close to 1, it means the predictions and the graph topology are very consistent. In our experiment, we observe that the average for SAGE over five CPF datasets is 0.9221, which means high consistency. The same for MLPs is only 0.7644, but for GLNNs it is 0.8986. This shows that the GLNN predictions indeed benefit from the graph topology knowledge contained in the teacher outputs (the full table of values in Appendix C).
5.7 Do Glnns have enough model expressiveness?
Intuitively, the addition of neighbor information makes GNNs more powerful than MLPs when classifying nodes. Thus, a natural question regarding KD from GNNs to MLPs is whether MLPs are expressive enough to represent graph data as well as GNNs. Many recent works studied GNN model expressiveness
(Xu et al., 2018; Chen et al., 2021). The latter analyzed GNNs and GAMLPs for node classification and characterized expressiveness as the number of equivalence classes of rooted graphs induced by the model (formal definitions in Appendix D). The conclusion is that GNNs are more powerful than GAMLPs, but in most realworld cases their expressiveness is indistinguishable.We adopt the analysis framework from Chen et al. (2021) and show in Appendix D that the number of equivalence classes induced by GNNs and MLPs are and respectively. Here denotes the max node degree, denotes the number of GNN layers, and denotes the set of all possible node features. The former is apparently larger which concludes that GNNs are more expressive. Empirically, however, the gap makes little difference when is large. In real applications, node features can be high dimensional like bagofwords, or even word embeddings, thus making enormous. Like for bagofwords, is in the order of , where is the vocabulary size, and is the max word frequency. The expressiveness of a Llayer GNN is lower bounded by , but empirically, both MLPs and GNNs should have enough expressiveness given is usually hundreds or bigger (see Table 5).
5.8 When will Glnns fail to work?
As discussed in Section 5.7 and Appendix D, the goal of GML node classification is to fit a function on the rooted graph and label . From the information theoretic perspective, fitting by minimizing the commonly used crossentropy loss is equivalent to maximizing the mutual information (MI), as shown in Qin et al. (2020). If we consider
as a joint distribution of two random variables
and representing the node features and edges in respectively, we have(3) 
only depends on edges and labels, thus MLPs can only maximize . In the extreme case, can be zero when is conditionally independent from given . For example, when every node is labeled by its degree or whether it forms a triangle. Then MLPs won’t be able to fit meaningful functions, and neither will GLNNs. However, such cases are typically rare, and unexpected in practical settings our work is mainly concerned with. For real GML tasks, node features and structural roles are often highly correlated (Lerique et al., 2020), hence MLPs can achieve reasonable results even only based on node features, and thus GLNNs can potentially achieve much better results. We study the failure case of GLNNs by creating a low MI scenario in Section 6.
6 Ablation Studies
In this section, we do ablation studies of GLNNs on node feature noise, inductive split rates, and teacher GNN architecture. Reported results are test accuracies averaged over five datasets in CPF.
Noisy node features. Following Section 5.8, we investigate failure cases of GLNN by adding different level of Gaussian noise to the node features to decrease their mutual information with labels. Specifically, we replace with . Here is isotropic Gaussian and independent from , and denotes the noise level. We show the inductive performance of MLPs, GNNs, and GLNN under different noise levels in the left plot in Figure 4. We see that as increases, the accuracy of MLPs and GLNNs decrease faster than GNNs, while the performance of GLNN and GNNs are still comparable with small s. When reaches 1, and will become independent, which corresponds to the extreme case discussed in Section 5.8.
Inductive split rate. In Section 5.4, we use a 2080 split of the test data for inductive evaluation. In the middle plot in Figure 4, we show the results under different split rates. We see that as the inductive portion increase, GNN and MLP performance stays roughly the same, and the GLNN inductive performance drops slightly. We only consider rates up to 5050 since having 50% or even more inductive nodes is highly atypical in practice. When a large amount of new data are encountered, practitioners can opt to retrain the model on all the data before deployment.
Teacher GNN architecture. For experiments above, we used SAGE to represent GNNs. In Figure 4 (right), we now results with other GNNs as teachers, e.g. GCN (Kipf and Welling, 2016), GAT (Veličković et al., 2017), and APPNP Klicpera et al. (2019). We see that GLNNs can learn from different GNN teachers and improve over MLPs. Their performance with all four teachers is similar, with the GLNN distilled from APPNP slightly worse than others. In fact, a similar phenomenon has been observed in Yang et al. (2021a) as well, i.e. APPNP benefits the student the least. One possible reason is that the first step of APPNP is to utilize the node’s own feature for prediction (prior to propagating over the graph), which is very similar to what the student MLP is doing, and thus provides less extra information to MLPs than other teachers.
7 Conclusion and Future Work
In this paper, we explored whether we can bridge the best of GNNs and MLPs to achieve accurate and fast GML models for deployment. We found that KD from GNNs to MLPs helps to eliminate inference graph dependency, which results in GLNNs that are 146×273× faster than GNNs while enjoying competitive performance. We do a comprehensive study of GLNN properties. The promising results on 7 datasets across different domains show that GLNNs can be a handy choice for deploying latencyconstraint models. In our experiments, the current version of GLNNs on the Arxiv dataset doesn’t show competitive inductive performance. More advanced distillation techniques can potentially improve the GLNN performance, and we leave this investigation as future work.
References
 Mincut pooling in graph neural networks. CoRR abs/1907.00481. External Links: Link, 1907.00481 Cited by: Appendix C.
 Spectral networks and locally connected networks on graphs. External Links: 1312.6203 Cited by: §2.
 FastGCN: fast learning with graph convolutional networks via importance sampling. In International Conference on Learning Representations, External Links: Link Cited by: §2.
 On graph neural networks versus graphaugmented {mlp}s. In International Conference on Learning Representations, External Links: Link Cited by: Appendix D, Appendix D, Appendix D, Appendix E, Appendix E, §5.7, §5.7.
 Simple and deep graph convolutional networks. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 1725–1735. External Links: Link Cited by: §2, §4.

Eyeriss: a spatial architecture for energyefficient dataflow for convolutional neural networks
. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Vol. , pp. 367–379. External Links: Document Cited by: §2.  One trillion edges: graph processing at facebookscale. Proceedings of the VLDB Endowment 8 (12), pp. 1804–1815. Cited by: §4, §4.
 Convolutional neural networks on graphs with fast localized spectral filtering. External Links: 1606.09375 Cited by: §2.

Kernel kmeans: spectral clustering and normalized cuts
. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, New York, NY, USA, pp. 551–556. External Links: ISBN 1581138881, Link, Document Cited by: Appendix C.  RepMLP: reparameterizing convolutions into fullyconnected layers for image recognition. External Links: 2105.01883 Cited by: §1.
 SIGN: scalable inception graph neural networks. External Links: 2004.11198 Cited by: §5.5.
 Want to start a startup?. http://www.paulgraham.com/growth.html. Cited by: §5.4.
 Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 1737–1746. External Links: Link Cited by: §2.
 Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §1, §2, §5.2.
 Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 1, NIPS’15, Cambridge, MA, USA, pp. 1135–1143. Cited by: §2.
 Distilling the knowledge in a neural network. External Links: 1503.02531 Cited by: §A.3, §2, §5.1.
 Open graph benchmark: datasets for machine learning on graphs. CoRR abs/2005.00687. External Links: Link, 2005.00687 Cited by: Figure 1, §5.2, §5.3.
 Redundancyfree computation for graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 997–1005. Cited by: §1.
 Proteus: exploiting numerical precision variability in deep neural networks. In Proceedings of the 2016 International Conference on Supercomputing, pp. 23. Cited by: §2.
 Adam: a method for stochastic optimization. In ICLR (Poster), External Links: Link Cited by: §A.5.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §2, §5.2, §6.
 Predict then propagate: graph neural networks meet personalized pagerank. External Links: 1810.05997 Cited by: §2, §5.2, §6.
 Joint embedding of structure and features via graph convolutional networks. Applied Network Science 5 (1), pp. 1–24. Cited by: §5.8.
 Training graph neural networks with 1000 layers. In International Conference on Machine Learning (ICML), Cited by: §4.

Deepgcns: can gcns go as deep as cnns?.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 9267–9276. Cited by: §2, §4.  Pay attention to mlps. External Links: 2105.08050 Cited by: §1, §4, §5.6.

Do you even need attention? a stack of feedforward layers does surprisingly well on imagenet
. External Links: 2105.02723 Cited by: §1.  Rethinking softmax with crossentropy: neural network classifier as mutual information estimator. External Links: 1911.10688 Cited by: §5.8.
 Degreequant: quantizationaware training for graph neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.
 Mlpmixer: an allmlp architecture for vision. arXiv preprint arXiv:2105.01601. Cited by: §1.
 Resmlp: feedforward networks for image classification with dataefficient training. arXiv preprint arXiv:2105.03404. Cited by: §1.
 Attention is all you need. External Links: 1706.03762 Cited by: §5.6.
 Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §2, §5.2, §6.
 Deep graph library: a graphcentric, highlyperformant package for graph neural networks. arXiv preprint arXiv:1909.01315. Cited by: §A.5, Figure 1.
 Fast adaptation for coldstart collaborative filtering with metalearning. In 2020 IEEE International Conference on Data Mining (ICDM), pp. 661–670. Cited by: §1.
 Simplifying graph convolutional networks. External Links: 1902.07153 Cited by: §5.5.
 How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §5.7.
 TinyGNN: learning efficient graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp. 1848–1856. Cited by: §2, §5.5.
 Extract the knowledge of graph neural networks and go beyond it: an effective knowledge distillation framework. External Links: 2103.02885 Cited by: §2, §5.2, §5.3, §6.
 Distilling knowledge from graph convolutional networks. External Links: 2003.10477 Cited by: §2.
 Agl: a scalable system for industrialpurpose graph machine learning. arXiv preprint arXiv:2003.02454. Cited by: §1.
 Learned low precision graph neural networks. External Links: 2009.09232 Cited by: §1, §2.
 Accelerating large scale realtime GNN inference using channel pruning. CoRR abs/2105.04528. External Links: Link, 2105.04528 Cited by: §1, §2, §5.5.
 Layerdependent importance sampling for training deep and large graph convolutional networks. arXiv preprint arXiv:1911.07323. Cited by: §2.
Appendix A Detailed Experiment Settings
a.1 Datasets
Here we provide a detailed description of the datasets we used to support our argument. Out of these datasets, 4 of them are citation graphs. Cora, Citeseer, Pubmed, ogbnarxiv with the node features being descriptions of the papers, either bagofword vector, TFIDF vector, or word embedding vectors.
In Table 5, we provided the basic statistics of these datasets.
Dataset  # Nodes  # Edges  # Features  # Classes 

Cora  2,485  5,069  1,433  7 
Citeseer  2,110  3,668  3,703  6 
Pubmed  19,717  44,324  500  3 
Acomputer  13,381  245,778  767  10 
Aphoto  7,487  119,043  745  8 
Arxiv  169,343  1,166,243  128  40 
Products  2,449,029  61,859,140  100  47 
For all datasets, we follow the setting in the original paper to split the data. Specifically, for the five smaller datasets from the CPF paper, we use the CPF splitting strategy and each random seed corresponds to a different split. For the OGB datasets, we follow the OGB official splits based on time and popularity for Arxiv and Products respectively.
a.2 Model Hyperparameters
The hyperparameters of GNN models on each dataset are taken from the best hyperparameters provided by the CPF paper and the OGB official examples. For the student MLPs and GLNN s, unless otherwise specified with w or L, we set the number of layers and the hidden dimension of each layer to be the same as the teacher GNN, so their total number of parameters stays the same as the teacher GNN.
SAGE  GAT  APPNP  
# layers  2  2  2 
hidden dim  128  64  64 
learning rate  0.01  0.01  0.01 
weight decay  0.0005  0.01  0.01 
dropout  0  0.6  0.5 
fan out  5,5     
attention heads    8   
power iterations      10 
Dataset  Arxiv  Products 

# layers  3  3 
hidden dim  256  256 
learning rate  0.001  0.001 
weight decay  0  0 
dropout  0.2  0.5 
normalization  batch  batch 
fan out  [5, 10, 15]  [5, 10, 15] 
For GLNN s we do a hyperparameter search of learning rate from [0.01, 0.005, 0.001], weight decay from [0, 0.001, 0.002, 0.005, 0.01], and dropout from 0.1 to 0.6.
a.3 Knowledge Distillation
We use the distillation method proposed in Hinton et al. (2015) as in Equation 1, the hard labels are found to be helpful, so nonzero s was suggested. In our case, we did a little tuning for but didn’t find nonzero s to be very helpful. Therefore, we report all of our results with , i.e. only the second term involving soft labels is effective. More careful tuning of should further improve the results since the searching space is strictly larger. We implemented a weighted version in our code, and we leave the choice of as future work.
a.4 The Transductive Setting and The Inductive Setting
Given , , and , the goal of node classification can be divided into two different settings, i.e. transductive and inductive. In real applications, the former can correspond to predict missing attributes of a user based on the user profile and other existing users, and the latter can correspond to predict labels of some new nodes that are only seen during inference time. To create the inductive setting on a given dataset, we hold out some nodes along with edges connected to these nodes during training and use them for inductive evaluation only. These nodes and edges are picked from the test data. Using notation defined above, we pick the inductive nodes , which partitions into the disjoint inductive subset and observed subset, i.e. . Then we can take all the edges connected to nodes in to further partition the whole graph, so we end up with , , and . We show the input and output of both settings using the notations below.
We visualize the difference between the inductive setting and the transductive setting in Figure 5.
a.5 Implementation and Hardward Details
The experiments on both baselines and our approach are implemented using PyTorch, the DGL
(Wang et al., 2019) library for GNN algorithms, and Adam (Kingma and Ba, 2015) for optimization. We run all experiments on a machine with 80 Intel(R) Xeon(R) E52698 v4 @ 2.20GHz CPUs, and a single NVIDIA V100 GPU with 16GB RAM.Appendix B Space and Time Complexity of GNNs vs. MLPs
In Table 2, the model comparison was between equalsized MLPs (GLNNs) and GNNs. This is how people used to compare models for the sake of fair space complexity, but it is not a completely fair comparison for crossmodel comparison, especially MLPs vs. GNNs. To do inference with GNNs, the graph needs to be loaded in the memory either entirely or batch by batch, which will take a much larger space than the model parameters. Thus, the space complexity of GNNs is actually much higher than equalsized MLPs. From the time complexity perspective, the major inference latency of GNNs comes from the data dependency as shown in Section 4. Under the same setting as Figure 1, we show in the left part of Figure 3 that even a 5layer MLP with 8 times wider hidden layers still runs much faster than a singlelayer SAGE. Another example of crossmodel comparison is Transformers vs. RNNs. Large Transformers can have more parameters than RNNs because of the attention mechanism, but they also run faster than RNNs in general.
Moreover, the comparison of GNNs and MLPs with the same number of parameters ignores the inductive bias difference. GNNs are advantaged since they get information from two sources when they do inference, i.e. the learned parameters and the graph data. For GNNs, there is an inductive bias of neighbor nodes influencing each other, so the knowledge missing from model parameters can be complemented. For MLPs, model parameters are the only source for them to remember the learned knowledge. They are being very flexible with less inductive bias, which could require more parameters for the model to learn (see Section 5.6 for a more detailed discussion of inductive bias).
In Table 2, we saw that for equalsized comparison, GLNNs are not as good as GNNs on the OGB datasets. Following the discussion above and given the GLNNs used in Table 2 are relatively small (3 layers and 256 hidden dimensions) for millions of nodes in the OGB datasets, we ask whether this gap can be mitigated by increasing the MLP and thus GLNN sizes. Our answer is shown in Table 2.
Appendix C Consistency Measure of Model Predictions and Graph Topology Based on MinCut
We introduce a metric to measure the consistency between model predictions and graph topology based on the mincut problem in Section 5.6. The way normalized mincut problem, or simply mincut, partitions nodes in into disjoint subsets by removing the minimum volume of edges. According to Dhillon et al. (2004), the mincut problem can be expressed as
(4)  
with being the node assignment matrix that partitions , i.e. if node is assigned to class . being the adjacency matrix and being the degree matrix. This quantity we try to maximize here tells us whether the assignment is consistent with the graph topology. The bigger it is, the less edges need to be removed, and the assignment is more consistent with existing connections in the graph. In Bianchi et al. (2019), the authors show that when replacing the hard assignments with a soft classification probability , a cut loss in Equation 2 can become a good approximation of Equation 4 and be used as the measuring metric.
Datasets  SAGE  MLP  GLNN 

Cora  0.9347  0.7026  0.8852 
Citeseer  0.9485  0.7693  0.9339 
Pubmed  0.9605  0.9455  0.9701 
Acomputer  0.9003  0.6976  0.8638 
Aphoto  0.8664  0.7069  0.8398 
Average  0.9221  0.7644  0.8986 
Appendix D Expressiveness of GNNs vs. MLPs in terms of equivalence classes of rooted graphs
In Chen et al. (2021), the expressiveness of GNNs and GAMLPs were theoretically quantified in terms of induced equivalence classes of rooted graphs. We adopt their framework and perform a similar analysis for GNNs vs. MLPs. We first define rooted graphs.
Definition 1 (Rooted Graph).
A rooted graph, denoted as is a graph with one node in designated as the root. GNNs, GAMLPs, and MLPs can all be considered as functions on rooted graphs. The goal of a nodelevel task on node with label is to fit a function to the inputoutput pairs (, ).
We denote the space of rooted graphs as . Following Chen et al. (2021), the expressive power of a model on graph data is evaluated by its ability to approximate functions on . This is further characterized as the number of induced equivalence classes of rooted graphs on , with the equivalence relation defined as the following. Given a family of functions on , we define an equivalence relation among all rooted graphs such that if and only if . We now give a proposition to characterize the GNN expressive power (proof in Appendix E).
Proposition 1.
With denotes the set of all possible node features and assuming , with denotes the maximum node degree and assuming , the total number of equivalence classes of rooted graphs induced by an Llayer GNN is lower bounded by .
As shown in Proposition 1, the expressive power of GNNs grows doublyexponentially in the number of layers , which means it grows linearly in after taking . The expressive power GAMLPs only grows exponentially in as shown in Chen et al. (2021). Under this framework, the expressive power of MLPs, which corresponds to a 0layer GAMLP, is . Since the former is much larger than the latter, the conclusion will be GNNs are much more expressive than MLPs. The gap between these two numbers indeed exists, but empirically this gap will only make a difference when is small. As in Chen et al. (2021), both the lower bound proof and the constructed examples showing GNNs are more powerful than GAMLPs assumed . In real applications and datasets considered in this work, the node features can be high dimensional vectors like bagofwords, which makes enormous. Thus, this gap doesn’t matter much empirically.
Appendix E Proof of The Proposition 1
To prove Proposition 1, we first define rooted aggregation trees, which is similar to but different from rooted graphs.
Definition 2 (Rooted Aggregation Tree).
The depthK rooted aggregation tree of a rooted graph is a depthK rooted tree with a (possibly manytoone) mapping from every node in the tree to some node in , where (i) the root of the tree is mapped to node , and (ii) the children of every node in the tree are mapped to the neighbors of the node in to which is mapped.
A rooted aggregation tree can be obtained by unrolling the neighborhood aggregation steps in the GNNs. An illustration of rooted graphs and rooted aggregation trees can be found in Chen et al. (2021) Figure 4. We denote the set of all rooted aggregation trees of depth L using . Then we use to denote a subset of , where the node features belong to , and all the nodes have exactly degree ( children), and at least two nodes out of these m nodes have different features. In other words, a node can’t have all identical children. With rooted aggregation trees defined, we are ready to prove Proposition 1. The proof is adapted from the proof of Lemma 3 in Chen et al. (2021).
Proof.
Since the number of equivalence classes on induced by the family of all depthL GNNs consists of all rooted graphs that share the same rooted aggregation tree of depthL (Chen et al., 2021), the lower bound problem in Proposition 1 can be reduced to lower bound , which can be further reduced to lower bound the subset . We now show inductively.
When , the root of the tree can have different choices. For the children nodes, we pick features from and repetitions are allowed. This leads to cases. Therefore, .
Assuming the statement holds for , we show it holds for by constructing trees in from . We do this by assigning node features in to the children of each leaf node in and . First note that when and are two nonisomorphic trees, two depthL+1 trees constructed from and will be different no matter how the node features are assigned. Now we consider all the trees can be constructed from by assign node features of children to leaf nodes.
We first consider all paths from the root to leaves in . Each path consists of a sequence of nodes where the node features form a onetoone mapping to an Ltuple . Leaf nodes are called node under if the path from the root to it corresponds to . The children of nodes under different s are always distinguishable, and thus any assignments lead to distinct rooted aggregation trees of depth . The assignment of children of nodes under the same , on the other hand, could be overcounted. Therefore, to lower bound , we only consider a special way of assignments to avoid over counting, which is that children of all nodes under the same are assigned the same set of features.
Since we assumed that at least two nodes of have different features, there are at least different s corresponding to the path from the root to leaves. For a leaf node under a fixed , one of its children needs to have the same feature as ’s parent node. This restriction is due to the definition of rooted aggregation trees. Therefore, we only pick features for the other nodes, which will be cases for each . Then through this construction, the total number of depthL+1 trees from can be lower bounded by . Finally, we have this lower bound holds for all , so we derive , and
∎
Comments
There are no comments yet.