Recent years have seen increasing attention to Graph Neural Nets (GNNs) [14, 12, 2, 10], which have achieved superior performance in many graph tasks, such as node classification [10, 17] and graph classification [16, 18]
. Different from traditional neural networks that are defined on regular structures such as sequences or images, graphs provide a more general abstraction for structured data, which subsume regular structures as special cases. The power of GNNs is that they can directly define learnable compositional function on (arbitrary) graphs, thus extending classic networks (e.g. CNNs, RNNs) to more irregular and general domains.
Despite their success, it is unclear what GNNs have learned, and how sophisticated the learned graph functions are. It is shown in  that traditional CNNs used in image recognition have learned complex hierarchical and compositional features, and that deep non-linear computation can be beneficial . Is this also the case when applying GNNs to common graph problems? Recently, 
showed that, for common node classification benchmarks, non-linearity can be removed in GNNs without suffering much loss of performance. The resulting linear GNNs collapse into a logistic regression on graph propagated features. This raises doubts on the necessity of complex GNNs, which require much more expensive computation, for node classification benchmarks. Here we take a step further dissecting GNNs, and examine the necessity of complex GNN parts on more challenging graph classification benchmarks[20, 23, 18].
To better understand GNNs on graph classification, we dissect it into two parts/stages: 1) the graph filtering part, where graph-based neighbor aggregations are performed, and 2) the set function part, where a set of hidden node features are composed for prediction. We aim to test the importance of both parts separately, and seek answers to the following questions. Do we need a sophisticated graph filtering function for a particular task or dataset? And if we have a powerful set function, is it enough to use a simple graph filtering function?
To answer these questions, we first propose Graph Feature Network (GFN), a simple lightweight neural net defined on a set of graph augmented features. Unlike GNNs, which learn a multi-step neighbor aggregation function on graphs [1, 4], the GFN only utilizes graphs in constructing its input features. It first augments nodes with graph structural and propagated features, and then learns a neural net directly on the set
of nodes (i.e. a bag of graph pre-processed feature vectors), which makes GFN a fast approximation to GNN. We then prove that GFN can be derived by linearizing the graph filtering part of a GNN, and leverage this connection to design experiments to probe both GNN parts separately.
Empirically, we perform evaluations on common graph classification benchmarks [20, 23, 18], and find that GFN can match or exceed the best accuracies produced by recently proposed GNNs, at a fraction of the computation cost. This result casts doubts on the necessity of non-linear graph filtering, and suggests that the existing GNNs may not have learned more sophisticated graph functions than linear neighbor aggregation on these benchmarks. Our ablations on GFN further demonstrate the importance of non-linear set function, as its linearization can hurt performance significantly.
Summary of contributions. We propose Graph Feature Network (GFN): a simple and lightweight model for graph classification. We dissect GNNs on graph classification and leverage GFN to study the necessity of complex GNN parts. Empirically we show GFN trains faster and matches the best performance of GNNs. Our results provide new perspectives on the functions that GNNs learn, and also suggest the current benchmarks for evaluating them are inadequate (not sufficiently differentiating).
Graph classification problem.
We use to denote a graph, where is a set of vertices/nodes, and is a set of edges. We further denote an attributed graph as , where are node attributes with . It is assumed that each attributed graph is associated with some label , where is a set of pre-defined categories. The goal in graph classification problem is to learn a mapping function , such that we can predict the target class for unseen graphs accurately. Many real world problems can be formulated as graph classification problems, such as social and biological graph classification [20, 10].
Graph neural networks.
Graph Neural Networks (GNNs) define functions on the space of attributed graph . Typically, the graph function, , learns a multiple-step transformation of the original attributes/signals for final node level or graph level prediction. In each of the step , a new node presentation, is learned. Initially, is initialized with the node attribute vector, and during each subsequent step, a neighbor aggregation function is applied to generate the new node representation. More specifically, common neighbor aggregation functions for the -th node take the following form:
where is a set of neighboring nodes of node . To instantiate this neighbor aggregation function,  proposes the Graph Convolutional Network (GCN) aggregation scheme as follows.
where is the learnable transformation weight, is the normalized adjacency matrix with as a constant ( in ) and ., where are the hidden states of all nodes at -th step.
More sophisticated neighbor aggregation schemes are also proposed, such as GraphSAGE  which allows pooling and recurrent aggregation over neighboring nodes. Most recently, in Graph Isomorphism Network (GIN) , a more powerful aggregation function is proposed as follows.
where MLP abbreviates for multi-layer perceptrons andcan either be zero or a learnable parameter.
Finally, in order to generate graph level representation , a readout function is used, which generally takes the following form:
This can be instantiated by a global sum pooling, i.e. followed by fully connected layers to generate the categorical or numerical output.
3.1 Graph feature network
Our model is motivated by the question whether, with a powerful graph readout function, we can simplify the sophisticated multi-step neighbor aggregation functions (such as Eq. 2 and 3). Therefore we propose Graph Feature Network (GFN): a neural set function defined on a set of graph augmented features.
Graph augmented features.
In GFN, we replace the sophisticated neighbor aggregation functions (such as Eq. 2 and 3) with graph augmented features based on . Here we consider two categories as follows: 1) graph structural/topological features, which are related to the intrinsic graph structure, such as node degrees, or node centrality scores222We only use node degree in this work as it is very efficient to calculate during both training and inference., but do not rely on node attributes; 2) graph propagated features, which leverage the graph as a medium to propagate node attributes. The graph augmented features
can be seen as the output of a feature extraction function defined on the attributed graph, i.e., and Eq. 5 below gives a specific form, which combine node degree features and multi-scale graph propagated features as follows:
Neural set function.
To build a powerful graph readout function based on graph augmented features , we use a neural set function. The neural set function discards the graph structures and learns purely based on the set of augmented node features. Motivated by the general form of a permutation-invariant set function shown in , we define our neural set function for GFN as follows:
Both and are parameterized by neural networks. Concretely, we parameterize the function as a multi-layer perceptron (MLP), i.e. . Note that a single layer of resembles a graph convolution layer with adjacency matrix
replaced by identity matrix(a.k.a. convolution). As for the function , we parameterize it with another MLP (i.e. fully connected layers in this case).
GFN provides a way to approximate GNN with less computation overheads, especially during the training process. Since the graph augmented features can be pre-computed before training starts, the graph structures are not involved in the iterative training process. This brings the following advantages. First, since there is no neighbor aggregation step in GFN, it reduces computational complexity. To see this, one can compare a single layer feature transformation function in GFN, i.e. , against the neighbor aggregation function in GCN, i.e. . Secondly, since graph augmented features of different scales are readily available from the input layer, GFN can leverage them much earlier, thus may require fewer transformation layers. Lastly, it also eases the implementation related overhead, since the neighbor aggregation operation in graphs are typically implemented by sparse matrix operations.
3.2 From GNN to GFN: a dissection of GNNs
To better understand GNNs on graph classification, we propose a formal dissection/decomposition of GNNs into two parts/stages: the graph filtering part and the set function part. As we shall see shortly, the simplification of the graph filtering part allows us to derive GFN from GNN, and also be able to assess the importance of the two GNN parts separately.
To make concepts more clear, we first give formal definitions of the two GNN parts in the dissection.
(Graph filtering) A graph filtering function, , performs a transformation of input signals based on the graph , which takes a set of signals and outputs another set of filtered signals .
Graph filtering in most existing GNNs consists of multi-step neighbor aggregation operations, i.e. multiple steps of Eq. 1. For example, in GCN , the multi-step neighbor aggregation can be expressed as .
(Set function) A set function, , takes a set of vectors where their order does not matter, and outputs a task specific prediction .
The graph readout function in Eq. 4 is a set function, which enables the graph level prediction that is permutation invariant w.r.t. nodes in the graph. Although a typical readout function is simply a global pooling , the set function can be as complicated as Eq. 6.
A GNN that is a mapping of can be decomposed into a graph filtering function followed by a set function, i.e. .
This claim is obvious for the neighbor aggregation framework defined by Eq. 1 and 4, where most existing GNN variants such as GCN, GraphSAGE and GIN follow. This claim is also general, even for unforeseen GNN variants that do not explicitly follow this framework 333We can absorb the set function into . That is, let the output be final logits for pre-defined classes and set
be final logits for pre-defined classes and setto softmax function with zero temperature, i.e. with .
We aim to assess the importance of two GNN parts separately. However, it is worth pointing out that the above decomposition is not unique in general, and the functionality of the two parts can overlap: if the graph filtering part has fully transformed graph features, then a simple set function may be used for prediction. This makes it challenging to answer the question: do we need a sophisticated graph filtering part for a particular task or dataset, especially when a powerful set function is used? To better disentangle these two parts and study their importance more independently, similar to , we propose to simplify the graph filtering part by linearizing it.
(Linear graph filtering) We say a graph filtering function is linear w.r.t. iff it can be expressed as , where is a linear map of , and is the only learnable parameter.
. By doing so, the graph filtering becomes linear w.r.t. X, thus multi-layer weights collapse into a single linear transformation, described by. More concretely, let us consider a linearized GCN , its -th layer can be written as , and we can rewrite the weights with .
The linearization of graph filtering part enables us to disentangle graph filtering and the set function more thoroughly: the graph filtering part mainly constructs graph augmented features (by setting ), and the set function learns to compose them for the graph-level prediction. This leads to the proposed GFN. In other words, GNNs with a linear graph filtering part can be expressed as GFN with appropriate graph augmented features. This is shown more formally in the following proposition 1.
Let be a mapping of that has a linear graph filtering part, i.e. , then we have , where .
The proof can be found in the appendix.
We have shown that GFN can be derived from GNN by linearizing its graph filtering function 444A small exception is GFNs whose feature extraction function is not a linear map of (the one defined by Eq. 5 is not the case)., and GFN can be more efficient than GNN counterpart. Beyond being a fast approximation, GFN can also help us design experiments to understand the functions that GNNs learned and the current benchmarks for evaluating them. First, by comparing GNN with linear graph filtering (i.e. GFN) against standard GNN with non-linear graph filtering, we can assess the importance of non-linear graph filtering part. Secondly, by comparing GFN with linear set function against standard GFN with non-linear set function, we can assess the importance of non-linear set function. The outcomes of these comparisons can also help us judge the complexity of the benchmark, assuming complex tasks/datasets require both non-linear GNN parts.
4.1 Datasets and settings
The main datasets we consider are commonly used graph classification benchmarks [20, 18, 19]. The graphs in the collection can be categorized into two categories: (1) biological graphs, including MUTAG, NCI1, PROTEINS, D&D, ENZYMES; and (2) social graphs, including COLLAB, IMDB-Binary (IMDB-B), IMDB-Multi (IMDB-M), Reddit-Multi-5K (RE-M5K), Reddit-Multi-12K (RE-M12K). It is worth noting that the social graphs have no node attributes, while the biological graphs come with categorical node attributes. The detailed statistics can be found in the appendix. In addition to the common graph benchmarks, we also consider image classification on MNIST where pixels are treated as nodes and eight nearest neighbors in the grid, with an extra self-loop, are used to construct the graph.
We compare with two families of baselines. The first family of baselines are kernel-based, namely the Weisfeiler-Lehman subtree kernel (WL) , Deep Graph Kernel (DGK)  and AWE  that incorporate kernel-based methods with learning-based approach to learn embeddings. The second family of baselines are GNN-based models, which include recently proposed PATCHY-SAN (PSCN) , Deep Graph CNN (DGCNN) , CapsGNN  and GIN .
For the above baselines, we use their accuracies reported in the original papers, following the same evaluation setting as in . Architecture and hyper-parameters can make a difference, so to enable a better controlled comparison between GFN and GNN, we also implement Graph Convolutional Networks (GCN) from . More specifically, our GCN model contains a dense feature transformation layer, i.e. , followed by three GCN layers, i.e.
. We also vary the number of GCN layers in our ablation study. To enable graph level prediction, we add a global sum pooling, followed by two fully-connected layers that produce categorical probability over pre-defined categories.
For the proposed GFN, we mirror our GCN model configuration to allow direct comparison. Therefore, we use the same architecture, parameterization and training setup, but replace the GCN layer with feature transformation layers (totaling four such layers). Converting GCN layer to feature transformation layer is equivalent to setting in in GCN layers. We also construct a faster GFN, namely “GFN-light”, that contains only a single feature transformation layer, which can further reduce the training time while maintaining similar performance.
For both our GCN and GFN, we utilize ReLU activation and batch normalization, and fix the hidden dimensionality to 128. No regularization is applied. Furthermore we use batch size of 128, a fixed learning rate of 0.001, and the Adam optimizer . To compare with existing work, we follow [18, 19]
and perform 10-fold cross validation. We run the model for 100 epochs, and select the epoch in the same way as
, i.e., a single epoch with the best cross-validation accuracy averaged over the 10 folds is selected. We report the average and standard deviation of test accuracies at the selected epoch over 10 folds.
In terms of input node features for the proposed GFN, by default, we use both degree and multi-scale propagated features (up to ), that is . We turn discrete features into one-hot vectors, and also discretize degree features into one-hot vectors, as suggested in . We set for the social graphs we consider as there are no node attributes. By default, we also augment node features in our GCN with an extra node degree feature (to counter that the normalized adjacency matrix may lose the degree information). Other graph augmented features are also studied for GCN.
For MNIST, we train and evaluate on the given train/test split. Additionally, since MNIST benefits more from deeper GCN layers, we parameterize our GCN model using a residual network  with multiple GCN blocks, the number of blocks are kept the same for GCN and GFN, and varied according to the size of total receptive field. GFN utilizes the same multi-scale features as in Eq. 5. All experiments are run on Nvidia GTX 1080 Ti GPU.
4.2 Performance comparison between GFN and existing GNN variants
Biological and social datasets.
Table 1 and 2 show the results of different methods in both biological and social datasets. It is worth noting that in both datasets, GFN achieves similar performances with our GCN, and match or exceed existing state-of-the-art results on multiple datasets. This suggests that GFN could very well approximate GCN (and other GNN variants) for these benchmarks. This result also casts doubt on the necessity of non-linear graph filtering for these benchmarks.
MNIST pixel graphs.
We report the accuracies under different total receptive field sizes (i.e. the number of hops a pixel could condition its computation on). Results in Table 3 show that, in all three different receptive field sizes, GCN with non-linear neighbor aggregation outperforms GFN with linear graph propagated features. This indicates that non-linear graph filtering is essential for performing well in this dataset. Note that our results are not directly comparable to traditional CNN’s, as our GNN does not distinguish the neighbor pixel direction in its parameterization, and a global sum pooling of pixels does not leverage spatial information. For context, when using coordinates as features both GCN and GFN achieve nearly 99% accuracy.
4.3 Training time comparisons between GFNs and GCNs
We compare the training time of our GCN and the proposed GFNs. Figure 1 shows that a significant speedup (from 1.4 to as fast) by utilizing GFN compared to GCN, especially for datasets with denser edges such as the COLLAB dataset. Also since our GFN can work with fewer transformation layers, GFN-light can achieve better speedup by reducing the number of transformation layers. Note that our GCN is already very efficient as it is built on a highly optimized framework .
4.4 Ablations on features, architectures, and visualization
To better understand the impact of features, we test both models with different input node features. Table 4 shows that 1) graph features are very important for both GFN and GCN, 2) the node degree feature is surprisingly important, and multi-scale features can further improve on that, and 3) even with multi-scale features, GCN still performs similarly to GFN, which further suggests that linear graph filtering is enough. More detailed results (per dataset) can be found in the appendix.
Architecture depth and linear set function.
We vary the number of convolutional layers (with two FC-layers after sum pooling kept the same), and also test the necessity of a non-linear set function by constructing GFN-flat. GFN-flat contains no feature transform layer, but just the global sum pooling followed by a single fully connected layer (mimicking multi-class logistic regression). Table 5 shows that 1) GCN benefits from multiple grpah convolutional layers with a significant diminishing return, 2) GFN with single feature transformation layer works pretty well already, likely due to the availability of multi-scale input node features, which otherwise require multiple GCN layers to obtain, and 3) by collapsing GFN into a linear model (i.e. linearizing set function) the performance degenerates significantly, which demonstrates the importance of non-linear set function.
Figure 2 shows visualization of random and misclassified samples from the IMDB-B dataset. We could not clearly distinguish graphs from different classes easily based on their appearance, suggesting that both GFN and GCN are capturing underlying non-trivial features. More visualization from different datasets can be found in the appendix.
In this work, we conduct a dissection of GNNs based on the proposed Graph Feature Network. GFN can be seen as a simplified GNN with linear graph filtering and non-linear set function, thus it can be used as a tool to assess and understand the complexity of learned GNNs. Empirically, we evaluate the approach on common graph classification benchmarks, and show that GFN can match or exceed the best results by recently proposed GNNs, with a fraction of computation cost. Our results also provide the following new perspectives on both the functions that GNNs learn and the current benchmarks for evaluating them.
First, the fact that GCN with linear graph filtering (i.e. our GFN) performs comparably to our GCN under the same hyper-parameter settings on the tested benchmarks, suggests that non-linear graph filtering is not essential, and the GCN, potentially other GNN variants as well, may not have learned more sophisticated graph functions than linear neighbor aggregation. However, we find the non-linear set function is important, and its linearization leads to poor results.
Secondly, when we test on graphs constructed from image dataset (MNIST), the similarly configured GCN outperforms GFN by a large margin, indicating the importance of non-linear graph filtering for this type of graph dataset.
Finally, the contrasting results on the two types of graphs above seem to suggest that the commonly used graph classification benchmarks [20, 23, 18] are inadequate and not sufficiently differentiating, since linear graph filtering is powerful enough to perform well. For this reason, we encourage the community to explore and adopt more convincing benchmarks for testing advanced GNN variants, or include GFN as a standard baseline to provide a sanity check.
We would like to thank Yunsheng Bai and Zifeng Kang for their help in a related project prior to this work. We also thank Jascha Sohl-dickstein, Yasaman Bahri, Yewen Wang, Ziniu Hu and Allan Zhou for helpful discussions and feedbacks. This work is partially supported by NSF III-1705169, NSF CAREER Award 1741634, and Amazon Research Award.
Dai et al. 
Hanjun Dai, Bo Dai, and Le Song.
Discriminative embeddings of latent variable models for structured
International conference on machine learning, 2016.
- Defferrard et al.  Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, 2016.
Fey and Lenssen 
Matthias Fey and Jan E. Lenssen.
Fast graph representation learning with PyTorch Geometric.In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
- Gilmer et al.  Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning, 2017.
- Hamilton et al.  Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, 2017.
He et al. 
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Identity mappings in deep residual networks.
European conference on computer vision, 2016.
- Ioffe and Szegedy  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Ivanov and Burnaev  Sergey Ivanov and Evgeny Burnaev. Anonymous walk embeddings. arXiv preprint arXiv:1805.11921, 2018.
- Kingma and Ba  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kipf and Welling  Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- Klicpera et al.  Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Combining neural networks with personalized pagerank for classification on graphs. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1gL-2A9Ym.
- Li et al.  Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
- Niepert et al.  Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In International conference on machine learning, 2016.
- Scarselli et al.  Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 2009.
- Shervashidze et al.  Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 2011.
Simonovsky and Komodakis 
Martin Simonovsky and Nikos Komodakis.
Dynamic edge-conditioned filters in convolutional neural networks on
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- Wu et al.  Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr, Christopher Fifty, Tao Yu, and Kilian Q Weinberger. Simplifying graph convolutional networks. arXiv preprint arXiv:1902.07153, 2019.
- Xinyi and Chen  Zhang Xinyi and Lihui Chen. Capsule graph neural network. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Byl8BnRcYm.
- Xu et al.  Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ryGs6iA5Km.
- Yanardag and Vishwanathan  Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015.
- Zaheer et al.  Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in neural information processing systems, 2017.
- Zeiler and Fergus  Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, 2014.
Zhang et al. 
Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen.
An end-to-end deep learning architecture for graph classification.In
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
Appendix A Proofs
Here we provide the proof for Proposition 1.
Appendix B Detailed statistics of datasets
|Avg # nodes||17.93||29.87||39.06||284.32||32.63|
|Avg # edges||19.79||32.30||72.82||715.66||62.14|
|Avg # nodes||74.49||19.77||13.00||508.52||391.41|
|Avg # edges||2457.78||96.53||65.94||594.87||456.89|
Appendix C Detailed performances with different features
Table 8 show the performances under different graph features for GNNs and GFNs. It is evident that both model benefit significantly from graph features, especially GFNs.
Appendix D Detailed performances with different architecture depths
Table 9 shows performance per datasets under different number of layers.
Appendix E Detailed visualizations
Figure 3, 4, 6, and 5 show the random and mis-classified samples for MUTAG, PROTEINS, IMDB-B, and IMDB-M, respectively. In general, it is difficult to find the patterns of each class by visually examining the graphs. And the mis-classified patterns are not visually distinguishable, except for IMDB-B/IMDB-M datasets where there are some graphs seem ambiguous.