Graph-centered machine learning has received significant interest in recent years due to the ubiquity of graph-structured data and its importance in solving numerous real-world problems such as semi-supervised node classification, community detection, graph classification and link predictionZhu (2005); Fortunato (2010); Shervashidze et al. (2011); Lü and Zhou (2011). Usually, the data at hand contains two sources of information: Node features and graph topology. As an example, in social networks, nodes represent users that have different combinations of interests and properties which represent features; edges capture friendship and collaboration relations that may or may not depend on the node features. Hence, learning methods that are able to simultaneously exploit node features and the graph topology are of great importance.
Graph neural networks (GNN) leverage their representational power to provide state-of-the-art performance when addressing the above described application domains. Many GNNs use message passing Gilmer et al. (2017); Battaglia et al. (2018) to manipulate node features and graph topology. They are constructed by stacking neural network layers which essentially propagate and transform node features over the given graph topology. Different types of network layers have been proposed and used in practice, including graph convolutional layers (GCN) Defferrard et al. (2016); Bruna et al. (2014); Kipf and Welling (2017), graph attention layers (GAT) Veličković et al. (2018) and many others Hamilton et al. (2017); Wijesinghe and Wang (2019); Zeng et al. (2020); Abu-El-Haija et al. (2019). Although in principle an arbitrary number of layers may be stacked, practical models are usually shallow (including - layers) as these architectures are known to achieve better empirical performance than deep networks.
A widely accepted explanation for the performance degradation of GNNs with increasing depth is feature-over-smoothing, which may be intuitively explained as follows. The process of GNN feature propagating represents a form of feature-random walks on graphs, and under proper conditions, such random walks converge with exponential rate to their stationary points. This essentially levels the expressive power of the features and renders them nondiscriminative. This intuitive reasoning was first described for linear settings in Li et al. (2018) and has been recently studied in Oono and Suzuki (2020) for a setting involving nonlinear rectifiers.
As GNNs essentially attempt to combine the information provided by node features and the graph topology, topological information represents the basis for many graph-based learning paradigms that perform “large-step” propagation over graphs. For example, traditional label propagation methods allow for many steps of message passing in order to achieve good performance for semi-supervised learning without node featuresZhu et al. (2003); various forms of PageRanks, involving centered around random walks on graphs, offer excellent community detection performance Kloumann and Kleinberg (2014); Kloster and Gleich (2014). An important recent theoretical finding reported in Li et al. (2019b) shows that topological information obtained through random walks has strong discriminatory power even after a large number of propagation steps. Hence, a natural question arises: Is it possible to design a GNN method that optimally combines node and topological features, as they require different depths of message propagation?
We address this question by combining GNNs with Generalized PageRank techniques (GPR) within a new model termed GPR-GNN. The GPR-GNN architecture is designed to first learn the hidden features and then to propagate them via GPR techniques. The focal component of the network is the GPR procedure that associates each step of feature propagation with a learnable weight, with the added constraint that all weights are normalized jointly. The weights emphasize the contributions of different steps during the information propagation procedure and further adaptively trade-off between the degree of smoothing of node features and aggregation potential of topological features. The GPR-GNN scheme significantly differs from recent GNN models that utilize PageRanks with fixed weights Wu et al. (2019); Klicpera et al. (2019a, b)
and fail to properly exploit the benefits of both node and topological features. The fixed weights are chosen heuristically and may require additional complex hyper-parameter tuning. At the same time, other works that have addressed feature-over-smoothing issuesXu et al. (2018); Rong et al. (2020); Li et al. (2019a); Zeng et al. (2020); Zhao and Akoglu (2020) do not provide theoretical explanations regarding the actual role of topological features.
The performance improvements of GPR-GNN are demonstrated both theoretically and empirically. Theoretically, GPR-GNN can provably mitigate the feature-over-smoothing issue adaptively even after large-step propagation. The key intuition is that when large-step propagation leads to feature over-smoothing, the corresponding GPR weights are automatically forced towards zero as they are recognized as not being “helpful”—they do not contribute towards a decrease of the loss function. Thus, the effect of large-step propagation on feature over-smoothing is reduced until the point that GPR-GNN escapes this effect. Moreover, GPR-GNN allows for model-interpretability via the learnt weights of different steps, and may assist users in better understanding the importance of node features and graph topology in their graph-structured data.
To better understand how GPR-GNN trades-off node and topological feature exploration, we first describe the recently proposed contextual stochastic block model (cSBM) to evaluate our learner Deshpande et al. (2018). cSBM allows for smoothly controlling the “information ratio” between node features and graph topology. We show that GPR-GNN outperforms all other baseline methods for the task of semi-supervised node classification on the cSBM. We then proceed to show that GPR-GNN offers state-of-the-art performance on node-classification benchmark real-world datasets. In addition, GPR-GNN is simple to implement and we also describe certain techniques that make GPR-GNN end-to-end trainable. Our training techniques themselves may be of independent practical interest.
The remainder of the paper is organized as follows. In Section 2 we introduce the relevant notation and review the GCN architecture and the over-smoothing problem. In Section 3 we introduce the ideas behind our GPR-based approach and the GPR-GNN model. The theoretical analysis of the GPR-GNN is presented Section 4. The experimental setup is described in Section 5 while our new synthetic datasets based on cSBM are introduced in Section 5.1. Finally, the experimental results on benchmark real-world datasets are presented in Section 5.2. Due to space limitations, all proofs are deferred to the Supplementary Material.
Let be an undirected graph with nodes and edges . Let denote the number of nodes, assumed to belong to one of classes. The nodes are associated with the node feature matrix where denotes the number of features per node. The label matrix is denoted by
, where each row is a one-hot vector. Throughout the paper, we useto indicate the row and to indicate the column of the matrix . We also use to denote the argmax of the vector : we have if and only if (ties are broken evenly), and otherwise. The symbol is reserved for the Kronecker delta function. We use to denote the standard Euclidean inner product.
The graph is described by the adjacency matrix , while describes the graph with added self-loops. We let be the diagonal degree matrix of where . We write for the symmetrically normalized adjacency matrix with self-loops.
GCN and the over-smoothing problem. One of the key components in most GNN models is the graph convolutional layer. Specifically, the function performed by the layer of an GCN can be written as:
where denotes the output of final layer, and represents the trainable weight matrix for the layer. The key issue that limits stacking multiple these layers is the over-smoothing phenomenon: If one were to remove the nonlinear rectifier ReLU in the above expression, where each row of only depends the degree of the corresponding node, provided that the graph is irreducible and aperiodic, which leads to the model loosing information about the discriminatory power of node features.
3 GPR-GNN: Motivation
Generalized PageRanks and the trade-off between node and topological features. Generalized PageRank (GPR) methods were first used in the context of unsupervised graph clustering where they showed significant performance improvements over existing methods Kloumann et al. (2017). The operational principles of GPRs can be succinctly described as follows. Given a seed node in some cluster of the graph, one-dimensional features is initialized according to . The GPR score is defined as , where the parameters are referred to as the GPR weights. Clustering of the graph is performed locally by thresholding the GPR score. The authors of Li et al. (2019b) recently introduced and theoretically analyzed a special form of GPR termed Inverse PR and showed that long random walk paths are more beneficial for clustering then previously assumed, provided that the GPR weights are properly selected. The findings relevant to this work are summarized in Figure 1 (b), depicting the predictive power of with increasing for clustering tasks based on graph topology only (a detailed explanation of the experiments is provided in the Supplement). The importance of long random walk paths is evident since propagating information through only steps of a random walk is insufficient to capture all relevant graph-topology features.
A significant problem of GNNs is that while large-step propagation helps with extracting topological features, it also introduces over-smoothing effects for the node features (Section 2). The most important premise of our works is that if one allows for training the GPR weights it becomes possible to trade-off the large-step topology exploration benefits with the feature over-smoothing loss. This also strongly motivates combining GPRs with GNNs. The main findings of our work are summarized in Section 4 where we theoretically prove that, after many steps of the random walk, if node features become smoothed to an extent that does not allow for reliable predictions, GPR-GNN naturally biases the GPR weights corresponding to those steps towards zero. This further emphasizes the need for trainable GPR weights in models combining random walks with GNNs.
(a) Hidden state feature extraction is performed by a neural networks using individual node features propagated via GPR. Note that both the GPR weightsand parameter set of the neural network are learned simultaneously in an end-to-end fashion (as indicated in red). In our implementation, we further restrict the weights
to be probabilities (nonnegative values that sum up to one). (b) The clustering performance based on(the suffix -d is used for degree normalization ). As may be observed, 2-4 steps are insufficient to capture the graph topological features and large-step propagation ( steps) is necessary.
The GPR-GNN Model. GPR-GNN first extracts hidden state features for each node and then uses GPR to propagate them. The GPR-GNN process can be mathematically described as:
where denotes the node feature matrix and represents a neural network with parameter set that generates the hidden state features . The GPR weights are trainable and we restrict them to be nonnegative and . This setting differs significantly from that used in APPNP Klicpera et al. (2019a), SGC Wu et al. (2019) and GDC Klicpera et al. (2019b), for which the propagation rules cannot be changed adaptively as in GPR-GNNs. Also, it can be easily seen that APPNP and SGC are special cases of our model as APPNP fixes , and SGC removes all nonlinearities with for some integer , respectively. These two weight choices define a Personalized PageRank (PPR) Jeh and Widom (2003) which is known to be suboptimal compared to some other GPR frameworks for community learning problems Li et al. (2019b). Fixing the GPR weights makes the model unable to adaptively learn the optimal propagation rules which is of crucial importance: As we show in Section 4, learning the optimal GPR weights is not only critical for mitigating the effect of over-smoothing but it also relates to optimal graph filtering. In practice, we find the normalization is important to ensure that the model can be stably trained. Moreover, a standard co-training procedure of GPR weights and other parameters does not work. Therefore, we introduce a novel training technique termed “gradient dropout” to facilitate learning the GPR weights. Detailed discussions of our implementations can be found in the Supplement.
Another benefit of the GPR-GNN model is its interpretability. As already pointed out, GPR-GNN has the ability to adaptively control the contribution of each propagation step, which allows for appropriate smoothing of node features over long paths. Examining the learned GPR weights also helps to understand the properties of the GPR-GNN method itself. To this end, we show by experiments that for the first few steps of the random walk the learned GPR weights of GPR-GNN are close to those of PPR, while the other GPR weights are significantly different. This explain why the heuristic PPR propagation often works well in practice but nevertheless offers suboptimal performance.
4 Theoretical properties of GPR-GNN
Escaping from over-smoothing. As already emphasized, the most crucial innovation of the GPR-GNN method is to make the GPR weights adaptively learnable, which allows GPR-GNN to avoid over-smoothing and trade node and topology feature informativeness. Intuitively, when large-step propagation is not beneficial, it increases the training loss. Hence the gradient of the corresponding GPR weight becomes positive which subsequently reduces its value. We prove this intuitive observation by first characterizing the behavior of for sufficiently large , corresponding to propagation on large-step via Lemma 4.1. Based on this result we then mathematically formulate the over-smoothing phenomenon in Definition 4.2. Through a careful analysis of the gradient of the weights for sufficiently large we manage to derive its closed form in Theorem 4.3. We then show that the gradient remains positive until GPR-GNN resolves the over-smoothing effect.
Assume that the nodes in an undirected and connected graph have one of labels. Then, for large enough, we have
For any and large enough , if the label prediction is dominated by , all nodes will have a representation proportional to . Hence, we will arrive at the same label for all nodes. This is what we refer to as the over-smoothing phenomenon.
Definition 4.2 (The over-smoothing phenomenon).
First, recall that . If over-smoothing occurs in the GPR-GNN for large enough, we have for some .
We are now ready to state our main result. The formal statement is in the Supplement for simplicity.
(Informal) Under the same assumptions as those listed in Lemma 4.1, if the training set contains nodes from more than one class, then the GPR-GNN method can always avoid over-smoothing. More specifically, let be the cross entropy loss. Then, for a large enough we have
Since we introduced a self loop for each node, and thus . Note that by ignoring the term, the expression in Equation (3) is . Equality holds iff . This implies that over-smoothing results in a prediction that is in perfectly alignment with the ground truth labels in the training set. However, if our training set contains nodes from different classes, the equality does not hold. Thus the gradient of will always be positive and will keep decreasing until GPR-GNN resolves the over-smoothing effect. Note that this is possible only when the GPR weights are adaptively trained.
Equivalence of the GPR method and polynomial graph filtering. We start by recalling that . Next, let
be the eigenvalue decomposition of. As a result, we have
which represents a polynomial graph filter of order with the GPR weights serving as the filter coefficients. Note that restricting the GPR weights to be nonnegative leads to a low pass filtering since the eigenvalues of are nonnegative (see the proof of Lemma 4.1 in the Supplement for details).
Homophily versus Heterophily. The homophily principle McPherson et al. (2001); Bojchevski et al. (2019) in the context of node classification asserts that nodes from the same class tend to form edges. This is usually a common assumption in graph clustering, especially in the unsupervised setting Von Luxburg (2007); Tsourakakis (2015); Dau and Milenkovic (2017). However, there exist practical scenarios when the graphs are heterophilic, in which case node classification is significantly more difficult Jia and Benson (2020). In this case, restricting the GPR weights to be nonnegative is inappropriate. By allowing the GPR weights to take arbitrary values, GPR-GNN can be adjusted to accommodate heterophilic graphs. Nevertheless, in practice, we find that restricting the GPR weights to be nonnegative always leads to better results on benchmark datasets. This is due to the fact that all the networks appear to obey the homophily principle which we also empirically verified and reported on in Table 2. An in-depth study of GPR-GNN under heterophilic graph assumptions is left as future work.
GPR-GNN prevents over-fitting. Besides over-smoothing, another important problem that limits the depth of GCN-like models is over-fitting. Recall that in GCN one needs to introduce a trainable parameter matrix by propagating one additional step. In contrast, GPR-GNN needs to introduce only one additional parameter that does not affect the complexity of the neural network component. Combined with Theorem 4.3, this shows that GPR-GNN simultaneously prevents over-smoothing and over-fitting problems with a very simple GNN architecture.
5 The experimental setup
Our experimental setup examines the semi-supervised node classification task in the sparse label regime () and transductive setting. The sparse label regime is of much more significant practical importance than the rich label setting (i.e. ). Obtaining node labels is expensive and time-consuming Cohn et al. (1994)
and there is significant ongoing effort to solve the problem via active learning on graphsBilgic et al. (2010); Dasarathy et al. (2015); Chien et al. (2019, 2020).
The authors of Shchur et al. (2018) recently pointed out issues that arise with experimental setups that consider only a single training/validation/test split of the data. To address this concern, we use a random split in all experiments. For all datasets, we run each experiment times with multiple random splits and initializations, if not specified otherwise.
Comparable models. We compare GPR-GNN with state-of-the-art models: GCN Kipf and Welling (2017), GAT Veličković et al. (2018), GraphSAGE Hamilton et al. (2017), JK-Net Xu et al. (2018), APPNP Klicpera et al. (2019a) and SGC Wu et al. (2019)
. For all these architectures, we use the corresponding Pytorch Geometric library implementationsFey and Lenssen (2019) with proper hyper-parameter tuning (Note that JK-Net, APPNP and SGC, per design, may heuristically capture some topological information via large-step propagation). We test APPNP with both and and denote the two implementations by APPNP(0.1) and APPNP(0.2), respectively.
The GPR-GNN model setup. We choose random walk path lengths with and use a
-layer multilayer perceptron (MLP) withhidden units for the NN component. For the GPR weights, we use peak initialization: We choose a large value for the central GPR weight ( in our experiment) and set all other GPR weights to be small. We enforce that the GPR weights are probability masses. As aforementioned, we also use a new technique termed “gradient drop” for learning the GPR weights which greatly improves their quality. Further experimental settings are discussed in the Supplement.
5.1 Testing new cSBM synthetic datasets
The cSBM Deshpande et al. (2018) allows for gradual testing of the trade-off between the node features and graph topology in the learning process. In cSBM, the node features are Gaussian random vectors, where the mean of the Gaussian depends on the community assignment. The difference of the means is controlled by a parameter , while the difference of the edge densities in the communities and between the communities is controlled by a parameter . Hence and capture the “relative informativeness” of node features and the graph topology, respectively. The information-theoretic limits of reconstruction for the cSBM are characterized in Deshpande et al. (2018). The results show that, asymptotically, one need to ensure a vanishing ratio of the misclassified nodes and the total number of nodes, where and denotes the dimension of the node features.
Note that given a tolerance value , is an arc of an ellipsoid for which and . To fairly and continuously control the extent of information carried by the node features and graph topology, we introduce a parameter . The setting indicates that only node features are informative, while indicates that only the graph topology is informative. Due to space limitation we refer the interested reader to Deshpande et al. (2018) for a review of all formal theoretical results and only outline the cSBM properties needed for our analysis. Additional information is also available in the Supplementary Material.
Results. We examine the robustness of all baseline methods and GPR-GNN using cSBM-generated data with . The results are summarized in Table 1. We further tested the performance of a -layer MLP as the reference method that does not use graph topology. For , GPR-GNN offers the same performance as GNNs. However, for GPR-GNN significantly outperforms all methods with a gain that increases as the graph topology information becomes more important. Clearly, in practice, graph-structured information is usually highly relevant and readily available which makes a strong case for GPR-GNN. This can also be directly verified from Table 2 by observing the presence of strong homophily in all benchmark datasets.
Average accuracy and its corresponding standard deviation on the cSBM. Note that GPR-GNN performs significantly better starting from the parameter value for which graph topology is at least as important as the node feature information.
5.2 Experiments on benchmark datasets
We use benchmark datasets available from the Pytorch Geometric library, including the citation networks Cora, CiteSeer, PubMed Sen et al. (2008); Yang et al. (2016) and DBLP Pan et al. (2016); Bojchevski and Günnemann (2018), the coauthor networks CS and Physics Shchur et al. (2018), the Amazon co-purchase graphs Computers and Photo McAuley et al. (2015); Shchur et al. (2018) and the Reddit posts graph Hamilton et al. (2017). We summarize the dataset statistics in Table 2. We also evaluate the in-class edge density and the cross-class edge density in the graphs, which confirms that the aforementioned homophily properties are at work in each of the tested datasets.
We use accuracy (the micro-F1 score) as the evaluation metric and the relevant results are summarized in Table3. We also report the relative accuracy and ranking to enable a more detailed comparison. With respect to the relative accuracy, for each dataset we normalize the accuracy of each model by the best model accuracy. In the context of rankings, the smaller ranking indicates a better performance. We also compute the average relative accuracy and average ranking for each model across all datasets. Due to space limitations, we only report the average accuracy and average ranking; additional results are available in the Supplementary Material.
Table 3 shows that, in general, GPR-GNN outperforms all methods tested: It outperforms all other methods on out of the benchmark datasets, while on the remaining datasets (Computers and Reddit) it offers only slightly worse results than GAT. However, note that both GAT and GraphSAGE are significantly more memory intensive due to their complex architectures. Also, although GAT and GraphSAGE can perform batch training on small datasets such as Cora and CiteSeer, we surprisingly find that neighborhood sampling Hamilton et al. (2017) produces better result. Thus for GAT and GraphSAGE we only report neighborhood sampling result for all the datasets. The Reddit dataset is too large for any method to perform training in the batch setting; thus, we apply neighborhood sampling for all methods when using the Reddit data and only execute runs. Note that GAT and GraphSAGE are inherently designed to operate in conjunction with neighborhood sampling which is not the case for GPR-GNN. This may be one of the reasons why GPR-GNN performs slightly worse than GAT on Reddit (the study of optimal mini-batch settings for GPR-GNN is beyond the scope of this paper). GPR-GNN also significantly outperforms all baseline methods in terms of average relative accuracy and average ranking.
Furthermore, we observe that GPR-GNN consistently outperforms APPNP(0.1) and APPNP(0.2) on all benchmark datasets. This shows that GPR-GNN is indeed more accurate than APPNP. Moreover, fixing in the APPNP model is not a robust strategy. For example, APPNP(0.1) performs poorly on Computers while APPNP(0.2) performs poorly on Cora. This once again emphasizes the importance of leveraging GPR as the propagation scheme and learning adequate GPR weights.
As a final remark, observe that SGC does not perform well on the Computers and Photo datasets. We conjecture that in our sparse label regime setting, a nonlinearity is necessary to learn meaningful results using GNNs. This agrees with the recent finding reported in Bojchevski et al. (2019) that SGC may breakdown for some “hard” tasks.
|Model||Computers||Photo||Avg. r-acc.||Avg. ranking|
. The blue doted lines denote the upper and lower quartile. The magnitude (y-axis) is shown in log-scale.
Model interpretability. We examined the GPR weights learned from GPR-GNN on Cora and Photo in more detail (Figure 2 (a) (b)). Interestingly, we find that for the first steps of GPR the weights are close to that of the PPR weights with or . This explains why APPNP with or offers good performance in practice; it also shows that GPR-GNN can learn the appropriate GPR weights. Besides being large due to our peak initialization procedure, the GPR weights of large-steps remain small. This is due the high correlation between large-step propagation results, which are also observed in the studies of GPRs reported in Kloumann et al. (2017); Li et al. (2019b).
Escaping from over-smoothing. To test the ability of GPR-GNN to mitigate over-smoothing, we set and set the largest weight of GPR initialization at
. This will force GPR-GNN to start over-smoothing from the first epoch and produce the same label prediction for all nodes. For Cora, we find that forout of runs GPR-GNN predicts the same labels for all nodes at epoch , which implies that over-smoothing indeed occurs immediately. From Figure 2 (c) we can observe that GPR-GNN can avoid over-smoothing as the GPR weights of the first few steps increase significantly. Moreover, on average, the final prediction is accurate which is much larger than the initial accuracy of at epoch . Similar results can be observed for other datasets.
- Community detection and stochastic block models: recent developments. The Journal of Machine Learning Research 18 (1), pp. 6446–6531. Cited by: §13.
- MixHop: higher-order graph convolutional architectures via sparsified neighborhood mixing. In International Conference on Machine Learning, pp. 21–29. Cited by: §1.
Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §1.
- Active learning for networked data. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 79–86. Cited by: §5.
- Deep gaussian embedding of graphs: unsupervised inductive learning via ranking. In International Conference on Learning Representations, External Links: Cited by: §5.2.
- Is pagerank all you need for scalable graph neural networks?. In ACM KDD, MLG Workshop, Cited by: §4, §5.2.
- Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, pp. http–openreview. Cited by: §1.
Active learning in the geometric block model.
Thirty-Fourth AAAI Conference on Artificial Intelligence. Cited by: §5.
- : Active learning over hypergraphs with pointwise and pairwise queries. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2466–2475. Cited by: §5.
- Improving generalization with active learning. Machine learning 15 (2), pp. 201–221. Cited by: §5.
- S2: an efficient graph based active learning algorithm with application to nonparametric classification. In Conference on Learning Theory, pp. 503–522. Cited by: §5.
- Latent network features and overlapping community discovery via boolean intersection representations. IEEE/ACM Transactions on Networking 25 (5), pp. 3219–3234. Cited by: §4.
- Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3844–3852. Cited by: §1.
- Contextual stochastic block models. In Advances in Neural Information Processing Systems, pp. 8581–8593. Cited by: §1, Theorem 13.1, §13, §13, §13, §5.1, §5.1.
- Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428. Cited by: §11.2, §5.
- Community detection in graphs. Physics reports 486 (3-5), pp. 75–174. Cited by: §1.
- Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §1.
- Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §1, Table 4, Table 5, §5.2, §5.2, §5.
- Scaling personalized web search. In Proceedings of the 12th international conference on World Wide Web, pp. 271–279. Cited by: §3.
- Outcome correlation in graph neural network regression. arXiv preprint arXiv:2002.08274. Cited by: §4.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §11.2.
- Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §1, §5.
- Combining neural networks with personalized pagerank for classification on graphs. In International Conference on Learning Representations, External Links: Cited by: §1, §10, §10, §3, §5.
- Diffusion improves graph learning. In Advances in Neural Information Processing Systems, pp. 13333–13345. Cited by: §1, §3.
- Heat kernel based community detection. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1386–1395. Cited by: §1.
- Community membership identification from small seed sets. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1366–1375. Cited by: §1.
- Block models and personalized pagerank. Proceedings of the National Academy of Sciences 114 (1), pp. 33–38. Cited by: §3, §5.2.
DeepGCNs: can gcns go as deep as cnns?.
The IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
- Optimizing generalized pagerank methods for seed-expansion community detection. In Advances in Neural Information Processing Systems, pp. 11705–11716. Cited by: §1, §13, §3, §3, §5.2.
- Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
- Link prediction in complex networks: a survey. Physica A: statistical mechanics and its applications 390 (6), pp. 1150–1170. Cited by: §1.
- Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–52. Cited by: §5.2.
- Birds of a feather: homophily in social networks. Annual review of sociology 27 (1), pp. 415–444. Cited by: §4.
- Graph neural networks exponentially lose expressive power for node classification. In International Conference on Learning Representations, Cited by: §1.
- Tri-party deep network representation. Network 11 (9), pp. 12. Cited by: §5.2.
- DropEdge: towards deep graph convolutional networks on node classification. In International Conference on Learning Representations, External Links: Cited by: §1.
- Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §5.2.
- Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868. Cited by: §5.2, §5.
- Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12 (77), pp. 2539–2561. Cited by: §1.
- Provably fast inference of latent features from networks: with applications to learning social circles and multilabel classification. In Proceedings of the 24th International Conference on World Wide Web, pp. 1111–1121. Cited by: §4.
- Graph attention networks. In International Conference on Learning Representations, External Links: Cited by: §1, §5.
A tutorial on spectral clustering. Statistics and computing 17 (4), pp. 395–416. Cited by: §4, §7, §7.
- DFNets: spectral cnns for graphs with feedback-looped filters. In Advances in Neural Information Processing Systems, pp. 6007–6018. Cited by: §1.
- Simplifying graph convolutional networks. In International Conference on Machine Learning, pp. 6861–6871. Cited by: §1, §3, §5.
- Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, pp. 5453–5462. Cited by: §1, §10, §5.
- Revisiting semi-supervised learning with graph embeddings. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pp. 40–48. Cited by: §5.2.
- GraphSAINT: graph sampling based inductive learning method. In International Conference on Learning Representations, External Links: Cited by: §1, §1.
- PairNorm: tackling oversmoothing in gnns. In International Conference on Learning Representations, External Links: Cited by: §1.
- Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine learning (ICML-03), pp. 912–919. Cited by: §1.
- Semi-supervised learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §1.
6 Formal statement of Theorem 4.3 and the relevant proofs
We start with the formal statement of our main theorem. Let us replace the softmax with softmax, where we let stand for the softmax with a smooth parameter . Note that for we recover the standard softmax. With a slight abuse of notation, for the vector we write to denote element-wise exponentiation. Also we use for the cross entropy loss where
Finally, we ignore the normalization of GPR weights used in practice for simplicity. The formal statement of our main theorem reads as follows.
To prove the theorem, we find the subsequent lemmas useful.
Let be the cross entropy loss and let be the training set. Under the same assumption as given in Lemma 4.1, the gradient of for large enough is
For any real vector and large enough, we have .
The first equality is due to the fact that and . Recall that by Lemma 4.1, . Since we have a self- loop for each node, and thus . Note that when ignoring the terms in the last part of (9), equality is achieved if and only if . This means that over-smoothing results in a prediction that perfectly aligns with the ground truth label in the training set. However, if our training set contains nodes from different classes then the equality can never be attained. Thus, the gradient of will always be positive and will keep decreasing until GPR-GNN escape the over-smoothing effect.
7 Proof of Lemma 4.1
We start by showing that the symmetric graph Laplacian
is positive semi-definite. Let be any real vector of unit norm and , then we have
where the last step follows from the definition of the degree.
Next we show that is indeed an eigenvalue of
associated with the unit eigenvectorwhere .
Let be the all one vector. Then, a direct calculation reveals that
Combining this result with the positive semi-definite property of the Laplacian shows that is indeed the smallest eigenvalue of associated with the eigenvector . Moreover, from (13) and the assumption that the graph is connected, it is not hard to see that the multiplicity of the eigenvalue is exactly 1 (See Proposition 2 and 4 in Von Luxburg (2007) for more detail). Finally, from (10) it is obvious that the the largest eigenvalue of is , which correspond to the eigenvector . Hence all other eigenvalues of .
Next, we prove that . This can also be shown directly from (13). Note that
The inequality follows from an application of the Cauchy-Schwartz inequality. Consequently, the largest eigenvalue of is bounded by which means that . Note that equality holds if and only if the underlying graph is bipartite. However, this is impossible in our setting since we have added a self loop to each node. Hence . This means
Hence, for any we have
Note that this can also be written with the term as
This completes the proof.
8 Proof of Lemma 6.2
Recall that our loss function equals
Then by taking the partial derivative of the loss function with respect to we have
Next, recall that for GPR-GNN we also have
. Plugging this expression into the previous formula and applying the chain rule we obtain
Settin for large enough , it follows from Lemma 4.1 that
9 Proof of Lemma 6.3
Let . Then by the definition of softmax for we have
Note that when and when . Without loss of generality we assume that there are maxima in , where and let denote the set of indices of those maxima. Then, taking the limit we have
This implies that for large enough one has
The above result completes the proof.
10 Related works
Among the methods that differ from GCN, GNNs, APPNP is one of the state-of-the-art methods that is related to our GPR-GNN approach. Compared to GPR-GNN, APPNP requires tunning an important hyperparameter. The choice of is highly dependent on the dataset used. In contrast, our GPR-GNN adaptively learns the optimal GPR weights in an end-to-end fashion. In Klicpera et al. (2019a), the hyperparameter is chosen to be either or . In our Experiments section we show that our GPR-GNN consistently outperforms APPNP for both the recommended choices of and on all benchmark datasets.
Among the GCN-like models, we find that JK-Net Xu et al. (2018) exhibits some similarities with GPR-GNN. Roughly speaking, it also aggregates the output from every convolutional layer in GCN. However, the depth of the JK-Net is still limited Klicpera et al. (2019a) and we again consistently outperform this method on all benchmark datasets.
11 Experimental setup details
11.1 Experimental setup for Figure 1 (b)
For each of the classes (communities), we randomly choose a node in the community as a seed to form the initial one-hot vector where . Then we return the top- scoring nodes in and compute the recall. A similar method is used for the degree-normalized case (-d). For each network, the results are summarized based on independently chosen seeds for each community-network pair and then averaged over over all communities.
11.2 Experimental setup for Section 5
All experiments are performed on a Linux Machine with cores, GB of RAM, and a NVIDIA Tesla P100 GPU with GB of GPU memory. For the training set, we ensure that number of nodes from each class is approximately the same while keeping the total number of training nodes close to . For the validation set, we randomly sample of the nodes and place the remaining ones into the test set.
For all baseline models, we directly use the implementation in the Pytorch Geometric library Fey and Lenssen (2019) and the corresponding properly tuned hyperparameters. For all methods the default hyperparameters setting is as following: Learning rate , dropout , early stopping and weight decay . The maximum number of epochs is chosen to be for both real benchmark dataset and our cSBM synthetic datasets. All models are chosen to use the Adam optimizer Kingma and Ba (2014). Note that the early stopping criteria is exactly the same as in Pytorch Geometric – when the epoch is greater than half of the maximum epoch, we check if the current validation loss is lower than the average over the past epochs. If it is not lower, we stop the training process.
For GCN, we use 2 GCN layers with 64 hidden units. For GAT, we use 2 GAT convolutional layers, where the first layer has 8 attention heads and each head has 8 hidden units; the second layer has 1 attention head and 64 hidden units. For GraphSAGE, we use 2 layers with 16 hidden units and mean pooling during aggregation. For JK-Net, we use the GCN-based model with 2 layers and 16 hidden units in each layer. As for the layer aggregation part, we use a LSTM with 16 channels and 4 layers. For SGC, we use 2 steps of propagation. For the MLP we tested in the cSBM experiments, we choose a 2-layer fully connected network with 64 hidden units. For APPNP we use the same 2-layer MLP with 10 steps of propagation and parameter equals to and , respectively.
12 GPR-GNN model details
GPR weights initialization. As mentioned in the main text, we find that not all arbitrary initializations of GPR weights work well in practice. We therefore use the peak like initialization procedure, which sets one medium range GPR weight ( in our experiment) to a large value, and all other GPR weights to a small value. This peak initialization is inspired by the success of SGC, which correspond to the case for some integer . Note that we empirically find that once a GPR weight reaches zero (either from the initialization or through parameter updates), the corresponding GPR weights will be hard to change afterwards. Hence we suggest to use some small value ( in our experiment) instead of .
GPR weights normalization. We empirically determined that applying the absolute value on the weights and normalizing them by their sum works well in practice. The reason why absolute values are favored over ReLUs is because they make the GPR weights more unlikely to be exactly . As mentioned above, once a GPR weight reaches it is hard to update it any further.
Gradient dropout. We also introduce a new technique we refer to as gradient dropout for training the GPR weights. In a nutshell, we randomly erase the gradient of each independently with probability before the gradient update. Note that this is significantly different from the standard dropout methods which erase the value itself. We find that the gradient dropout is necessary for learning good GPR weights while the standard dropout does not work properly. This is because setting the GPR value to 0 may dramatically render the current near-optimal learned propagation rule. In contrast, by setting the gradient to 0 the GPR will not change the weights themselves. We conjecture that the gradient dropout not only introduces some highly needed random perturbations to the GPR weights during training, but also balances the learning speed of the neural network and GPR component of the system.
Other hyperparameter details. For the remaining hyperparameters, we use the same learning rate, dropout, early stopping, and maximum epoch number as for the other baseline models. We use a smaller learning rate for the GPR part in order to balance the learning speed of the neural network part and the GPR part. We set the peak value at the initialization stage to be .
13 cSBM details
The cSBM adds Gaussian random vectors as node features on top of the classical SBM. For simplicity, we assume equally sized communities with node labels in . Each node is associate with a dimensional Gaussian vector where is the number of nodes, and has independent standard normal entries. The (undirected) graph in cSBM is described by the adjacency matrix defined as
Similar to the classical SBM, given the node labels the edges are independent. The symbol stands for the average degree of the graph. Also, recall that and control the information strength carried by the node features and the graph structure respectively.
One reason for using the cSBM to generate synthetic data is that the information-theoretic limit of the model is already characterized in Deshpande et al. (2018). This result is summarized below.
Theorem 13.1 (Informal main result in Deshpande et al. (2018)).
Assume that , and
. Then there exists an estimatorsuch that is bounded away from if and only if .
In our experiment, we set and thus have . We vary and along the arc for some to ensure that we are in the achievable parameter regime. We also choose for all our experiment. To better understand the information ratio between node features and graph structure, we introduce a auxiliary parameter that we can control in our experiments. When is close to , the information is mostly contained in node features while if is close to , the the information are mostly contained in the graph structure. Although the theoretical result in Deshpande et al. (2018) holds for an unsupervised setting, it still represents a good synthetic model for testing the robustness of GNNs in various relevant settings.
As a final remark, the theoretically optimal clustering method for cSBM is belief propagation (BP), which is also known to be optimal for the standard SBM Deshpande et al. (2018); Abbe (2017). The authors of Li et al. (2019b) empirically showed that for the classical SBM setting the clustering performance of GPR converges to that of BP as the propagation steps increase, but that the same it not true for PPR. This also explain why one should not use PPR instead of Inverse PR in the message propagation scheme.
14 Additional results