Graphs are a ubiquitous data representation of many real world phenomena, with applications ranging from social networks, to chemistry, biology, and recommendation systems (Zhou et al., 2018)
. In such domains, in addition to each data point features, the input includes a graph representing relationships between the data points. The structure of the graph can be exploited in machine learning tasks to make more accurate predictions than when each data point is considered separately. One way to leverage the graph structure is to extract graph properties (e.g., node connectivity, distances, degrees(Chojnacki et al., 2010), centralities (Das et al., 2018), and density regions (Burch, 2018)), which can be useful to identify node/subgraph roles, or to identify more significant areas of the graph. Furthermore, there are “higher resolution” features, such as the presence, the position, and the quantity of particular sub-graphs or patterns of nodes, that can distinguish graphs that represent different phenomena or functionalities (Rossi and Ahmed, 2015; Khorshidi et al., 2018).
In the past few years, Graph Neural Networks (GNNs) have received a huge amount of attention from the research community. GNNs are the generalization of deep learning for graph structured data. One class of models in particular, the Graph Convolutional Network (GCN), has demonstrated to be extremely effective and is the current state-of-the-art for tasks such as graph classification, node classification, and link prediction. GCNs aim at generalizing the convolution operation, made popular by Convolutional Neural Networks (CNNs) for grid-like structures, to the graph domain, adopting a message passing mechanism. In particular, at each GCN layer, every node in the graph receives a message (e.g. a feature vector) by its 1-hop neighbours. The massages are then aggregated with a permutation invariant function (e.g. by mean or sum) and are used to update the node’s representation vector with a learnable, possibly non-linear, transformation. The final node embedding vectors are used to make predictions, and the whole process is trained end-to-end. Empirically, the best results are obtained when the message passing procedure is repeated 2 or 3 times, as a higher number of layers leads to over-smoothing(Li et al., 2018; Xu et al., 2018b)
. Thus, GCNs are only leveraging the graph structure in the form of the 2-hop or 3-hop neighbourhood of each node. This encodes the inductive bias that nodes that are close in the graph probably belong to the same class, and is in fact showing great results with respect to considering data as unstructured. However, it is still unclear if such scheme is really exploiting all the information provided by a graph.
In this work we assess whether the injection of structural information not captured by 2 or 3-hop neighbourhoods has a significant impact on the performance of several state-of-the-art GCN models, potentially showing the need for more powerful architectures to fully exploit the intricate and vast amount of information that is embedded in graph structures. In this regard, our contributions are fourfold. Firstly, we propose and formalize four different levels of structural information injection. Secondly, we propose a novel and practical regularization strategy, Random Walk with Restart Regularization (RWRReg), to inject structural information using random walks with restart, allowing GCNs to leverage long-range dependencies. RWRReg does not require additional operations at inference time, maintains the permutation-invariance of GCN models, and leads to an average increase in accuracy on both node classification, and graph classification. Thirdly, we prove a theoretical result linking random walks with restart and the Weisfeiler-Leman algorithm, providing a theoretical foundation for their use in GCNs. Fourthly, we test how the injection of structural information can impact the performance of 6 different GCN models on node classification, graph classification, and on the task of triangle counting. Results show that current state-of-the-art models lack the ability to extract long range information, and this is severely affecting their performance.
Organization of The Paper.
The rest of the paper is organized as follows. Section 2 formalizes four different strategies to inject additional structural information into existing GCN models. These include a novel regularization technique, RWRReg, based on random walks with restart. Section 3 presents the models considered to study the effects of the injection of structural knowledge. Our extensive experimental evaluation is presented in Section 4. Section 5 presents modifications of RWRReg to improve its scalability and usability for all the considered tasks and scenarios. Section 6 discusses related work, while Section 7 concludes the paper with some final remarks.
2 Injecting Long-Range Information in GCNs
To test if GCNs are missing on important information that is encoded in the structure of a graph, we inject additional structural information into existing GCN models, and test how the performance of these models changes in several graph related tasks. Intuitively, based on a model’s performance when injected with different levels of structural information, we can understand how much information is not captured by GCNs, and if this additional knowledge can improve performance on the considered tasks. In the rest of this section we present the notation used throughout the paper, the four levels of structural information injection that we consider, and an analytical result proving the effectiveness of using information from random walks with restart.
We use uppercase bold letters for matrices (), and lowercase bold letters for vectors (). We use plain letters with subscript indices to refer to a specific element of a matrix (), or of a vector (). We refer to the vector containing the -th row of a matrix with the subscript “” (), while we refer to the -th column with the subscript “” ().
For a graph , where is the set of nodes and is the set of edges, the input is given by a tuple . is a matrix where the -th row contains the -dimensional feature vector of the -th node, and is the adjacency matrix. For the sake of clarity we restrict our presentation to undirected graphs, but similar concepts can be applied to directed graphs.
2.2 Structural Information Injection
We consider four different levels of structural information injection, briefly described below.
We concatenate each node’s adjacency matrix row to its feature vector. This explicitly empowers the GCN model with the connectivity of each node, and allows for higher level structural reasoning when considering a neighbourhood (the model will have access to the connectivity of the whole neighbourhood when aggregating messages from neighbouring nodes).
Random Walk with Restart (RWR) Matrix.
We perform random walks with restart (RWR) (Page et al., 1998) from each node , thus obtaining a -dimensional vector (for each node) that gives a score of how much is “related" to each other node in the graph. We concatenate this vector of RWR features to each node’s feature vector. The choice of RWR is motivated by their capability to capture the relevance between two nodes in a graph (Tong et al., 2006), and by the possibility to modulate the exploration of long-range dependencies by changing the restart probability. Intuitively we have that if a RWR starting at node is very likely to visit a node (e.g. there are multiple paths that connect the two), then there will be a high score in the RWR vector for at position . This gives the GCN model higher level information about the structure of the graph that goes beyond the 1-hop neighbourhood of each node, and, again, it allows for high level reasoning on neighbourhood connectivity.
We define a novel regularization term that pushes nodes with mutually high RWR scores to have embeddings that are close to each other (independently of how far they are in the graph). This regularization term encourages the message passing procedure defined by GCNs, that acts on neighbouring nodes, to produce embeddings where pairs of nodes with high RWR score have similar representations. Therefore, the learned representation combines local information with long range information provided by RWR. Let be the matrix with the RWR scores. We define the RWRReg (Random Walk with Restart Regularization) loss as follows:
where is a matrix of size containing -dimensional node embeddings that are in between graph convolution layers (see Appendix A for the exact point in which
is considered for each model). With this approach, the loss function used to train the model becomes:, where is the original loss function for each model, and is a balancing term.
RWR Matrix + RWR Regularization.
We combine the previous two types of structural knowledge injection. The intuition is that it should be easier to enforce the RWRReg by having the additional long range information provided by the RWR features. We expect this type of information injection to have the highest impact on performance of the models on downstream tasks.
2.3 Relationship between the 1-Weisfeiler-Leman Algorithm and RWRs
In this section we provide analytical evidence that the information from RWR significantly empowers GCNs. In particular, we prove an interesting connection between the 1-Weisfeiler-Leman (1-WL) algorithm and RWR.
The 1-WL algorithm for graph isomorphism testing uses an iterative coloring, or relabeling, scheme, in which all nodes are initially assigned the same label (e.g., the value ). It then iteratively refines the color of each node by aggregating the multiset of colors in its neighborhood. The final feature representation of a graph is the histogram of resulting node colors. (For a more detailed description of the 1-WL algorithm we refer the reader to Shervashidze et al. (2011).) It is known that there are graphs that are different but are not distinguishable by the 1-WL algorithm, and that iterations are enough to distinguish two graphs of vertices which are distinguishable by the 1-WL algorithm. There is a well known connection (Kipf and Welling, 2017; Xu et al., 2018a) between 1-WL and aggregation-based GCNs, which can be seen as a differentiable approximation of the algorithm. In particular, graphs that can be distinguished in iterations by the 1-WL algorithm, can be distinguished by GCNs in message passing iterations (Morris et al., 2019).
Here, we prove that graphs that are distinguishable by 1-WL in iterations have different feature representations extracted by RWR of length . Given a graph , we define its -step RWR representation as the set of vectors , , where each entry describes the probability that a RWR of length starting in ends in .
Let and be two non-isomorphic graphs for which the 1-WL algorithm terminates with the correct answer after iterations and starting from the labelling of all ’s. Then the -step RWR representations of and are different.
The proof can be found in Appendix B. Given that iterations of the 1-WL algorithm require GCNs of depth to be performed, but in practice GCNs are limited to depth 2 or 3, the result above shows that RWR can empower GCNs with relevant information that is discarded in practice.
Recent work (Micali and Zhu, 2016) has shown that anonymous random walks (i.e., random walks where the global identities of nodes are not known) of fixed length starting at node are sufficient to reconstruct the local neighborhood within a fixed distance of a node (Micali and Zhu, 2016). Subsequently, anonymous random walks have been introduced in the context of learning graph representations (Ivanov and Burnaev, 2018). While providing interesting insights into the information obtained by random walks, such results are complementary to ours, since they assume access to the distribution of entire walks of a given length, while our RWR representation only stores information on the probability of ending in a given node. In addition, such works do not provide a connection between RWR and 1-WL.
3 Choice of Models
In order to test the effect of the different levels of structural information injection and to obtain results that are indicative of the whole class of GCN models, our experimental study covers most of the proposed techniques for spatial graph convolution. We conceptually identify four different categories from which we select representative models (a detailed review is presented in Appendix C).
Simple Aggregation Models.
Such models fall into the message passing framework (Gilmer et al., 2017) and utilize a “simple” aggregation strategy, where each node receives messages (e.g. feature vectors) from its neighbours, and uses the received messages to update its embedding vector. As a representative we choose GCN (Kipf and Welling, 2017), one of the fundamental and widely used GNNs models. We also consider GraphSage (Hamilton et al., 2017), as it represents a different aggregation strategy where a set of neighborhood aggregation functions are learned, and a sampling approach is used for defining fixed size neighbourhoods.
Several models have used an attention mechanism in a GNN scenario (Lee et al., 2018a, b; Veličković et al., 2018; Zhang et al., 2018). While they fall into the message passing framework, we consider them separately as they employ a more sophisticated aggregation scheme. As a representative we focus on GAT (Veličković et al., 2018), the first to present an attention mechanism over nodes for the aggregation phase, and currently one of the best performing models on several datasets. Furthermore, it can be used in an inductive scenario.
Pooling on graphs is a very challenging task, since it has to take into account that each node might have a different sized neighbourhood. Among the methods that have been proposed for differentiable pooling on graphs (Cangea et al., 2018; Ying et al., 2018b; Diehl et al., 2019; Gao and Ji, 2019; Lee et al., 2019), we choose DiffPool (Ying et al., 2018b)
for its strong empirical results. Furthermore, it can learn to dynamically adjust the number of clusters (the number is a hyperparameter, but the network can learn to use fewer clusters if necessary).
Morris et al. (2019) prove that message-passing GNNs cannot be more powerful than the 1-WL algorithm, and propose -GNNs, which rely on a subgraph message-passing mechanism and are proven to be as powerful as the -WL algorithm. Another approach that goes beyond the WL algorithm was proposed by Murphy et al. (2019). Both models are computationally intractable in their initial theoretical formulation, so approximations are needed. As representative we choose -GNNs, to test if subgraph message-passing is affected by additional structural information.
We now present our framework for evaluating the effects of the injection of structural information into GNNs, and the results of our experiments. We first present the results for node classification, and then we present the results for graph classification. Successively, we study the impact of the restart probability of RWR on the results, and finally we study the impact of structural information injection on the task of triangles counting, where the ability to analyze the graph structure is fundamental.
We use each architecture for the task that better suits its design: GCN, GraphSage, and GAT for node classification, and DiffPool and -GNN for graph classification. We add an adapted version of GCN for graph classification, as a common strategy for this task is to deploy a node-level GNN, and then apply a readout function to combine node embeddings into a global graph embedding vector.
With regards to datasets, for node classification we considered the three most used benchmarking datasets in literature: Cora, Citeseer, and Pubmed (Sen et al., 2008). Analogously, for graph classification we chose three frequently used datasets: ENZYMES, PROTEINS, and D&D (Kersting et al., 2016). Dataset statistics can be found in Appendix D.
For all the considered models we take the hyperparameters from the implementations released by the authors. The only parameter tuned using the validation set is the balancing term when RWRReg is applied. We found that the RWRReg loss tends to be larger than the Cross Entropy loss for prediction, and the best values for lie in the range . For all the RWR-based techniques we used a restart probability of 111We use 0.15 as it is a common default value used in many papers and software libraries.. (The effects of different restart probabilities are explored below.) Detailed information on the implementation of the considered models can be found in Appendix A 222Source code is provided as Supplementary Material and will be made publicly available upon acceptance..
For each dataset we follow the approach that has been widely adopted in literature: we take 20 labeled nodes per class as training set, 500 nodes as validation set, and 1000 nodes for testing. Most authors have used the train/validation/test split defined by Yang et al. (2016)
. Since we want to test the general effect of the injection of structural information, we differ from this approach and we do not rely on a single split. We perform 100 runs, where at each run we randomly sample 20 nodes per class for training, 500 random nodes for validation, and 1000 random nodes for testing. We then report mean and variance for the accuracy on the test set over these 100 runs.
Results are summarized in Table 1, where we observe that the simple addition of RWR features to the feature vector of each node is sufficient to give a performance gain. The RWRReg term then significantly increments the gain, showing that even for the task of node classification structural information and long range information are important, confirming that only looking at neighbours and close nodes is not enough.
Following the approach from Ying et al. (2018b) and Morris et al. (2019) we use 10-fold cross validation, and report mean and variance of the accuracy on graph classification. Results are summarized in Table 2. The performance gains given by the injection of structural information are even more apparent than for the node classification task. Intuitively, the structure of the nodes in a graph is fundamental for distinguishing different graphs. Most notably, the addition of the adjacency features is sufficient to give a large performance boost.
Surprisingly, models like DiffPool and -GNN show an important difference in accuracy when there is injection of structural information, meaning that even the most advanced methods suffer from the inability to properly exploit all the structural information that is encoded in graph data.
Impact of RWR Restart Probability.
We tested how performance change with different restart probabilities. Intuitively, higher restart probabilities might put too much focus on close nodes, while lower probabilities may focus too much on nodes that are “central" in the graph structure, with fewer differences in the RWR features between nodes. Figure 1 (a) summarises how the accuracy on node classification changes with different restart probabilities. Results for graph classification are shown in Figure 1 (b). In accordance to our intuition, higher restart probabilities focus on close nodes (and less on distant nodes), and produce lower accuracies. Furthermore, we notice how injecting RWR information is never detrimental to the performance of the model without any injection.
The TRIANGLES dataset Knyazev et al. (2019) is composed of randomly generated graphs, where the task is to count the number of triangles contained in each graph. This is a hard task for GNNs as the aggregation of neighbouring node’s features with permutation invariant functions does not allow the model to explicitly access to structural information.
|Model||TRIANGLES Test Set|
The TRIANGLES dataset has a test set with 10’000 graphs, of which half are similar in size to the ones in the training and validation sets (4-25 nodes), and half are bigger (up to 100 nodes). This permits an evaluation of the generalization capabilities to graphs of unseen sizes.
For this regression task we use a three layer GCN followed by max-pooling, and we minimize the Mean Squared Error (MSE) loss (additional details can be found in Appendix A). Table 3 presents MSE results on the test dataset as a whole and on the two splits separately. We see that the addition of RWR features and of RWRReg provides significant benefits, specially when the model has to generalize to graphs of unseen sizes, while the addition of adjacency features leads to overfitting (additional analysis are available in Appendix E).
5 Practical RWR Regularization
As shown in Section 4, the addition of RWR features as node features coupled with RWRReg provides a significant improvement of the accuracy on all considered tasks. However, these benefits come at a high cost: adding RWR features increases the input size of elements (which is prohibitive for large graphs), and RWRReg requires the computation of an additional loss term (and the storage of the RWR matrix) during training. Furthermore, all the considered models have a weight matrix at each layer that depends on the feature dimension, which means we are also increasing the number of parameters at the first layer by (where is the dimension of the feature vector for each node after the first GCN layer). In this section we propose a practical way to take advantage of the injection of structural information without increasing the number of parameters, and controlling the memory consumption during training.
The results in Section 4 show that the sole addition of the RWRReg term increases the performance of the considered models by more than 5%. Furthermore, RWRReg does not increase the size of the input feature vectors, does not require additional operations at inference time, and maintains the permutation invariance of GCN models. Therefore, RWRReg alone is a very practical tool that significantly improves the quality of GCN models. However, when dealing with very large graphs, keeping in memory the RWR matrix to compute RWRReg during training might be too expensive. We then explore how the sparsification of this matrix affects the resulting model. In particular, we apply a top- strategy: for each node, we only keep the highest RWR weights. Figure 2 shows how different values impact performance on node classification (which usually is the task with the largest graphs). We can see that the addition of the RWRReg term is always beneficial. Furthermore, by taking the top-, we can reduce the number of entries in the RWR matrix of elements, while still obtaining an average increment on the accuracy of the model. This strategy then allows to select the value of that best suits the available memory, while still obtaining a high performing model (better than GCN without structural information injection).
6 Related Work
The field of GNNs has become extremely vast, for a thorough review we refer the reader to the latest survey on the subject (Wu et al., 2019). To the best of our knowledge there are no studies that test if additional structural information can significantly impact GCNs, and there has been very few interest in long-range dependencies between nodes. However, there are some works that are conceptually related to our approach.
Klicpera et al. (2019b) use RWR to create a new (weighted) adjacency matrix where message passing is performed. While this can enable long-range communication, it is impractical for inductive scenarios, as the RWR matrix needs to be calculated for each new graph. In contrast, our RWRReg method only uses the RWR matrix in the training phase, and does not require any additional operation at inference time. Other works have used random walks with GCNs in different ways. Li et al. (2018) use random walks in a co-training scenario to add new nodes for the GCN’s training set. Ying et al. (2018a) and Zhang et al. (2019) use random walks to define aggregation neighbourhoods that are not confined to a fixed distance. Abu-El-Haija et al. (2018) and Abu-El-Haija et al. (2019) use powers of the adjacency matrix, which can be considered as random walk statistics, to define neighbourhoods of different scales. Zhuang and Ma (2018) use random walks to define the positive pointwise mutual information (PPMI) matrix and then use it in place of the adjacency matrix in the GCN formulation. Klicpera et al. (2019a) use a diffusion strategy based on RWR instead of aggregating information from neighbours. We remark how all the aforementioned papers focus on creating smart or extended neighbourhoods which are then used for node aggregation, while we show that node aggregation (or message-passing) without additional information (e.g., RWR features or RWR-based regularization) is not capable of fully exploiting structural graph information.
Pei et al. (2020) propose a strategy to insert long range dependencies information in GCNs by performing aggregation between neighbours in a latent space obtained with some classical node embedding techniques, but it is limited to transductive tasks. Our method can be easily applied to any existing GCN architecture, and works also on inductive tasks. Gao et al. (2019), and Jiang and Lin (2018) use regularization techniques to enforce that the embeddings of neighbouring nodes should be close to each other. The first uses Conditional Random Fields, while the second uses a regularization term based on the graph Laplacian. Both approaches only focus on 1-hop neighbours and do not take long range dependencies into account.
With regards to the study of the capabilities and weaknesses of GNNs, Li et al. (2018) and Xu et al. (2018b) study the over-smoothing problem that appears in Deep-GCN architectures, while Xu et al. (2018a) and Morris et al. (2019) characterize the relation to the Weisfeiler-Leman algorithm. Other works have expressed the similarity with distributed computing (Sato et al., 2019; Loukas, 2020), and the alignment with particular algorithmic structures (Xu et al., 2020). These important contributions have advanced our understanding of the capabilities of GNNs, but they do not quantify the impact of additional structural information.
In this work we showed that state-of-the-art GCN models ignore relevant information regarding node and graph similarity that is revealed by long distance relations among nodes. We describe four ways to inject such information in several models, and empirically show that the performance of all models significantly improve when such information is used. We then propose a novel regularization technique based on RWR, which leads to an average improvement of on all models. Our experimental results are supported by a novel connection between RWR and the 1-Weisfeiler-Leman algorithm, which proves that RWR encode long-range relations that are not captured by considering only neighbours at distance at most 2 or 3, as it is common practice in GCNs. Based on our results, there are several interesting directions for future research, including the design of GCN architectures that directly capture long distance relations.
Work partially supported by MIUR, the Italian Ministry of Education, University and Research, under PRIN Project n. 20174LF3T8 AHeAD (Efficient Algorithms for HArnessing Networked Data) and under the initiative “Departments of Excellence" (Law 232/2016), and by the grant STARS2017 from the University of Padova.
- N-GCN: multi-scale graph convolution for semi-supervised node classification. In UAI, Cited by: §6.
- MixHop: higher-order graph convolution architectures via sparsified neighborhood mixing. In International Conference on Machine Learning (ICML), Cited by: §6.
- Local graph partitioning using pagerank vectors. 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06). External Links: Cited by: §6.
- Fast incremental and personalized pagerank. Proceedings of the VLDB Endowment 4 (3), pp. 173–184. External Links: Cited by: §6.
- Exploring density regions for analyzing dynamic graph data. Journal of Visual Languages & Computing 44, pp. 133 – 144. External Links: Cited by: §1.
Towards sparse hierarchical graph classifiers. NIPS Workshop on Relational Representation Learning. Cited by: §3.
- Node degree distribution in affiliation graphs for social network density modeling. pp. 51–61. External Links: Cited by: §1.
- Fast and accurate deep network learning by exponential linear units (ELUs). ICLR. Cited by: §A.5, §A.7.
- Study on centrality measures in social networks: a survey. Social Network Analysis and Mining 8 (1). External Links: Cited by: §1.
- Towards graph pooling by edge contraction. ICML Workshop on Learning and Reasoning with Graph-Structured Data. Cited by: §3.
- Conditional random field enhanced graph convolutional neural networks. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining - KDD ’19. External Links: Cited by: §6.
- Graph U-nets. External Links: Cited by: §3.
- Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 1263–1272. External Links: Cited by: §3.
- Inductive representation learning on large graphs. In NIPS, Cited by: §C.2, §3.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. ICML. Cited by: §A.6.
- Anonymous walk embeddings. arXiv preprint arXiv:1805.11921. Cited by: §2.3.
Graph laplacian regularized graph convolutional networks for semi-supervised learning. ArXiv abs/1809.09839. Cited by: §6.
- The treatment of ties in ranking problems. Biometrika 33 (3), pp. 239–251. External Links: Cited by: Appendix F.
- Benchmark data sets for graph kernels. Cited by: §4.
- The role of graphlets in viral processes on networks. Journal of Nonlinear Science. External Links: Cited by: §1.
- Adam: a method for stochastic optimization. ICLR. Cited by: §A.1, §A.2, §A.3, §A.5, §A.6, §A.7.
- Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §C.1, §2.3, §3.
- Predict then propagate: graph neural networks meet personalized pagerank. In ICLR, Cited by: §6.
- Diffusion improves graph learning. In Conference on Neural Information Processing Systems (NeurIPS), Cited by: §6.
- Understanding attention in graph neural networks. In International Conference on Learning Representations (ICLR) Workshop on Representation Learning on Graphs and Manifolds, Cited by: Appendix A, §4.
- Attention models in graphs: a survey. ArXiv abs/1807.07984. Cited by: §3.
- Graph classification using structural attention. In KDD, Cited by: §3.
- Self-attention graph pooling. In ICML, Cited by: §3.
- Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI, Cited by: §1, §6, §6.
- Efficient algorithms for personalized pagerank. ArXiv abs/1512.04633. Cited by: §6.
- What graph neural networks cannot learn: depth vs width. In International Conference on Learning Representations, Cited by: §6.
- Reconstructing markov processes from independent and anonymous experiments. Discrete Applied Mathematics 200, pp. 108–122. Cited by: §2.3.
- Weisfeiler and Leman go neural: higher-order graph neural networks. In AAAI, Cited by: §C.5, §2.3, §3, §4, §6.
- Relational pooling for graph representations. ICML. Cited by: §3.
- The pagerank citation ranking: bringing order to the web. In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, pp. 161–172. Cited by: §2.2.
- Geom-GCN: geometric graph convolutional networks. In International Conference on Learning Representations, Cited by: §6.
- Role discovery in networks. IEEE Transactions on Knowledge and Data Engineering 27, pp. 1112–1131. Cited by: §1.
- Approximation ratios of graph neural networks for combinatorial problems. In NeurIPS, Cited by: §6.
- Collective classification in network data. AI Magazine 29 (3), pp. 93. External Links: Cited by: §4.
- Weisfeiler-Lehman graph kernels. Journal of Machine Learning Research 12 (Sep), pp. 2539–2561. Cited by: §2.3.
- Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. Cited by: §A.1, §A.2, §A.3, §A.5, §A.7.
- Fast random walk with restart and its applications. Sixth International Conference on Data Mining (ICDM’06), pp. 613–622. Cited by: §2.2, §6.
- Graph Attention Networks. ICLR. Cited by: §C.3, §3.
- Efficient algorithms for approximate single-source personalized pagerank queries. ACM Transactions on Database Systems 44 (4), pp. 1–37. External Links: Cited by: §6.
- TopPPR: top-k personalized pagerank queries with precision guarantees on large graphs. Proceedings of the 2018 International Conference on Management of Data - SIGMOD ’18. External Links: Cited by: §6.
- A comprehensive survey on graph neural networks. ArXiv abs/1901.00596. Cited by: §6.
- How powerful are graph neural networks?. ArXiv abs/1810.00826. Cited by: §2.3, §6.
- Representation learning on graphs with jumping knowledge networks. In ICML, Cited by: §1, §6.
- What can neural networks reason about?. In International Conference on Learning Representations, Cited by: §6.
- Revisiting semi-supervised learning with graph embeddings. In ICML, Cited by: §4.
- Graph convolutional neural networks for web-scale recommender systems. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining - KDD ’18. External Links: Cited by: §6.
- Hierarchical graph representation learning with differentiable pooling. In NeurIPS, Cited by: §C.4, §3, §4.
- Heterogeneous graph neural network. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining - KDD ’19. External Links: Cited by: §6.
- GaAN: gated attention networks for learning on large and spatiotemporal graphs. In UAI, Cited by: §3.
- Graph neural networks: a review of methods and applications. ArXiv abs/1812.08434. Cited by: §1.
- Dual graph convolutional networks for graph-based semi-supervised classification. Proceedings of the 2018 World Wide Web Conference on World Wide Web - WWW ’18. External Links: Cited by: §6.
Appendix A Model Implementation Details
We present here a detailed description of the implementations of the models we use in our experimental section. Whenever possible, we started from the official implementation of the authors of each model. Table 4 contains links to the implementations we used as starting point for the code for our experiments.
With regards to the training procedure we have that all models are trained with early stopping on the validation set (stopping the training if the validation loss doesn’t decrease for a certain amount of epochs), and unless explicitly specified, we use Cross Entropy as loss function for all the classification tasks.
For the task of graph classification we zero-pad the feature vectors of each node to make them all the same length when we inject structural information into the node feature vectors.
For the task of triangle counting we follow Knyazev et al.  and use the one-hot representation of node degrees as node feature vectors to impose some structural information in the network.
The experiments were run on a GPU cluster with 7 Nvidia 1080Ti, and on a CPU cluster (when the memory consumption was too big to fit in the GPUs) equipped with 8 cpus 12-Core Intel Xeon Gold 5118 @2.30GHz, with 1.5Tb of RAM.
In the rest of this Section we go through each model used in our experiments, and we specify the architecture, the hyperparameters, and the position of the node embeddings used for RWRReg. For a review of these models, we refer to Appendix C.
|GCN (for node classification)||https://github.com/tkipf/pygcn|
|GCN (for graph classification)||https://github.com/bknyaz/graph_nn|
|GCN (for triangle counting)|
a.1 Gcn (node classification)
We use a two layer architecture. The first layer outputs a 16-dimensional embedding vector for each node, and passes it through a ReLu activation, before applying dropoutSrivastava et al. , with probability . The second layer outputs a -dimensional embedding vector for each node, where is the number of output classes and these vectors are passed through Softmax to get the output probabilities for each class. An additional L2-loss is added with a balancing term of . The model is trained using the Adam optimizer Kingma and Ba  with a learning rate of 0.01.
We apply the RWRReg on the 16-dimensional node embeddings after the first layer.
a.2 Gcn (graph classification)
We first have two GCN layers, each one generating a 128-dimensional embedding vector for each node. Then we apply maxat the last one, where is the number of output classes. A ReLu activation is applied in between the two feed-forward layers, and Softmax is applied after the last layer. Dropout Srivastava et al.  is applied in between the last GCN layer and the feed-forward layer, and in between the feedforward layers (after ReLu), in both cases with probability of 0.1. The model is trained using the Adam optimizer Kingma and Ba  with a learning rate of 0.0005.
We apply the RWRReg on the 128-dimensional node embeddings after the last GCN layer.
a.3 Gcn (counting triangles)
We first have three GCN layers, each one generating a 64-dimensional embedding vector for each node. Then we apply max-pooling on the features of the nodes and pass the pooled 64-dimensional vector to a one-layer feed-forward neural network with one neuron. Dropout Srivastava et al.  is applied in between the last GCN layer and the feed-forward layer with probability of 0.1. The model is trained by minimizing the mean squared error (MSE) and is optimized using the Adam optimizer Kingma and Ba  with a learning rate of 0.005.
We apply the RWRReg on the 64-dimensional node embeddings after the last GCN layer.
We use a two layer architecture. For Cora we sample 5 nodes per-neighbourhood at the first layer and 5 at the second, while on the other datasets we sample 10 nodes per-neighbourhood at the first layer and 25 at the second. Both layers are composed of mean-aggregators (i.e., we take the mean of the feature vectors of the nodes in the sampled neighbourhood) that output a 128-dimensional embedding vector per node. After the second layer these embeddings are multiplied by a learnable matrix with size , where is the number of output classes, giving thus a -dimensional vector per-node. These vectors are passed through Softmax
to get the output probabilities for each class. The model is optimized using Stochastic Gradient Descent with a learning rate of 0.7.
We apply the RWRReg on the 128-dimensional node embeddings after the second aggregation layer.
We use a two layer architecture. The first layer uses an 8-headed attention mechanism that outputs an -dimensional embedding vector per-node. LeakyReLu is set with slope . Dropout Srivastava et al.  (with probability of 0.6) is applied after both layers. The second layer outputs a -dimensional vector for each node, where is the number of classes, and before passing each vector through Softmax to obtain the output predictions, the vectors are passed through an Elu activation Clevert et al. . An additional L2-loss is added with a balancing term of . The model is optimized using Adam Kingma and Ba  with a learning rate of 0.005.
We apply the RWRReg on the 8-dimensional node embeddings after the first attention layer. A particular note needs to be made for the training of GATs: we found that naively implementing the RWRReg term on the node embeddings in between two layers brings to an exploding loss as the RWRReg term grows exponentially at each epoch. We believe this happens because the attention mechanism in GATs allows the network to infer that certain close nodes, even 1-hop neighbours, might not be important to a specific node and so they shouldn’t be embedded close to each other. This clearly goes in contrast with the RWRReg loss, since 1-hop neighbours always have a high score. We solved this issue by using the attention weights to scale the RWR coefficients at each epoch (we make sure that gradients are not calculated for this operation as we only use them for scaling). This way the RWRReg penalizations are in accordance with the attention mechanism, and are still encoding long-range dependencies.
We use a 1-pooling architecture. The initial node feature matrix is passed through two (one to obtain the assignment matrix and one for node embeddings) 3-layer GCN, where each layer outputs a 20-dimensional vector per-node. Pooling is then applied, where the number of clusters is set as 10% of the number of nodes in the graph, and then another 3-layer GCN is applied to the pooled node features. Batch normalization Ioffe and Szegedy  is added in between every GCN layer. The final graph embedding is passed through a 2-layer MLP with a final Softmax activation. An additional L2-loss is added with a balancing term of , together with two pooling-specific losses. The first enforces the intuition that nodes that are close to each other should be pooled together and is defined as: , where is the Frobenius norm, and is the assignment matrix at layer . The second one encourages the cluster assignment to be close to a one-hot vector, and is defined as: , where is the entropy function. However, in the implementation available online, the authors do not make use of these additional losses. We follow the latter implementation. The model is optimized using Adam Kingma and Ba  with a learning rate of 0.001.
We apply the RWRReg on the 20-dimensional node embeddings after the first 3-layer GCN (before pooling). We tried applying it also after pooling on the coarsened graph, but the fact that this graph could change during training yields to poor results.
We use the hierarchical 1-2-3-GNN architecture (which is the one showing the highest empirical results). First a 1-GNN is applied to obtain node embeddings, then these embeddings are used as initial values for the 2 GNN (1-2-GNN). The embeddings of the 2-GNN are then used as initial values for the 3-GNN (1-2-3-GNN). The 1-GNN applies 3 graph convolutions (as defined in Appendix C.5), while 2-GNN and the 3-GNN apply 2 graph convolutions. Each convolution outputs a 64-dimensional vector and is followed by an Elu activation Clevert et al. . For each , node features are then globally averaged and the final vectors are concatenated and passed through a three layer MLP. The first layer outputs a 64-dimensional vector, while the second outputs a 32-dimensional vector, and the third outputs a -dimensional vector, where is the number of output classes. To obtain the final output probabilities for each class, log(Softmax) is applied, and the negative log likelihood is used as loss function. After the first and the second MLP layers an Elu activation Clevert et al.  is applied, furthermore, after the first MLP layer dropout Srivastava et al.  is applied with probability 0.5. The model is optimized using Adam Kingma and Ba  with a learning rate of 0.01, and a decaying learning rate schedule based on validation results (with minimum value of ).
We apply the RWRReg on the 64-dimensional node embeddings after the -GNN. We were not able to apply it also after the 2-GNN and the 3-GNN, as it would cause out-of-memory issues with our computing resources. In fact, -GNNs are very expensive memory-wise.
Appendix B Proof of Proposition 1
Given a graph , we define its -step RWR representation as the set of vectors , , where each entry describes the probability that a RWR of length starting in ends in .
Let and be two non-isomorphic graphs for which the 1-WL algorithm terminates with the correct answer after iterations and starting from the labelling of all ’s. Then the -step RWR representations of and are different.
Consider the WL algorithm with initial labeling given by all 1’s. It’s easy to see that i) after iterations the label of a node corresponds to the information regarding the degree distribution of the neighborhood of distance from and ii) in iteration , the degrees of nodes at distance from are included in the label of . In fact, after the first iteration, two nodes have the same colour if they have the same degree, as the colour of each node is given by the multiset of the colours of its neighbours (and we start with initial labeling given by all 1’s). After the second colour refinement iteration two nodes have the same colour if they had the same colour after the first iteration (i.e. have the same degree), and the multisets containing the colours (degrees) of their neighbours are the same. In general, after the -th iteration, two nodes have the same colour if they had the same colour in iteration , and the multiset containing the degrees of the neighbours at distance is the same for the two nodes. Hence, two nodes that have different colours after a certain iteration, will have different colours in all the successive iterations. Furthermore, the colour after the -th iteration depends on the colour at the previous iteration (which “encodes” the distribution of degree of neighbours up to distance included), and the multiset of the degrees of neghbours at distance .
Given two non-isomorphic graphs and , if the WL algorithm terminates with the correct answer starting from the all ’s labelling in iterations, it means that there is no matching between vertices in and vertices in such that matched vertices have the same degree distribution for neighborhoods at distance exactly . Equivalently, any matching that minimizes the number of matched vertices with different degree distribution has at least one such pair. Now consider one such matching , and let and be vertices matched in with different degree distributions for neighborhoods at distance exactly . Since and have different degree distributions at distance , the number of choices for paths of length starting from and must be different (since the number of choices for the -th edge on the path is different). Therefore, there must be at least a node and a node that are matched by but for which the number of paths of length from to is different from the number of paths of length from to . Since is proportional to the number of paths of length from to , we have that , that is . Thus, the -step RWR representation of and are different. ∎
Appendix C Model Reviews
In this section we provide a review of the models chosen for our experimental evaluation. Let us first define some notation: we use
to indicate a non-linear activation function (e.g.), and we use to indicate the number of layers in a model. We also define CONCAT as the function that takes as input two vectors and returns their concatenation.
We provide details about the specific implementation of each model in Appendix A.
c.1 Graph Convolutional Networks
While the term “Graph Convolutional Network” refers to the entirety of models that operate on graphs by emulating the convolution operation, the nowadays standard model that is associated with this term is the GCN developed by Kipf et al. Kipf and Welling . Let be the matrix containing the -dimensional embedding of each node at layer (with ). Each GCN layer (or filter) is defined by the following equation:
where is the adjacency matrix with self loops, is a diagonal matrix with , and is the learnable weight matrix at layer . The final node embeddings are given by .
GraphSage Hamilton et al.  is a spatial approach that learns a set of aggregation functions that take as input a set of node embeddings and return a fixed size vector. These functions are used to aggregate the nodes in a neighbourhood, and, for each node, the output is passed through a feed forward neural network to obtain its embedding. The nodes in each neighbourhood are sampled uniformly. Let be a fixed size set of nodes uniformly sampled from the 1-hop neighbourhood of node . Let be the learnable weight matrix at layer , and let AGGREGATE be a permutation invariant function (e.g. sum, mean) that takes as input a multiset of node feature vectors, and returns a single vector of the same dimension. We describe GraphSage’s embedding procedure in Algorithm 1.
c.3 Graph Attention Networks
The Graph Attention Network (GAT) Veličković et al.  is a powerful variant of GCN that uses an attention operator to weight the importance of each node in a neighbourhood in order to obtain more meaningful node embeddings. Let be a single-layer feedforward neural network that takes as input the concatenation of the feature vectors of two nodes, and outputs an attention coefficient. Let be a learnable weight matrix. We describe the procedure behind GATs in Algorithm 2, where LeakyReLu is the variant of ReLu defined as: .
DiffPool Ying et al. [2018b] generates hierarchical representations of graphs, emulating the pooling layer of a CNN. Each pooling layer outputs a coarsened representation of the graph at the previous layer, that can ease the task of graph classification.
DiffPool uses two GCNs at each layer. One that generates embeddings, and another that generates a soft cluster-assignment matrix, that is then used to obtain the feature and adjacency matrices of the coarsened graph returned as output of the layer. We present DiffPool in detail in Algorithm 3, where is a hyperparameter representing the number of clusters at layer , and is the dimension of the node feature vectors at layer . At the last layer everything is pooled into a single cluster/node to get the final graph embedding.
The main idea behind -GNNs Morris et al.  is to consider groups of nodes, in order to perform massage passing between subgraph structures, rather than nodes. This should allow the network to access structural information that would not be available at node level.
We present here the tractable version of the -GNN algorithm. Let be the set of all possible -elements subsets over . Let , then a neighbourhood of is defined as: . The neighbourhood is then defined as . The propagation function of each layer is then defined as:
is a function returning a one-hot encoding of the isomorphism type of, and and are the learnable matrices at layer . The authors propose a hierarchical version that combines the information at different granularities, where the initial feature vector of is given by a concatenation of the isomorphism type and the features learned by a -GNN:
where is the number of the last layer in the -GNN, and is a learnable matrix.
Appendix D Datasets
We briefly present here some additional details about the datasets used for our experimental section. Table 5 summarizes the datasets for node classification, while Table 6 presents information about the datasets for graph classification and triangle counting. Finally, Table 7 contains download links for the datasets.
|Dataset||Graphs||Classes||Avg. # Nodes||Avg. # Edges|
Appendix E Adjacency Matrix Features Lead to Bad Generalization on the Triangle Counting Task
We present additional details about the overfitting behaviour of GCN on the triangle counting task when injected with adjacency matrix information. In Figure 3 we plot the evolution of the MSE on the training and test set over the training epochs. GCN-AD reaches the lowest error on the training set, while the highest on the test set, thus confirming its overfitting behaviour. We can observe that after 6 epochs, GCN-AD is already the model presenting the lowest training loss, and it remains so until the end. Furthermore we can notice how the test loss presents a growing trend, which is in contrast to the other models.
Appendix F Empirical Analysis of the Random Walk with Restart Matrix
We now analyse the RWR matrix to justify the use of RWR for the encoding of long range dependencies, and other important structural information. We consider the three node classification datasets (see Section 4 of the paper), as this is the task with the largest input graphs, and hence where this kind of information seems more relevant.
We first consider the distribution of the RWR333We consider RWR, with a restart probability of , as done for the experimental evaluation of our proposed technique. weights at different distances from a given node. In particular, for each node, we take the sum of the weights assigned to the 1-hop neighbours, the 2-hop neighbours, and so on. We then take the average, over all nodes, of the sum of the RWR weights at each hop. We discard nodes that belong to connected components with diameter , and we only plot the values for the distances that have an average sum of weights higher than . Plots are shown in Figure 4. We notice that the RWR matrix contains information that goes beyond the immediate neighbourhood of a node. In fact, we see that approximately of the weights are contained within the 6-hop neighbourhood, with a significant portion that is not contained in the 2-hop neighbourhood usually accessed by GCN-like models.
|Dataset||Average Kendall Tau-b|
Average and standard deviation, over all nodes, of Kendall Tau-b values measuring the non-trivial relationships between nodes captured by the RWR weights.
Next we analyse if RWR capture some non-trivial relationships between nodes. In particular, we investigate if there are nodes that are far from the starting node, but receive a higher weight than some closer nodes.
To quantify this property we use the Kendall Tau-b444We use the Tau-b version because the elements in the sequences we analyze are not all distinct. measure (Kendall ). In more detail, for each node we consider the sequence where the -th element is the weight that the RWR from node has assigned to node : . We then define the sequence such that , where dist(x, y) is the shortest path distance between node and node , and is the node with the -th highest RWR weight in . Intuitively, if the RWR matrix isn’t capable of capturing non-trivial relationship we would have that is a sorted list (with repetitions). By comparing with its sorted version with the Kendall Tau-b rank, we obtain a value between 1 and where 1 means that the two sequences are identical, and means that one is the reverse of the other. Table 8 presents the results, averaged over all nodes, on the node classification datasets. These results show that while there is a strong relation between the information provided by RWR and the distance between nodes, there is information in the RWR that is not captured by shortest path distances.
As an example of the non-trivial relationships encoded by RWR, Figure 5 presents a sequence taken from a node in Cora. This sequence obtains a Kendall Tau-b value of . We can observe that the nodes at distance 1 are the nodes with the highest weights, however, for distances greater than 1, we already have some non-trivial relationships. In fact, we observe some nodes at distance 3 that receive a larger weight than nodes at distance 2. There are many other interesting non-trivial relationships, for example we notice that some nodes at distance 7, and some at distance 11, obtain a higher weight than some nodes at distance 5.
Appendix G Fast Implementation of the Random Walk with Restart Regularization
Let be the matrix containing the node embeddings, and be the matrix with the RWR statistics. We are interested in the following quantity
To calculate it in a fast way (specially when using GPUs) we use the following procedure. Let us first define the following matrices:
Where we are allowed to make symmetric because . We then have
Where is the trace of the matrix. Note that is the -th column of , transposed, so its size is .