1 Introduction
Graphs are ubiquitous in the real world and its description through scientific models. They are used to study the spread of information, to optimize delivery, to recommend new books, to suggest friends, or to find a party’s potential voters. Deep learning approaches have achieved great success on many important graph problems such as link prediction
(Grover & Leskovec, 2016; Bojchevski et al., 2018), graph classification (Duvenaud et al., 2015; Niepert et al., 2016; Gilmer et al., 2017) and semisupervised node classification (Yang et al., 2016; Kipf & Welling, 2017).There are many approaches for leveraging deep learning algorithms on graphs. Node embedding methods use random walks or matrix factorization to directly train individual node embeddings, often without using node features and usually in an unsupervised manner, i.e. without leveraging node classes (Perozzi et al., 2014; Tang et al., 2015; Nandanwar & Murty, 2016; Grover & Leskovec, 2016; Qiu et al., 2018; Faerman et al., 2018)
. Many other approaches use both graph structure and node features in a supervised setting. Examples for these include spectral graph convolutional neural networks
(Bruna et al., 2014; Defferrard et al., 2016), message passing (or neighbor aggregation) algorithms (Kearnes et al., 2016; Kipf & Welling, 2017; Hamilton et al., 2017; Pham et al., 2017; Monti et al., 2017; Gilmer et al., 2017), and neighbor aggregation via recurrent neural networks
(Scarselli et al., 2009; Li et al., 2016; Dai et al., 2018). Among these categories, the class of message passing algorithms has garnered particular attention recently.Several works have been aimed at improving the basic neighborhood aggregation scheme by using attention mechanisms (Kearnes et al., 2016; Hamilton et al., 2017; Veličković et al., 2018), random walks (AbuElHaija et al., 2018a; Ying et al., 2018; Li et al., 2018), edge features (Kearnes et al., 2016; Gilmer et al., 2017; Schlichtkrull et al., 2018) and making it more scalable on large graphs (Chen et al., 2018; Ying et al., 2018). However, all of these methods only use the information of a very limited neighborhood for each node.
Increasing the size of the neighborhood used by the algorithm, i.e. its range, however, is not trivial since neighborhood aggregation in this scheme is essentially a type of Laplacian smoothing and too many layers lead to oversmoothing (Li et al., 2018). Xu et al. (2018) highlighted the same problem by establishing a relationship between the message passing algorithm termed Graph Convolutional Network (GCN) by Kipf & Welling (2017) and a random walk. Using this relationship we can see that GCN converges to this random walk’s limit distribution as the number of layers increases. This limit distribution is a property of the graph as a whole and does not take the random walk’s starting (root) node into account. As such it is unsuited to describe the root node’s neighborhood. Hence, the quality of this aggregation procedure necessarily deteriorates for a high number of layers (or aggregation/propagation steps) since we approach the limit distribution.
To solve this issue, in this paper, we first highlight the inherent connection between the limit distribution and PageRank (Page et al., 1998). We then propose an algorithm that utilizes a propagation scheme derived from personalized PageRank instead. This algorithm adds a chance of teleporting back to the root node, which ensures that the score encodes the local neighborhood for every root node (Page et al., 1998)
. The teleport probability allows us to balance the needs of preserving locality (i.e. staying close to the root node to avoid oversmoothing) and leveraging the information from a large neighborhood. We show that this propagation scheme permits the use of far more (in fact, infinitely many) propagation steps without leading to oversmoothing.
Moreover, while propagation and classification are inherently intertwined in message passing schemes, our proposed algorithm separates the neural network from the propagation scheme. This allows us to achieve a much higher range without changing the neural network, whereas in the message passing scheme every additional propagation step would require an additional layer. It also permits the independent development of the propagation algorithm and the neural network generating predictions from node features. That is, we can combine any stateoftheart prediction method with our propagation scheme. We even found that adding our propagation scheme significantly improves the accuracy of networks that have been trained without using any graph information.
Our model achieves stateoftheart results while requiring fewer parameters and less training time compared to most competing models, with a computational complexity that is linear in the number of edges. We show these results in the most thorough study (including significance testing) of message passing models using graphs with textbased features that has been done so far.
2 Graph convolutional networks and their limited range
We first introduce our notation and explain the problem our model solves. Let be a graph with nodes and edges . Let denote the number of nodes and the number of edges. The nodes are described by the feature matrix , with the number of features per node, and the class (or label) matrix , with the number of classes . The graph is described by the adjacency matrix . denotes the adjacency matrix with added selfloops.
One simple and widely used message passing algorithm for semisupervised classification is the Graph Convolutional Network (GCN). In the case of two message passing layers its equation is
(1) 
where are the predicted node labels, is the symmetrically normalized adjacency matrix with selfloops, with the diagonal degree matrix , and and are trainable weight matrices (Kipf & Welling, 2017).
With two GCNlayers, only neighbors in the twohop neighborhood are considered. There are essentially two reasons why a message passing algorithm like GCN can’t be trivially expanded to use a larger neighborhood. First, aggregation by averaging causes oversmoothing if too many layers are used. It, therefore, loses its focus on the local neighborhood (Li et al., 2018). Second, most common aggregation schemes use learnable weight matrices in each layer. Therefore, using a larger neighborhood necessarily increases the depth and number of learnable parameters of the neural network (the second aspect can be circumvented by using weight sharing, which is typically not the case, though). However, the required neighborhood size and neural network depth are two completely orthogonal aspects. This fixed relationship is a strong limitation and leads to bad compromises.
We will start by concentrating on the first issue. Using a randomization assumption for the ReLU activations,
Xu et al. (2018) have shown that for a layer GCN the influence score of node on , , is proportional in expectation to a slightly modified step random walk distribution starting at the root node , . Hence, the information of node spreads to node in a random walklike manner. If we take the limitand the graph is irreducible and aperiodic, this random walk probability distribution
converges to the limit (or stationary) distribution . This distribution can be obtained by solving the equation . Obviously, the result only depends on the graph as a whole and is independent of the random walk’s starting (root) node . This global property is therefore unsuitable for describing the root node’s neighborhood.3 Personalized propagation of neural predictions
From message passing to personalized PageRank. We can solve the problem of lost focus by recognizing the connection between the limit distribution and PageRank (Page et al., 1998). The only differences between these two are the added selfloops and the adjacency matrix normalization in . Original PageRank is calculated via , with . Having made this connection we can now consider using a variant of PageRank that takes the root node into account – personalized PageRank (Page et al., 1998). We define the root node
via the teleport vector
, which is a onehot indicator vector. Our adaptation of personalized PageRank can be obtained for node using the recurrent equation , with the teleport (or restart) probability . By solving this equation, we obtain(2) 
Introducing the teleport vector allows us to preserve the node’s local neighborhood even in the limit distribution. In this model the influence score of root node on node , , is proportional to the th element of our personalized PageRank . This value is different for every root node. How fast it decreases as we move away from the root node can be adjusted via . By substituting the indicator vector with the unit matrix we obtain our fully personalized PageRank matrix , whose element specifies the influence score of node on node , . Note that due to symmetry , i.e. the influence of on is equal to the influence of on . Also, this inverse always exists since
and therefore can’t be an eigenvalue of
(see Appendix A).Personalized propagation of neural predictions (PPNP). To utilize the above influence scores for semisupervised classification we generate predictions for each node based on its own features and then propagate them via our fully personalized PageRank scheme to generate the final predictions. This is the foundation of personalized propagation of neural predictions. PPNP’s model equation is
(3) 
where is the feature matrix and a neural network with parameter set generating the predictions . Note that operates on each node’s features independently.
As a consequence, PPNP separates the neural network used for generating predictions from the propagation scheme. This separation additionally solves the second issue mentioned above: the depth of the neural network is now fully independent of the propagation algorithm. As we saw when connecting GCN to PageRank, personalized PageRank can effectively use even infinitely many neighborhood aggregation layers, which is clearly not possible in the classical message passing framework. Furthermore, the separation gives us the flexibility to use any method for generating predictions, e.g. deep convolutional neural networks for graphs of images.
While generating predictions and propagating them happen consecutively during inference, it is important to note that the model is trained endtoend
. That is, the gradient flows through the propagation scheme during backpropagation (implicitly considering infinitely many neighborhood aggregation layers). Adding these propagation effects significantly improves the model’s accuracy.
Efficiency analysis. Directly calculating the fully personalized PageRank matrix , is computationally inefficient and results in a dense matrix. Using this matrix would lead to a computational complexity and memory requirement of for training and inference.
To solve this issue, reconsider the equation . Instead of viewing this equation as a combination of a dense fully personalized PageRank matrix with the prediction matrix, we can also view it as a variant of topicsensitive PageRank, with each class corresponding to one topic (Haveliwala, 2002). In this view every column of defines an (unnormalized) distribution over nodes that acts as a teleport set. Hence, we can approximate PPNP via an approximate computation of topicsensitive PageRank.
Approximate personalized propagation of neural predictions (APPNP). More precisely, APPNP achieves linear computational complexity by approximating topicsensitive PageRank via power iteration. While PageRank’s power iteration is connected to the regular random walk, the power iteration of topicsensitive PageRank is related to a random walk with restarts. Each power iteration (random walk/propagation) step of our topicsensitive PageRank variant is, thus, calculated via
(4) 
where the prediction matrix acts as both the starting vector and the teleport set, defines the number of power iteration steps and . Note that this method retains the graph’s sparsity and never constructs an matrix. The convergence of this iterative scheme can be shown by investigating the resulting series (see Appendix B).
Note that the propagation scheme of this model does not require any additional parameters to train – as opposed to models like GCN, which require more parameters for each additional propagation layer. We can therefore propagate very far with very few parameters. Our experiments show that this ability is indeed very beneficial (Section 6). A similar model expressed in the message passing framework would therefore not be able to achieve the same level of performance.
The reformulation of PPNP via fixedpoint iterations illustrates a connection to the graph neural network (GNN) model (Scarselli et al., 2009). While the latter uses a learned fixedpoint iteration, our approach uses a fixed one and applies a (learned) feature transformation before propagation.
In both PPNP and APPNP, the size of the neighborhood influencing each node can be adjusted via the teleport probability . The freedom to choose allows us to adjust the model for different types of networks, since varying graph types require the consideration of different neighborhood sizes, as shown in Section 6 and described by Grover & Leskovec (2016) and AbuElHaija et al. (2018b).
4 Related work
Several works have tried to improve the training of message passing algorithms and increase the neighborhood available at each node by adding skip connections (Li et al., 2016; Pham et al., 2017; Hamilton et al., 2017; Ying et al., 2018). One recent approach combined skip connection with aggregation schemes (Xu et al., 2018). However, the range of all of these models is still limited, as apparent in the low number of message passing layers used. While it is possible to add skip connections in the neural network used by our algorithm, this would not influence the propagation scheme. Our approach to solving the range problem is therefore unrelated to these models.
Li et al. (2018) facilitated training by combining message passing with co and selftraining. The improvements achieved by this combination are similar to results reported with other semisupervised classification models (Buchnik & Cohen, 2018). Note that most algorithms, including ours, can be improved using self and cotraining. However, each additional step used by these methods corresponds to a full training cycle and therefore significantly increases the training time.
5 Experimental setup
Recently, many experimental evaluations have suffered from experimental bias by using varying training setups, from superficial statistical evaluation, and from overfitting. The latter is caused by experiments only using a single training/validation/test split, by not distinguishing clearly between the validation and test set, and by finetuning hyperparameters to each dataset or even data split. Messagepassing algorithms are very sensitive to both data splits and weight initialization (as clearly shown by our evaluation). Thus, a carefully designed evaluation protocol is extremely important. Our work aims to establish such a thorough evaluation protocol. First, we run each experiment 100 times on multiple random splits and initializations. Second, we split the data into a visible and a test set, which do not change. The test set was only used
once to report the final performance; and in particular, has never been used to perform hyperparameter and model selection. To further prevent overfitting we use the same number of layers and hidden units, dropout rate , regularization parameter , and learning rate across datasets, since all datasets are scientific networks and use bagofwords as features. To prevent experimental bias we optimized the hyperparameters of all models individually on the same setup using a grid search on Citeseer and CoraML and use the same early stopping criterion across models.Finally, to ensure the statistical robustness of our experimental setup, we calculate confidence intervals via bootstrapping and report the pvalues of a paired
test for our main claims. To our knowledge, this is the most rigorous study on GCNlike models which has been done so far on graphs having textbased features. More details about the experimental setup are provided in Appendix C.Dataset  Type  Classes  Features  Nodes  Edges  Label rate  Avg. diameter 

Citeseer  Citation  
CoraML  Citation  
PubMed  Citation  
MS Academic  Coauthor 
Datasets. We use four textclassification datasets for evaluation. Citeseer (Sen et al., 2008), CoraML (McCallum et al., 2000) and PubMed (Namata et al., 2012) are citation graphs, where each node represents a paper and the edges represent citations between them. We further introduce the Microsoft Academic graph, where edges represent coauthorship. We use the largest connected component of each graph. All graphs use a bagofwords representation of the papers’ abstracts as features. Table 1 reports the dataset statistics.
Baseline models. We use five stateoftheart models as baselines: GCN (Kipf & Welling, 2017), network of GCNs (NGCN) (AbuElHaija et al., 2018a), graph attention networks (GAT) (Veličković et al., 2018), bootstrapped (selftrained) feature propagation (bt. FP) (Buchnik & Cohen, 2018) and jumping knowledge networks with concatenation (JK) (Xu et al., 2018). For GCN we also show the results of the (unoptimized) vanilla version (V. GCN) to demonstrate the strong impact of early stopping and hyperparameter optimization. We describe the optimized hyperparameters of all models in Appendix D.
Model hyperparameters. To ensure a fair model comparison we used a neural network for PPNP that is structurally very similar to GCN and has the same number of parameters. We use two layers with hidden units. We apply regularization with on the weights of the first layer and use dropout with dropout rate on both layers and the adjacency matrix. For APPNP, adjacency dropout is resampled for each power iteration step. For propagation we use the teleport probability and power iteration steps for APPNP. We use on the Microsoft Academic graph due to its structural difference (see Figure 5 and its discussion). The combination of this shallow neural network with a comparatively high number of power iteration steps achieved the best results during hyperparameter optimization (see Appendix H).
6 Results
Overall accuracy. The results for the accuracy (micro F1score) are summarized in Table 2. Similar trends are observed for the macro F1score (see Appendix E). Both models significantly outperform the stateoftheart baseline models on all datasets. Our rigorous setup might understate the improvements achieved by PPNP and APPNP – this result is statistically significant , as tested via a paired test (see Appendix F). This thorough setup furthermore shows that the advantages reported by recent works practically vanish when training is harmonized, hyperparameters are properly optimized and multiple data splits are considered. A simple GCN with optimized hyperparameters outperforms several recently proposed models on our setup.
Figure 2
shows how broad the accuracy distribution of each model is. This is caused by both random initialization and different data splits (train / early stopping / test). This demonstrates how crucial a statistically rigorous evaluation is for a conclusive model comparison. Moreover, it shows the sensitivity (robustness) of each method, e.g. PPNP, APPNP and GAT typically have lower variance.
Model  Citeseer  CoraML  PubMed  MS Academic 

V. GCN  
GCN  
NGCN  
GAT  
JK  
Bt. FP  
PPNP      
APPNP  
out of memory on PubMed, MS Academic (see efficiency analysis in Section 3) 
Graph  V. GCN  GCN  NGCN  GAT  JK  Bt. FP  PPNP  APPNP 
Citeseer    
CoraML    
PubMed      
MS Academic      
not applicable, since core method not trainable out of memory on PubMed, MS Academic (see efficiency analysis in Section 3) 
Average training time per epoch. PPNP and APPNP are only slightly slower than GCN and much faster than more sophisticated methods like GAT.
Training time per epoch. We report the average training time per epoch in Table 3. We decided to only compare the training time per epoch since all hyperparameters were solely optimized for accuracy and the used early stopping criterion is very generous. Obviously, (exact) PPNP can only be applied to moderately sized graphs, while APPNP scales to large data. On average, APPNP is around slower than GCN due to its higher number of matrix multiplications. It scales similarly with graph size as GCN and is therefore significantly faster than other more sophisticated models like GAT. This is observed even though our implementation improved GAT’s training time roughly by a factor of 2 compared to the reference implementation.
Training set size. Since the labeling rate is often very small for real world datasets, investigating how the models perform with a small number of training samples is very important. Figure 3 shows how the number of training nodes per class impacts the achieved accuracy on CoraML. The results on the other datasets can be found in Appendix G. The dominance of PPNP and APPNP becomes specifically clear for small training sets. This can be attributed to their higher range, which allows them to better propagate the information further away from the (few) training nodes.
Number of power iteration steps. Figure 4 shows how the accuracy depends on the number of power iteration steps for two different propagation schemes. The first mimics the standard propagation as known from GCNs (i.e. taking in APPNP). As clearly shown the performance breaks down as we increase the number of power iteration steps (since we approach the global PageRank solution). However, when using the power of personalized propagation (here with ) the accuracy increases and converges to the solution of (exact) PPNP with infinitely many propagation steps, thus demonstrating the personalized propagation principle is indeed beneficial. As also shown in the figure, it is enough to use a moderate number of power iterations (e.g. ) to approximate the exact PPNP solution.
Teleport probability . Figure 5 shows how the accuracy on the validation set is affected by the hyperparameter . While the optimum differs slightly for every dataset, we consistently found a teleport probability of around to perform best. This probability should be adjusted for the dataset under investigation, since different graphs exhibit different neighborhood structures (Grover & Leskovec, 2016; AbuElHaija et al., 2018b).
Neural network without propagation. PPNP and APPNP are trained endtoend, with the propagation scheme affecting (i) the neural network during training, and (ii) the classification decision during inference. Investigating how the model performs without propagation shows if and how valuable this addition is. Figure 6
shows how propagation affects both training and inference. ”Never” denotes the case where no propagation is used; essentially we train and apply a standard multilayer perceptron (MLP)
using the features only. ”Training” denotes the case where we use APPNP during training to learn ; at inference time, however, only is used to predict the class labels. ”Inference”, in contrast, denotes the case where is trained without APPNP (i.e. standard MLP on features); this pretrained network with fixed weights, however, is then used within APPNP during inference time. Finally, ”Inf. & Training” denotes the regular APPNP, which always uses propagation.The best results are achieved with regular APPNP, which validates our approach. However, on most datasets the classification accuracy decreases surprisingly little when propagation is only performed during inference. Skipping propagation during training can significantly reduce training time for large graphs, since all nodes can be handled independently. Furthermore, this shows that our model can be combined with pretrained neural networks that do not incorporate any graph information and still significantly improve their accuracy. Moreover, Figure 6 shows that only applying propagation during training can also lead to large improvements. This suggests that our model should be able to handle cases of online and inductive learning where only the features and not the neighborhood information of an incoming (previously unobserved) node are available.
7 Conclusion
In this paper we have introduced personalized propagation of neural predictions (PPNP) and a fast approximation, APPNP. We derived this model by considering the relationship between GCN and PageRank and extending it to personalized PageRank. This model solves the limited range problem inherent in many message passing models without introducing any additional parameters. It uses the information from a large, adjustable neighborhood for classifying each node and is computationally efficient. This model outperforms several modern methods for semisupervised classification on multiple graphs in the most thorough study which has been done for GCNlike models so far.
For future work we it would be interesting to combine PPNP with more complex neural networks used e.g. in computer vision or natural language processing. Furthermore, faster or incremental approximations of personalized PageRank
(Bahmani et al., 2010, 2011; Lofgren et al., 2014) and more sophisticated propagation schemes would also benefit the method.References
 AbuElHaija et al. (2018a) Sami AbuElHaija, Amol Kapoor, Bryan Perozzi, and Joonseok Lee. NGCN: Multiscale Graph Convolution for Semisupervised Node Classification. In International Workshop on Mining and Learning with Graphs (MLG), 2018a.
 AbuElHaija et al. (2018b) Sami AbuElHaija, Bryan Perozzi, Rami AlRfou, and Alex Alemi. Watch Your Step: Learning Node Embeddings via Graph Attention. In NIPS, 2018b.
 Bahmani et al. (2010) Bahman Bahmani, Abdur Chowdhury, and Ashish Goel. Fast Incremental and Personalized PageRank. VLDB, 4(3):173–184, 2010.
 Bahmani et al. (2011) Bahman Bahmani, Kaushik Chakrabarti, and Dong Xin. Fast Personalized PageRank on MapReduce. In SIGMOD, 2011.
 Bojchevski et al. (2018) Aleksandar Bojchevski, Oleksandr Shchur, Daniel Zügner, and Stephan Günnemann. NetGAN: Generating Graphs via Random Walks. In ICML, 2018.
 Bruna et al. (2014) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral Networks and Deep Locally Connected Networks on Graphs. ICLR, 2014.
 Buchnik & Cohen (2018) Eliav Buchnik and Edith Cohen. Bootstrapped Graph Diffusions: Exposing the Power of Nonlinearity. Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), 2(1):1–19, April 2018.
 Chen et al. (2018) Jie Chen, Tengfei Ma, and Cao Xiao. FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling. ICLR, 2018.
 Dai et al. (2018) Hanjun Dai, Zornitsa Kozareva, Bo Dai, Alexander J. Smola, and Le Song. Learning SteadyStates of Iterative Algorithms over Graphs. In ICML, 2018.
 Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In NIPS, 2016.
 Duvenaud et al. (2015) David K. Duvenaud, Dougal Maclaurin, Jorge AguileraIparraguirre, Rafael GómezBombarelli, Timothy Hirzel, Alán AspuruGuzik, and Ryan P. Adams. Convolutional Networks on Graphs for Learning Molecular Fingerprints. In NIPS, 2015.
 Faerman et al. (2018) Evgeniy Faerman, Felix Borutta, Kimon Fountoulakis, and Michael W. Mahoney. LASAGNE: Locality And Structure Aware Graph Node Embedding. In International Conference on Web Intelligence (WI), 2018.
 Gilmer et al. (2017) Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural Message Passing for Quantum Chemistry. In ICML, 2017.
 Glorot & Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
 Grover & Leskovec (2016) Aditya Grover and Jure Leskovec. node2vec: Scalable Feature Learning for Networks. In KDD, 2016.
 Hamilton et al. (2017) William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive Representation Learning on Large Graphs. In NIPS, 2017.
 Haveliwala (2002) Taher H. Haveliwala. Topicsensitive PageRank. In WWW, 2002.
 Kearnes et al. (2016) Steven M. Kearnes, Kevin McCloskey, Marc Berndl, Vijay S. Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of ComputerAided Molecular Design, 30(8):595–608, 2016.
 Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. ICLR, 2015.
 Kipf & Welling (2017) Thomas N. Kipf and Max Welling. SemiSupervised Classification with Graph Convolutional Networks. ICLR, 2017.

Li et al. (2018)
Qimai Li, Zhichao Han, and XiaoMing Wu.
Deeper Insights Into Graph Convolutional Networks for SemiSupervised Learning.
In AAAI, 2018.  Li et al. (2016) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. Gated Graph Sequence Neural Networks. In ICLR, 2016.

Lofgren et al. (2014)
Peter Lofgren, Siddhartha Banerjee, Ashish Goel, and Seshadhri Comandur.
FASTPPR: scaling personalized pagerank estimation for large graphs.
In The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA  August 24  27, 2014, 2014.  Martín Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: LargeScale Machine Learning on Heterogeneous Systems, 2015.
 McCallum et al. (2000) Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127–163, 2000.
 Monti et al. (2017) Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M. Bronstein. Geometric Deep Learning on Graphs and Manifolds Using Mixture Model CNNs. In CVPR, 2017.
 Namata et al. (2012) Galileo Namata, Ben London, Lise Getoor, and Bert Huang. Querydriven Active Surveying for Collective Classification. In International Workshop on Mining and Learning with Graphs (MLG), 2012.
 Nandanwar & Murty (2016) Sharad Nandanwar and M. N. Murty. Structural Neighborhood Based Classification of Nodes in a Network. In KDD, 2016.
 Niepert et al. (2016) Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning Convolutional Neural Networks for Graphs. In ICML, 2016.
 Page et al. (1998) Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1998.
 Perozzi et al. (2014) Bryan Perozzi, Rami AlRfou, and Steven Skiena. DeepWalk: online learning of social representations. In KDD, 2014.
 Pham et al. (2017) Trang Pham, Truyen Tran, Dinh Q. Phung, and Svetha Venkatesh. Column Networks for Collective Classification. In AAAI, 2017.
 Qiu et al. (2018) Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec. In ACM International Conference on Web Search and Data Mining (WSDM), 2018.
 Scarselli et al. (2009) F. Scarselli, M. Gori, Ah Chung Tsoi, M. Hagenbuchner, and G. Monfardini. The Graph Neural Network Model. IEEE Transactions on Neural Networks, 20(1):61–80, January 2009.
 Schlichtkrull et al. (2018) Michael Sejr Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling Relational Data with Graph Convolutional Networks. In Extended Semantic Web Conference (ESWC), 2018.
 Sen et al. (2008) Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina EliassiRad. Collective Classification in Network Data. AI Magazine, 29(3):93–106, 2008.
 Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. LINE: Largescale Information Network Embedding. In WWW, 2015.
 Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph Attention Networks. ICLR, 2018.
 Xu et al. (2018) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Kenichi Kawarabayashi, and Stefanie Jegelka. Representation Learning on Graphs with Jumping Knowledge Networks. In ICML, 2018.
 Yang et al. (2016) Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. Revisiting SemiSupervised Learning with Graph Embeddings. In ICML, 2016.
 Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. Graph Convolutional Neural Networks for WebScale Recommender Systems. KDD, 2018.
Appendix A Existence of
The matrix
(5) 
exists iff the determinant , which is the case iff , i.e. iff is not an eigenvalue of . This value is always larger than 1 since the teleport probability . Furthermore, the symmetrically normalized matrix
has the same eigenvalues as the rowstochastic matrix
. This can be shown by multiplying the eigenvalue equation with from left and substituting . However, the largest eigenvalue of a rowstochastic matrix is 1, as can be proven using the Gershgorin circle theorem. Hence, can’t be an eigenvalue and always exists.Appendix B Convergence of APPNP
APPNP uses the iterative equation
(6) 
After the th propagation step, the resulting predictions are
(7) 
If we take the limit the left term tends to 0 and the right term becomes a geometric series. The series converges since and is symmetrically normalized and therefore , resulting in
(8) 
which is the equation for calculating (exact) PPNP.
Appendix C Experimental details
The sampling procedure is illustrated in Figure 7. The data is first split into a visible and a test set. For the visible set 1500 nodes were sampled for the citation graphs and 5000 for Microsoft Academic. The test set contains all remaining nodes. We use three different label sets in each experiment: A training set of 20 nodes per class, an early stopping set of 500 nodes and either a validation or test set. The validation set contains the remaining nodes of the visible set. We use 20 random seeds for determining the splits. These seeds are drawn once and fixed across runs to facilitate comparisons. We use one set of seeds for the validation splits and a different set for the test splits. Each experiment is run with 5 random initializations on each data split, leading to a total of 100 runs per experiment.
The early stopping criterion uses a patience of and an (unreachably high) maximum of epochs. The patience is reset whenever either the accuracy increases or the loss decreases on the early stopping set. We choose the parameter set achieving the highest accuracy and break ties by selecting the lowest loss on this set. This early stopping strategy was inspired by GAT (Veličković et al., 2018).
We used TensorFlow (Martín Abadi et al., 2015) for all experiments except bootstrapped feature propagation. All uncertainties and confidence intervals correspond to a confidence level of and were calculated by bootstrapping with samples.
Appendix D Baseline hyperparameters
Vanilla GCN uses the original settings of two layers with hidden units, no dropout on the adjacency matrix, regularization parameter and the original early stopping with a maximum of 200 steps and a patience of 10 steps based on the loss.
The optimized GCN uses two layers with hidden units, dropout on the adjacency matrix with and regularization parameter .
NGCN uses hidden units, heads per random walk length and random walks of up to steps. It uses regularization on all layers with and the attention variant for merging the predictions (AbuElHaija et al., 2018a). Note that this model effectively uses hidden units, which is 5 times as many as the optimized GCN model, GAT and PPNP use.
For GAT we use the (well optimized) original hyperparameters, except the regularization parameter and learning rate . As opposed to the original paper, we do not use different hyperparameters on PubMed, as described in our experimental setup.
Bootstrapped feature propagation uses a return probability of , 10 propagation steps, 10 bootstrapping (selftraining) steps with training nodes added per step. We add the training nodes with the lowest entropy on the predictions. The number of nodes added per class is based on the class proportions estimated using the predictions. Note that this model does not include any stochasticity in its initialization. We therefore only run it once per train/early stopping/test split.
For the jumping knowledge networks we use the concatenation variant with three layers and hidden units per layer. We apply regularization with on all layers and perform dropout with on all layers but not on the adjacency matrix.
Appendix E F1 score
Model  Citeseer  CoraML  PubMed  MS Academic 

V. GCN  
GCN  
NGCN  
GAT  
JK  
Bt. FP  
PPNP      
APPNP  
out of memory on PubMed, MS Academic (see efficiency analysis in Section 3) 
Appendix F Paired test
Model  Citeseer  CoraML  PubMed  MS Academic 

PPNP      
APPNP 
Model  Citeseer  CoraML  PubMed  MS Academic 

PPNP      
APPNP 
Comments
There are no comments yet.