As a natural abstraction of real-world entities and their relationships, graphs are widely adopted as a tool for modeling machine learning tasks on relational data. Applications are manifold, including documents classification in citation networks, user recommendations in social networks or function prediction of proteins in biological networks.
Remarkable success has been achieved by recent efforts in formulating deep learning models operating on graph-structured domains. Unsupervised node embedding techniques(perozzi2014deepwalk; tang2015line; cao2015grarep; grover2016node2vec; qiu2018network)
employ matrix factorization to derive distributed vector space representations for further downstream tasks. In settings where labels are provided, semi-supervised models can be trained end-to-end to improve performance for a given task. In particular, graph neural network models(bronstein2017geometric; gilmer2017neural; battaglia2018relational)
have been established as a de-facto standard for semi-supervised learning on graphs. While spectral methods(bruna2013spectral; defferrard2016convolutional; monti2017geometric; bronstein2017geometric) can be derived from a signal processing point of view, a message passing perspective (duvenaud2015convolutional; li2015gated; kearnes2016molecular; kipf2016semi; hamilton2017inductive; gilmer2017neural) has proved especially useful due to its flexibility and amenability to highly parallel GPU computation (fey2019fast). Further recent works have considered additional edge features (gilmer2017neural; schlichtkrull2018modeling), attention mechanisms (velivckovic2017graph; thekumparampil2018attention; lee2018attention), addressed scalability (chen2018fastgcn; wu2019simplifying) and studied the expressive power of graph neural network models (xu2018powerful; morris2018weisfeiler).
While the above techniques may serve as a basis for modeling further tasks such as link prediction (kipf2016variational; zhang2018link) or graph classification (niepert2016learning; lee2018graph), we focus on semi-supervised node classification in this work. Given a graph , a feature matrix and a label matrix , the goal is to predict labels for a set of unlabeled nodes based on graph topology, node attributes and observed node labels. If no node attributes are available, auxiliary features such as one-hot vectors or node degrees may be used, depending on the task at hand. All graphs considered in the following are undirected, however, extension to directed graphs is straightforward.
Despite their success, existing neural message passing algorithms suffer from several central issues. First, information is pulled indiscriminately from -hop neighborhoods which will include many irrelevant nodes and miss important ones. In particular, long-range dependencies are modeled ineffectively, since unnecessary messages do not only impede efficiency but additionally introduce noise. Further, interesting correlations might exist on different levels of locality which makes it necessary to consider multi-scale representations. These issues prevent existing neural message passing algorithms from reaching their full potential in terms of prediction performance.
To address the above issues, we propose a novel -based neural message passing algorithm which propagates information on demand rather than indiscriminately pulling it from all neighbors. We show that it can be interpreted equivalently as either an asynchronous message passing scheme or a single synchronous message passing iteration over sparse neighborhoods derived from Approximate Personalized PageRank. Thereby, each node neighborhood is personalized to its source node, providing a stronger structural bias and resulting in a node-adaptive receptive field. Both views are illustrated in Figure 1. Consequently, our model benefits from the existing synchronous neural message passing framework while providing additional advantages derived from its asynchronous message passing interpretation. In contrast to existing synchronous methods, our model further eliminates the need of stacking multiple message passing layers to reach distant nodes by introducing a suitable neighborhood function. It additionally supports highly efficient training and is able to learn combinations of multi-scale representations.
2 Neural Message Passing Algorithms
Neural message passing algorithms follow a synchronous neighborhood aggregation scheme. Starting with an initial feature matrix , for iterations, each node sends a message to each of its neighbors and updates its own state based on the aggregated received messages. Borrowing some notation from fey2019fast, we formalize this procedure in Algorithm 1 where is a message function, is a permutation invariant aggregation function and is an update function. All of these functions are required to be differentiable.
One of the most simple and widespread representatives of this framework is the Graph Convolutional Network (GCN) (kipf2016semi) which can be defined via
where , is a non-linearity (ReLU is used for hidden layers, softmax for the final prediction layer), is a symmetrically normalized adjacency matrix with self-loops and with degree matrix . Note that due to self-loops each node also aggregates its own features. Normalization preserves the scale of the feature vectors. In each GCN layer, features are transformed and aggregated from direct neighbors as a weighted sum.
While various models which can be formulated in this framework have achieved remarkable performance, a general issue with synchronous message passing schemes is that long-range dependencies in the graph are not modeled effectively. If denotes the one-hop neighborhood of node (which is commonly the case), then each message passing iteration expands the receptive field by one hop. For a single node to gather information from another node of distance , message passing iterations need to be performed for all nodes in the graph. Sending a large number of unnecessary messages does not only result in unnecessary computation but further introduces noise to the learned node features. On the same note, (xu2018representation) and (li2018deeper) pointed out an over-smoothing effect. xu2018representation showed that with an increasing number of layers, node importance in a GCN converges to the graph’s random walk limit distribution, i.e., all local information is lost.
2.1 Asynchronous Message Passing
Instead of passing messages along all edges in multiple subsequent rounds, one might consider an asynchronous propagation scheme where nodes perform state updates and send messages one after another. In particular, pushing natively supports adaptivity, since instead of just pulling information from all neighbors, nodes are able to push and receive important information on demand. This motivates our push-based message passing framework (Algorithm 2).
First, it is important to note that each node needs to aggregate incoming messages until it is selected to be updated. For that purpose, we introduce aggregator states which contain novel unprocessed information for each node. After it is used by a node to update its state and it has pushed messages to its neighbors, the aggregator state is reset until the node receives more information and becomes active again. Further note that the aggregator states naturally lend themselves to serve as a basis for selecting the next node and for a convergence criterion, based on the amount of unprocessed information. The functions , and fulfill the same roles and share the same requirements as their synchronous counterparts. Though not specifically indicated, in principle, different functions may be used for different iterations.
As a particular instance of this framework, we propose Local Push Message Passing (LPMP) (Algorithm 3). For the next update, the node with the largest aggregator state is selected, since it holds the largest amount of unprocessed information. Similarly, convergence is attained if each node has only a small amount of unprocessed information left. Note that all state updates are additive and no learnable transformations are applied in order to effectively treat long-range dependencies and retain flexibility. Feature transformations may be applied before or after propagation. Further, in each iteration of the outer loop, only the features of node are diffused through the graph in order to avoid excessive smoothing which might occur when multiple features are propagated at the same time over longer distances in the graph. Also, all iterations of the outer loop are independent of each other and can be performed in parallel. Remaining details of the algorithm will be motivated and explained below.
We further wish to point out that the synchronous framework does not consider any notion of convergence but instead introduces a hyper-parameter for the number of message passing iterations. An early work on Graph Neural Network (GNN) (scarselli2009graph) applies contraction mappings and performs message passing until node states converge to a fixed point. However, neighborhood aggregation is still performed synchronously.
Finally, further instances of the general push-based message passing framework may be considered in future work. We focus on this particular algorithm due to its nice interpretation in terms of existing push algorithms (as detailed below), its favorable properties and since we observed it to perform well in practice.
3 Pushing Networks
The LPMP algorithm described above is inspired by local push algorithms for computation of Approximate Personalized PageRank (APPR) (jeh2003scaling; berkhin2006bookmark) and, in particular, we will show in the following how it can be equivalently described as a single synchronous message passing iteration using sparse APPR neighborhoods. Thus, the proposed message passing scheme effectively combines the advantages of existing synchronous algorithms with the benefits of asynchronous message passing described above.
3.1 Personalized Node Neighborhoods
Personalized PageRank (PPR) refers to a localized variant of PageRank (page1999pagerank) where random walks are restarted only from a certain set of nodes. We consider the special case in which the starting distribution is a unit vector, i.e., when computing PPR-vector of node , walks are always restarted at itself. Formally, can be defined as the solution of the linear system
where denotes the th unit vector, denotes the random walk transition matrix of
, and the restart probabilitycontrols the locality, where a larger value leads to stronger localization. The PPR vectors for all nodes can be stored as rows of a PPR-matrix . Intuitively, corresponds to the probability that a random walk starting at stops at where the expected length of the walk is controlled by . The vector can be interpreted as an importance measure for node over all other nodes where measures the importance of for . Since these measures are not sparse and global computation of would require operations, we consider local computation of APPR instead. In particular, we refer to the Reverse Local Push algorithm (andersen2007local), since it comes with several useful theoretical properties. Complexity of computing the whole matrix is reduced to (andersen2007local), i.e., linear in the number of nodes. The parameter controls the quality of approximation, sparsification and runtime where a larger value leads to sparser solutions. For a more in-depth discussion, we refer to andersen2007local.
Based on the above neighborhood function, we propose the following neural message passing algorithm:
Definition 1 (PushNet)
Let and be MLPs parametrized by and , respectively, denote hidden dimensions, be a tensor storing precomputed APPR matrices for different scales
be a tensor storing precomputed APPR matrices for different scalesand denote a differentiable scale aggregation function. Given input features , the layers of PushNet are defined as
In most cases, , such that provides the final predictions for each node over classes. In general, PushNet might also be applied to different graph learning problems such as graph classification, where learned node representations are pooled and labels are predicted for whole graphs. However, we leave these further applications to future work.
To draw the connection between synchronous and asynchronous message passing, we show that the base variant of PushNet with no feature transformations and a single scale is equivalent to LPMP (Algorithm 3):
Let and be fixed, be identity functions and . Then where is the output of PushNet and is the output of LPMP.
The main idea is that instead of propagating features directly as in LPMP, we can first propagate scalar importance weights as in Reverse Local Push and then propagate features in a seconds step. Thus, all discussion on LPMP are directly applicable to PushNet, including adaptivity, effective treatment of long-range dependencies and avoidance of over-smoothing. We wish to point out that an additional interpretation of adaptivity can be derived from the perspective of PushNet: APPR-induced neighborhoods of different nodes are sparse and directly exclude irrelevant nodes from consideration, in contrast to commonly used -hop neighborhoods. In this sense, APPR is adaptive to the particular source node. To the best of our knowledge, no existing neural message passing algorithm shares this property.
In practice, it is favorable to not propagate features using LPMP, but to pre-compute APPR matrices such that features are propagated only once along all non-zero APPR entries and there is no need to propagate gradients back over long paths of messages. Thus, PushNet benefits from the existing synchronous neural message passing framework while providing additional advantages derived from its asynchronous interpretation.
3.3 Learning Multi-Scale Representations
Additional properties of PushNet compared to LPMP include feature transformations and which may be applied before and after feature propagation. Since the optimal neighborhood size cannot be assumed to be the same for each node and patterns might be observed at multiple scales, we additionally propagate features over different localities by varying the restart probability . The multi-scale representations are then aggregated per node into a single vector such that the model learns to combine different scales for a given node. In particular, we consider the following scale aggregation functions:
sum: Summation of multi-scale representations. Intuitively, sum-aggregation corresponds to an unnormalized average with uniform weights attached to all scales.
Note that due to distributivity, PushNet with sum aggregation reduces to propagation with a single matrix , i.e., features can be propagated and additively combined over an arbitrary number of different scales at the cost of only a single propagation. Thereby, the non-zero entries in are given by . However, usually the number of non-zero entries will be close to , since nodes considered at a smaller scale will most often also be considered at a larger scale. Thus, complexity will be dominated by the largest scale considered.
max: Element-wise maximum of multi-scale representations. The most informative scale is selected for each feature individually. This way, different features may correspond to more local or more global properties.
cat: Concatenation of multi-scale representations. Scale combination is learned in subsequent layers. The implied objective is to learn a scale aggregation function which is globally optimal for all nodes.
3.4 The PushNet Model Family
We wish to point out several interesting special cases of our model. In our default setting, prediction layers will always be dense with a softmax activation. If hidden layers are used, we use a single dense layer with ReLU activation.
PushNet. The general case in which and are generic MLPs. As per default, is a single dense hidden layer and is a dense prediction layer.
No feature transformation is performed prior to propagation. In this case, needs to be computed only once and can then be cached, making learning extremely efficient. The following sub-cases are of particular interest:
PushNet-PTP. The sequence of operations is ”push – transform – predict”. In this case, is a generic MLP, consisting of 2 layers per default.
The model performs operations ”push – predict” and uses no hidden layers. Predictions can be interpreted as the result of logistic regression on aggregated features. This version is similar to SGC(wu2019simplifying) with the difference that SGC does not consider multiple scales and propagates over -hop neighborhoods.
LPMP. In this setting, and , i.e., no feature transformations are performed and features are aggregated over a single scale. This setting corresponds to LPMP, cf. Theorem 1. Note that this model describes only feature propagation, no actual predictions are made.
PushNet-TPP. The setting is and , such that the model first predicts class labels for each node and then propagates the predicted class labels. This setting is similar to APPNP (klicpera2018predict) but with some important differences. APPNP considers only a single fixed scale and does not propagate over APPR neighborhoods. Instead, multiple message passing layers are stacked to perform a power iteration approximation of PPR. The resulting receptive field is restricted to -hop neighborhoods. Note that cat aggregation is not applicable here.
3.5 Comparison with Existing Neural Message Passing Algorithms
Existing message passing algorithms have explored different concepts of node importance, i.e., weights used in neighborhood aggregation. While GCN (kipf2016semi) and other GCN-like models use normalized adjacency matrix entries in each layer, Simplified Graph Convolution (SGC) (wu2019simplifying) aggregates nodes over -hop neighborhoods in a single iteration using a -step random walk matrix. Approximate Personalized Propagation of Neural Predictions (APPNP) (klicpera2018predict) also relies on a -step random walk matrix but with restarts which can be equivalently interpreted as a power iteration approximation of the PPR matrix. Graph Attention Network (GAT) (velivckovic2017graph) learns a similarity function which computes a pairwise importance score given two nodes’ feature vectors. All of the above methods aggregate features over fixed -hop neighborhoods. PushNet on the other hand aggregates over sparse APPR neighborhoods using the corresponding importance scores.
Multi-scale representations have been considered in Jumping Knowledge Networks (JK) (xu2018representation) where intermediate representations of a GCN or GAT base network are combined before prediction. The original intention was to avoid over-smoothing by introducing these skip connections. Similarly, (liao2019lanczosnet) concatenate propagated features at multiple selected scales. LD (faerman2018semi) uses APPR to compute local class label distributions at multiple scales and proposes different combinations but is limited to unattributed graphs. PushNet varies the restart probability in APPR to compute multi-scale representations and combines them using simple and very efficient aggregation functions. APPR-Roles borutta2019structural also employs APPR but to compute structural node embeddings in an unsupervised setting. The idea of performing no learnable feature transformations prior to propagation was explored already in SGC wu2019simplifying. It allows for caching propagated features, resulting in very efficient training, and is also used by PushNet-PTP and PushNet-PP. LD (faerman2018semi) explored the idea of propagating class labels instead of latent representations. Similarly, APPNP (klicpera2018predict) and PushNet-TPP propagate predicted class labels.
To the best of our knowledge, the only existing work considering asynchronous neural message passing is SSE (dai2018learning). Compared to PushNet, SSE is pull-based, i.e., in each iteration a node pulls features from all neighbors and updates its state, until convergence to steady node states. To make learning feasible, stochastic training is necessary. Further, the work focuses on learning graph algorithms for different tasks and results for semi-supervised node classification were not very competitive. In contrast, PushNet offers very fast deterministic training and adaptive state updates due to a push-based approach.
We compare PushNet and its variants against six state-of-the-art models, GCN (kipf2016semi), GAT (velivckovic2017graph), JK (xu2018representation) with base model GCN and GAT, SGC (wu2019simplifying), Graph Isomorphism Network (GIN) (xu2018powerful) and APPNP (klicpera2018predict) on five established node classification benchmark datasets. For better comparability, all models were implemented using PyTorch Geometric (fey2019fast) 111https://github.com/rusty1s/pytorch_geometric and trained on a single NVIDIA GeForce GTX 1080 Ti GPU.
|Hidden size||Learning rate||Dropout||reg. strength|
Experiments are performed on semi-supervised text classification benchmarks. In particular, we consider three citation networks, CiteSeer and Cora from (sen2008collective) and PubMed from (namata2012query), and two co-authorship networks, Coauthor CS and Coauthor Physics from (shchur2018pitfalls). Statistics of these datasets are summarized in Table 1.
4.2 Experimental Setup
For the sake of an unbiased and fair comparison, we follow a rigorous evaluation protocol, similarly as in (shchur2018pitfalls) and (klicpera2018predict). 222Note that results for competing methods might differ from those reported in related work due to a different experimental setup. We restrict all graphs to their largest connected components and -normalize all feature vectors. Self-loops are added and different normalizations are applied to the adjacency matrices individually for each method as proposed by the respective authors. For each dataset, we sample 20 nodes per class for training and 500 nodes for validation. The remaining nodes are used as test data. Each model is evaluated on 20 random data splits with 5 random initializations, resulting in 100 runs per model and dataset. Using the same random seed for all models ensures that all models are evaluated on the same splits.
Model architectures including sequences and types of layers, activation functions, locations of dropout and
-regularization are fixed as recommended by the respective authors. All remaining hyperparameters are optimized per model by selecting the parameter combination with best average accuracy on CiteSeer and Cora validation sets. Final results are reported only for the test sets using optimal parameters. All models are trained withAdam (kingma2014adam) using default parameters and early stopping based on validation accuracy and loss as in (velivckovic2017graph)
with a patience of 100 for a maximum of 10000 epochs.
For all PushNet variants, we fix the architecture as described in the previous section. Dropout is applied to all APPR matrices and to the inputs of all dense layers. However, for PushNet-PTP and PushNet-PP dropout is only applied after propagation, such that propagated features can be cached. -regularization is applied to all dense layers. As a default setting, we consider three different scales and . Due to memory constraints, we use on Physics dataset for all PushNet variants and on CS and PubMed datasets for PushNet and PushNet-TPP. We further add self-loops to the adjacency matrices and perform symmetric normalization as in GCN. All APPR-matrices are -normalized per row. We use the following parameter grid for tuning hyper-parameters of all models:
Number of hidden dimensions: [8, 16, 32, 64]
Learning rate: [0.001, 0.005, 0.01]
Dropout probability: [0.3, 0.4, 0.5, 0.6]
Strength of L2-regularization: [1e-4, 1e-3, 1e-2, 1e-1]
Except for APPNP, all competitors use propagation layers. JK and GIN use an additional dense layer for prediction. For GAT layers, the number of attention heads is fixed to 8. Optimal hyper-parameters for all models are reported in Table 2.
Accuracy/micro-F1 scores on semi-supervised node classification datasets in terms of mean and standard deviation over 100 independent runs. JK-GAT and SGC are out of GPU memory on the largest dataset, Coauthor Physics.
|CiteSeer||Cora||PubMed||Coauthor CS||Coauthor Physics|
4.3 Node Classification Accuracy
Accuracy/micro-F1 scores for all datasets are provided in Table 3. It can be observed that our models consistently provide best results on all datasets and that the strongest model, PushNet-TPP, outperforms all competitors on all datasets. Improvements of our best model compared to the best competing model are statistically significant with on all datasets according to a Wilcoxon signed-rank test. 333In fact, results are significant with on all datasets except PubMed. P-values for all datasets are reported in Table 4. On CiteSeer and PubMed, PushNet-PTP is able to push performance even further. PushNet with feature transformations before and after propagation is less performant but still outperforms all competitors on all datasets. PushNet-PP, the most simple of our models, performs worst as expected. However, it is still competitive, outperforming all competitors on CiteSeer. Boxplots shown in Figure 3
We argue that improvements over existing methods are primarily due to push-based propagation. Figure 1(a) compares APPR neighborhoods with -hop neighborhoods in terms of the fraction of -neighbors considered. It can be seen that -neighborhoods used by competitors draw a sharp artificial boundary while APPR adaptively selects nodes from larger neighborhoods and discards nodes from smaller ones, individually for each source node. Visually, deviations left to the boundary correspond to discarded irrelevant nodes, while deviations on the right hand side indicate additional nodes beyond the receptive field of competitors that can be leveraged by our method. Stacking more message passing layers to reach these nodes would degrade performance due to overfitting as demonstrated in (xu2018representation) and (li2018deeper).
Among the competing methods, APPNP performs best in general, providing best baseline performance on all datasets but CS where JK-GAT achieves best results. GAT outperforms GCN on three datasets, CiteSeer, Cora and Physics. JK performs worse than its respective basemodel in most cases. Similar observations were already made in (klicpera2018predict). SGC mostly performs worse than GCN due to its simplicity, outperforming it only on CiteSeer. GIN also provides worse results than GCN in most cases, possibly due to overfitting caused by larger model complexity. It outperforms GCN only on Physics, providing results similar to APPNP.
On CiteSeer, models using cached features perform very well, even the simple models SGC and PushNet-PP which effectively perform linear regression on propagated raw features provide superior performance. On the remaining datasets, the additional feature transformation provided by PushNet-PTP is necessary to guarantee high accuracy.
Macro-F1 scores reveal similar insights and are ommitted due to space constraints.
Figure 4 compares all methods based on average runtime per epoch and accuracy. 444We note that for APPNP and all PushNet variants, (A)PPR matrix computation is not included in runtime analysis such that runtime comparison is solely based on propagation, transformation and prediction for all compared models. We consider (A)PPR computation as a preprocessing step, since it is only required to be performed once per graph and can then be reused by all models for this graph. Computation is very fast for each of the considered graphs and can be performed on CPU. SGC has lowest runtime on all datasets but runs out of memory on Physics since it propagates raw features over -hop neighborhoods. PushNet-PP performs second fastest, followed by PushNet-PTP which generally provides a good tradeoff between runtime and accuracy. PushNet and PushNet-TPP are slower than competitors but still provide comparable runtime at a superior level of accuracy. Among the competitors, APPNP, GAT and JK-GAT require most computation time. JK-GAT also runs out of memory on Physics.
4.5 Influence of Locality
To study the influence of the locality parameter on the performance of our models, we run experiments with our base model PushNet with various single values and different aggregations of multiple values. Figure 1(b) illustrates how the fraction of -neighbors considered for propagation varies with . Generally, a larger value leads to stronger localization and the shape gets closer to a step function as for -neighborhoods. Figure 5 additionally shows the average performance on CiteSeer and Cora. For single , small values in achieve best accuracy. For larger , runtime drops considerably but at the cost of decreased accuracy and larger variance. Among multi-scale aggregations, sum performs best. It slightly improves the performance over single alphas and provides additional robustness, producing smaller variance and no outlier scores. Runtime is very close to the smallest single considered. The remaining aggregation functions lead to increased runtime and max does not even lead to an increase of accuracy.
4.6 Influence of Sparsity
Similarly as , the approximation threshold controls the effective neighborhood size considered for propagation. We study the effect on our models with a similar setup as above. Figure 1(c) illustrates how a larger value of leads to stronger sparsification of APPR neighborhoods. Variation leads to a shift of the curve, indicating that neighbors with small visiting probabilities are discarded mostly from -neighborhoods with moderate or large . Figure 5 shows that accuracy remains relatively stable on CiteSeer and Cora, decreasing monotonically for increasing . Simultaneously, runtime decreases steadily. While smaller provide marginally better accuracy, our results suggest that may be increased safely to allow for faster runtime and to account for limited GPU memory.
We presented a novel push-based asynchronous neural message passing algorithm which allows for efficient feature aggregation over adaptive node neighborhoods. A multi-scale approach additionally leverages correlations on increasing levels of locality and variants of our model capture different inductive bias. Semi-supervised node classification experiments on five real-world benchmark datasets exhibit consistent improvements of our models over all competitors with statistical significance, demonstrating the effectiveness of our approach. Ablation studies investigate the influence of varying locality and sparsity parameters as well as combinations of multi-scale representations. In future work, we intend to investigate additional instances of the push-based message passing framework, extensions to dynamic graphs and applications to further tasks such as link prediction and graph classification.
This work was done during an internship at CT RDA BAM IBI-US, Siemens Corporate Technology, Princeton, NJ, USA.