Introduction
Many realworld datasets such as social networks can be modeled using a graph wherein the nodes in the graph represent entities in the network and edges between the nodes capture the interactions between the corresponding entities. Further, every node can have attributes associated with it and some nodes can have known labels associated with them. Given such a graph, collective Classification (CC) [Neville and Jensen2000, Lu and Getoor2003, Sen P et al.2008]
is the task of assigning labels to the remaining unlabeled nodes in the graph. A key task here is to extract relational features for every node which not only consider the attributes of the node but also the attributes and labels of its partially labeled neighborhood. Neural network based models have become popular for computing such node representations by aggregating node & neighborhood information.
The key idea is to exploit the inherent relational structure among the nodes which encodes valuable information about homophily, influence, community structure, etc. [Jensen, Neville, and Gallagher2004]. Traditionally, various neighborhood statistics on structural properties [Gallagher and EliassiRad2010], and distributions on labels and features [Neville and Jensen2003, Lu and Getoor2003, McDowell and Aha2013] were used as relational features to predict labels. Further, iterative inference techniques were widely adopted to propagate these label predictions until convergence [Sen P et al.2008]. Recently, [Kipf and Welling2016] proposed Graph Convolutional Networks with a reparameterized Laplacian based graph kernel (GCN) for the nodelevel semisupervised classification task. GraphSAGE [Hamilton, Ying, and Leskovec2017] further extended GCN and proposed few additional neighborhood aggregation functions to achieve state of the art results for inductive learning.
These graph convolution kernels are based on differentiable extensions of the popular WeisfielerLehman(WL) kernels. In this work, we first show that a direct adaptation of WL kernels for CC task is inherently limited as node features get exponentially morphed with neighborhood information when considering farther hops. More importantly, learning to aggregate information from hop neighborhood in an endtoend differentiable manner is not easily scalable. The exponential increase in neighborhood size with increase in hops severely limits the model due to excessive memory and computation requirements. In this work, we propose a Higherorder Propagation framework (HOPF) that provides a solution for both these problems. Our main contributions are:

[leftmargin=*]

A modular graph kernel that generalizes many existing methods. Through this, we discuss a hitherto unobserved phenomenon which we refer to as Node Information Morphing. We discuss its implications on the limitations of existing methods and then discuss a novel family of kernels called the Node Information Preserving (NIP) kernels to address these limitations.

A
hybrid semisupervised learning framework
for higher order propagation (HOPF) that couples differentiable kernels with an iterative inference procedure to aggregate neighborhood information over farther hops. This allows differentiable kernels to exploit label information and further overcome excessive memory constraints imposed by multihop information aggregation. 
An extensive experimental study on 11 datasets from different domains. We demonstrate the NIM issue and show that the proposed iterative NIP model is robust and overall outperforms existing models.
Background
In this section, (i) we define the notations and terminologies used (ii) we present the generic differentiable kernel for capturing higher order information in CC setting (iii) we discuss existing works in the light of the generic kernel and (iv) analyze the Node Information Morphing (NIM) issue.
Definitions and notations
Let denote a graph with a set of vertices, , and edges, . Let . The set E is represented by an adjacency matrix and let denote the diagonal degree matrix defined as .
A collective classification dataset defined on graph comprises of a set of labeled nodes, , a set of unlabeled nodes, with , a feature matrix: and a label matrix: , where and denote the number of features and labels, respecetively. Let denote the predicted label matrix.
In this work, neural networks defined over hop neighborhoods have aggregation or convolution layers with dimensions each and whose outputs are denoted by . We denote the learnable weights associated with th layer as and . The weights of the input layer (, ) and output layer, are in and respectively. Iterative inference steps are indexed by .
Generic propagation kernel
We define the generic propagation (graph) kernel as follows:
(1) 
where and are the node and neighbor features considered at the propagation step (layer), is a function of the adjacency matrix of the graph, and and are weights associated with the th layer of the neural network. One can view the first term in the equation as processing the information of a given node and the second term as processing the neighbors’ information. The kernel recursively computes the outputs of the ^{th} layer by combining the features computed till the ^{th} layer.
is the activation function of the
th layer and andcan be scalars, vectors or matrices depending on the kernel.
Label predictions, can be obtained by projecting
onto the label space followed by a sigmoid or softmax layer corresponding to multiclass or multilabel classification task. The weights of the model are learned via backpropagation by minimizing an appropriate classification loss on
.Models 

F(A) 






    1        No  

    1    Yes  No  
SSICA  1  1  No  No  Yes  
WL  1  1      No  
GCN  1  Yes  Yes  No  
GCNMEAN  1  1  Yes  Yes  No  
GSPool  1  1  No  Yes  No  
GSMEAN  1  1  No  Yes  No  
GSLSTM  LSTM gates  LSTM  1  1  No  Yes  No  
NIPMEAN  1  1  No  Yes  No  
INIPMEAN  1  1  No  Yes  Yes 
Relation to existing works:
Appropriate choice of , , , and in the generic kernel yield different models. Table 1 lists out the choices for some of the popular models, as well as our proposed approaches. Iterative collective inference techniques, such as the ICA family combine node information with aggregated label summaries of immediate neighbors to make predictions. Aggregation can be based on averaging kernel: =, or label count kernel: =, etc with labels as neighbors features (=). This neighborhood information is then propagated iteratively to capture higher order information. ICA also has a semisupervised variant [McDowell and Aha2012] where after each iteration the model is relearned with updated labels of neighbors. Table: 1 shows how the modular components can be chosen to see semisupervised ICA (SSICA) as a special instantiation of our framework.
The WeisfeilerLehman (WL) family of recursive kernels [Weisfeiler and Lehman1968, Shervashidze et al.2011] were initially defined for graph isomorphism tests and most recent CC methods use differentiable extensions of it. In its basic form, it is the simplest instantiation of our generic propagation kernel with no learnable parameters as shown in Table: 1.
The normalized symmetric Laplacian kernel (GCN) used in [Kipf and Welling2016] can be seen as an instance of the the generic kernel with node weight, =, individual neighbors’ weights’ =, and . We also consider its mean aggregation variant (GCNMEAN), where . In theory, by stacking multiple graph convolutional layers, any higher order information can be captured in a differentiable way in computations. However in practice, the proposed model in[Kipf and Welling2016] is only full batch trainable and thus cannot scale to large graph when memory is limited.
GraphSAGE (GS) [Hamilton, Ying, and Leskovec2017] is the recent stateoftheart for inductive learning. GraphSAGE has also proposed variants of order differentiable WL kernels, viz: GSMEAN, GSPool and GSLSTM. These variants can be viewed as special instances of our generic framework as mentioned in the Table 1
. GSPool applies a maxpooling function to aggregate neighborhood information whereas GSLSTM uses a LSTM to combine neighbors’ information sequenced in random order similar to
[Moore and Neville2017]. GS has a mean averaging variant, similar to the to GCNMEAN model, but treats nodes separately from its neighbors, i.e . Finally, it either concatenates or adds up the node and neighborhood information. GSLSTM is overparameterized for small datasets. With GSMAX and GSLSTM there is a loss of information as Max pooling considers only the largest input and LSTM focuses more on the recent neighbors in the random sequence.Node Information Morphing (NIM): Analysis
In this section, we show that existing models which extract relational features, do not retain the original node information, completely. With multiple propagation steps the is decayed and morphed with neighborhood information. We term this issue as Node Information Morphing (NIM).
For ease of illustration, we demonstrate the NIM issue by ignoring the nonlinearity and weights. Based on the commonly observed instantiations of our generic propagation kernel (Eqn: 1), where , we consider the following equation:
(2) 
On unrolling the above expression, one can derive the following binomial form:
(3) 
From Eqn: 3, it can be seen that the relative importance of information associated with node’s hop information, , is . Hence, for any positive the importance of decays exponentially with . It can be seen that the decay rate for GCN is and for the other WL kernel variants mentioned in Table: 1.
Skip connections and Node Information Morphing:
It can be similarly derived and seen that the information morphing not only happens at but also for every , . This decay of neighborhood information can be lessened by leveraging skip connections. Consider the propagation kernel in Eqn: 2 with skip connections as shown below:
(4) 
The above equation on expanding as above gives:
(5) 
The relative importance of weights of then becomes , which decays slower than for all . Though this helps in retaining information longer, it doesn’t solve the problem completely. Skip connections were used in GCN to reduce the drop in performance of their model with multiple hops. The addition of skip connection in GCN was originally motivated from the conventional perspective to avoid reduction in performance with increasing neural network layers and not with the intention to address information morphing. In fact, their standard 2 layer model cannot accommodate skip connections because of varying output dimensions of layers. Similarly, GraphSAGE models which utilized concatenation operation to combine node and neighborhood information also lessened the decay effect in comparison to summation based combination models. This can be attributed to the fact that concatenation of information from the previous layer can be perceived as skip connections, as noted by its authors. Though the above analysis is done on a linear propagation model, this insight is applicable to the nonlinear models as well. Our empirical results also confirm this.
Proposed work
In this section we propose (i) a solution to the NIM issue and (ii) a generic semisupervised learning framework for higher order propagation.
Node Information Preserving models
To address the NIM issue, we propose a specific class of instantiations of the generic kernel which we call the Node Information Preserving (NIP) models. One way to avoid NIM issue is to explicitly retain the information at every propagation step as in the equation below. This is obtained from Eqn: 1 by setting and .
(6) 
For different choices of and , we get different kernels of this family. In particular, setting and yields a kernel similar to Random Walk with Restart (RWR) [Tong, Faloutsos, and Pan2006].
(7) 
The NIP formulation has two significant advantages: (a) It enables capturing correlation between hop reachable neighbors and the node explicitly and (b) it creates a direct gradient path to the node information from every layer, thus allowing for better training. We propose a specific instantiation of the generic NIP kernel below:
(8) 
NIPMEAN is similar to GCNMEAN but with and .
Higher Order Propagation Framework: HOPF
Building any endtoend differentiable model requires all the relational information to be in memory. This hinders models with a large number of parameters and those that process data in large batches. For graphs with high link density and a power law degree distribution, processing even or hop information becomes infeasible. Even with regular graphs, the memory grows at with the number of hops, . Thus, using a differentiable kernel for even small number of hops over a moderate size graph becomes infeasible.
To address this critical issue of scalability, we propose a novel Higher Order Propagation Framework (HOPF) which incorporates an iterative mechanism over the differentiable kernels. In each iteration of HOPF, the differentiable kernel computes a hop neighborhood summary, where . Every iteration starts with a summary, (), of the information computed until the step as given below.
(9)  
After iterations the model would have incorporated hop neighborhood information. Here, we fix based on the required number of hops we want to capture, , but it can also be based on some convergence criteria on the inferred labels. For the empirical results reported in this work, we have chosen to be (predicted) labels , along the lines of the ICA family of algorithms. Other choices for includes the hop relational information, .
We explain HOPF’s mechanism with a toy chain graph illustrated in Fig: 1. The graph has 6 nodes with attributes ranging over AF and the graph kernel used is of the second order. The figure is intended to explain how differentiable and nondifferentiable layers are interleaved to allow propagation up to the diameter. We first analyze it with respect to node 1. In the first iteration, node 1 has learned to aggregate attributes from node 2 and 3, viz BC, along with its own. This provides it with an aggregate of information from A, B and C. At the start of each subsequent iteration, label predictions are made for all the nodes using a (In Fig: 1, ) order differentiable kernel learned in the previous iteration. These labels are concatenated with node attributes to form the features for the current iteration. By treating the labels as nondifferentible entities, we stop the gradients from propagating to the previous iteration and hence the model is only hop differentiable.
With the concatenated label information, the model can be made to relearn from scratch or continue on top of the pretrained model from the last iteration. Following this setup, one can observe that the information of nodes D, E, and F which is not accessible with a order differentiable kernel(blue paths) is now accessible via the nondifferentiable paths (red and green paths). In the second iteration, information from nodes at and hop (D and E) becomes available and in the subsequent iteration, information from the hop (F) becomes available. The paths encoded in blue, purple and orange represent different iterations in the figure and are differentiable only during their ongoing iteration, not as a whole.
Iterative NIP Mean Kernel: INIPMEAN
In this section, we propose a special instance of HOPF which addresses the NIM issue with NIP kernels in a scalable fashion. Specifically, we consider the following NIP Kernel instantiation, INIPMEAN with mean aggregation function, by setting , , , and .
(10) 
In Algorithm 1 (INIPMEAN), the iterative learning and inference steps are described in lines: 710 and 1216 respectively. Both learning and inference happen in minibatches, , sampled from the labeled set, or the unlabeled set, respectively as shown in and correspondingly. The predict function described in lines:1727 is used during learning and inference to obtain label predictions for nodes, . The procedure first extracts hop relational features () and then projects it to the label space and applies a sigmoid or a softmax depending on the task, see line: 27.
To extract hop relational features for , the model via get_subgraph function first gathers all along with their neighbors reachable by less than hops () and represents this entire sub graph by an adjacency matrix (). A hop representation is then obtained with the kernel as in lines:2124. At each learning phase, the weights of the kernels (s and ,
) are updated via backpropagation to minimize an appropriate loss function.
Scalability analysis:
In most realworld graphs exhibiting power law, the size of the neighborhood for each node grows exponentially with the depth of neighborhood being considered. Storing all the node attributes, the edges of the graph, intermediate activations, and all the associated parameters become a critical bottleneck. Here we analyze the efficiency of proposed work to scale to large graphs in terms of the reduction in the number of parameters and space and time complexity.
Number of parameters: The ratio of available labeled nodes to the unlabeled nodes in a graph is often very small. As observed in [Kipf and Welling2016, Hamilton, Ying, and Leskovec2017], the model tends to easily overfit and perform poorly during test time when additional parameters (layers) are introduced to capture deeper neighborhood. In our proposed framework with iterative learning and inference, the parameters of the kernel at iteration is used to initialize kernel and is then discarded, hence the model parameters is and not . Thus the model can obtain information from any arbitrary hop, with constant learnable parameters of , where . But in the inductive setup, the parameter complexity is similar to GCN and GraphSAGE as the kernel parameters from all iterations are required to make predictions for unseen nodes.
Space and Time complexity: For a Graph , we consider aggregating information up to hop neighborhood. Let number of nodes , and average degree . For making full batch updates over the graph (like in GCN), computational complexity for information aggregation is , and memory required is . Even for moderate size graphs, dealing with such memory requirement quickly becomes impractical. Updating parameters in minibatches trades off memory requirements with computation time. If batches of size (where, ) are considered, memory requirement reduces but in worst case, computation complexity increases exponentially to as neighborhood of size needs to be aggregated for each of batches independently. In a highly connected graph (such as a PPI, Reddit, Blog etc.), the neighborhood set of a small may already be the whole network, making the task computationally expensive and often infeasible. o make this tractable, GraphSAGE considers a partial neighborhood information. Though useful, in many cases it can significantly hurt the performance as shown on citation networks in Figure: 2
. Not only it hurts the performance but also results in additional hyperparameter tuning for neighborhood size. In comparison,
the proposed work reduces complexity from exponential to linear in the total number of hops considered, by doing iterations of a constant hop differentiable kernel, such that . In our experiments, we found that even as small as and was sufficient to outperform existing methods on most of the datasets. The best models were the ones whose was the largest hop which gave the best performance for the differnetiable kernel.Miscellaneous related works
Many extensions of classical methods have been proposed to capture higherorder relational properties of the data. Glocalized kernels [Morris, Kersting, and Mutzel2017] are a variant of the dimensional WeisfeilerLehman [Weisfeiler and Lehman1968]
kernel for graph level tasks that use a stochastic approximation to aggregate information from distant nodes. The differentiable kernels are all 1dim WLKernels whose direct adaptation suffers from Node Information Morphing. Relation classifier
[Macskassy and Provost2003] builds upon the homophily assumption in the graph structure and diffuses the available label data to predict the labels of unlabelled ones. To make this process more efficient, propagation kernels [Neumann et al.2016] provide additional schemes for diffusing the available information across the graph. However, none of these provide a mechanism to adapt to the dataset by learning the aggregation filter.From a dynamical systems perspective, predictive state representations [Sun et al.2016]
also make use of iterative refinement of internal representations of the model for sequential modeling tasks. However, no extension to graph models has been mentioned. In computer vision application, iterative Markov random fields
[Subbanna, Precup, and Arbel2014, Yu and Clausi2005] have also been shown to be useful for incrementally using the local structure for capturing global statistics. In this work, we restrict our focus to address the limitations of the current stateoftheart differentiable graph kernels to provide higher order information for collective classification tasks. Moreover, HOPF additionally leverages label information that is found useful.Message Passing Neural Network (MPNN) [Gilmer et al.2017] is a message passing framework which contains the message and read out component. They are defined for graph level tasks. HOPF is explicitly defined for node level tasks and aims at scaling existing graph networks. HOPF’s generic propagation kernel is more detailed than MPNN’s message component and can additionally support iterative learning and inference.
Experiments
In this section, we describe the datasets used for our experiments, the experimental setup and the models compared.
Dataset details
We extensive evaluate the proposed models and the baselines on 11 datasets from various domains. In this work, we treat these networks as undirected graphs but the proposed framework can also handle directed graphs with nonnegative edges. The datasets used are described below and certain statistics are provided in Table 2.
Social networks: We use Facebook (FB) from [Pfeiffer III, Neville, and Bennett2015, Moore and Neville2017], BlogCatalog (BLOG) from [Wang et al.2010] and Reddit dataset from [Hamilton, Ying, and Leskovec2017]. In the Facebook dataset, the nodes are Facebook users and the task is to predict the political views of a user given the gender and religious view of the user as features. In the BlogCatalog dataset, the nodes are users of a social blog directory, the user’s blog tags are treated as node features and edges correspond to friendship or fan following. The task here is to predict the interests of users. In Reddit, the nodes are the Reddit posts, the features are the averaged glove embeddings of text content in the post and edges are created between posts if the same users comment on both. The task here is to predict the subReddit community to which the post belongs.
Citation Networks: We use four citation graphs: Cora [Lu and Getoor2003], Citeseer [Bhattacharya and Getoor2007], Pubmed [Namata et al.2012] and Cora2 [Mccallum2001]. In all the four datasets, the articles are the nodes and the edges denote citations. The bagofword representation of the article is used as node attributes. The task is to predict the research area of the article. Apart from Cora2, which is a multilabel classification dataset from [Mccallum2001], others are multiclass datasets.
Biological network: We use two proteinprotein interaction network: Yeast and Human. Yeast dataset is part of the KDD cup 2001 challenge [Hatzis and Page2001] which contain interactions between proteins. The task is to predict the function of these genes. Similarly, the Human dataset, introduced in [Hamilton, Ying, and Leskovec2017], is a proteinprotein interaction (PPI) network from Human Tissues. The dataset contains PPI from 24 human tissues and the task is to predict the gene’s functional ontology. Features consist of positional gene sets, motif gene sets, and immunology signatures.
Movie network: We constructed a movie network from Movielens2k dataset available as a part of HetRec 2011 workshop [Cantador, Brusilovsky, and Kuflik2011]. The dataset is an extension of the MovieLens10M dataset with additional movie tags. The nodes are the movies and edges are created between movies if they share a common actor or director. The movie tags form the movie features. The task here is to predict all possible genres of the movies.
Product network: We constructed an Amazon DVD copurchase network which is a subset of Amazon_060 copurchase data by [Leskovec and Sosič2016]. The network construction procedure is similar to the one created in [Moore and Neville2017]. The nodes correspond to DVDs and edges are constructed if two DVDs are copurchased. The DVD genres are treated as DVD features. The task here is to predict whether a DVD will have Amazon sales rank 7500 or not. To the best of our knowledge there exists no previous work in collective classification that reports results on these many datasets over a wider range of domains. ^{1}^{1}1Code is available at https://github.com/PriyeshV/HOPF
Dataset  Network  V  E  F  L  L 

Cora  Citation  2708  5429  1433  7  F 
Citeseer  Citation  3312  4715  3703  6  F 
Cora2  Citation  11881  34648  9568  79  T 
Pubmed  Citation  19717  44327  500  3  F 
Yeast  Biology  1240  1674  831  13  T 
Human  Biology  56944  1612348  50  121  T 
Social  232965  5376619  602  41  T  
Blog  Social  69814  2810844  5413  46  T 
Fb  Social  6302  73374  2  2  F 
Amazon  Product  16553  76981  30  2  F 
Movie  Movie  7155  388404  5297  20  T 
Experiment setup:
The experiments follow a semisupervised setting with only labeled data. We consider of nodes in the graph as test nodes and randomly create 5 sets of training data by sampling of the nodes from the remaining graph. Further, of these training nodes are used as the validation set. We do not use the validation set for (re)training. To account for the imbalance in the training set, we use a weighted cross entropy loss (see Appendix) similar to [Moore and Neville2017] for all the models. In Table: 3, we report the averaged test results for transductive experiments obtained from models trained on the different training sets. We also report results on the Transfer (Inductive) learning task introduced in [Hamilton, Ying, and
Leskovec2017] under their same setting, where the task is to classify proteins in new human tissues (graphs) which are unseen during training. For detailed information on implementation and hyperparameter details, kindly refer the Appendix.
Datasets  Aggregate measures  
MODELS  Blog  FB  Movie  Cora  Citeseer  Cora2  Pubmed  Yeast  Human  Amazon  Shortfall  Rank  
BL_NODE  37.929  64.683  50.329  59.852  65.196  40.583  83.682  59.681  41.111  57.118  64.121  16.9  8.82 
BL_NEIGH  19.746  51.413  35.601  77.43  70.181  63.862  83.16  53.522  60.939  59.699  66.236  17.3  8.45 
GCN  34.068  50.397  39.059  76.969  72.991  63.956  85.722  62.565  58.298  75.667  61.777  11.0  6.64 
GCNS  39.101  63.682  51.194  77.523  71.903  63.152  86.432  60.34  62.057  77.637  73.746  4.1  4.36 
GCNMEAN  38.541  62.651  51.143  76.081  72.357  62.842  85.792  61.787  64.662  74.324  63.674  5.6  6 
GSMEAN  39.433  64.127  50.557  76.821  70.967  62.8  84.23  59.771  63.753  79.051  68.266  4.9  6 
GSMAX  40.275  64.571  50.569  73.272  71.39  53.476  85.087  62.727  65.068  78.203  70.302  5.5  4.73 
GSLSTM  37.744  64.619  41.261  65.73  63.788  38.617  82.577  58.353  64.231  63.169  68.024  14.4  8.45 
NIPMEAN  39.433  64.286  51.316  76.932  71.148  63.901  86.203  61.583  68.688  77.262  69.136  3.6  4 
SSICA  38.517  64.349  52.433  75.342  68.973  63.098  84.798  68.444  43.629  81.92  65.789  6.6  5.73 
INIPMEAN  39.398  62.889  51.864  78.854  71.541  66.23  85.341  69.917  68.652  81.64  75.045  0.9  2.81 
Node  Neighbor  NIPMEAN  GCNMEAN  GCN  GSMean  GSMax  GSLSTM  SSICA  INIPMEAN  
PPI  44.51  83.891  92.243  86.049  88.585  79.634  78.054  87.111  61.51  92.477 
Models compared: We compare the proposed NIP and HOPF models with various differentiable WL kernels, SemiSupervised ICA and two baselines, BL_NODE and BL_NEIGH as defined in Table: 1. BL_NODE is a Klayer feedforward network that only considers the node’s information ignoring the relational information whereas BL_NEIGH ignores the node’s information and considers the neighbors’ information. BL_NEIGH is a powerful baseline which we introduce. It is helpful to understand the usefulness of relational information in datasets. In cases where BL_NEIGH performs poorer than BL_NODE, the dataset has less or no useful relational information to extract with the available labeled data and vice versa. In such datasets, we observe no significant gain in considering beyond one or two hops. All the models in Table: 3 and Table: 4 except SSICA, GCN and GraphSAGE models have skip connections. GraphSAGE models combine node and neighborhood information by concatenation instead of summation.
Results and Discussions
In this section, we make some observations from the results of our experiments as summarized in Tables 3 and 4.
Statistical significance:
In order to report statistical significance of models’ performance across different datasets we resort to Friedman’s test and Wilcoxon signed rank test as discussed by previous work [Demšar2006]
. Levering Friedmans’ test, we can reject the null hypothesis that all the models perform similarly with
. More details and report about the statistical significance of our proposed models, NIPMEAN and INIPMEAN, over their base variants is presented in the subsequent discussions. .Model Consistency
These rank based significance tests do not provide a metric to measure the robustness of a model across datasets. One popular approach is to use count based statistics like average rank and number of wins. Average Rank of the models across datasets is provided in the table, where lower rank indicates better performance. It is evident from Table 3, that the proposed algorithm INIPMEAN achieves best rank and wins on 4/11 datasets followed by SSICA which has 2 wins and NIPMean which has 1 win and second best rank. By this simple measure of counting the number of wins of a given algorithm, the proposed method outperforms existing methods.
However, we argue that this is not helpful at measuring the robustness of models. For example, there could be an algorithm which is consistently the second best algorithm on all the datasets with minute difference from the best and yet have zero wins. To capture this notion of consistency, we introduce a measure, shortfall, which captures the relative shortfall in performance compared to the best performing model on a given dataset.
(11) 
Where best[dataset] is the micro_f1 of the best performing model for the dataset and performance[data] is the model’s performance for that dataset. In Table: 3, we report the average shortfall across datasets. Lower shortfall indicates a better consistent performance. Even using this measure the proposed algorithm INIPMEAN outperforms existing methods. In particular, notice that while SSICA seemed to be the second best algorithm using the naive method of counting the number of wins, it does very poor when we consider the shortfall metric. This is because SSICA is not consistent across datasets and in particular it gives a very poor performance on some datasets which is undesirable. On the other hand, INIPMEAN not only wins on 4/11 datasets but also does consistently well on all the datasets and hence has the lowest average shortfall.
Baselines Vs. Collective classification (CC) models
As mentioned earlier, the baselines BL_NEIGH and BL_NODE use only neighbor and only node information respectively. In datasets, where BL_NEIGH significantly outperform BL_NODE, all CC models ouperform both these baselines by jointly utilizing the node and neighborhood information. In datasets such as Cora, Citeseer, Cora2, Pubmed and Human, where performance of BL_NEIGH BL_NODE, CC models improve over BL_NEIGH by up to in the transductive setup. Similarly, on the inductive task where the performance of BL_NEIGH is greater than BL_NODE by , CC methods end up further improving by another . In Reddit and Amazon datasets, where the performance of BL_NODE BL_NEIGH, CC Methods still learn to exploit useful correlations between them to obtain a further improvement of and respectively.
WLKernels Vs NIPKernels
We make the following observations:
Node Information Morphing in WLKernels: The poor performance of BL_NEIGH compared to BL_NODE on the Blog, FB and Movie datasets suggests that the neighborhood information is noisy and node features are more crucial. The original GCN which aggregates information from the neighbors but does not use CONCAT or skip connections typically suffers a severe drop in performance of up to on datasets with high degree. Despite having the node information, GCN performs worse than BL_NODE on these datasets. The improved performance of GCN over BL_NEIGH in Blog and Movie support that node information is essential.
Solving Node Information Morphing with skip connections in WLKernels: The original GCN architecture does not allow for skip connections from to and from to . We modify the original architecture and introduce these skip connections (GCNS) by extracting features from the convolution’s node information. With skip connections, GCNS outperforms the base GCN on 8/11 datasets. We observed a performance boost of in Blog, FB, Movie and Amazon datasets even when we consider only 2 hops thereby decreasing the shortfall on these datasets. GCNS closed the performance gap with BL_NODE on these datasets and in the case of Amazon dataset, it further improved by another . GCNMEAN which also has skip connections performs quite similarly to GCNS in all datasets and does not suffer from node information morphing as much as GCN. It is important to note that skip connections are required not only for going deeper but more importantly, to avoid information morphing even for smaller hops.
GS models do not suffer from the node information morphing issue as they concatenate node and neighborhood information. Authors of GS also noted that they observed significant performance boost with the inclusion of CONCAT combination. GSMEAN’s counterpart among the summation models is the GCNMEAN model which gives similar performance on most datasets, except for Reddit and Amazon where GSMEAN with concat performs better than GCNMEAN by . GSMAX provides very similar performances to GSMEAN, GCNMEAN, and GCNS across the board. Their shortfall performances are also very similar. GSLSTM typically performs poorly which might be because of the morphing of earlier neighbors’ information by more recent neighbors by in the list.
Solving Node Information Morphing with NIP Kernels: NIPMEAN, a MEAN pooling kernel from the NIP propagation family outperforms its WL family counterpart, GCNMEAN on 9/11 datasets. With Wilcoxon signedrank test, NIPMEAN > GCNMEAN with p < 0.01. It achieves a significant improvement of over GCNMEAN in Human, Reddit and Amazon datasets. It similarly outperforms GSMEAN on another 9/11 datasets even though GSMEAN has twice the number of parameters. NIPMEAN provides the most consistent performance among the noniterative models with a shortfall as low as 3.6. NIPMEAN’s clear improvement over its WLcounterparts demonstrates the benefit of using NIP family of kernels which explicitly preserve the node information and mitigate the node information morphing issue.
Iterative inference models Vs. Differentiable kernels
Iterative inference models, SSICA and INIPMEAN exploit label information from the neighborhood and scale beyond the memory limits of differentiable kernels. This was evidently visible with our experiments on the large Reddit dataset. Reddit was computationally timeconsuming with even partial neighbors due to its high link density. However the iterative models scale beyond 2 hops and consider 5 hops and 10 hops for SSICA and INIPMEAN respectively. This is computationally possible because of the linear scaling of time and constant memory complexity of iterative models. Hence, they achieve superior performance with lesser computation time on Reddit. The microf1 scores of SSICA over iterations on a particular fold was . Similarly for INIPMEAN on the same fold, we obtained . SSICA was remarkable as it can be seen from the table that it managed to obtain 81.92 starting from 57.118 ().
The benefit of label information over attributes can be analyzed with SSICA which aggregates only the label information of immediate neighbors. In Yeast dataset, SSICA gains improvement over noniterative models which do not use label information. However, SSICA does not give good performance on some datasets as it does not leverage neighbors features and is restricted to only learn firstorder local information unlike multHOPF differentiable WL or NIP kernels.
Iterative Differentiable kernels Vs. Rest
INIPMEAN which is an extension of NIPMEAN with iterative learning and inference can leverage attribute information and exploit nonlinear correlations between the labels and attributes from different hops. INIPMEAN improves over NIPMEAN on seven of the eleven datasets with significant boost in performance up to in Cora2, Reddit, Amazon, and Yeast datasets. Levering Wilcoxon signedrank test, INIPMEAN is significantly better than NIPMEAN (with p < 0.05). INIPMEAN also successfully leverages label information like SSICA and obtains similar performance boost on Yeast and Reddit dataset. It also outperforms SSICA on eight of eleven datasets with a statistical significance of p < 0.02 as per the Wlicoxon test The benefits of using neighbors’ attributes along with labels are visible in Amazon and Human datasets where INIPMEAN model achieves and improvement respectively over SSICA which uses label information alone. Moreover, by leveraging both attributes and labels in a differentiable manner it further achieves a improvement over the second best model in cora2. This superior hybrid model, INIPMEAN emerges as the most robust model across all datasets with the lowest shortfall of .
Inductive learning on Human dataset
For the inductive learning task in Table: 4, the cc models obtain a improvement over BL_NODE by leveraging relational information. The INIPMEAN and NIPMEAN kernels achieves best performance with a improvement over GCNMEAN.
Run time analysis:
We adapt the scalability setup from GCN [Kipf and Welling2016]
to compare average training time per epoch between fully differentiable model, NIPMEAN and iterative differentible model, INIPMEAN, to make predictions with multHOPF representations. We consider two Iterative NIPMEAN Variants here: INIPMEAN with 1 differential layer (C=1) and INIPMEAN with 2 differential layerss (C=2). In order to obtain multHOPF representations, we increase the number of iterations,
accordingly. Note: INIPMEAN with C=2 can only provide multHOPF representations of multiples of 2. The training time, included time to prefetch neighbors with queues, forward pass of NN, loss computation and backward gradient propagation) similar to the setup of [Kipf and Welling2016]. We record wallclock running time for these models to process a synthetic graph with 100k nodes, 500k edges, 100 features, and 10 labels on a 4GB GPU. The batch size and hidden layers size were set to 128. The plot of the averaged run time over runs across different hops is presented in Figure: 3.The fully differentiable model, NIPMEAN incurred an exponential increase in compute time with increase in hop, and moreover ran OutOfMemory after hops. Whereas, INIPMEAN with and has a linear growth in compute time with increasing . This is in agreement with the time complexity provided earlier for these models. Not only the time for noniterative methods increased exponentially with hops, but the memory complexity also increases exponentially with a new layer as it is required to store the gradients and activations for all the new neighbors introduced with a hop. In comparison, the runtime of the proposed iterative solution has a linear growth rate and also has lesser memory footprint.
The choice of K and the decision to use iterative learning depends on a variety of factors such as memory availability and relevance of the labels. The choice of C, K and T should be determined by performance on a validation set. C should be set as the minimum of maximum differentiable layers that fit into memory or the maximum hop beyond which performance saturates or drops. One should set , if K doesn’t fit in memory or to an arbitrary constant if performance improves with iteration even if it fits in memory.
Conclusion
In this work, we proposed HOPF, a novel framework for collective classification that combines differentiable graph kernels with an iterative stage. Deep learning models for relational learning tasks can now leverage HOPF to use complete information from larger neighborhoods without succumbing to overparameterization and memory constraints. For future work, we can further optimize the framework by committing only high confidence labels , like in cautious ICA
[McDowell, Gupta, and Aha2007] to reduce the erroneous information propagation and we can also increase the supervised information flow to unlabeled nodes by incorporating ghost edges [Gallagher et al.2008]. The framework can also be extended for unsupervised tasks by incorporating structural regularization with Laplacian smoothing on the embedding space.References
 [Bhattacharya and Getoor2007] Bhattacharya, I., and Getoor, L. 2007. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD) 1(1):5.
 [Cantador, Brusilovsky, and Kuflik2011] Cantador, I.; Brusilovsky, P.; and Kuflik, T. 2011. 2nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In Proceedings of the 5th ACM conference on Recommender systems, RecSys 2011. New York, NY, USA: ACM.

[Demšar2006]
Demšar, J.
2006.
Statistical comparisons of classifiers over multiple data sets.
Journal of Machine learning research
7(Jan):1–30.  [Gallagher and EliassiRad2010] Gallagher, B., and EliassiRad, T. 2010. Leveraging labelindependent features for classification in sparsely labeled networks: An empirical study. Advances in Social Network Mining and Analysis 1–19.
 [Gallagher et al.2008] Gallagher, B.; Tong, H.; EliassiRad, T.; and Faloutsos, C. 2008. Using ghost edges for classification in sparsely labeled networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 256–264. ACM.
 [Gilmer et al.2017] Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Dahl, G. E. 2017. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212.

[Glorot and
Bengio2010]
Glorot, X., and Bengio, Y.
2010.
Understanding the difficulty of training deep feedforward neural
networks.
In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, 249–256.  [Hamilton, Ying, and Leskovec2017] Hamilton, W. L.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, 1025–1035.
 [Hatzis and Page2001] Hatzis, C., and Page, D. 2001. 2001 KDD cup challenge Dataset. In pages.cs.wisc.edu/ dpage/kddcup2001. KDD.
 [Jensen, Neville, and Gallagher2004] Jensen, D.; Neville, J.; and Gallagher, B. 2004. Why collective inference improves relational classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 593–598. ACM.
 [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980.
 [Kipf and Welling2016] Kipf, T. N., and Welling, M. 2016. Semisupervised classification with graph convolutional networks. CoRR abs/1609.02907.
 [Leskovec and Sosič2016] Leskovec, J., and Sosič, R. 2016. Snap: A generalpurpose network analysis and graphmining library. ACM Trans. Intell. Syst. Technol. 8(1):1:1–1:20.
 [Lu and Getoor2003] Lu, Q., and Getoor, L. 2003. Linkbased classification. In Proceedings of the 20th International Conference on Machine Learning (ICML03), 496–503.
 [Macskassy and Provost2003] Macskassy, S. A., and Provost, F. 2003. A simple relational classifier. Technical report, NEW YORK UNIV NY STERN SCHOOL OF BUSINESS.
 [Mccallum2001] Mccallum, A. 2001. CORA Research Paper Classification Dataset. In people.cs.umass.edu/ mccallum/data.html. KDD.
 [McDowell and Aha2012] McDowell, L. K., and Aha, D. W. 2012. Semisupervised collective classification via hybrid label regularization. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26  July 1, 2012.
 [McDowell and Aha2013] McDowell, L. K., and Aha, D. W. 2013. Labels or attributes?: rethinking the neighbors for collective classification in sparselylabeled networks. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, 847–852. ACM.
 [McDowell, Gupta, and Aha2007] McDowell, L. K.; Gupta, K. M.; and Aha, D. W. 2007. Cautious inference in collective classification. In AAAI, volume 7, 596–601.
 [Moore and Neville2017] Moore, J., and Neville, J. 2017. Deep collective inference. In AAAI, 2364–2372.
 [Morris, Kersting, and Mutzel2017] Morris, C.; Kersting, K.; and Mutzel, P. 2017. Glocalized weisfeilerlehman graph kernels: Globallocal feature maps of graphs. In Data Mining (ICDM), 2017 IEEE International Conference on, 327–336. IEEE.
 [Namata et al.2012] Namata, G.; London, B.; Getoor, L.; Huang, B.; and EDU, U. 2012. Querydriven active surveying for collective classification. In 10th International Workshop on Mining and Learning with Graphs.
 [Neumann et al.2016] Neumann, M.; Garnett, R.; Bauckhage, C.; and Kersting, K. 2016. Propagation kernels: efficient graph kernels from propagated information. Machine Learning 102(2):209–245.
 [Neville and Jensen2000] Neville, J., and Jensen, D. 2000. Iterative classification in relational data. In Proc. AAAI2000 Workshop on Learning Statistical Models from Relational Data, 13–20.
 [Neville and Jensen2003] Neville, J., and Jensen, D. 2003. Collective classification with relational dependency networks. In Proceedings of the Second International Workshop on MultiRelational Data Mining, 77–91.
 [Pfeiffer III, Neville, and Bennett2015] Pfeiffer III, J. J.; Neville, J.; and Bennett, P. N. 2015. Overcoming relational learning biases to accurately predict preferences in large scale networks. In Proceedings of the 24th International Conference on World Wide Web, 853–863. International World Wide Web Conferences Steering Committee.
 [Sen P et al.2008] Sen P, Galileo, M.; Getoor, L.; Galligher, B.; and EliassiRad, T. 2008. Collective classification in network data. AI magazine 29(3):93.
 [Shervashidze et al.2011] Shervashidze, N.; Schweitzer, P.; Leeuwen, E. J. v.; Mehlhorn, K.; and Borgwardt, K. M. 2011. Weisfeilerlehman graph kernels. Journal of Machine Learning Research 12(Sep):2539–2561.

[Subbanna, Precup, and
Arbel2014]
Subbanna, N.; Precup, D.; and Arbel, T.
2014.
Iterative multilevel mrf leveraging context and voxel information for
brain tumour segmentation in mri.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 400–405.  [Sun et al.2016] Sun, W.; Venkatraman, A.; Boots, B.; and Bagnell, J. A. 2016. Learning to filter with predictive state inference machines. In International Conference on Machine Learning, 1197–1205.
 [Tong, Faloutsos, and Pan2006] Tong, H.; Faloutsos, C.; and Pan, J.Y. 2006. Fast random walk with restart and its applications. In Proceedings of the Sixth International Conference on Data Mining, ICDM ’06, 613–622. Washington, DC, USA: IEEE Computer Society.
 [Wang et al.2010] Wang, X.; Tang, L.; Gao, H.; and Liu, H. 2010. Discovering overlapping groups in social media. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, 569–578. IEEE.
 [Weisfeiler and Lehman1968] Weisfeiler, B., and Lehman, A. 1968. A reduction of a graph to a canonical form and an algebra arising during this reduction. NauchnoTechnicheskaya Informatsia 2(9):12–16.
 [Yang, Cohen, and Salakhutdinov2016] Yang, Z.; Cohen, W. W.; and Salakhutdinov, R. 2016. Revisiting semisupervised learning with graph embeddings. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, 40–48.
 [Yu and Clausi2005] Yu, Q., and Clausi, D. A. 2005. Combining local and global features for image segmentation using iterative classification and region merging. In Computer and Robot Vision, 2005. Proceedings. The 2nd Canadian Conference on, 579–586. IEEE.
Appendix A Appendix
Hyperparams  CORA  CITE  CORA2  YEAST  HUMAN  BLOG  FB  AMAZON  MOVIE  Pubmed  

Learning Rate  1E02  1E02  1E02  1E02  1E02  1E02  1E02  1E02  1E02  1E02  1E02 
Batch Size  128  128  128  128  512  512  128  512  64  128  512 
Dimensions  16  16  128  128  128  128  8  8  128  16  128 
L2 weight  1E03  1E03  1E06  1E6  0  1E06  0  0  1E06  1E3  0 
Dropouts  0.5  0.5  0.25  0.25  0  0  0  0  0  0.5  0 
WCE  Yes  Yes  Yes  Yes  No  Yes  Yes  Yes  Yes  Yes  No 
Activation  ReLU  ReLU  ReLU  ReLU  ReLU  ReLU  ReLU  ReLU  ReLU  ReLU  ReLU 
Implementation details
For optimal performance, both iterative and noniterative models are processed in minibatches. They make use of queues to prefetch the exponential neighborhood information of nodes in a minibatch. The propagation steps are computed with sparsedense computations. Minibatching also makes it possible to efficiently distribute the gradient computation in a multiGPU setup, we leave this enhancement for future work. The choice of data structure for the kernel is also crucial for processing the graph, i.e tradeoff between adjacency list and adjacency matrix results. Working with maxpool or LSTMs are difficult using adjacency matrix as the node’s neighborhood information needs to be flattened dynamically. Models based on LSTM will also have to deal with issues of nodes having highly varying degrees and limitations of sequential processing of nodes even at the same hop distance. The code for HOPF framework processes neighborhood information with adjacency matrices and is primarily suited for weighted mean kernels.
Weighted Cross Entropy Loss (WCE)
Models in previous works [Yang, Cohen, and Salakhutdinov2016, Kipf and Welling2016]
, were trained with a balanced labeled set i.e equal number of samples for each label is provided for training. Such assumptions on the availability of training samples and similar label distribution at test time are unrealistic in most scenarios. To test the robustness of CC models in a more realistic setup, we consider training datasets created by drawing random subsets of nodes from the full ground truth data. It is highly likely that randomly drawn training samples will suffer from severe class imbalance. This Imbalance in class distribution can make the weight updates skewed towards the dominant labels during training. To overcome this problem, we generalize the weighted cross entropy defined in
[Moore and Neville2017] to incorporate both multiclass and multilabel setting. We use this as the loss function for all the methods including baselines. The weight for the label is given in the equation below, where is the total number of labels and represents the number of training samples with label . The weight of each label is inversely proportional to the number of samples having that label.(12) 
Hyperparameters
The hyperparameters for the models are the number of layers of neural network (hops), dimensions of the layers, dropouts for all layers and L2 regularization. We train all the models for a maximum of 2000 epochs using Adam [Kingma and Ba2014] with the initial learning rate set to 1e2. We use a variant of patience method with learning rate annealing for early stopping of the model. Specifically, we train the model for a minimum of 50 epochs and start with a patience of 30 epochs and drop the learning rate and patience by half when the patience runs out (i.e when the validation loss does not reduce within the patience window). We stop the training when the model consecutively loses patience for 2 turns.
We found all weighted average kernels along with GSMax model to share similar optimal hyperparameters as their formulations and parameters were similar. In fact this is in agreement with the work of GCN and GraphSAGE where all their models had similar hyperparameters. However, GSLSTM which has more parameters and a different aggregation function required additional hyperparameter tuning. For reported results, we searched for optimal hyperparameter setting for a two layer GCNS model on all datasets with the validation set. We then used the same hyperparameters across all the other models except for GSLSTM for which we searched separately. We report performance of models with their ideal number of differentiable graph layers, based on their performance in validation set. The maximum number of hops beyond which performance saturated or decreased on datasets were: 3 hops for Amazon, 4 hops for Cora2 and HUMAN and 2 hops for the remaining datasets. For the Reddit dataset, we used partial neighbors 25 and 10 in and hop which is the default GraphSAGE setting as the dataset had extremely high link density.
We rownormalize the node features and initialize the weights with [Glorot and Bengio2010]. Since the percentage of different labels in training samples can be significantly skewed, like [Moore and Neville2017] we weigh the loss for each label inversely proportional to its total fraction as in Eqn: 12. We added all these components to the baseline codes too and ensured that all models have the same setup in terms of the weighted cross entropy loss, the number of layers, dimensions, patience based stopping criteria and dropouts. In fact, we observed an improvement of 25.91 percentage for GraphSage on their dataset. GraphSAGE’s LSTM model gave Out of Memory error for Blog, Movielens, and Cora2 as the initial feature size was large and with the large number parameters for the LSTM model the parameter size exploded. Hence, for these datasets alone we reduced the features size.