Many real-world datasets such as social networks can be modeled using a graph wherein the nodes in the graph represent entities in the network and edges between the nodes capture the interactions between the corresponding entities. Further, every node can have attributes associated with it and some nodes can have known labels associated with them. Given such a graph, collective Classification (CC) [Neville and Jensen2000, Lu and Getoor2003, Sen P et al.2008]
is the task of assigning labels to the remaining unlabeled nodes in the graph. A key task here is to extract relational features for every node which not only consider the attributes of the node but also the attributes and labels of its partially labeled neighborhood. Neural network based models have become popular for computing such node representations by aggregating node & neighborhood information.
The key idea is to exploit the inherent relational structure among the nodes which encodes valuable information about homophily, influence, community structure, etc. [Jensen, Neville, and Gallagher2004]. Traditionally, various neighborhood statistics on structural properties [Gallagher and Eliassi-Rad2010], and distributions on labels and features [Neville and Jensen2003, Lu and Getoor2003, McDowell and Aha2013] were used as relational features to predict labels. Further, iterative inference techniques were widely adopted to propagate these label predictions until convergence [Sen P et al.2008]. Recently, [Kipf and Welling2016] proposed Graph Convolutional Networks with a re-parameterized Laplacian based graph kernel (GCN) for the node-level semi-supervised classification task. GraphSAGE [Hamilton, Ying, and Leskovec2017] further extended GCN and proposed few additional neighborhood aggregation functions to achieve state of the art results for inductive learning.
These graph convolution kernels are based on differentiable extensions of the popular Weisfieler-Lehman(WL) kernels. In this work, we first show that a direct adaptation of WL kernels for CC task is inherently limited as node features get exponentially morphed with neighborhood information when considering farther hops. More importantly, learning to aggregate information from -hop neighborhood in an end-to-end differentiable manner is not easily scalable. The exponential increase in neighborhood size with increase in hops severely limits the model due to excessive memory and computation requirements. In this work, we propose a Higher-order Propagation framework (HOPF) that provides a solution for both these problems. Our main contributions are:
A modular graph kernel that generalizes many existing methods. Through this, we discuss a hitherto unobserved phenomenon which we refer to as Node Information Morphing. We discuss its implications on the limitations of existing methods and then discuss a novel family of kernels called the Node Information Preserving (NIP) kernels to address these limitations.
hybrid semi-supervised learning frameworkfor higher order propagation (HOPF) that couples differentiable kernels with an iterative inference procedure to aggregate neighborhood information over farther hops. This allows differentiable kernels to exploit label information and further overcome excessive memory constraints imposed by multi-hop information aggregation.
An extensive experimental study on 11 datasets from different domains. We demonstrate the NIM issue and show that the proposed iterative NIP model is robust and overall outperforms existing models.
In this section, (i) we define the notations and terminologies used (ii) we present the generic differentiable kernel for capturing higher order information in CC setting (iii) we discuss existing works in the light of the generic kernel and (iv) analyze the Node Information Morphing (NIM) issue.
Definitions and notations
Let denote a graph with a set of vertices, , and edges, . Let . The set E is represented by an adjacency matrix and let denote the diagonal degree matrix defined as .
A collective classification dataset defined on graph comprises of a set of labeled nodes, , a set of unlabeled nodes, with , a feature matrix: and a label matrix: , where and denote the number of features and labels, respecetively. Let denote the predicted label matrix.
In this work, neural networks defined over -hop neighborhoods have aggregation or convolution layers with dimensions each and whose outputs are denoted by . We denote the learnable weights associated with -th layer as and . The weights of the input layer (, ) and output layer, are in and respectively. Iterative inference steps are indexed by .
Generic propagation kernel
We define the generic propagation (graph) kernel as follows:
where and are the node and neighbor features considered at the propagation step (layer), is a function of the adjacency matrix of the graph, and and are weights associated with the -th layer of the neural network. One can view the first term in the equation as processing the information of a given node and the second term as processing the neighbors’ information. The kernel recursively computes the outputs of the th layer by combining the features computed till the th layer.
is the activation function of the-th layer and and
can be scalars, vectors or matrices depending on the kernel.
Label predictions, can be obtained by projecting
onto the label space followed by a sigmoid or softmax layer corresponding to multi-class or multi-label classification task. The weights of the model are learned via backpropagation by minimizing an appropriate classification loss on.
Relation to existing works:
Appropriate choice of , , , and in the generic kernel yield different models. Table 1 lists out the choices for some of the popular models, as well as our proposed approaches. Iterative collective inference techniques, such as the ICA family combine node information with aggregated label summaries of immediate neighbors to make predictions. Aggregation can be based on averaging kernel: =, or label count kernel: =, etc with labels as neighbors features (=). This neighborhood information is then propagated iteratively to capture higher order information. ICA also has a semi-supervised variant [McDowell and Aha2012] where after each iteration the model is re-learned with updated labels of neighbors. Table: 1 shows how the modular components can be chosen to see semi-supervised ICA (SS-ICA) as a special instantiation of our framework.
The Weisfeiler-Lehman (WL) family of recursive kernels [Weisfeiler and Lehman1968, Shervashidze et al.2011] were initially defined for graph isomorphism tests and most recent CC methods use differentiable extensions of it. In its basic form, it is the simplest instantiation of our generic propagation kernel with no learnable parameters as shown in Table: 1.
The normalized symmetric Laplacian kernel (GCN) used in [Kipf and Welling2016] can be seen as an instance of the the generic kernel with node weight, =, individual neighbors’ weights’ =, and . We also consider its mean aggregation variant (GCN-MEAN), where . In theory, by stacking multiple graph convolutional layers, any higher order information can be captured in a differentiable way in computations. However in practice, the proposed model in[Kipf and Welling2016] is only full batch trainable and thus cannot scale to large graph when memory is limited.
GraphSAGE (GS) [Hamilton, Ying, and Leskovec2017] is the recent state-of-the-art for inductive learning. GraphSAGE has also proposed variants of order differentiable WL kernels, viz: GS-MEAN, GS-Pool and GS-LSTM. These variants can be viewed as special instances of our generic framework as mentioned in the Table 1
. GS-Pool applies a max-pooling function to aggregate neighborhood information whereas GS-LSTM uses a LSTM to combine neighbors’ information sequenced in random order similar to[Moore and Neville2017]. GS has a mean averaging variant, similar to the to GCN-MEAN model, but treats nodes separately from its neighbors, i.e . Finally, it either concatenates or adds up the node and neighborhood information. GS-LSTM is over-parameterized for small datasets. With GS-MAX and GS-LSTM there is a loss of information as Max pooling considers only the largest input and LSTM focuses more on the recent neighbors in the random sequence.
Node Information Morphing (NIM): Analysis
In this section, we show that existing models which extract relational features, do not retain the original node information, completely. With multiple propagation steps the is decayed and morphed with neighborhood information. We term this issue as Node Information Morphing (NIM).
For ease of illustration, we demonstrate the NIM issue by ignoring the non-linearity and weights. Based on the commonly observed instantiations of our generic propagation kernel (Eqn: 1), where , we consider the following equation:
On unrolling the above expression, one can derive the following binomial form:
From Eqn: 3, it can be seen that the relative importance of information associated with node’s hop information, , is . Hence, for any positive the importance of decays exponentially with . It can be seen that the decay rate for GCN is and for the other WL kernel variants mentioned in Table: 1.
Skip connections and Node Information Morphing:
It can be similarly derived and seen that the information morphing not only happens at but also for every , . This decay of neighborhood information can be lessened by leveraging skip connections. Consider the propagation kernel in Eqn: 2 with skip connections as shown below:
The above equation on expanding as above gives:
The relative importance of weights of then becomes , which decays slower than for all . Though this helps in retaining information longer, it doesn’t solve the problem completely. Skip connections were used in GCN to reduce the drop in performance of their model with multiple hops. The addition of skip connection in GCN was originally motivated from the conventional perspective to avoid reduction in performance with increasing neural network layers and not with the intention to address information morphing. In fact, their standard 2 layer model cannot accommodate skip connections because of varying output dimensions of layers. Similarly, GraphSAGE models which utilized concatenation operation to combine node and neighborhood information also lessened the decay effect in comparison to summation based combination models. This can be attributed to the fact that concatenation of information from the previous layer can be perceived as skip connections, as noted by its authors. Though the above analysis is done on a linear propagation model, this insight is applicable to the non-linear models as well. Our empirical results also confirm this.
In this section we propose (i) a solution to the NIM issue and (ii) a generic semi-supervised learning framework for higher order propagation.
Node Information Preserving models
To address the NIM issue, we propose a specific class of instantiations of the generic kernel which we call the Node Information Preserving (NIP) models. One way to avoid NIM issue is to explicitly retain the information at every propagation step as in the equation below. This is obtained from Eqn: 1 by setting and .
For different choices of and , we get different kernels of this family. In particular, setting and yields a kernel similar to Random Walk with Restart (RWR) [Tong, Faloutsos, and Pan2006].
The NIP formulation has two significant advantages: (a) It enables capturing correlation between -hop reachable neighbors and the node explicitly and (b) it creates a direct gradient path to the node information from every layer, thus allowing for better training. We propose a specific instantiation of the generic NIP kernel below:
NIP-MEAN is similar to GCN-MEAN but with and .
Higher Order Propagation Framework: HOPF
Building any end-to-end differentiable model requires all the relational information to be in memory. This hinders models with a large number of parameters and those that process data in large batches. For graphs with high link density and a power law degree distribution, processing even or hop information becomes infeasible. Even with -regular graphs, the memory grows at with the number of hops, . Thus, using a differentiable kernel for even small number of hops over a moderate size graph becomes infeasible.
To address this critical issue of scalability, we propose a novel Higher Order Propagation Framework (HOPF) which incorporates an iterative mechanism over the differentiable kernels. In each iteration of HOPF, the differentiable kernel computes a hop neighborhood summary, where . Every iteration starts with a summary, (), of the information computed until the step as given below.
After iterations the model would have incorporated hop neighborhood information. Here, we fix based on the required number of hops we want to capture, , but it can also be based on some convergence criteria on the inferred labels. For the empirical results reported in this work, we have chosen to be (predicted) labels , along the lines of the ICA family of algorithms. Other choices for includes the hop relational information, .
We explain HOPF’s mechanism with a toy chain graph illustrated in Fig: 1. The graph has 6 nodes with attributes ranging over A-F and the graph kernel used is of the second order. The figure is intended to explain how differentiable and non-differentiable layers are interleaved to allow propagation up to the diameter. We first analyze it with respect to node 1. In the first iteration, node 1 has learned to aggregate attributes from node 2 and 3, viz BC, along with its own. This provides it with an aggregate of information from A, B and C. At the start of each subsequent iteration, label predictions are made for all the nodes using a (In Fig: 1, ) order differentiable kernel learned in the previous iteration. These labels are concatenated with node attributes to form the features for the current iteration. By treating the labels as non-differentible entities, we stop the gradients from propagating to the previous iteration and hence the model is only hop differentiable.
With the concatenated label information, the model can be made to re-learn from scratch or continue on top of the pre-trained model from the last iteration. Following this setup, one can observe that the information of nodes D, E, and F which is not accessible with a order differentiable kernel(blue paths) is now accessible via the non-differentiable paths (red and green paths). In the second iteration, information from nodes at and hop (D and E) becomes available and in the subsequent iteration, information from the hop (F) becomes available. The paths encoded in blue, purple and orange represent different iterations in the figure and are differentiable only during their ongoing iteration, not as a whole.
Iterative NIP Mean Kernel: I-NIP-MEAN
In this section, we propose a special instance of HOPF which addresses the NIM issue with NIP kernels in a scalable fashion. Specifically, we consider the following NIP Kernel instantiation, I-NIP-MEAN with mean aggregation function, by setting , , , and .
In Algorithm 1 (I-NIP-MEAN), the iterative learning and inference steps are described in lines: 7-10 and 12-16 respectively. Both learning and inference happen in mini-batches, , sampled from the labeled set, or the unlabeled set, respectively as shown in and correspondingly. The predict function described in lines:17-27 is used during learning and inference to obtain label predictions for nodes, . The procedure first extracts -hop relational features () and then projects it to the label space and applies a sigmoid or a softmax depending on the task, see line: 27.
To extract -hop relational features for , the model via get_subgraph function first gathers all along with their neighbors reachable by less than hops () and represents this entire sub graph by an adjacency matrix (). A -hop representation is then obtained with the kernel as in lines:21-24. At each learning phase, the weights of the kernels (s and ,
) are updated via back-propagation to minimize an appropriate loss function.
In most real-world graphs exhibiting power law, the size of the neighborhood for each node grows exponentially with the depth of neighborhood being considered. Storing all the node attributes, the edges of the graph, intermediate activations, and all the associated parameters become a critical bottleneck. Here we analyze the efficiency of proposed work to scale to large graphs in terms of the reduction in the number of parameters and space and time complexity.
Number of parameters: The ratio of available labeled nodes to the unlabeled nodes in a graph is often very small. As observed in [Kipf and Welling2016, Hamilton, Ying, and Leskovec2017], the model tends to easily over-fit and perform poorly during test time when additional parameters (layers) are introduced to capture deeper neighborhood. In our proposed framework with iterative learning and inference, the parameters of the kernel at iteration is used to initialize kernel and is then discarded, hence the model parameters is and not . Thus the model can obtain information from any arbitrary hop, with constant learnable parameters of , where . But in the inductive setup, the parameter complexity is similar to GCN and GraphSAGE as the kernel parameters from all iterations are required to make predictions for unseen nodes.
Space and Time complexity: For a Graph , we consider aggregating information up to hop neighborhood. Let number of nodes , and average degree . For making full batch updates over the graph (like in GCN), computational complexity for information aggregation is , and memory required is . Even for moderate size graphs, dealing with such memory requirement quickly becomes impractical. Updating parameters in mini-batches trades off memory requirements with computation time. If batches of size (where, ) are considered, memory requirement reduces but in worst case, computation complexity increases exponentially to as neighborhood of size needs to be aggregated for each of batches independently. In a highly connected graph (such as a PPI, Reddit, Blog etc.), the neighborhood set of a small may already be the whole network, making the task computationally expensive and often infeasible. o make this tractable, GraphSAGE considers a partial neighborhood information. Though useful, in many cases it can significantly hurt the performance as shown on citation networks in Figure: 2
. Not only it hurts the performance but also results in additional hyperparameter tuning for neighborhood size. In comparison,the proposed work reduces complexity from exponential to linear in the total number of hops considered, by doing iterations of a constant hop differentiable kernel, such that . In our experiments, we found that even as small as and was sufficient to outperform existing methods on most of the datasets. The best models were the ones whose was the largest hop which gave the best performance for the differnetiable kernel.
Miscellaneous related works
Many extensions of classical methods have been proposed to capture higher-order relational properties of the data. Glocalized kernels [Morris, Kersting, and Mutzel2017] are a variant of the -dimensional Weisfeiler-Lehman [Weisfeiler and Lehman1968]
kernel for graph level tasks that use a stochastic approximation to aggregate information from distant nodes. The differentiable kernels are all 1-dim WL-Kernels whose direct adaptation suffers from Node Information Morphing. Relation classifier[Macskassy and Provost2003] builds upon the homophily assumption in the graph structure and diffuses the available label data to predict the labels of unlabelled ones. To make this process more efficient, propagation kernels [Neumann et al.2016] provide additional schemes for diffusing the available information across the graph. However, none of these provide a mechanism to adapt to the dataset by learning the aggregation filter.
From a dynamical systems perspective, predictive state representations [Sun et al.2016]
also make use of iterative refinement of internal representations of the model for sequential modeling tasks. However, no extension to graph models has been mentioned. In computer vision application, iterative Markov random fields[Subbanna, Precup, and Arbel2014, Yu and Clausi2005] have also been shown to be useful for incrementally using the local structure for capturing global statistics. In this work, we restrict our focus to address the limitations of the current state-of-the-art differentiable graph kernels to provide higher order information for collective classification tasks. Moreover, HOPF additionally leverages label information that is found useful.
Message Passing Neural Network (MPNN) [Gilmer et al.2017] is a message passing framework which contains the message and read out component. They are defined for graph level tasks. HOPF is explicitly defined for node level tasks and aims at scaling existing graph networks. HOPF’s generic propagation kernel is more detailed than MPNN’s message component and can additionally support iterative learning and inference.
In this section, we describe the datasets used for our experiments, the experimental setup and the models compared.
We extensive evaluate the proposed models and the baselines on 11 datasets from various domains. In this work, we treat these networks as undirected graphs but the proposed framework can also handle directed graphs with non-negative edges. The datasets used are described below and certain statistics are provided in Table 2.
Social networks: We use Facebook (FB) from [Pfeiffer III, Neville, and Bennett2015, Moore and Neville2017], BlogCatalog (BLOG) from [Wang et al.2010] and Reddit dataset from [Hamilton, Ying, and Leskovec2017]. In the Facebook dataset, the nodes are Facebook users and the task is to predict the political views of a user given the gender and religious view of the user as features. In the BlogCatalog dataset, the nodes are users of a social blog directory, the user’s blog tags are treated as node features and edges correspond to friendship or fan following. The task here is to predict the interests of users. In Reddit, the nodes are the Reddit posts, the features are the averaged glove embeddings of text content in the post and edges are created between posts if the same users comment on both. The task here is to predict the sub-Reddit community to which the post belongs.
Citation Networks: We use four citation graphs: Cora [Lu and Getoor2003], Citeseer [Bhattacharya and Getoor2007], Pubmed [Namata et al.2012] and Cora-2 [Mccallum2001]. In all the four datasets, the articles are the nodes and the edges denote citations. The bag-of-word representation of the article is used as node attributes. The task is to predict the research area of the article. Apart from Cora-2, which is a multi-label classification dataset from [Mccallum2001], others are multi-class datasets.
Biological network: We use two protein-protein interaction network: Yeast and Human. Yeast dataset is part of the KDD cup 2001 challenge [Hatzis and Page2001] which contain interactions between proteins. The task is to predict the function of these genes. Similarly, the Human dataset, introduced in [Hamilton, Ying, and Leskovec2017], is a protein-protein interaction (PPI) network from Human Tissues. The dataset contains PPI from 24 human tissues and the task is to predict the gene’s functional ontology. Features consist of positional gene sets, motif gene sets, and immunology signatures.
Movie network: We constructed a movie network from Movielens-2k dataset available as a part of HetRec 2011 workshop [Cantador, Brusilovsky, and Kuflik2011]. The dataset is an extension of the MovieLens10M dataset with additional movie tags. The nodes are the movies and edges are created between movies if they share a common actor or director. The movie tags form the movie features. The task here is to predict all possible genres of the movies.
Product network: We constructed an Amazon DVD co-purchase network which is a subset of Amazon_060 co-purchase data by [Leskovec and Sosič2016]. The network construction procedure is similar to the one created in [Moore and Neville2017]. The nodes correspond to DVDs and edges are constructed if two DVDs are co-purchased. The DVD genres are treated as DVD features. The task here is to predict whether a DVD will have Amazon sales rank 7500 or not. To the best of our knowledge there exists no previous work in collective classification that reports results on these many datasets over a wider range of domains. 111Code is available at https://github.com/PriyeshV/HOPF
The experiments follow a semi-supervised setting with only labeled data. We consider of nodes in the graph as test nodes and randomly create 5 sets of training data by sampling of the nodes from the remaining graph. Further, of these training nodes are used as the validation set. We do not use the validation set for (re)training. To account for the imbalance in the training set, we use a weighted cross entropy loss (see Appendix) similar to [Moore and Neville2017] for all the models. In Table: 3, we report the averaged test results for transductive experiments obtained from models trained on the different training sets. We also report results on the Transfer (Inductive) learning task introduced in [Hamilton, Ying, and
Leskovec2017] under their same setting, where the task is to classify proteins in new human tissues (graphs) which are unseen during training. For detailed information on implementation and hyper-parameter details, kindly refer the Appendix.
Models compared: We compare the proposed NIP and HOPF models with various differentiable WL kernels, Semi-Supervised ICA and two baselines, BL_NODE and BL_NEIGH as defined in Table: 1. BL_NODE is a K-layer feedforward network that only considers the node’s information ignoring the relational information whereas BL_NEIGH ignores the node’s information and considers the neighbors’ information. BL_NEIGH is a powerful baseline which we introduce. It is helpful to understand the usefulness of relational information in datasets. In cases where BL_NEIGH performs poorer than BL_NODE, the dataset has less or no useful relational information to extract with the available labeled data and vice versa. In such datasets, we observe no significant gain in considering beyond one or two hops. All the models in Table: 3 and Table: 4 except SS-ICA, GCN and GraphSAGE models have skip connections. GraphSAGE models combine node and neighborhood information by concatenation instead of summation.
Results and Discussions
In order to report statistical significance of models’ performance across different datasets we resort to Friedman’s test and Wilcoxon signed rank test as discussed by previous work [Demšar2006]
. Levering Friedmans’ test, we can reject the null hypothesis that all the models perform similarly with. More details and report about the statistical significance of our proposed models, NIP-MEAN and I-NIP-MEAN, over their base variants is presented in the subsequent discussions. .
These rank based significance tests do not provide a metric to measure the robustness of a model across datasets. One popular approach is to use count based statistics like average rank and number of wins. Average Rank of the models across datasets is provided in the table, where lower rank indicates better performance. It is evident from Table 3, that the proposed algorithm I-NIP-MEAN achieves best rank and wins on 4/11 datasets followed by SS-ICA which has 2 wins and NIP-Mean which has 1 win and second best rank. By this simple measure of counting the number of wins of a given algorithm, the proposed method outperforms existing methods.
However, we argue that this is not helpful at measuring the robustness of models. For example, there could be an algorithm which is consistently the second best algorithm on all the datasets with minute difference from the best and yet have zero wins. To capture this notion of consistency, we introduce a measure, shortfall, which captures the relative shortfall in performance compared to the best performing model on a given dataset.
Where best[dataset] is the micro_f1 of the best performing model for the dataset and performance[data] is the model’s performance for that dataset. In Table: 3, we report the average shortfall across datasets. Lower shortfall indicates a better consistent performance. Even using this measure the proposed algorithm I-NIP-MEAN outperforms existing methods. In particular, notice that while SS-ICA seemed to be the second best algorithm using the naive method of counting the number of wins, it does very poor when we consider the shortfall metric. This is because SS-ICA is not consistent across datasets and in particular it gives a very poor performance on some datasets which is undesirable. On the other hand, I-NIP-MEAN not only wins on 4/11 datasets but also does consistently well on all the datasets and hence has the lowest average shortfall.
Baselines Vs. Collective classification (CC) models
As mentioned earlier, the baselines BL_NEIGH and BL_NODE use only neighbor and only node information respectively. In datasets, where BL_NEIGH significantly outperform BL_NODE, all CC models ouperform both these baselines by jointly utilizing the node and neighborhood information. In datasets such as Cora, Citeseer, Cora2, Pubmed and Human, where performance of BL_NEIGH BL_NODE, CC models improve over BL_NEIGH by up to in the transductive setup. Similarly, on the inductive task where the performance of BL_NEIGH is greater than BL_NODE by , CC methods end up further improving by another . In Reddit and Amazon datasets, where the performance of BL_NODE BL_NEIGH, CC Methods still learn to exploit useful correlations between them to obtain a further improvement of and respectively.
WL-Kernels Vs NIP-Kernels
We make the following observations:
Node Information Morphing in WL-Kernels: The poor performance of BL_NEIGH compared to BL_NODE on the Blog, FB and Movie datasets suggests that the neighborhood information is noisy and node features are more crucial. The original GCN which aggregates information from the neighbors but does not use CONCAT or skip connections typically suffers a severe drop in performance of up to on datasets with high degree. Despite having the node information, GCN performs worse than BL_NODE on these datasets. The improved performance of GCN over BL_NEIGH in Blog and Movie support that node information is essential.
Solving Node Information Morphing with skip connections in WL-Kernels: The original GCN architecture does not allow for skip connections from to and from to . We modify the original architecture and introduce these skip connections (GCN-S) by extracting features from the convolution’s node information. With skip connections, GCN-S outperforms the base GCN on 8/11 datasets. We observed a performance boost of in Blog, FB, Movie and Amazon datasets even when we consider only 2 hops thereby decreasing the shortfall on these datasets. GCN-S closed the performance gap with BL_NODE on these datasets and in the case of Amazon dataset, it further improved by another . GCN-MEAN which also has skip connections performs quite similarly to GCN-S in all datasets and does not suffer from node information morphing as much as GCN. It is important to note that skip connections are required not only for going deeper but more importantly, to avoid information morphing even for smaller hops. GS models do not suffer from the node information morphing issue as they concatenate node and neighborhood information. Authors of GS also noted that they observed significant performance boost with the inclusion of CONCAT combination. GS-MEAN’s counterpart among the summation models is the GCN-MEAN model which gives similar performance on most datasets, except for Reddit and Amazon where GS-MEAN with concat performs better than GCN-MEAN by . GS-MAX provides very similar performances to GS-MEAN, GCN-MEAN, and GCN-S across the board. Their shortfall performances are also very similar. GS-LSTM typically performs poorly which might be because of the morphing of earlier neighbors’ information by more recent neighbors by in the list.
Solving Node Information Morphing with NIP Kernels: NIP-MEAN, a MEAN pooling kernel from the NIP propagation family outperforms its WL family counterpart, GCN-MEAN on 9/11 datasets. With Wilcoxon signed-rank test, NIP-MEAN > GCN-MEAN with p < 0.01. It achieves a significant improvement of over GCN-MEAN in Human, Reddit and Amazon datasets. It similarly outperforms GS-MEAN on another 9/11 datasets even though GS-MEAN has twice the number of parameters. NIP-MEAN provides the most consistent performance among the non-iterative models with a shortfall as low as 3.6. NIP-MEAN’s clear improvement over its WL-counterparts demonstrates the benefit of using NIP family of kernels which explicitly preserve the node information and mitigate the node information morphing issue.
Iterative inference models Vs. Differentiable kernels
Iterative inference models, SS-ICA and I-NIP-MEAN exploit label information from the neighborhood and scale beyond the memory limits of differentiable kernels. This was evidently visible with our experiments on the large Reddit dataset. Reddit was computationally time-consuming with even partial neighbors due to its high link density. However the iterative models scale beyond 2 hops and consider 5 hops and 10 hops for SS-ICA and I-NIP-MEAN respectively. This is computationally possible because of the linear scaling of time and constant memory complexity of iterative models. Hence, they achieve superior performance with lesser computation time on Reddit. The micro-f1 scores of SS-ICA over iterations on a particular fold was . Similarly for I-NIP-MEAN on the same fold, we obtained . SS-ICA was remarkable as it can be seen from the table that it managed to obtain 81.92 starting from 57.118 ().
The benefit of label information over attributes can be analyzed with SS-ICA which aggregates only the label information of immediate neighbors. In Yeast dataset, SS-ICA gains improvement over non-iterative models which do not use label information. However, SS-ICA does not give good performance on some datasets as it does not leverage neighbors features and is restricted to only learn first-order local information unlike multHOPF differentiable WL or NIP kernels.
Iterative Differentiable kernels Vs. Rest
I-NIP-MEAN which is an extension of NIP-MEAN with iterative learning and inference can leverage attribute information and exploit non-linear correlations between the labels and attributes from different hops. I-NIP-MEAN improves over NIP-MEAN on seven of the eleven datasets with significant boost in performance up to in Cora2, Reddit, Amazon, and Yeast datasets. Levering Wilcoxon signed-rank test, I-NIP-MEAN is significantly better than NIP-MEAN (with p < 0.05). I-NIP-MEAN also successfully leverages label information like SS-ICA and obtains similar performance boost on Yeast and Reddit dataset. It also outperforms SS-ICA on eight of eleven datasets with a statistical significance of p < 0.02 as per the Wlicoxon test The benefits of using neighbors’ attributes along with labels are visible in Amazon and Human datasets where I-NIP-MEAN model achieves and improvement respectively over SS-ICA which uses label information alone. Moreover, by leveraging both attributes and labels in a differentiable manner it further achieves a improvement over the second best model in cora2. This superior hybrid model, I-NIP-MEAN emerges as the most robust model across all datasets with the lowest shortfall of .
Inductive learning on Human dataset
For the inductive learning task in Table: 4, the cc models obtain a improvement over BL_NODE by leveraging relational information. The I-NIP-MEAN and NIP-MEAN kernels achieves best performance with a improvement over GCN-MEAN.
Run time analysis:
We adapt the scalability setup from GCN [Kipf and Welling2016]
to compare average training time per epoch between fully differentiable model, NIP-MEAN and iterative differentible model, I-NIP-MEAN, to make predictions with multHOPF representations. We consider two Iterative NIP-MEAN Variants here: I-NIP-MEAN with 1 differential layer (C=1) and I-NIP-MEAN with 2 differential layerss (C=2). In order to obtain multHOPF representations, we increase the number of iterations,accordingly. Note: I-NIP-MEAN with C=2 can only provide multHOPF representations of multiples of 2. The training time, included time to pre-fetch neighbors with queues, forward pass of NN, loss computation and backward gradient propagation) similar to the setup of [Kipf and Welling2016]. We record wall-clock running time for these models to process a synthetic graph with 100k nodes, 500k edges, 100 features, and 10 labels on a 4GB GPU. The batch size and hidden layers size were set to 128. The plot of the averaged run time over runs across different hops is presented in Figure: 3.
The fully differentiable model, NIP-MEAN incurred an exponential increase in compute time with increase in hop, and moreover ran Out-Of-Memory after hops. Whereas, I-NIP-MEAN with and has a linear growth in compute time with increasing . This is in agreement with the time complexity provided earlier for these models. Not only the time for non-iterative methods increased exponentially with hops, but the memory complexity also increases exponentially with a new layer as it is required to store the gradients and activations for all the new neighbors introduced with a hop. In comparison, the runtime of the proposed iterative solution has a linear growth rate and also has lesser memory footprint.
The choice of K and the decision to use iterative learning depends on a variety of factors such as memory availability and relevance of the labels. The choice of C, K and T should be determined by performance on a validation set. C should be set as the minimum of maximum differentiable layers that fit into memory or the maximum hop beyond which performance saturates or drops. One should set , if K doesn’t fit in memory or to an arbitrary constant if performance improves with iteration even if it fits in memory.
In this work, we proposed HOPF, a novel framework for collective classification that combines differentiable graph kernels with an iterative stage. Deep learning models for relational learning tasks can now leverage HOPF to use complete information from larger neighborhoods without succumbing to over-parameterization and memory constraints. For future work, we can further optimize the framework by committing only high confidence labels , like in cautious ICA[McDowell, Gupta, and Aha2007] to reduce the erroneous information propagation and we can also increase the supervised information flow to unlabeled nodes by incorporating ghost edges [Gallagher et al.2008]. The framework can also be extended for unsupervised tasks by incorporating structural regularization with Laplacian smoothing on the embedding space.
- [Bhattacharya and Getoor2007] Bhattacharya, I., and Getoor, L. 2007. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD) 1(1):5.
- [Cantador, Brusilovsky, and Kuflik2011] Cantador, I.; Brusilovsky, P.; and Kuflik, T. 2011. 2nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In Proceedings of the 5th ACM conference on Recommender systems, RecSys 2011. New York, NY, USA: ACM.
Statistical comparisons of classifiers over multiple data sets.
Journal of Machine learning research7(Jan):1–30.
- [Gallagher and Eliassi-Rad2010] Gallagher, B., and Eliassi-Rad, T. 2010. Leveraging label-independent features for classification in sparsely labeled networks: An empirical study. Advances in Social Network Mining and Analysis 1–19.
- [Gallagher et al.2008] Gallagher, B.; Tong, H.; Eliassi-Rad, T.; and Faloutsos, C. 2008. Using ghost edges for classification in sparsely labeled networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 256–264. ACM.
- [Gilmer et al.2017] Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Dahl, G. E. 2017. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212.
Glorot, X., and Bengio, Y.
Understanding the difficulty of training deep feedforward neural
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 249–256.
- [Hamilton, Ying, and Leskovec2017] Hamilton, W. L.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, 1025–1035.
- [Hatzis and Page2001] Hatzis, C., and Page, D. 2001. 2001 KDD cup challenge Dataset. In pages.cs.wisc.edu/ dpage/kddcup2001. KDD.
- [Jensen, Neville, and Gallagher2004] Jensen, D.; Neville, J.; and Gallagher, B. 2004. Why collective inference improves relational classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 593–598. ACM.
- [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980.
- [Kipf and Welling2016] Kipf, T. N., and Welling, M. 2016. Semi-supervised classification with graph convolutional networks. CoRR abs/1609.02907.
- [Leskovec and Sosič2016] Leskovec, J., and Sosič, R. 2016. Snap: A general-purpose network analysis and graph-mining library. ACM Trans. Intell. Syst. Technol. 8(1):1:1–1:20.
- [Lu and Getoor2003] Lu, Q., and Getoor, L. 2003. Link-based classification. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), 496–503.
- [Macskassy and Provost2003] Macskassy, S. A., and Provost, F. 2003. A simple relational classifier. Technical report, NEW YORK UNIV NY STERN SCHOOL OF BUSINESS.
- [Mccallum2001] Mccallum, A. 2001. CORA Research Paper Classification Dataset. In people.cs.umass.edu/ mccallum/data.html. KDD.
- [McDowell and Aha2012] McDowell, L. K., and Aha, D. W. 2012. Semi-supervised collective classification via hybrid label regularization. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012.
- [McDowell and Aha2013] McDowell, L. K., and Aha, D. W. 2013. Labels or attributes?: rethinking the neighbors for collective classification in sparsely-labeled networks. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, 847–852. ACM.
- [McDowell, Gupta, and Aha2007] McDowell, L. K.; Gupta, K. M.; and Aha, D. W. 2007. Cautious inference in collective classification. In AAAI, volume 7, 596–601.
- [Moore and Neville2017] Moore, J., and Neville, J. 2017. Deep collective inference. In AAAI, 2364–2372.
- [Morris, Kersting, and Mutzel2017] Morris, C.; Kersting, K.; and Mutzel, P. 2017. Glocalized weisfeiler-lehman graph kernels: Global-local feature maps of graphs. In Data Mining (ICDM), 2017 IEEE International Conference on, 327–336. IEEE.
- [Namata et al.2012] Namata, G.; London, B.; Getoor, L.; Huang, B.; and EDU, U. 2012. Query-driven active surveying for collective classification. In 10th International Workshop on Mining and Learning with Graphs.
- [Neumann et al.2016] Neumann, M.; Garnett, R.; Bauckhage, C.; and Kersting, K. 2016. Propagation kernels: efficient graph kernels from propagated information. Machine Learning 102(2):209–245.
- [Neville and Jensen2000] Neville, J., and Jensen, D. 2000. Iterative classification in relational data. In Proc. AAAI-2000 Workshop on Learning Statistical Models from Relational Data, 13–20.
- [Neville and Jensen2003] Neville, J., and Jensen, D. 2003. Collective classification with relational dependency networks. In Proceedings of the Second International Workshop on Multi-Relational Data Mining, 77–91.
- [Pfeiffer III, Neville, and Bennett2015] Pfeiffer III, J. J.; Neville, J.; and Bennett, P. N. 2015. Overcoming relational learning biases to accurately predict preferences in large scale networks. In Proceedings of the 24th International Conference on World Wide Web, 853–863. International World Wide Web Conferences Steering Committee.
- [Sen P et al.2008] Sen P, Galileo, M.; Getoor, L.; Galligher, B.; and Eliassi-Rad, T. 2008. Collective classification in network data. AI magazine 29(3):93.
- [Shervashidze et al.2011] Shervashidze, N.; Schweitzer, P.; Leeuwen, E. J. v.; Mehlhorn, K.; and Borgwardt, K. M. 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12(Sep):2539–2561.
[Subbanna, Precup, and
Subbanna, N.; Precup, D.; and Arbel, T.
Iterative multilevel mrf leveraging context and voxel information for
brain tumour segmentation in mri.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 400–405.
- [Sun et al.2016] Sun, W.; Venkatraman, A.; Boots, B.; and Bagnell, J. A. 2016. Learning to filter with predictive state inference machines. In International Conference on Machine Learning, 1197–1205.
- [Tong, Faloutsos, and Pan2006] Tong, H.; Faloutsos, C.; and Pan, J.-Y. 2006. Fast random walk with restart and its applications. In Proceedings of the Sixth International Conference on Data Mining, ICDM ’06, 613–622. Washington, DC, USA: IEEE Computer Society.
- [Wang et al.2010] Wang, X.; Tang, L.; Gao, H.; and Liu, H. 2010. Discovering overlapping groups in social media. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, 569–578. IEEE.
- [Weisfeiler and Lehman1968] Weisfeiler, B., and Lehman, A. 1968. A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsia 2(9):12–16.
- [Yang, Cohen, and Salakhutdinov2016] Yang, Z.; Cohen, W. W.; and Salakhutdinov, R. 2016. Revisiting semi-supervised learning with graph embeddings. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, 40–48.
- [Yu and Clausi2005] Yu, Q., and Clausi, D. A. 2005. Combining local and global features for image segmentation using iterative classification and region merging. In Computer and Robot Vision, 2005. Proceedings. The 2nd Canadian Conference on, 579–586. IEEE.
Appendix A Appendix
For optimal performance, both iterative and non-iterative models are processed in mini-batches. They make use of queues to pre-fetch the exponential neighborhood information of nodes in a mini-batch. The propagation steps are computed with sparse-dense computations. Mini-batching also makes it possible to efficiently distribute the gradient computation in a multi-GPU setup, we leave this enhancement for future work. The choice of data structure for the kernel is also crucial for processing the graph, i.e trade-off between adjacency list and adjacency matrix results. Working with maxpool or LSTMs are difficult using adjacency matrix as the node’s neighborhood information needs to be flattened dynamically. Models based on LSTM will also have to deal with issues of nodes having highly varying degrees and limitations of sequential processing of nodes even at the same hop distance. The code for HOPF framework processes neighborhood information with adjacency matrices and is primarily suited for weighted mean kernels.
Weighted Cross Entropy Loss (WCE)
, were trained with a balanced labeled set i.e equal number of samples for each label is provided for training. Such assumptions on the availability of training samples and similar label distribution at test time are unrealistic in most scenarios. To test the robustness of CC models in a more realistic set-up, we consider training datasets created by drawing random subsets of nodes from the full ground truth data. It is highly likely that randomly drawn training samples will suffer from severe class imbalance. This Imbalance in class distribution can make the weight updates skewed towards the dominant labels during training. To overcome this problem, we generalize the weighted cross entropy defined in[Moore and Neville2017] to incorporate both multi-class and multi-label setting. We use this as the loss function for all the methods including baselines. The weight for the label is given in the equation below, where is the total number of labels and represents the number of training samples with label . The weight of each label is inversely proportional to the number of samples having that label.
The hyper-parameters for the models are the number of layers of neural network (hops), dimensions of the layers, dropouts for all layers and L2 regularization. We train all the models for a maximum of 2000 epochs using Adam [Kingma and Ba2014] with the initial learning rate set to 1e-2. We use a variant of patience method with learning rate annealing for early stopping of the model. Specifically, we train the model for a minimum of 50 epochs and start with a patience of 30 epochs and drop the learning rate and patience by half when the patience runs out (i.e when the validation loss does not reduce within the patience window). We stop the training when the model consecutively loses patience for 2 turns.
We found all weighted average kernels along with GS-Max model to share similar optimal hyper-parameters as their formulations and parameters were similar. In fact this is in agreement with the work of GCN and GraphSAGE where all their models had similar hyper-parameters. However, GS-LSTM which has more parameters and a different aggregation function required additional hyper-parameter tuning. For reported results, we searched for optimal hyper-parameter setting for a two layer GCN-S model on all datasets with the validation set. We then used the same hyper-parameters across all the other models except for GS-LSTM for which we searched separately. We report performance of models with their ideal number of differentiable graph layers, based on their performance in validation set. The maximum number of hops beyond which performance saturated or decreased on datasets were: 3 hops for Amazon, 4 hops for Cora2 and HUMAN and 2 hops for the remaining datasets. For the Reddit dataset, we used partial neighbors 25 and 10 in and hop which is the default GraphSAGE setting as the dataset had extremely high link density.
We row-normalize the node features and initialize the weights with [Glorot and Bengio2010]. Since the percentage of different labels in training samples can be significantly skewed, like [Moore and Neville2017] we weigh the loss for each label inversely proportional to its total fraction as in Eqn: 12. We added all these components to the baseline codes too and ensured that all models have the same setup in terms of the weighted cross entropy loss, the number of layers, dimensions, patience based stopping criteria and dropouts. In fact, we observed an improvement of 25.91 percentage for GraphSage on their dataset. GraphSAGE’s LSTM model gave Out of Memory error for Blog, Movielens, and Cora2 as the initial feature size was large and with the large number parameters for the LSTM model the parameter size exploded. Hence, for these datasets alone we reduced the features size.