Introduction
In many reallife applications such as social networks, citation networks, protein interaction networks, etc., the entities in an environment are not independent but rather influenced by each other through their interactions. Such relational datasets have been popularly modeled as graphs where the entities make up the node, and the edges represent an interaction. The use of graphbased learning algorithms has increasingly gained traction owing to their ability to model structured data. Categorizing such entities requires extracting relational information from their multihop neighborhoods and combining that efficiently with their features. Summarizing information from multiple hops is useful in many applications where there exists semantics in local and group level interactions among entities. Thus, defining and finding the significance of neighborhood information over multiple hops becomes an important aspect of the problem.
Traditionally handcrafted features were widely used to capture relational information. Popular methods used mix of count statistics of label distribution [Neville and Jensen2003, Lu and Getoor2003], relational properties of nodes like degree, centrality scores [Gallagher and EliassiRad2010], and attribute summaries of immediate neighborhood, etc. Limited by manual engineering, these traditional methods only used raw features built from information associated with immediate (first order or one hop) neighbors.
The recent surge in deep learning has shown promising results in extracting important semantic features and learning good representations for many machine learning tasks. Deep learning models for such relational node classification tasks can be broadly categorized into models that either learn node representation with structural regularizations
[Perozzi, AlRfou, and Skiena2014, Wang et al.2016] or those that ignore explicit structural regularization and learn to aggregate neighbors’ information [Hamilton, Ying, and Leskovec2017, Moore and Neville2017, Kipf and Welling2016]. The former is limited to work only in networks that exhibit high homophily, as they enforce the representation of a node and its neighbor to be similar. The later methods, on the other hand, do no make such assumptions.Initial works [Frasconi, Gori, and Sperduti1998]
on relational feature extraction with neural nets primarily relied on recursive neural nets to process the graph data. Limited by their ability to deal only with directed ordered acyclic graphs,
[Scarselli et al.2009, Gori, Monfardini, and Scarselli2005]introduced Graph Neural Networks (GNNs) which used recursive neural nets to propagate information in any general graph iteratively. However, these GNNs were limited to problems where the entire graph can fit into memory. To extend the work to sequence generation problems on graphs,
[Li et al.2015] adapted GRUs [Cho et al.2014] in the propagation step. Recently, [Moore and Neville2017] proposed an LSTM [Hochreiter and Schmidhuber1997] based sequence embedding model for node classification but it required randomly ordering first hop neighbors’ information, thus essentially discarded the topological structure.To directly deal with the graph’s topological structure, [Bruna et al.2013] defined convolutional operations in the spectral domain for graph classification tasks, but required computationally expensive eigen decomposition of the graph Laplacian. To reduce this requirement, [Defferrard, Bresson, and Vandergheynst2016] approximated the higher order relational feature computation with first order Chebyshev polynomials defined on the graph Laplacian. Graph Convolutional Networks (GCNs) [Kipf and Welling2016] adapted them to semisupervised node level classification tasks. GCNs simplified Chebyshev Nets by recursively convolving onehop neighborhood information with a symmetric graph Laplacian. Recently, [Hamilton, Ying, and Leskovec2017] proposed a generic framework called GraphSAGE with multiple neighborhood aggregator functions. GraphSAGE works with a partial (fixed number) neighborhood of nodes to scale to large graphs. GCN and GraphSAGE are the current stateoftheart approaches for transductive and inductive node classification tasks in graphs with node features. These endtoend differentiable methods provide impressive results besides being efficient regarding memory and computational requirements.
Herein, we argue that despite their impressive results these models lack the representation capacity to summarize relevant neighborhood information from multiple hops effectively. We support our argument with our analysis of their representation limitations and also provide a solution to alleviate this issue. Below, we list out primary contributions:

To the best of our knowledge, we are the first to analyze and point out that current stateoftheart graph convolutional nets lack the representation capacity to regulate different neighborhood information independently.

We also show that these models capture hop neighborhood information by a order binomial. We take advantage of this to build a binomials basis for the Khop neighborhood space with outputs from different graph layers corresponding to different hop. With the binomial basis, we define a simple linear fusion layer that can capture any required combination of hops for the end task.

We propose FGCN, a simple extension to GCNs with the proposed fusion layer. We show that the proposed model outperforms the stateoftheart models on six datasets while being highly competitive on the other two.
Background
Notations
Let denote a graph comprising of vertices, , and edges, with respectively. Let denote the nodes’ features and denote the nodes’ labels with and referring to the number of features and labels, respectively. Let denote the adjacency matrix representation of the set of edges, and let denote the diagonal degree matrix defined as . Let denote the normalized graph Laplacian and denote the renormalized Laplacian [Kipf and Welling2016].
In this paper, a Graph Convolutional Network defined to capture hop information will have graph convolutional layers with dimensional outputs, and an final label layer denoted by . if the last convolution is considered as the label layer otherwise, . Let denote the weights associated with the layer, where is the first hidden layer’s weights, is the label layer’s weights and is the intermediate layer’s weights. Let
define the activation function associated with layer,
.Graph Convolutional Networks
Graph Convolutional Networks (GCNs), introduced in [Kipf and Welling2016]
, is a multilayer convolutional neural network where the convolutions are defined on a graph structure for the problem of semisupervised node classification. The conventional twolayer GCNs which captures information up to the
^{nd} hop neighborhood of a node can be reformulated to capture information up to any arbitrary hop, as given below in Eqn: 1.(1) 
GCN was used for multiclass classification task with ReLU activation function,
and a softmax label layer, . We can rewrite the GCN model in terms of ^{th} hop node and neighbor features as below by factoring .(2) 
GraphSAGE
Graph Sample and Aggregator (GraphSAGE) proposed in [Hamilton, Ying, and Leskovec2017] consists of 3 models made up of different differentiable neighborhood aggregator functions. GraphSAGE models were defined for the multilabel semisupervised inductive learning task, i.e generalizing to unseen nodes during training. Let the function abstractly denote the different aggregator functions in GraphSAGE, specifically
{mean, max pooling, LSTM} and we will refer to these models as GSMEAN, GSMAX and GSLSTM, respectively. Similar to GCN, GraphSAGE models also recursively combine neighborhood information at each layer of the Neural Network. GraphSAGE has an additional label layer unlike GCN, i. e.,
. Hence, and . Here, the weights .(3) 
GraphSAGE models, unlike GCN, are defined to work with partial neighborhood information. For each node, these models randomly sample and use only a subset of neighbors from different hops. This choice to work with partial neighborhood information allows them to scale to large graphs but restricts them from capturing the complete neighborhood information. Rather than viewing it as a choice it can also be seen as restriction imposed by the use of Max Pool and LSTM aggregator functions which require fixed input lengths to compute efficiently. Hence, GraphSAGE constraints the neighborhood subgraph of a node to contain a fixed number of neighbors at each hop.
Analysis of recursive propagation models
In this section, we first provide a unified formulation of GCN and GraphSAGE as recursively computed graph propagation models. Then, we analyze this unified formulation and show that they lack the representation capacity to regulate information from different hops independently.
Unified Recursive Graph Propagation Kernel
GCN and GraphSAGE differ from each other in terms of their node features, the neighborhood features, and the combination function. These differences can be abstracted to provide a unified formulation as below.
(4) 
Where and denote the ^{th} hop node and it’s neighbors’ features respectively, denotes the scaling factor for node features, denotes the neighbors’ weights, and denotes the mode of combination of node and it’s neighbors’ features. For brevity, we have made the neighbors’ weighting function to be independent of .
We can view GCN in terms of the components in Eqn: 4 with node features, where = , neighbors’ features, = with = and combining by summation, .
Similarly, GraphSAGE can be seen to combine nodes’ features, with = and different based neighbors’ features by concatenation, = . Specifically, = for GSMEAN, = for GSMAX where
is a one hot vector with
in the position of the node with the maximum value for the ^{th} feature and for GSLSTM, is defined by the LSTM gates which randomly orders neighbors and gives weightage for a neighbor in terms of neighbors seen before.The concatenation combination (denoted by square braces below) can also be expressed in terms of a summation of node and neighbors features with different weight matrices, and
respectively by appropriately padding zero matrices, (
) as shown below.(5) 
The and terms for CONCAT and SUMMATION combinations are similar if weights are shared in the CONCAT formulation as shown in Eqn: 6 and Eqn: 7, respectively. Weight sharing refers to .
(6)  
(7) 
For brevity of analysis made henceforth, we only consider the summation model to discuss the limitations of the recursive propagation kernels without losing any generality on the deductions made. Further, we provide another abstraction to the summation formulation as in Eqn: 7 by Eqn: 8. Henceforth, we refer to Eqn: 8 as the generic recursive propagation kernel in the upcoming analysis.
(8) 
Lack of independent regulatory paths to different hops
Though these propagation models can combine information from multiple hops, their formulation restricts them from independently regulating information from different hops. This is a consequence of recursively computing ^{th} hop information in terms of ^{th} hop information which results in interdependence among weights associated with the different hop information. We can see this below in the recursively expanded unified graph kernel.
(9) 
Let’s analyze this with an example of a hop linear kernel with = and = which on expansion yields the following equation:
(10) 
This expansion makes it trivial to note that all the weights influence all the different hop information ( and ) in the model. For example, if we take the case where only firsthop information (just term) is required, then there exists no combination of s that can provide it under the current model. It should be noted that we cannot obtain the ^{st} hop information alone by using a hop kernel as that would also include hop information, .
From the above analysis, we can say that these recursive graph kernels have limited representation capacity as they cannot capture information from a particular subset of hops without including information from other hops. The limitation of these networks can be attributed to the specific formulation of recursion used to compute output at every layer. As with every layer of graph convolutional nets, a new information about the ^{th} hop is introduced as in ). More importantly, the output at ^{th} layer passes through a series of computations involving later hops () before reaching the last output layer. And also note that this phenomenon happens for previous layers too. Thus, this leads to a lack of independent computation paths to regulate information from any hop without affecting information from later and earlier hops.
Inclusion of skip connections: Adding the popular skip connection [He et al.2016] to these models as in Eqn: 11 improves the multihop information regulation capacity.
(11)  
(12) 
On recursively expanding the above equation, it can be seen that adding skip connections to a layer, results in directly adding information from all the lower hops, as shown in Eqn: 12. Unlike Eqn: 8 where the output at each layer, was only dependent on the previous layer, accounting to only one computational path; now adding skip connections allows for multiple computational paths. As it can be seen that at the ^{th} layer the model has the flexibility to select an output from any or all .
However, it can only discard information beyond a particular hop, and is still not sufficient to individually regulate the importance of information from individual hop as all hops, are interdependent. Lets us consider the same example of a hop model as earlier to capture information from ^{st} hop alone ignoring the rest. The best, the hop model with skip connections can do is to learn to ignore information from and hop by setting == and including along with . It can be reasoned as before to see that cannot be set to as depends on the result of thereby having no means to ignore information from . This limits the expressive power to efficiently span the entire space of order neighborhood information. To summarize, skip connections at best can obtain information up to a particular hop by ignoring information from subsequent hops. The combination can be perceived as linear skip connection as noted by the authors of GraphSAGE.
Inclusion of different weights: As with the CONCAT model, the summation models can also be modified to have different weights to compute node and neighborhood features. We provide the analysis for such models with nonshared weights in the supplementary material. From that analysis, it can be noticed that models with different weights are powerful enough to obtain any particular hop information ignoring the rest and also can obtain information from a continuous subset of hops i.e () ignoring the rest. However, including information from two different hops and with , information from all hops between and will also be included and can’t be ignored.
Proposed Methodology
In this section, we propose a simple yet effective extension to GCNs by adding a fusion component that allows them to capture multiple hop information effectively. We motivate and propose this component as a solution that will enable these graph kernels to span the entire space of hop neighborhood. First, we show that the unified kernel is a binomial combination of node and its’ neighborhood information. Thus, at each layer , a ^{th} hop kernel is computed by a binomial combination. In light of this, we propose a simple fusion layer that learns to linearly combine information from these binomial bases to span the entire hop space.
Binomial basis
The hop unified propagation kernel defined in Eqn: 8 can be rolled out similar to Eqn: 9 and be expressed as a ^{th} order binomial in terms of node and it’s neighbors’ features for the linear activation case as given in Eqn: 13.
(13) 
The higher order binomial term in Eqn: 13 when expanded assigns different weights to different terms as seen in Eqn: 10. These weights correspond to the binomial coefficients of the binomial series, . For example, refer to Eqns: 14 and 15 corresponding to a hop and hop kernel with and for simplicity. It can be seen that for the hop kernel the weights are and for the hop kernel it is . Thus, these recursive propagation kernels combine different hop information weighed by the binomial coefficients.
(14)  
(15) 
These weights induce a bias on the importance of each hop which again is a limitation of the kernel design. Any such fixed bias over different hops cannot consistently provide good performance across numerous datasets. In the limit of infinite data, we can expect the parameters to correct these scaling factors induced by these biases. However, as with most graph based semisupervised learning applications where the amount of available labeled data for training is limited, an undesirable bias can result in a suboptimal model.
Existing propagation kernels defined over hop information, extract relational information by performing convolution operations on different hop neighbors based on their respective ^{th} order binomial. As discussed earlier, biasing the importance of information along with recursive weight dependencies hinder the model from learning relevant information from different hops. These limitations constrain the expressive power of these models from spanning the entire space of order neighborhood information. Hence, it is restricted to only a subspace of all possible order polynomial defined on the neighborhood of nodes.
Linear Fusion Component
To mitigate these issues with existing models, we propose a minimalistic component for these models, a fusion component. This fusion component consists of parameters to combine the information from the binomial basis defined by the different hop information to effectively scale the entire space of a order neighborhood.
We define the fusion component in Eqn: 16 as a linear weighted combination over hop neighborhood space spanned by the binomial basis, s () with coefficients, . The coefficients allow the neural network to explicitly learn the optimal combination of information from different hops. As the s are binomials, a parameterized linear combination of these binomials can obtain any combination of the individual hop information.
(16) 
Fusion Graph Convolutional Network
We propose Fusion Graph Convolutional Network, FGCN in equations in 17. FGCN is a minimalistic architecture that adds the fusion component defined in Eqn: 16 to GCN defined in Eqn: 1. It can be seen to combine different hop information with the fusion component. The fusion component mentioned in the penultimate line of the equations in 17 fuses label scores from each propagation step. The fused label scores are then normalized to make label prediction.
(17) 
The dimensions of , , , , are , , , , respectively. FGCN uses ReLU activations and a softmax label layer accompanied by a multiclass cross entropy for multiclass classification problem or a sigmoid layer followed by a binary cross entropy layer for multilabel classification problem. Since predictions are obtained from every hop, we also subject to a nonlinear activation function with weights same as from .
The number of parameters in FGCN is , where the first term is for GCN and the second is for the fusion component. GraphSAGE models with no shared weights have or for the summation and concat combination respectively which is more than FGCN as typically. This simple fusion component with fewer parameters provides additional benefits to FGCN besides explicitly allowing to capture different hop information. It provides for additional direct gradient flow paths to each of the propagation steps allowing it to learn better discriminative features at the lower hops too which also improves its chances of mitigating vanishing gradient. FGCN can be seen as a multiresolution architecture which simultaneously looks at information from different resolutions/hops and models the correlations among them.
The fusion component is similar in spirit to the Chebyshev filters introduced in [Defferrard, Bresson, and Vandergheynst2016] for complete graph classification task. The primary difference is that the Chebyshev filters learn coefficients to combine Chebyshev polynomials defined over neighborhood information whereas in FGCN the coefficients of the filter are used to combine different binomials pertaining to different layers. Moreover, an additional difference is that the Chebyshev polynomial basis is not associated with weights to filter ^{th}
hop information which can potentially enable the model to learn complex nonlinear feature basis. FGCN also enjoys the benefit of the renormalization trick of GCN that stabilizes the learning to diminish the effect of vanishing or exploding gradient problems associated with training neural networks.
Note that existing attention mechanisms [Vaswani et al.2017] are typically defined for a positive combination of information. They have a restricted scoring range based on the activation functions used. In our case, since we needed a mechanism that can scale the amount of addition and subtraction of information from different binomial bases, we opted for the simple linear weighted combination layer.
Experiments
We run detailed experiments to compare the performance of GCN Vs. GraphSAGE Vs. FGCN across many datasets. The end task is either semisupervised multiclass or multilabel node classification. Links to code, dataset, and hyperparameter details will be made available.
Datasets
Experiments were conducted on eight publicly available datasets from social, citation, product, movie and biological domains. In Table: 1, network statistics of these datasets are given. Dataset details are provided below.
Dataset  Network  Nodes  Edges  Classes  Multilabel  Features 

CORA  Citation  2708  5429  7  FALSE  1433 
CITE  Citation  3312  4715  6  FALSE  3703 
CORA_ML  Citation  11881  34648  79  TRUE  9568 
HUMAN  Biology  56944  1612348  121  TRUE  50 
BLOG  Social  69814  2810844  46  TRUE  5413 
FB  Social  6302  73374  2  FALSE  2 
AMAZON  Product  16553  76981  2  FALSE  30 
MOVIE  Movie  7155  388404  20  TRUE  5297 
Biological network: We use proteinprotein interaction (PPI) network of the Human tissues’, as presented in GraphSAGE [Hamilton, Ying, and Leskovec2017]. The dataset contains protein interactions from 24 human tissues. Positional gene sets, motif gene sets, and immunology signatures were used as features/attributes of the state. The ultimate task is to predict the gene’s functional ontology.
Movie network: The movielens2k dataset available as a part of HetRec 2011 workshop [Cantador, Brusilovsky, and Kuflik2011] contains a large number of movies. The dataset is an extension of the MovieLens10M dataset with additional movie tags. The graph constructed based on it considers each movie as a node and a common actor as an edge. The goal of the task is to predict the genre of the movies.
Product network: We follow the setup similar to [Moore and Neville2017] and consider Amazon DVD copurchase network. It is a subset of the copurchase data, Amazon_060 [Leskovec and Sosič2016]. The nodes correspond to DVDs and edges are constructed if two DVDs are copurchased. The DVD genres are treated as DVD features. The task here is to predict whether DVD sales will cross 7500 or not.
Social networks: We consider the BlogCatalog (BLOG) [Wang et al.2010] and Facebook (FB) [Pfeiffer III, Neville, and Bennett2015, Moore and Neville2017] datasets. The nodes in the BlogCatalog datasets represents the users of a social blog. Each user’s blog tags are treated as the attributes of the nodes and relations based on friendship or fan following are represented as edges. The task is to determine the interests of users. Similarly, in the Facebook dataset, the nodes denote the Facebook users, and its corresponding attributes consists of gender and religious view. The task here is to determine the political view of a user.
Citation Networks: In citation networks, the research articles are the nodes and citations are the edges. We use Cora, Citeseer, and Cora_ML datasets from this domain. The node attributes for them are the bagofwords representation of the article. Predicting the research area of the article is the main objective. Unlike most others which are multiclass datasets, CoraML is a multilabel classification dataset [Mccallum2001],
NODE  GCN  GSMEAN  GSMAX  GSLSTM  FGCN  

CORA  60.222  79.039  76.821  73.272  65.730  79.039 
CITE  65.861  72.991  70.967  71.390  65.751  72.266 
CORA_ML  40.311  63.848  62.800  53.476  OOM  63.993 
HUMAN  41.459  62.057  63.753  65.068  64.231  65.538 
BLOG  37.876  34.073  39.433  40.275  OOM  39.069 
FB  64.683  49.762  64.127  64.571  64.619  64.857 
AMAZON  63.710  61.777  68.266  70.302  68.024  74.097 
MOVIE  50.712  39.059  50.557  50.569  OOM  52.021 
Penalty  10.997  6.276  2.011  2.986  5.232  0.241 
NODE  GCN  GSMEAN  GSMAX  GSLSTM  Mean*  Max*  LSTM*  FGCN 
44.644  85.708  79.634  78.054  87.111  59.800  60.000  61.200  88.942 
Experiment Setup
We report the experiment results for semisupervised learning on a test set, populated by randomly sampling of the dataset. We create the training sets by randomly sampling five sets of nodes from the entire graph. Further, of these training nodes are chosen as the validation set. We do not use the validation set for (re)training. It is ensured that these training samples are mutually exclusive from the held out test data. For all transductive experiments, the reported results are an average of these five different training sets. We report results for inductive learning experiment with the same experimental setup as GraphSAGE on the Human PPI network, where the test nodes and validation nodes have no path to the nodes in the training set.
The hyperparameters for the models are the number of layers of the network (hops), dimensions of the layers, level of dropouts for all layers and L2 regularization, similar to
[Kipf and Welling2016]. We set the same starting learning rate for all the models across all datasets. We train all the models for a maximum of 2000 epochs using Adam
[Kingma and Ba2014] with learning rate set to 1e2. We use a variant of patience method with learning rate annealing for early stopping of the model. Precisely, we train the model for a minimum of 50 epochs and start with the patience of 30 epochs and drop the learning rate and patience by half when the patience runs out (i.e., when the validation loss does not reduce within the patience window). We stop the training when the model consecutively loses patience for two turns. We added all these components to the baseline codes too, and we even observed an improvement up to points for GraphSAGE on their dataset.For hyperparameter selection, we search for optimal setting on a twolayer deep feedforward neural network with the node attributes (NODE) alone. We then use the same hyperparameters across all the other models. We rownormalize the node features and initialize the weights with [Glorot and Bengio2010]
. Since the percentage of different labels in training samples can be significantly skewed, we weigh the loss for each label inversely proportional to its total fraction like
[Moore and Neville2017]. We use CONCAT operations for all GraphSAGE models as in the original model and also include skip connections for the rest of the models for all kernel layers.We ensured that all models have the same setup in terms of the weighted cross entropy loss, the number of layers, dimensions, stopping criteria and dropouts. We evaluate the models on MicroF1 scores similar to GraphSAGE [Hamilton, Ying, and Leskovec2017]. The best results for models across multiple hops were reported, which were typically 4 hops for Cora_ML and HUMAN, 3 hops for Amazon, 2 hops for Cora, Citeseer and FB and 1 hop for Movie and Blog datasets. We report results from different hops rather than a fixed number to be fair to baselines that dropped performance on specific datasets with increased hops.
Our implementation is minibatch trainable, similar to GraphSAGE. We compare our model, FGCN, against the node only classifier: NODE, GCN, and all the GraphSAGE variants: GSMEAN, GSMAX, and GSLSTM. GSLSTM model ran out of memory (OOM) for few datasets as mentioned in the table owing to high parameterization.
Experiment results
Among the baselines, GraphSAGE models with more complex aggregator functions and no shared weights for node and neighborhood features significantly outperform the GCN model with no shared weights and limiting scaling factor, . GCN model performs poorly on datasets where the number of edges is high. This is primarily due to the effect of its node scaling factor in these high degree datasets where the nodes’ features are heavily under weighed (= ) relative to its’ neighbors’ information. The effect of node scaling and biased importance is in agreement with the theoretical justification made in [Kipf and Welling2016] for the design of renormalized GCNs over the mean model. However their experimental benefits on highly homophilous datasets, Cora and Citeseer were not achievable on other datasets as shown in Table 2. This can be observed from the datasets: Blog, Amazon, Movie, and FB as it performs poorly than the classifier which only uses the node attributes, NODE by , , and percentage points respectively.
FGCN Vs. NODE: FGCN significantly outperforms the NODE model across the board.
FGCN Vs. GCN: FGCN improves over GCN by in the inductive setup and outperforms GCN on six of the eight datasets in the transductive setup while being comparable to the other two. FGCN has seemed to have learned to avoid the bias induced by the scaling factor by learning to effectively combine the node features along with additional hop information resulting in improved performances over GCN by up to percentage points in FB. In datasets, where GCN’s performance dropped below the NODE model, the FGCN model has recovered the drop in performance and also further improved the performance by on BLOG, Movie, and FB. Moreover, with Amazon, we observe a further point improvement over the node only model. With the fusion component, the GCN model which performed poorly among the propagation kernels not only obtained a significant boost in performance but also achieved best overall consistent score with a penalty (average difference from the best) as low as .
FGN Vs GraphSAGE: FGCN outperforms GraphSAGE variants across the board on seven of the eight datasets while slightly underperforming on one. GraphSAGE models have higher flexibility compared to GCNs as they have no shared weights. This explains the significant experimental improvement benefited with concatenation as noted in [Hamilton, Ying, and Leskovec2017]. FGCN significantly improves over GraphSAGE models in Human and Amazon dataset. In Amazon dataset, though all of GraphSAGE’s variants significantly outperformed GCN and NODE models, FGCN further improved over GraphSAGE by another . GSMEAN model, similar to GCN improves over GCN and NODE by and respectively. This can be attributed to not scaling down the node information as with GCN. Despite that, FGCN manages to further improve over GSMEAN by . There is no single winner among GraphSAGE models across datasets. Different variants champion in different datasets among them. Despite having a single simple aggregation function, FGCN easily champions over all of them combined except on BLOG. This suggests that the flexibility to independently regulate information is necessary and irrespective of the complex aggregation functions used, the mild lack in representation capacity holds back GraphSAGE from achieving FGCN’s performance.
Overall, FGCN improves over the stateoftheart results on six datasets while being extremely competitive on the other two datasets. Though the proposed fusion component, in theory, is an optimal solution for GCNs with linear activations, they also seem to be experimentally beneficial for GCNs with the piecewise linear ReLU activations too. Such generalization is not unrealistic in practice, as it is often observed that such generalization of insights from a relaxed linear analysis seems to provide significant clarity and potential improvements on the nonlinear front [Saxe, McClelland, and Ganguli2013, Kawaguchi2016, Hardt and Ma2016, Orabona and Tommasi2017].
FGCN robustly captures mutlihop information In real life datasets, there exists a varying amount of information among the interactions between the node and its different distant hop neighbors. An ideal relational model should be able to efficiently capture relevant information while filtering out the increasing noise induced by the expanding neighborhood size with each hop. We demonstrate FGCN’s capability in Fig: 1 to robustly capture information from multiple hops on different datasets with a varied information pattern. We selected only those datasets that have high relevant relational information for the classification task. These were datasets that obtained significant improvement on results over the node only classifier with just the inclusion of the first hop information.
In the citation networks (Cora and Cora_ML), it can be seen that the performances seem to saturate after two hops and with one hop in the Amazon, copurchase network. Despite that, FGCN with its capability to selectively regulate information from multiple hops remains unaffected by the noise induced by considering additional hops.
In contrast, there is a significant increase in performance with the consideration of nodes’ higher order neighborhood interactions for the inductive experiment on the proteinprotein interaction dataset (HUMAN). In the Human dataset, FGCN was able to extract relevant information and achieve remarkable performance gain from further hops despite the dataset’s high average degree.
Conclusion
We showed that differentiable recursive graph kernels are higher order binomial combinations of node and neighborhood information. Through analysis, we pointed out that such powerful recursively computed binomial functions lack the representation capacity to capture multihop neighborhood information effectively. Besides highlighting this critical issue, we also proposed a minimalist fusion component that can alleviate this issue. We empirically demonstrated the effectiveness of coupling the proposed fusion component with GCNs by significantly improving the performance of GCN and achieving highly competitive stateoftheart results across eight datasets from different domains. For future work, we will incorporate the fusion component with GraphSAGE models and the recent FastGCN [Chen, Ma, and Xiao2018], which samples neighbors to reduce the computation complexity of GCNs.
References
 [Bruna et al.2013] Bruna, J.; Zaremba, W.; Szlam, A.; and LeCun, Y. 2013. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203.
 [Cantador, Brusilovsky, and Kuflik2011] Cantador, I.; Brusilovsky, P.; and Kuflik, T. 2011. 2nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In Proceedings of the 5th ACM conference on Recommender systems, RecSys 2011. New York, NY, USA: ACM.
 [Chen, Ma, and Xiao2018] Chen, J.; Ma, T.; and Xiao, C. 2018. Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247.
 [Cho et al.2014] Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
 [Defferrard, Bresson, and Vandergheynst2016] Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, 3844–3852.
 [Frasconi, Gori, and Sperduti1998] Frasconi, P.; Gori, M.; and Sperduti, A. 1998. A general framework for adaptive processing of data structures. IEEE transactions on Neural Networks 9(5):768–786.
 [Gallagher and EliassiRad2010] Gallagher, B., and EliassiRad, T. 2010. Leveraging labelindependent features for classification in sparsely labeled networks: An empirical study. Advances in Social Network Mining and Analysis 1–19.

[Glorot and
Bengio2010]
Glorot, X., and Bengio, Y.
2010.
Understanding the difficulty of training deep feedforward neural
networks.
In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, 249–256.  [Gori, Monfardini, and Scarselli2005] Gori, M.; Monfardini, G.; and Scarselli, F. 2005. A new model for learning in graph domains. In Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on, volume 2, 729–734. IEEE.
 [Hamilton, Ying, and Leskovec2017] Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, 1025–1035.
 [Hardt and Ma2016] Hardt, M., and Ma, T. 2016. Identity matters in deep learning. arXiv preprint arXiv:1611.04231.

[He et al.2016]
He, K.; Zhang, X.; Ren, S.; and Sun, J.
2016.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, 770–778.  [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long shortterm memory. Neural computation 9(8):1735–1780.
 [Kawaguchi2016] Kawaguchi, K. 2016. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, 586–594.
 [Kingma and Ba2014] Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 [Kipf and Welling2016] Kipf, T. N., and Welling, M. 2016. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
 [Leskovec and Sosič2016] Leskovec, J., and Sosič, R. 2016. Snap: A generalpurpose network analysis and graphmining library. ACM Transactions on Intelligent Systems and Technology (TIST) 8(1):1.
 [Li et al.2015] Li, Y.; Tarlow, D.; Brockschmidt, M.; and Zemel, R. 2015. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493.
 [Lu and Getoor2003] Lu, Q., and Getoor, L. 2003. Linkbased classification. In Proceedings of the 20th International Conference on Machine Learning (ICML03), 496–503.
 [Mccallum2001] Mccallum, A. 2001. CORA Research Paper Classification Dataset. In people.cs.umass.edu/ mccallum/data.html. KDD.
 [Moore and Neville2017] Moore, J., and Neville, J. 2017. Deep collective inference. In AAAI, 2364–2372.
 [Neville and Jensen2003] Neville, J., and Jensen, D. 2003. Collective classification with relational dependency networks. In Proceedings of the Second International Workshop on MultiRelational Data Mining, 77–91.
 [Orabona and Tommasi2017] Orabona, F., and Tommasi, T. 2017. Training deep networks without learning rates through coin betting. In Advances in Neural Information Processing Systems, 2160–2170.
 [Perozzi, AlRfou, and Skiena2014] Perozzi, B.; AlRfou, R.; and Skiena, S. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 701–710. ACM.
 [Pfeiffer III, Neville, and Bennett2015] Pfeiffer III, J. J.; Neville, J.; and Bennett, P. N. 2015. Overcoming relational learning biases to accurately predict preferences in large scale networks. In Proceedings of the 24th International Conference on World Wide Web, 853–863. International World Wide Web Conferences Steering Committee.
 [Saxe, McClelland, and Ganguli2013] Saxe, A. M.; McClelland, J. L.; and Ganguli, S. 2013. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120.
 [Scarselli et al.2009] Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G. 2009. The graph neural network model. IEEE Transactions on Neural Networks 20(1):61–80.
 [Vaswani et al.2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, 5998–6008.
 [Wang et al.2010] Wang, X.; Tang, L.; Gao, H.; and Liu, H. 2010. Discovering overlapping groups in social media. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, 569–578. IEEE.
 [Wang et al.2016] Wang, S.; Tang, J.; Aggarwal, C.; and Liu, H. 2016. Linked document embedding for classification. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 115–124. ACM.
Appendix A Appendix
Inclusion of Different weights
For convenience, we change the notations for weights associated with the node and the neighbor features to and , respectively. The representation capacity of the recursive graph kernel with different weights for the node and the neighbor features as in Equation 18 is better than a kernel with shared weights.
(18) 
The flexibility of this formulation can be explained again with the binomial expansion. At every recursive step (), the model can be seen deciding to use only the node information or the neighbors’ information or both, by manipulating the weights ( and to be zero or nonzero values). For convenience, we refer the weights to be set if the weights have nonzero values. The decisions of the model can be better understood when visualized as a binomial tree where the nodes are labeled with the computation output, and the edges are labeled by the decision taken, i.e. or . The left figure in Figure 2 illustrates the computation graph for a hop kernel with different weights.
For any hop kernel with a recursive computational layer, there will be unique paths/decisions to make. The paths lead to leaf nodes which compute different terms. terms are available at one or more leaves where the multiplicity of availability is given by the different ways to choose k from K, i.e., (binomial coefficient). and terms have only one path (==) whereas the terms () have more than one path.
Let string ‘0’ denote the identity transformation, and ‘1’ denote the F(A) transformation, . We say a transformation has happened if the weights associated with it have nonzero values. To comprehend the dependencies among weights, let us trace the weights along the different paths to leaf nodes. We create the tree on the right in Figure 2 from the output computation tree on the left by relabeling nodes with substrings representing the transformation taken to reach that node. For example, a node labeled ‘01’ indicates that the node was reached by taking an identity transformation followed by an F(A) transformation. Hence, the number of 0s and 1s at each leaf node conveys the pattern to compute each hop information for a hop kernel.
In Table 4, we tabulate the number of identities, and the number of transformations, taken to obtain different hop information at the leaves for a hop kernel. in the table denote the number of paths to compute the same. With the example in the table, we generalize the following claims to any hops.

computation requires identity transformations () and transformations ().

All computation has a unique combination of and . From this, we can say that the model can learn to obtain any specific hop information without the inclusion of any information from the rest of the hops unlike shared weight models, where all the computations shared the same path.

We cannot independently regulate information from two or more required hops without the inclusion of information from the others hops lying within the range of the required set. This is a consequence of sharing weights among the computation paths as seen in the Figures.
Leaving out the first two trivial claims we analyze the last claim. Let define the set of all required hops in a hop graph kernel. Let and be the minimum and the maximum hops in that set. For the hop, and should be set (non zero values) and similarly for the hop and should be set. and hop information can be obtained by traversing any path that would satisfy the previously mentioned conditioned on the number of identity and transformations. Thus, put together and will be set to obtain and when traversed along one path in the computation tree, to the leaf from the root.
Since the model sums up all the leaf nodes, it will also include information from those leaf nodes which can be traversed from the root by following the set s and s. We go about the proof by first formulating the condition under which other hops’ information can be obtained. Then, we go on to show that if then additional hop information will be included.
Let denote hops, with . Computing the hop requires and . can be obtained only if and , which essentially means that should be a subset of and should be a subset of which have already been set while considering and hop information.
We then find the possible values under the following three conditions listed below:

: As , the required number of for hop is available whereas the required number of are not as . Hence information from is not included.

: As , the required number of are unavailable though the required number of can be satisfied as .

: As , the required number of are satisfied and so is the required number of as .
This can be clearly seen from the example on the hop kernel presented in Table 4. When the weights along the unique path for and hop information are set, it sets up all the weights. As hop information is obtained by doing identity transformation at every layer and hop information is obtained by doing transformations at every layer, all and are set, which would necessarily end up including all the other hop information as all the weights are active. Similarly, it can be shown that when hops and are included, no information from hops and are included.
3  0  1  
2  1  3  
1  2  3  
0  3  1 