Fusion Graph Convolutional Networks

Semi-supervised node classification involves learning to classify unlabelled nodes given a partially labeled graph. In transductive learning, all unlabelled nodes to be classified are observed during training and in inductive learning, predictions are to be made for nodes not seen at training. In this paper, we focus on both these settings for node classification in attributed graphs, i.e., graphs in which nodes have additional features. State-of-the-art models for node classification on such attributed graphs use differentiable recursive functions. These differentiable recursive functions enable aggregation and filtering of neighborhood information from multiple hops (depths). Despite being powerful, these variants are limited in their ability to combine information from different hops efficiently. In this work, we analyze this limitation of recursive graph functions in terms of their representation capacity to effectively capture multi-hop neighborhood information. Further, we provide a simple fusion component which is mathematically motivated to address this limitation and improve the existing models to explicitly learn the importance of information from different hops. This proposed mechanism is shown to improve over existing methods across 8 popular datasets from different domains. Specifically, our model improves the Graph Convolutional Network (GCN) and a variant of Graph SAGE by a significant margin providing highly competitive state-of-the-art results.


page 1

page 2

page 3

page 4


Non-Recursive Graph Convolutional Networks

Graph Convolutional Networks (GCNs) are powerful models for node represe...

Semi-supervised Node Classification via Hierarchical Graph Convolutional Networks

Graph convolutional networks (GCNs) have been successfully applied in no...

A Flexible Framework for Large Graph Learning

Graph Convolutional Network (GCN) has shown strong effectiveness in grap...

Graph Structural-topic Neural Network

Graph Convolutional Networks (GCNs) achieved tremendous success by effec...

HOPF: Higher Order Propagation Framework for Deep Collective Classification

Given a graph wherein every node has certain attributes associated with ...

Propagation with Adaptive Mask then Training for Node Classification on Attributed Networks

Node classification on attributed networks is a semi-supervised task tha...

Multi-hop Convolutions on Weighted Graphs

Graph Convolutional Networks (GCNs) have made significant advances in se...


In many real-life applications such as social networks, citation networks, protein interaction networks, etc., the entities in an environment are not independent but rather influenced by each other through their interactions. Such relational datasets have been popularly modeled as graphs where the entities make up the node, and the edges represent an interaction. The use of graph-based learning algorithms has increasingly gained traction owing to their ability to model structured data. Categorizing such entities requires extracting relational information from their multi-hop neighborhoods and combining that efficiently with their features. Summarizing information from multiple hops is useful in many applications where there exists semantics in local and group level interactions among entities. Thus, defining and finding the significance of neighborhood information over multiple hops becomes an important aspect of the problem.

Traditionally handcrafted features were widely used to capture relational information. Popular methods used mix of count statistics of label distribution [Neville and Jensen2003, Lu and Getoor2003], relational properties of nodes like degree, centrality scores [Gallagher and Eliassi-Rad2010], and attribute summaries of immediate neighborhood, etc. Limited by manual engineering, these traditional methods only used raw features built from information associated with immediate (first order or one hop) neighbors.

The recent surge in deep learning has shown promising results in extracting important semantic features and learning good representations for many machine learning tasks. Deep learning models for such relational node classification tasks can be broadly categorized into models that either learn node representation with structural regularizations

[Perozzi, Al-Rfou, and Skiena2014, Wang et al.2016] or those that ignore explicit structural regularization and learn to aggregate neighbors’ information [Hamilton, Ying, and Leskovec2017, Moore and Neville2017, Kipf and Welling2016]. The former is limited to work only in networks that exhibit high homophily, as they enforce the representation of a node and its neighbor to be similar. The later methods, on the other hand, do no make such assumptions.

Initial works [Frasconi, Gori, and Sperduti1998]

on relational feature extraction with neural nets primarily relied on recursive neural nets to process the graph data. Limited by their ability to deal only with directed ordered acyclic graphs,

[Scarselli et al.2009, Gori, Monfardini, and Scarselli2005]

introduced Graph Neural Networks (GNNs) which used recursive neural nets to propagate information in any general graph iteratively. However, these GNNs were limited to problems where the entire graph can fit into memory. To extend the work to sequence generation problems on graphs,

[Li et al.2015] adapted GRUs [Cho et al.2014] in the propagation step. Recently, [Moore and Neville2017] proposed an LSTM [Hochreiter and Schmidhuber1997] based sequence embedding model for node classification but it required randomly ordering first hop neighbors’ information, thus essentially discarded the topological structure.

To directly deal with the graph’s topological structure, [Bruna et al.2013] defined convolutional operations in the spectral domain for graph classification tasks, but required computationally expensive eigen decomposition of the graph Laplacian. To reduce this requirement, [Defferrard, Bresson, and Vandergheynst2016] approximated the higher order relational feature computation with first order Chebyshev polynomials defined on the graph Laplacian. Graph Convolutional Networks (GCNs) [Kipf and Welling2016] adapted them to semi-supervised node level classification tasks. GCNs simplified Chebyshev Nets by recursively convolving one-hop neighborhood information with a symmetric graph Laplacian. Recently, [Hamilton, Ying, and Leskovec2017] proposed a generic framework called GraphSAGE with multiple neighborhood aggregator functions. GraphSAGE works with a partial (fixed number) neighborhood of nodes to scale to large graphs. GCN and GraphSAGE are the current state-of-the-art approaches for transductive and inductive node classification tasks in graphs with node features. These end-to-end differentiable methods provide impressive results besides being efficient regarding memory and computational requirements.

Herein, we argue that despite their impressive results these models lack the representation capacity to summarize relevant neighborhood information from multiple hops effectively. We support our argument with our analysis of their representation limitations and also provide a solution to alleviate this issue. Below, we list out primary contributions:

  • To the best of our knowledge, we are the first to analyze and point out that current state-of-the-art graph convolutional nets lack the representation capacity to regulate different neighborhood information independently.

  • We also show that these models capture -hop neighborhood information by a order binomial. We take advantage of this to build a binomials basis for the K-hop neighborhood space with outputs from different graph layers corresponding to different hop. With the binomial basis, we define a simple linear fusion layer that can capture any required combination of hops for the end task.

  • We propose F-GCN, a simple extension to GCNs with the proposed fusion layer. We show that the proposed model outperforms the state-of-the-art models on six datasets while being highly competitive on the other two.



Let denote a graph comprising of vertices, , and edges, with respectively. Let denote the nodes’ features and denote the nodes’ labels with and referring to the number of features and labels, respectively. Let denote the adjacency matrix representation of the set of edges, and let denote the diagonal degree matrix defined as . Let denote the normalized graph Laplacian and denote the re-normalized Laplacian [Kipf and Welling2016].

In this paper, a Graph Convolutional Network defined to capture -hop information will have graph convolutional layers with dimensional outputs, and an final label layer denoted by . if the last convolution is considered as the label layer otherwise, . Let denote the weights associated with the layer, where is the first hidden layer’s weights, is the label layer’s weights and is the intermediate layer’s weights. Let

define the activation function associated with layer,


Graph Convolutional Networks

Graph Convolutional Networks (GCNs), introduced in [Kipf and Welling2016]

, is a multi-layer convolutional neural network where the convolutions are defined on a graph structure for the problem of semi-supervised node classification. The conventional two-layer GCNs which captures information up to the

nd hop neighborhood of a node can be reformulated to capture information up to any arbitrary hop, as given below in Eqn: 1.


GCN was used for multi-class classification task with ReLU activation function,

and a softmax label layer, . We can rewrite the GCN model in terms of th hop node and neighbor features as below by factoring .



Graph Sample and Aggregator (GraphSAGE) proposed in [Hamilton, Ying, and Leskovec2017] consists of 3 models made up of different differentiable neighborhood aggregator functions. GraphSAGE models were defined for the multi-label semi-supervised inductive learning task, i.e generalizing to unseen nodes during training. Let the function abstractly denote the different aggregator functions in GraphSAGE, specifically

{mean, max pooling, LSTM} and we will refer to these models as GS-MEAN, GS-MAX and GS-LSTM, respectively. Similar to GCN, GraphSAGE models also recursively combine neighborhood information at each layer of the Neural Network. GraphSAGE has an additional label layer unlike GCN, i. e.,

. Hence, and . Here, the weights .


GraphSAGE models, unlike GCN, are defined to work with partial neighborhood information. For each node, these models randomly sample and use only a subset of neighbors from different hops. This choice to work with partial neighborhood information allows them to scale to large graphs but restricts them from capturing the complete neighborhood information. Rather than viewing it as a choice it can also be seen as restriction imposed by the use of Max Pool and LSTM aggregator functions which require fixed input lengths to compute efficiently. Hence, GraphSAGE constraints the neighborhood subgraph of a node to contain a fixed number of neighbors at each hop.

Analysis of recursive propagation models

In this section, we first provide a unified formulation of GCN and GraphSAGE as recursively computed graph propagation models. Then, we analyze this unified formulation and show that they lack the representation capacity to regulate information from different hops independently.

Unified Recursive Graph Propagation Kernel

GCN and GraphSAGE differ from each other in terms of their node features, the neighborhood features, and the combination function. These differences can be abstracted to provide a unified formulation as below.


Where and denote the th hop node and it’s neighbors’ features respectively, denotes the scaling factor for node features, denotes the neighbors’ weights, and denotes the mode of combination of node and it’s neighbors’ features. For brevity, we have made the neighbors’ weighting function to be independent of .

We can view GCN in terms of the components in Eqn: 4 with node features, where = , neighbors’ features, = with = and combining by summation, .

Similarly, GraphSAGE can be seen to combine nodes’ features, with = and different based neighbors’ features by concatenation, = . Specifically, = for GS-MEAN, = for GS-MAX where

is a one hot vector with

in the position of the node with the maximum value for the th feature and for GS-LSTM, is defined by the LSTM gates which randomly orders neighbors and gives weightage for a neighbor in terms of neighbors seen before.

The concatenation combination (denoted by square braces below) can also be expressed in terms of a summation of node and neighbors features with different weight matrices, and

respectively by appropriately padding zero matrices, (

) as shown below.


The and terms for CONCAT and SUMMATION combinations are similar if weights are shared in the CONCAT formulation as shown in Eqn: 6 and Eqn: 7, respectively. Weight sharing refers to .


For brevity of analysis made henceforth, we only consider the summation model to discuss the limitations of the recursive propagation kernels without losing any generality on the deductions made. Further, we provide another abstraction to the summation formulation as in Eqn: 7 by Eqn: 8. Henceforth, we refer to Eqn: 8 as the generic recursive propagation kernel in the upcoming analysis.


Lack of independent regulatory paths to different hops

Though these propagation models can combine information from multiple hops, their formulation restricts them from independently regulating information from different hops. This is a consequence of recursively computing th hop information in terms of -th hop information which results in interdependence among weights associated with the different hop information. We can see this below in the recursively expanded unified graph kernel.


Let’s analyze this with an example of a -hop linear kernel with = and = which on expansion yields the following equation:


This expansion makes it trivial to note that all the weights influence all the different hop information ( and ) in the model. For example, if we take the case where only first-hop information (just term) is required, then there exists no combination of s that can provide it under the current model. It should be noted that we cannot obtain the st hop information alone by using a -hop kernel as that would also include -hop information, .

From the above analysis, we can say that these recursive graph kernels have limited representation capacity as they cannot capture information from a particular subset of hops without including information from other hops. The limitation of these networks can be attributed to the specific formulation of recursion used to compute output at every layer. As with every layer of graph convolutional nets, a new information about the th hop is introduced as in ). More importantly, the output at th layer passes through a series of computations involving later hops () before reaching the last output layer. And also note that this phenomenon happens for previous layers too. Thus, this leads to a lack of independent computation paths to regulate information from any hop without affecting information from later and earlier hops.

Inclusion of skip connections: Adding the popular skip connection [He et al.2016] to these models as in Eqn: 11 improves the multi-hop information regulation capacity.


On recursively expanding the above equation, it can be seen that adding skip connections to a layer, results in directly adding information from all the lower hops, as shown in Eqn: 12. Unlike Eqn: 8 where the output at each layer, was only dependent on the previous layer, accounting to only one computational path; now adding skip connections allows for multiple computational paths. As it can be seen that at the th layer the model has the flexibility to select an output from any or all .

However, it can only discard information beyond a particular hop, and is still not sufficient to individually regulate the importance of information from individual hop as all hops, are inter-dependent. Lets us consider the same example of a -hop model as earlier to capture information from st hop alone ignoring the rest. The best, the -hop model with skip connections can do is to learn to ignore information from and hop by setting == and including along with . It can be reasoned as before to see that cannot be set to as depends on the result of thereby having no means to ignore information from . This limits the expressive power to efficiently span the entire space of order neighborhood information. To summarize, skip connections at best can obtain information up to a particular hop by ignoring information from subsequent hops. The combination can be perceived as linear skip connection as noted by the authors of GraphSAGE.

Inclusion of different weights: As with the CONCAT model, the summation models can also be modified to have different weights to compute node and neighborhood features. We provide the analysis for such models with non-shared weights in the supplementary material. From that analysis, it can be noticed that models with different weights are powerful enough to obtain any particular hop information ignoring the rest and also can obtain information from a continuous subset of hops i.e () ignoring the rest. However, including information from two different hops and with , information from all hops between and will also be included and can’t be ignored.

Proposed Methodology

In this section, we propose a simple yet effective extension to GCNs by adding a fusion component that allows them to capture multiple hop information effectively. We motivate and propose this component as a solution that will enable these graph kernels to span the entire space of -hop neighborhood. First, we show that the unified kernel is a binomial combination of node and its’ neighborhood information. Thus, at each layer , a th hop kernel is computed by a binomial combination. In light of this, we propose a simple fusion layer that learns to linearly combine information from these binomial bases to span the entire -hop space.

Binomial basis

The -hop unified propagation kernel defined in Eqn: 8 can be rolled out similar to Eqn: 9 and be expressed as a th order binomial in terms of node and it’s neighbors’ features for the linear activation case as given in Eqn: 13.


The higher order binomial term in Eqn: 13 when expanded assigns different weights to different terms as seen in Eqn: 10. These weights correspond to the binomial coefficients of the binomial series, . For example, refer to Eqns: 14 and 15 corresponding to a -hop and -hop kernel with and for simplicity. It can be seen that for the -hop kernel the weights are and for the -hop kernel it is . Thus, these recursive propagation kernels combine different hop information weighed by the binomial coefficients.


These weights induce a bias on the importance of each hop which again is a limitation of the kernel design. Any such fixed bias over different hops cannot consistently provide good performance across numerous datasets. In the limit of infinite data, we can expect the parameters to correct these scaling factors induced by these biases. However, as with most graph based semi-supervised learning applications where the amount of available labeled data for training is limited, an undesirable bias can result in a sub-optimal model.

Existing propagation kernels defined over -hop information, extract relational information by performing convolution operations on different -hop neighbors based on their respective th order binomial. As discussed earlier, biasing the importance of information along with recursive weight dependencies hinder the model from learning relevant information from different hops. These limitations constrain the expressive power of these models from spanning the entire space of order neighborhood information. Hence, it is restricted to only a subspace of all possible order polynomial defined on the neighborhood of nodes.

Linear Fusion Component

To mitigate these issues with existing models, we propose a minimalistic component for these models, a fusion component. This fusion component consists of parameters to combine the information from the binomial basis defined by the different hop information to effectively scale the entire space of a order neighborhood.

We define the fusion component in Eqn: 16 as a linear weighted combination over -hop neighborhood space spanned by the binomial basis, s () with coefficients, . The coefficients allow the neural network to explicitly learn the optimal combination of information from different hops. As the s are binomials, a parameterized linear combination of these binomials can obtain any combination of the individual hop information.


Fusion Graph Convolutional Network

We propose Fusion Graph Convolutional Network, F-GCN in equations in 17. F-GCN is a minimalistic architecture that adds the fusion component defined in Eqn: 16 to GCN defined in Eqn: 1. It can be seen to combine different hop information with the fusion component. The fusion component mentioned in the penultimate line of the equations in 17 fuses label scores from each propagation step. The fused label scores are then normalized to make label prediction.


The dimensions of , , , , are , , , , respectively. F-GCN uses ReLU activations and a softmax label layer accompanied by a multi-class cross entropy for multi-class classification problem or a sigmoid layer followed by a binary cross entropy layer for multi-label classification problem. Since predictions are obtained from every hop, we also subject to a non-linear activation function with weights same as from .

The number of parameters in F-GCN is , where the first term is for GCN and the second is for the fusion component. GraphSAGE models with no shared weights have or for the summation and concat combination respectively which is more than F-GCN as typically. This simple fusion component with fewer parameters provides additional benefits to F-GCN besides explicitly allowing to capture different hop information. It provides for additional direct gradient flow paths to each of the propagation steps allowing it to learn better discriminative features at the lower hops too which also improves its chances of mitigating vanishing gradient. F-GCN can be seen as a multi-resolution architecture which simultaneously looks at information from different resolutions/hops and models the correlations among them.

The fusion component is similar in spirit to the Chebyshev filters introduced in [Defferrard, Bresson, and Vandergheynst2016] for complete graph classification task. The primary difference is that the Chebyshev filters learn coefficients to combine Chebyshev polynomials defined over neighborhood information whereas in F-GCN the coefficients of the filter are used to combine different binomials pertaining to different layers. Moreover, an additional difference is that the Chebyshev polynomial basis is not associated with weights to filter th

hop information which can potentially enable the model to learn complex non-linear feature basis. F-GCN also enjoys the benefit of the re-normalization trick of GCN that stabilizes the learning to diminish the effect of vanishing or exploding gradient problems associated with training neural networks.

Note that existing attention mechanisms [Vaswani et al.2017] are typically defined for a positive combination of information. They have a restricted scoring range based on the activation functions used. In our case, since we needed a mechanism that can scale the amount of addition and subtraction of information from different binomial bases, we opted for the simple linear weighted combination layer.


We run detailed experiments to compare the performance of GCN Vs. GraphSAGE Vs. FGCN across many datasets. The end task is either semi-supervised multi-class or multi-label node classification. Links to code, dataset, and hyper-parameter details will be made available.


Experiments were conducted on eight publicly available datasets from social, citation, product, movie and biological domains. In Table: 1, network statistics of these datasets are given. Dataset details are provided below.

Dataset Network Nodes Edges Classes Multi-label Features
CORA Citation 2708 5429 7 FALSE 1433
CITE Citation 3312 4715 6 FALSE 3703
CORA_ML Citation 11881 34648 79 TRUE 9568
HUMAN Biology 56944 1612348 121 TRUE 50
BLOG Social 69814 2810844 46 TRUE 5413
FB Social 6302 73374 2 FALSE 2
AMAZON Product 16553 76981 2 FALSE 30
MOVIE Movie 7155 388404 20 TRUE 5297
Table 1: Dataset statistics

Biological network: We use protein-protein interaction (PPI) network of the Human tissues’, as presented in GraphSAGE [Hamilton, Ying, and Leskovec2017]. The dataset contains protein interactions from 24 human tissues. Positional gene sets, motif gene sets, and immunology signatures were used as features/attributes of the state. The ultimate task is to predict the gene’s functional ontology.

Movie network: The movielens-2k dataset available as a part of HetRec 2011 workshop [Cantador, Brusilovsky, and Kuflik2011] contains a large number of movies. The dataset is an extension of the MovieLens10M dataset with additional movie tags. The graph constructed based on it considers each movie as a node and a common actor as an edge. The goal of the task is to predict the genre of the movies.

Product network: We follow the set-up similar to [Moore and Neville2017] and consider Amazon DVD co-purchase network. It is a subset of the co-purchase data, Amazon_060 [Leskovec and Sosič2016]. The nodes correspond to DVDs and edges are constructed if two DVDs are co-purchased. The DVD genres are treated as DVD features. The task here is to predict whether DVD sales will cross 7500 or not.

Social networks: We consider the BlogCatalog (BLOG) [Wang et al.2010] and Facebook (FB) [Pfeiffer III, Neville, and Bennett2015, Moore and Neville2017] datasets. The nodes in the BlogCatalog datasets represents the users of a social blog. Each user’s blog tags are treated as the attributes of the nodes and relations based on friendship or fan following are represented as edges. The task is to determine the interests of users. Similarly, in the Facebook dataset, the nodes denote the Facebook users, and its corresponding attributes consists of gender and religious view. The task here is to determine the political view of a user.

Citation Networks: In citation networks, the research articles are the nodes and citations are the edges. We use Cora, Citeseer, and Cora_ML datasets from this domain. The node attributes for them are the bag-of-words representation of the article. Predicting the research area of the article is the main objective. Unlike most others which are multi-class datasets, Cora-ML is a multi-label classification dataset [Mccallum2001],

CORA 60.222 79.039 76.821 73.272 65.730 79.039
CITE 65.861 72.991 70.967 71.390 65.751 72.266
CORA_ML 40.311 63.848 62.800 53.476 OOM 63.993
HUMAN 41.459 62.057 63.753 65.068 64.231 65.538
BLOG 37.876 34.073 39.433 40.275 OOM 39.069
FB 64.683 49.762 64.127 64.571 64.619 64.857
AMAZON 63.710 61.777 68.266 70.302 68.024 74.097
MOVIE 50.712 39.059 50.557 50.569 OOM 52.021
Penalty 10.997 6.276 2.011 2.986 5.232 0.241
Table 2: Transductive Experiments
44.644 85.708 79.634 78.054 87.111 59.800 60.000 61.200 88.942
Table 3: Inductive learning experiment with Human dataset: 222Our training setup gives better results than what was originally reported for GraphSAGE models (Mean*, LSTM*, Max*)

Experiment Setup

We report the experiment results for semi-supervised learning on a test set, populated by randomly sampling of the dataset. We create the training sets by randomly sampling five sets of nodes from the entire graph. Further, of these training nodes are chosen as the validation set. We do not use the validation set for (re)training. It is ensured that these training samples are mutually exclusive from the held out test data. For all transductive experiments, the reported results are an average of these five different training sets. We report results for inductive learning experiment with the same experimental setup as GraphSAGE on the Human PPI network, where the test nodes and validation nodes have no path to the nodes in the training set.

The hyperparameters for the models are the number of layers of the network (hops), dimensions of the layers, level of dropouts for all layers and L2 regularization, similar to

[Kipf and Welling2016]

. We set the same starting learning rate for all the models across all datasets. We train all the models for a maximum of 2000 epochs using Adam

[Kingma and Ba2014] with learning rate set to 1e-2. We use a variant of patience method with learning rate annealing for early stopping of the model. Precisely, we train the model for a minimum of 50 epochs and start with the patience of 30 epochs and drop the learning rate and patience by half when the patience runs out (i.e., when the validation loss does not reduce within the patience window). We stop the training when the model consecutively loses patience for two turns. We added all these components to the baseline codes too, and we even observed an improvement up to points for GraphSAGE on their dataset.

For hyper-parameter selection, we search for optimal setting on a two-layer deep feedforward neural network with the node attributes (NODE) alone. We then use the same hyper-parameters across all the other models. We row-normalize the node features and initialize the weights with [Glorot and Bengio2010]

. Since the percentage of different labels in training samples can be significantly skewed, we weigh the loss for each label inversely proportional to its total fraction like

[Moore and Neville2017]. We use CONCAT operations for all GraphSAGE models as in the original model and also include skip connections for the rest of the models for all kernel layers.

We ensured that all models have the same setup in terms of the weighted cross entropy loss, the number of layers, dimensions, stopping criteria and dropouts. We evaluate the models on Micro-F1 scores similar to GraphSAGE [Hamilton, Ying, and Leskovec2017]. The best results for models across multiple hops were reported, which were typically 4 hops for Cora_ML and HUMAN, 3 hops for Amazon, 2 hops for Cora, Citeseer and FB and 1 hop for Movie and Blog datasets. We report results from different hops rather than a fixed number to be fair to baselines that dropped performance on specific datasets with increased hops.

Our implementation is mini-batch trainable, similar to GraphSAGE. We compare our model, F-GCN, against the node only classifier: NODE, GCN, and all the GraphSAGE variants: GS-MEAN, GS-MAX, and GS-LSTM. GS-LSTM model ran out of memory (OOM) for few datasets as mentioned in the table owing to high parameterization.

Experiment results

Among the baselines, GraphSAGE models with more complex aggregator functions and no shared weights for node and neighborhood features significantly outperform the GCN model with no shared weights and limiting scaling factor, . GCN model performs poorly on datasets where the number of edges is high. This is primarily due to the effect of its node scaling factor in these high degree datasets where the nodes’ features are heavily under weighed (= ) relative to its’ neighbors’ information. The effect of node scaling and biased importance is in agreement with the theoretical justification made in [Kipf and Welling2016] for the design of re-normalized GCNs over the mean model. However their experimental benefits on highly homophilous datasets, Cora and Citeseer were not achievable on other datasets as shown in Table 2. This can be observed from the datasets: Blog, Amazon, Movie, and FB as it performs poorly than the classifier which only uses the node attributes, NODE by , , and percentage points respectively.

F-GCN Vs. NODE: F-GCN significantly outperforms the NODE model across the board.

F-GCN Vs. GCN: F-GCN improves over GCN by in the inductive setup and outperforms GCN on six of the eight datasets in the transductive setup while being comparable to the other two. F-GCN has seemed to have learned to avoid the bias induced by the scaling factor by learning to effectively combine the node features along with additional hop information resulting in improved performances over GCN by up to percentage points in FB. In datasets, where GCN’s performance dropped below the NODE model, the F-GCN model has recovered the drop in performance and also further improved the performance by on BLOG, Movie, and FB. Moreover, with Amazon, we observe a further point improvement over the node only model. With the fusion component, the GCN model which performed poorly among the propagation kernels not only obtained a significant boost in performance but also achieved best overall consistent score with a penalty (average difference from the best) as low as .

FGN Vs GraphSAGE: F-GCN outperforms GraphSAGE variants across the board on seven of the eight datasets while slightly underperforming on one. GraphSAGE models have higher flexibility compared to GCNs as they have no shared weights. This explains the significant experimental improvement benefited with concatenation as noted in [Hamilton, Ying, and Leskovec2017]. F-GCN significantly improves over GraphSAGE models in Human and Amazon dataset. In Amazon dataset, though all of GraphSAGE’s variants significantly outperformed GCN and NODE models, F-GCN further improved over GraphSAGE by another . GS-MEAN model, similar to GCN improves over GCN and NODE by and respectively. This can be attributed to not scaling down the node information as with GCN. Despite that, F-GCN manages to further improve over GS-MEAN by . There is no single winner among GraphSAGE models across datasets. Different variants champion in different datasets among them. Despite having a single simple aggregation function, F-GCN easily champions over all of them combined except on BLOG. This suggests that the flexibility to independently regulate information is necessary and irrespective of the complex aggregation functions used, the mild lack in representation capacity holds back GraphSAGE from achieving F-GCN’s performance.

Overall, F-GCN improves over the state-of-the-art results on six datasets while being extremely competitive on the other two datasets. Though the proposed fusion component, in theory, is an optimal solution for GCNs with linear activations, they also seem to be experimentally beneficial for GCNs with the piece-wise linear ReLU activations too. Such generalization is not unrealistic in practice, as it is often observed that such generalization of insights from a relaxed linear analysis seems to provide significant clarity and potential improvements on the non-linear front [Saxe, McClelland, and Ganguli2013, Kawaguchi2016, Hardt and Ma2016, Orabona and Tommasi2017].

F-GCN robustly captures mutli-hop information In real life datasets, there exists a varying amount of information among the interactions between the node and its different distant hop neighbors. An ideal relational model should be able to efficiently capture relevant information while filtering out the increasing noise induced by the expanding neighborhood size with each hop. We demonstrate F-GCN’s capability in Fig: 1 to robustly capture information from multiple hops on different datasets with a varied information pattern. We selected only those datasets that have high relevant relational information for the classification task. These were datasets that obtained significant improvement on results over the node only classifier with just the inclusion of the first hop information.

In the citation networks (Cora and Cora_ML), it can be seen that the performances seem to saturate after two hops and with one hop in the Amazon, co-purchase network. Despite that, F-GCN with its capability to selectively regulate information from multiple hops remains unaffected by the noise induced by considering additional hops.

In contrast, there is a significant increase in performance with the consideration of nodes’ higher order neighborhood interactions for the inductive experiment on the protein-protein interaction dataset (HUMAN). In the Human dataset, F-GCN was able to extract relevant information and achieve remarkable performance gain from further hops despite the dataset’s high average degree.

Figure 1: F-GCN performance with hops 1-4


We showed that differentiable recursive graph kernels are higher order binomial combinations of node and neighborhood information. Through analysis, we pointed out that such powerful recursively computed binomial functions lack the representation capacity to capture multi-hop neighborhood information effectively. Besides highlighting this critical issue, we also proposed a minimalist fusion component that can alleviate this issue. We empirically demonstrated the effectiveness of coupling the proposed fusion component with GCNs by significantly improving the performance of GCN and achieving highly competitive state-of-the-art results across eight datasets from different domains. For future work, we will incorporate the fusion component with GraphSAGE models and the recent Fast-GCN [Chen, Ma, and Xiao2018], which samples neighbors to reduce the computation complexity of GCNs.


Appendix A Appendix

Inclusion of Different weights

For convenience, we change the notations for weights associated with the node and the neighbor features to and , respectively. The representation capacity of the recursive graph kernel with different weights for the node and the neighbor features as in Equation 18 is better than a kernel with shared weights.


The flexibility of this formulation can be explained again with the binomial expansion. At every recursive step (), the model can be seen deciding to use only the node information or the neighbors’ information or both, by manipulating the weights ( and to be zero or non-zero values). For convenience, we refer the weights to be set if the weights have non-zero values. The decisions of the model can be better understood when visualized as a binomial tree where the nodes are labeled with the computation output, and the edges are labeled by the decision taken, i.e. or . The left figure in Figure 2 illustrates the computation graph for a hop kernel with different weights.

For any hop kernel with a recursive computational layer, there will be unique paths/decisions to make. The paths lead to leaf nodes which compute different terms. terms are available at one or more leaves where the multiplicity of availability is given by the different ways to choose k from K, i.e., (binomial coefficient). and terms have only one path (==) whereas the terms () have more than one path.

Let string ‘0’ denote the identity transformation, and ‘1’ denote the F(A) transformation, . We say a transformation has happened if the weights associated with it have non-zero values. To comprehend the dependencies among weights, let us trace the weights along the different paths to leaf nodes. We create the tree on the right in Figure 2 from the output computation tree on the left by relabeling nodes with substrings representing the transformation taken to reach that node. For example, a node labeled ‘01’ indicates that the node was reached by taking an identity transformation followed by an F(A) transformation. Hence, the number of 0s and 1s at each leaf node conveys the pattern to compute each hop information for a hop kernel.

In Table 4, we tabulate the number of identities, and the number of transformations, taken to obtain different hop information at the leaves for a -hop kernel. in the table denote the number of paths to compute the same. With the example in the table, we generalize the following claims to any hops.

  • computation requires identity transformations () and transformations ().

  • All computation has a unique combination of and . From this, we can say that the model can learn to obtain any specific hop information without the inclusion of any information from the rest of the hops unlike shared weight models, where all the computations shared the same path.

  • We cannot independently regulate information from two or more required hops without the inclusion of information from the others hops lying within the range of the required set. This is a consequence of sharing weights among the computation paths as seen in the Figures.

Figure 2: Binomial Computation Trees for Graph Kernels

Leaving out the first two trivial claims we analyze the last claim. Let define the set of all required hops in a hop graph kernel. Let and be the minimum and the maximum hops in that set. For the hop, and should be set (non zero values) and similarly for the hop and should be set. and hop information can be obtained by traversing any path that would satisfy the previously mentioned conditioned on the number of identity and transformations. Thus, put together and will be set to obtain and when traversed along one path in the computation tree, to the leaf from the root.

Since the model sums up all the leaf nodes, it will also include information from those leaf nodes which can be traversed from the root by following the set s and s. We go about the proof by first formulating the condition under which other hops’ information can be obtained. Then, we go on to show that if then additional hop information will be included.

Let denote hops, with . Computing the hop requires and . can be obtained only if and , which essentially means that should be a subset of and should be a subset of which have already been set while considering and hop information.

We then find the possible values under the following three conditions listed below:

  • : As , the required number of for hop is available whereas the required number of are not as . Hence information from is not included.

  • : As , the required number of are unavailable though the required number of can be satisfied as .

  • : As , the required number of are satisfied and so is the required number of as .

This can be clearly seen from the example on the hop kernel presented in Table 4. When the weights along the unique path for and hop information are set, it sets up all the weights. As hop information is obtained by doing identity transformation at every layer and hop information is obtained by doing transformations at every layer, all and are set, which would necessarily end up including all the other hop information as all the weights are active. Similarly, it can be shown that when hops and are included, no information from hops and are included.

3 0 1
2 1 3
1 2 3
0 3 1
Table 4: Number of Identity and F(A) transformations