1 Introduction
Graphs, such as social networks, biological networks, and citation networks, are ubiquitous data structures that can capture interactions between individual nodes (Hamilton et al., 2017b). Nodes in graphs are often associated with feature vectors. For example, in a typical citation graph, nodes are documents, edges are citation links, and node features are bagofwords feature vectors. This paper will focus on analyzing such graphs with node features available.
Graphs are challenging to deal with (Shaw & Jebara, 2009). Most realworld graphs have no regular connectivity and the node degrees can range from one to hundreds or even thousands in the same graph. Moreover, graphs have rich and important information that can not be revealed by simply analyzing the individual nodes or structure information. For example, sparse bagofwords feature vectors can not effectively reflect the citations between papers. To understand complex graphs better, it is important to learn graph representations that can capture rich information from both node feature vectors and graph structures (i.e., node neighborhood information).
Kipf & Welling (2017)
proposed graph convolutional networks (GCN) as an effective graph representation model that naturally combines structure information and node features in the learning process. It represents a node by aggregating all the feature vectors of its neighbors, analogous to the receptive field of a convolutional kernel in convolutional neural networks (CNN). It has been proved to be powerful in many applications, including node classification, link prediction, and recommendation
(Kipf & Welling, 2017; Schlichtkrull et al., 2018; Ying et al., 2018). However, GCN aggregates all neighbors and does not consider whether the central node has a dense or sparse connection, and whether a neighbor’s individual features are useful or not in the aggregation process.There are two existing approaches to address the two problems of GCN above. The first is samplingbased approach. Instead of considering all neighbors, this approach samples a fixedsize set of neighbors so that the neighborhood resembles that in CNN better. Hamilton et al. (2017b) proposed GraphSAGE that randomly samples a fixedsize set of neighbors by random walk. FastGCN by Chen et al. (2018) and jumping knowledge networks (JKnetworks) by Xu et al. (2018) sample nodes from the whole graph rather than the neighborhood, aiming to improve the efficiency. However, these methods did not consider to weight the selected node or features in the aggregation process either .
The second approach focuses on how to learn to weight neighbor feature vectors, instead of simply aggregating them. Inspired by the attention mechanism (Bahdanau et al., 2015), Velickovic et al. (2018) proposed the graph attention networks (GATs) to use a 1D convolutional layer to learn different weights of neighbors for aggregation. However, in GAT, each feature of a feature vector shares the same weight, i.e., the usefulness of features is not considered and useful features are weighted the same as less useful features. Recently, Gao et al. (2018) proposed a learnable graph convolutional networks (LGCNs). It applys a learnable graph convolutional layer after a graph embedding layer (GCN layer), then uses two 1D convolutional layers to perform convolution on features from the GCN layer. The feature map consists of the central node’s embedding and reorganized neighbors’ embedding, rather than the original features. Since the learnable convolutional layer operates after the GCN layer, it inherits the limitations of GCN above.
As a powerful representation learning method, CNN can successfully work on fixedsize grids (e.g., images) or sequences (e.g., sentence) datasets to tackle problems such as image classification (Krizhevsky et al., 2012; Karpathy et al., 2014), semantic segmentation (Girshick et al., 2014) and machine translation (Bahdanau et al., 2015). GCN and its extensions above all aim to apply CNNlike convolutional operations on graphs. They have made progress in performing convolutional operation on node representations, using neighborhood for node representation (to imitate the receptive field). However, this is different from the convolution in CNN where the convolution operation works on features, and GCN and its extensions just use the connectivity structure of the graphs as the receptive field to perform neighborhood aggregation.
In this paper, we propose a novel GCN extension named as graph convolutional networks with nodefeature convolution (NFCGCN), aiming to generalize the concepts in CNN further to graphs. In particular, NFCGCN uses the 1D or 2D convolution on nodefeature 2D feature map to learn new representation of the central node. NFCGCN has three steps of operation as shown in Fig. 1.

Neighbor sampling: We randomly sample a fixedsize set of neighbors for each central node so that each node has a regular connectivity.

Nodefeature convolution (NFC): After sampling the neighbors, we propose a nodefeature convolutional layer to learn different weights for a nodefeature 2D feature map to get the firstlevel node representation via convolution and flattening.

Standard GCN: We feed the learned firstlevel representation to a standard GCN to learn a secondlevel node representation.
Therefore, the proposed NFCGCN embodies ideas from the samplebased GCN methods (Hamilton et al., 2017b; Chen et al., 2018; Xu et al., 2018) and extends the ideas of 1D convolutional layer in GAT (Velickovic et al., 2018), while also keeping the virtues of the original, standard GCN (Kipf & Welling, 2017).
Our key idea is in the nodefeature convolution step that introduces a convolutional layer to work on a 2D feature map constructed directly from feature vectors of the central node and its sampled neighbors. This layer enables endtoend learning of weights for different features from different neighbors. To reduce the model complexity (i.e., the number of model parameters) and reduce overfitting, we keep the filter size small, use multiple filters, and share a filter’s parameters on all nodes in the graph. This makes the convolution in this NFC layer resembles more the convolution of CNN on images than the convolution in other GCN methods.
Experiments on three citation graph datasets show that NFCGCN outperforms existing GCN methods across all three datasets. In addition, it can converge with less training epochs. Furthermore, we studied deeper models of NFCGCN and GCN with up to five GCN layers. The results show that NFCGCN has much smaller performance variation than GCN. This encourages the exploration of deep learning models for graphs, an area with little progress so far.
2 Related Work
In this section, we review and discuss related works on graph representation learning and particularly GCN and its GCN extensions.
2.1 Notations
In this paper, we consider graphs with a feature vector associated with each node. Let = denotes an undirected graph with nodes , edges , where , an adjacency matrix , and a feature matrix containing the dimensional feature vectors. We first define a list of important notations that will be used throughout this paper, as shown in Table 1.
Symbol  Definition 

=  Graph with node features 
Node  
Adjacency matrix for the network  
Matrix of nodes features  
Feature vector for  
The firstlevel representation of  
fixed node bandwidth via neighbor sampling  
Reconstructed feature map for  
The hidden representation of th GCN layer 

Label indicator matrix 
2.2 Graph Representation Learning
Traditional machine learning methods on graphs are taskdependent and require feature engineering. In contrast, the more datadriven graph representation learning approach aims to learn taskindependent representations that can capture rich information in graphs for various downstream tasks such as node classification, link prediction, and recommendation. For graphs with associated features as defined above, the graph representation will be learned from both the structure information defined by the nodes and edges , as well as the features .
Hamilton et al. (2017b) categorizes graph representation leaning methods into three approaches: the factorizationbased approach, random walkbased approach and neural networkbased approach.

Random walkbased methods. Inspired by the Word2Vec method (Mikolov et al., 2013), Perozzi et al. (2014)
proposed the DeepWalk that generates random paths over a graph. It learns the new node representation by maximizing the cooccurrence probability of the neighbors in the walk. Node2vec
(Grover & Leskovec, 2016) and LINE (Tang et al., 2015) extend DeepWalk with more sophisticated walks. PLANTOID learns the embedding from both labels and graph structure by injecting the label information (Yang et al., 2016). 
Neural networkbased methods. Graph neural networks (GNNs) have previously been introduced by Gori et al. (2005) and Scarselli et al. , which consist of an iterative process propagating the node states until the node representation reaches a stable fixed point. More recently, several improved methods have been proposed. Li et al. (2016)
introduced gated recurrent units
(Cho et al., 2014) to alleviate the restriction. Duvenaud et al. (2015) further introduced a convolutionlike propagation rule on graphs, which does not scale to large graphs with wide node degree distributions.
2.3 Graph Convolutional Networks (GCN)
The above graph representation methods mainly consider the graph structure (node and edge) information but they do not use the node feature matrix in the learning process. Kipf & Welling (2017) proposed the graph convolutional networks (GCN) as an effective graph representation model that can naturally combine structure information and node features in the learning process. It is derived from conducting graph convolution in the spectral domain (Bruna et al., 2014) (Cho et al., 2014). It represents a node by aggregating feature vectors from its neighbors (including itself), which is similar with the convolution operation in CNN. The propagation rule of GCN can be summarized by the following expression:
(1) 
where
(2) 
is a normalized adjacency matrix of the undirected graph with added selfconnections .
is an identity matrix. The diagonal entries of
is is a layerspecific trainable weight matrix,denotes an activation function such as the
, and is the matrix of activation in the layer. is the node feature matrix.In Eq. (1), can be treated as: average with different weights (according to the node degrees) of the central node and all its neighbors’ feature vectors.
(3) 
Here, we can call Eq. (1) as a GCN layer consisting of two step: 1) averaging the central node and its neighbors’ feature vectors with different weights (according to the node degree), then 2) feeding the averaged feature vector to a fullyconnected networks to get a new representation.
GCN has significantly advanced the stateoftheart in graph representation learning, particularly in the problem of semisupervised node classification. However, there are still two major limitations.

Neighborhood selection/weighting. Equation (3) shows that GCN learns the new node representation from features of all its neighbors, not considering whether the node has a dense or sparse connection. Realworld graphs can have node degrees ranging from one to hundreds or even thousands. Therefore, some nodes may need more neighbors to get sufficient information, while some other nodes may aggregate too broadly such that their own features may be “washed out” due to aggregating too many in Eq. (3).

Feature selection/weighting. GCN did not select or weight the features in a feature vector. In this case, noisy features can be aggregated to produce the new representation, which can confuse the classifier and reduce the classification accuracy.
2.4 GCN Extensions
There are two major approaches to deal with the two problems mentioned above: samplingbased methods and feature convolutionbased methods

Samplingbased methods. These methods aim to get a fixed number of neighbors for each node, to get closer to the CNN application scenario of fixed neighborhood size. GraphSAGE (Hamilton et al., 2017a) uniformly samples a fixed number of neighbors, instead of sampling the full neighbors. These neighbors are generated by a fixedlength random walk, and the neighbors can come from a different number of hops, or search depth, away from a central node. Another samplingbased algorithm is FastGCN (Chen et al., 2018). It interprets graph convolutions as integral transforms of embedding functions and directly samples the nodes in each layer independently. JPnetwroks (Xu et al., 2018) proposed to sample nodes for the central node from the whole graph rather than the neighbors.

Feature convolutionbased methods. GCN aggregates the feature vectors of central node and all its neighbors, where each neighbor is treated differently according to their node degree. Inspired by the attention mechanisms (Bahdanau et al., 2015), GAT (Velickovic et al., 2018) learns weights for different neighbors by calculating the correlation between the central node and neighbors’ feature vectors via convolution. However, all features in a feature vector share the same weight. The features, which can be useful or not useful, are treated equally without considering their different importance. Gao et al. (2018) takes GCN features as input and uses two 1D convolutional layers to perform convolution on GCN features for the central node and reorganized features selected from its neighbors. Thus, LGCN inherits the limitations of GCN and the original features are still aggregating over all neighbors without selection due to the GCN layers in front.
2.5 CNN Revisit
The convolution operation in GCN and its extensions is inspired by that in CNN. Aggregating neighboring node information defined by connectivity is similar to aggregating neighboring pixels in receptive fields for images. While being successful on images, CNN relies on gridlike structures, which is however lacking clear definition in graphs, and filters for convolution operations on the receptive field. In images, most pixels have regular neighbors for the defined receptive field. For graphs, the number of neighbors for each node varies a lot. The samplingbased GCN methods addressed this problem by using sampling to get a fixed number of neighbors for each node, which helps defining a fixedsize “receptive field” for graphs. The feature convolutionbased method employs 1D convolutional layer on single node features. This motivated our investigation of proposing a convolutional layer on the nodefeature maps in graphs.
3 The Proposed NFCGCN
In this section, we present our proposed approach: graph convolutional networks with nodefeature convolution (NFCGCN). Our method combines ideas from standard GCN, as well as its samplingbased and feature convolutionbased extensions, which enables us to design convolution operations on feature maps constructed from feature vectors from the central node and its sampled neighbors. Our operation makes further progress in imitating CNN on graphs.
NFCGCN has three steps: neighbor sampling, nodefeature convolution, and standard GCN, as shown in Fig. 1. Firstly, neighbors are sampled for each node for a fixed node bandwidth in a graph, which effectively creates a 2D nodefeature map that enables the application of a convolutional layer. Then, the convolutional layer works on the 2D nodefeature map to get representations of fixed sizes, which are then flattened into a vector as the firstlevel NFC representation. Next, this NFC representation is fed into a standard GCN to get the secondlevel NFCGCN representation, which can be used for downstream tasks such as node classification, link prediction or node recommendation. The workflow is shown as Fig. 2 and Algorithm 1 gives the pseudocode for one training epoch, with details for each step discussed below.
3.1 Neighbor Sampling
For subsequent convolution operations, we need a fixed sized feature map for each node in a graph. Since the node degrees vary greatly across nodes, we use sampling techniques to tackle the problem. For computational simplicity, we use simple random sampling with uniform distribution to select neighbors for each node, although more advanced sampling techniques such as those in
(Niepert et al., 2016; Hamilton et al., 2017a; Chen et al., 2018) can be applied in future work. When the number of neighbors is less than the desired size, we duplicate the central node.Feature map. Let denote the desired node bandwidth for each node and denote the number of the neighbors for node , which is the degree of . For each node , we will have a local feature map , consisting of dimensional feature vectors from nodes consisting of the central node and its sampled neighbors ,
(4) 
where represents the selected neighbors of node , and represent the feature vector of the central node and its neighbor respectively. On the whole, this leads to a virtual 2D feature map of size for the whole graph, which resembles a 2D image.
In practice, sparsely connected nodes may have less than neighbors. In this case, we use all existing neighbors and duplicate the central node to reach the desired size of for the local feature matrix . We will analyze the effect of on node classification performance in Section 4.
3.2 NodeFeature Convolution
The goal of the second stage is to learn to aggregate neighbors and central node and obtain new node representations that can facilitate the subsequent classification tasks. After obtaining a fixedsize local feature matrix for each node in the first stage, it is natural to introduce convolutional operations to assign different weights to different features of different neighbors during the aggregation process as shown in Fig. 1 because the local feature matrix is formed as fixedsize gridlike structure. Specifically, we use 1D convolutional layer on the local feature matrix of each node
(5) 
The input channel is set to be . The output is shaped as , where is determined by the filter size
and stride
, and is the filter number. These three convolutional parameters, , , , are also the hyperparameters of our approach. Alternatively, we can use 2D convolutional layer on each local feature matrix. In this case, the input channel is set to be 1 and we use 2D convolutional layers with size to slide each feature matrix. After both 1D or 2D convolutional operations, we flatten the output as a vector as shown in Eq. (6) to serve as a new node representation for subsequent classification tasks.(6) 
3.3 GCN Layers
The last stage is quite straightforward, in which we directly feed the learned node representation vectors into GCN layers. The GCN layers take the elementwise mean of the vectors to learn the new representation of node , and it can be written as:
(7) 
where is the representation of after the first GCN layers.
After GCN layers, the final representation will be passed to a onelayer neural networks with a
activation function. For multiclass classification, the loss function is defined as the crossentropy error over all labeled examples:
(8) 
where is the set of node indices that have labels and is the dimension of output features that is equal to the number of classes. is a label indicator matrix.
Dataset  Nodes  Edges  Features  Classes  Training/Validation/Test (Chen et al., 2018)) 

Cora  2708  5429  1433  7  1208/500/1,000 
Citeseer  3327  4732  3703  6  1827/500/1,000 
PubMed  19717  44338  500  3  18217/500/1,000 
3.4 Relationship with Highly Related Work
Both, our method and GAT add one layer before GCN. In Eq. (7), the input of GCN model can be treated as which is the raw feature vector. In GAT model is no longer the raw feature vector, and it is the raw feature vector’ embedding containing only its own features. in our method is a more higherlevel representation that contains node and part of its neighbors (local graph structure) information. Besides, in the GCN stage, the input of our model has been carefully selected by the filters in the nodefeature convolution stage, while GCN and GAT consider all the neighbors without any node or feature selection.
Dataset  Cora  Citeseer  PubMed 

Input  2708 1433 6 1  3327 3703 6 1  19717 500 6 1 
filter = 64  filter = 64  filter =  
Convolutional layer  stride = 16  stride=32  stride=16 
GCN layer 1  16  16  32 
GCN layer 2  7  6  3 
Classifier layer  7  6  3 
4 Experiments
In this section, we have performed evaluation of our models against a wide variety of strong baselines and previous approaches on three citation networks—Cora, Citeseer and PubMed. Then, we list the comparative methods. Finally, we present the experiments results and analyse the advantages and limitations of our method.
4.1 Datasets
We utilize three citation network benchmark datasets—Cora, Citeseer and PubMed (Sen et al., 2008), with the same train/validation/test splits in (Chen et al., 2018), as summarized in Table 2. Detailed descriptions are given below.

Cora The Cora dataset contains 2,708 documents (nodes) classified into 7 classes and 5,429 citation links (edges). We treat the citation links as (undirected) edges and construct a binary, symmetric adjacency matrix. Each document has a 1,433 dimensional sparse bagofwords feature vector and a class label.

Citeseer The Citeseer dataset contains 3,327 documents classified into 6 classes and 4,732 links. Each document has a 3,703 dimensional sparse bagofwords feature vector and a class label.

PubMed The PubMed dataset contains 19,717 documents classified into 3 classes and 44,338 links. Each document has a 500 dimensional sparse bagofwordss feature vector and a class label.
4.2 Experimental Setup
We train a threelayers model (one convolutional layers and two GCN layers) and evaluate prediction accuracy on Cora, Citeseer and PubMed datasets. We choose the datasets splits as shown in Table 2, which is the same as in FastGCN (Chen et al., 2018)
. We choose a big proportion training datasets, because our model has more parameters to learn than GCN. We use an additional validation set of 500 labeled examples for hyperparameters optimization. Throughout the experiments, we use the Adam optimizer
(Kingma & Ba, 2015) with learning rate 0.002 for Cora and Citeseer, 0.01 for PubMed. We fix the dropout rate to be 0.5 for the hidden layers’ inputs and add an L2 regularization of 0.0001. We employ the early stopping strategy based on the validation accuracy and train 200 epochs at most.We choose 5 neighbors for the central node and the remaining parameters are summarized in Table 2 for our method with 1D convolutional layer. Besides, we also use a 2D convolutional layers in the nodefeature convolution stage. We set the width of filter is 3 and all the remaining parameters are the same as the 1D convolutional lays as shown in Table 3.
4.3 Baselines
We compare against four stateoftheart baselines. In order to ensure the baselines have sufficient diversity, we compare against the following methods:

GCN. Graph convolutional networks (GCN ) is the most important baseline. In the experiment, we use a twolayer GCN model. The main parameters are set as the original paper: 0.5 (dropout rate), 16 (number of hidden units), 10 (early stopping), 0.1 (learning rate). For there are more training data, we set max training epoch is 400. We use code ^{1}^{1}1https://github.com/tkipf/gcn publicly available for GCN.

Sampling based methods. We choose FastGCN and GraphSAGE as the comparison methods. We use the same data splits as FastGCN and we resuse the results of FastGCN and GraphSAGE reported in (Chen et al., 2018).

Aggregation based. We choose the most related one: GAT as the comparison method. It learns to assign different weight to different neighbors. In the experiment, we use the code publicly available for GAT ^{2}^{2}2https://github.com/PetarV/GAT. We use a twolayer GAT model. The main parameters are set as the original paper: 8 (attention heads), 8 (each feature dimension after the 1D convolution). We stop the training within 400 epochs.
5 Results
Methods  Cora  Citeseer  PubMed 

GCN  86.3%  77.8%  86.8% 
FastGCN  85.0%  77.6%  88.0% 
GraphSAGEmean  82.2%  71.4%  87.1% 
GAT  80.4%  75.7%  85.0% 
NFCGCN(1D)  88.3%  78.5%  89.5% 
NFCGCN(2D)  88.0%  78.9%  88.43% 
5.1 Semisupervised Node Classification

[leftmargin=*]

Test accuracy. The results of our comparative evaluation experiments are summarized in Table 4. We report the classification accuracy (average of ten runs) on the test nodes of our methods, and reuse results already reported in FastGCN (Chen et al., 2018). We run GCN and GAT in the same data splits setting and it is curious that GCN outperforms GAT in our dataset split, though consistent with the experiments results in (Xu et al., 2018). FastGCN and GraphSAGEmean can not perform as well as GCN, because they only use limited nodes in the graph.
Figure 3: In this experiment, we train the models for 200 epoch (without early stopping). It can be seen that our method can quickly get better results in less than 50 training epochs Figure 4: In this experiment, we train the models for 200 epoch (without early stopping). It can be seen that after the nodefeature convolution process, the new node representation can remarkably facilitate the subsequent classification tasks. Our results successfully demonstrate stateoftheart performance across all datasets, even we also only use limited nodes as FastGCN and GraphSAGE. Besides, we try to use 1D convolutional layer and 2D convolutional layer in the nodefeature convolution separately. They both outperform other methods on all the three datasets, and in the experiments 1D convolutional layer performs better than 2D convolutional layer on Cora and PubMed. We are able to improve upon all the methods by a margin of 2.0% on Cora, suggesting that learning to wisely aggregate the fixed size neighbors by using NFC layer can be greatly beneficial. In addition, our method can get the better performance in a small of training epochs, but our training time is much more than other methods, because our model is much more complex and there are many parameters to learn in each training epoch.
Methods Cora Citeseer PubMed GCN_5 64.8% 74.1% 80.0% GAT_5 64.2% 74.2% 82.2% NFC_5GCN 86.0% 79.1% 89.0% Improvement 21.2% 4.9% 6.8% Table 5: Results of test accuracy without the GCN layer. GCN_5 means aggregate the central and five neighbors’ features to get the new representation of central node in GCN’s manner. GAT_5 means average the central and five neighbors’ features with learned weights to get the new representation of central node. NFC_5GCN uses the convolutional layer to learn the new representation of central node and five neighbors’ features to get the new representation. Then, feeding the new representation to a one layer neural network to get the classification results. 
Accuracy, loss change with training epoch. Here, we also show the training accuracy, validation accuracy, training loss, and validation loss change with each training epoch in Fig. 3, Fig. 4. We did not use early stopping in our model for a better comparison with GCN and GAT in the same training epoch and we show the results within 200 epochs. We can see that our method can get good results in few training epochs, while GCN and GAT need one hundred training epochs and even more to stable. Besides, we can see that training, validation accuracy/loss of NFCGCN rise or descend not only very quickly but also stably. With the same Adam SGD optimizer to minimize crossentropy on the training nodes as GCN and GAT, our method performs better in the optimization process. This verifies that the firstlevel node representation learned from the nodefeature convolution can improve the subsequent classification tasks.

Test accuracy without GCN layer. In order to show our methods can learn more smartly from the neighbors features, we use three different ways to deal with a fixed size set of neighbors: average with different weights according to neighbors’ node degrees (GCN), learn to assign weight to neighbors and then mean average (GAT), convolution of node and features (ours). Then we feed the new representation of each node to the classifier layer directly. We test this comparison experiments on Cora, Citeseer and PubMed, and each node has 5 neighbors for the central node to use the 3 different ways to learn the new representation for node classification. The results are summarized in Table 5. In this experiment, we can see that how to deal with the neighbors’ features has a significant effect on the final results. Our methods significantly increase the test accuracy by margins of 21.2% on Cora, 4.9% on Citeseer, 6.8% on PubMed.
It should be emphasized that our method can get competitive performance without the GCN layers. From Table 4 and Table 5, we can see that the best performance of other methods for Cora, Citeseer and PubMed are 86.3%, 77.8%, 88.0% respectively, while ours results are 86.0%, 79.1%, 89.0%. Adding the GCN layer can slightly improve the performance (enhancement of about 1).
5.2 Effect of the Node Bandwidth
Dataset  Highest  Lowest  Average  Median 

Cora  168  1  4.87  4 
Citeseer  99  1  3.7  3 
Pumbed  171  1  5.5  3 
Node bandwidth  Cora  Citeseer  PubMed 

NFCGCN  87.53%  78.30%  87.58% 
NFCGCN  87.79%  78.34%  87.83% 
NFCGCN  88.13%  78.37%  88.02% 
NFCGCN  88.18%  78.48%  88.29% 
NFCGCN  88.30%  78.52%  88.23% 
In the nodefeature convolution process, we fix the node bandwidth to get a fixed size feature map for the central node and enable the following convolution operation. In order to see the influence of the node bandwidth, we study the effect of varying in the nodefeature convolution process.
Table 6 shows the distribution of node degrees for the three datasets studied. We can see that the three datasets are very sparse graphs, so we vary from 2 to 6. For a larger , e.g., , there are many duplication of the central nodes. and the respective results are summarized in the Table 7. The results show consistent improvement in accuracy with increasing . A larger means more feature diversity, and this can be especially helpful for the representation learning of nodes with sparse features. However, a larger will increase the computation cost so in practice, there is a tradeoff between the classification accuracy and computation complexity when choosing .
5.3 Effect of the Model Depth
Next, we investigate the influence of model depth (number of GCN layers) on classification performance. We change the GCN layers from 1 to 5 and the results are summarized in Tabel 8. We can see that the three citation datasets’ test accuracy changes less than .
Besides, we also notice that the best test accuracy for our methods is not much better than GCN in Cora and Citeseer datasets. The best test accuracy is not better than GCN in PubMed dataset (with only 60 labeled training data), because our model has more parameters to learn, NFCGCN is more suitable for datasets with many labeled training data. This is one of our method’s limitation.
Our method is less sensitive with the number of hidden layers, and this indicates that the new representation of the central node becoming more robust and easily classified after the nodefeature convolution process. This can be verified in Table 5. After the first convolutional process, the new representation of the central node is feed to a one layer neural networks directly and can get much better results than GCN and GAT. This shows that the learned new features are significantly representative for this class. So, even th order neighborhood is treated as the context size for the central node, and it can be still accurately classified. Besides, we just use limited nodes in our methods, and this can prevent the neighborhood explosion to a certain degree.
NFCGCN  Cora  Citeseer  PubMed 

NFCGCN  86.58%  76.70%  89.49% 
NFCGCN  88.30%  78.52%  88.23% 
NFCGCN  88.15%  77.95%  86.70% 
NFCGCN  87.51%  77.43%  86.28% 
NFCGCN  86.83%  76.88%  85.77% 
Test accuracy changes  1.72%  1.82%  3.21% 
6 Discussion
NFCGCN embodies the ideas from GCN and its extensions such as samplingbased GraphSAGE and feature convolutionbased GAT. NFCGCN simply adds a convolution layer before GCN layers that convolves local feature maps for each node, which differs from the 1D convolution layer in GAT. In the following, we discuss the limitation and new opportunities for this new architecture

Limitations. Our method’s main limitation is that NFCGCN needs more training samples to learn the parameters. Although in each training epoch, NFCGCN has a larger gradient descent than GCN/GAT, the computation cost for each epoch of NFCGCN is higher than that of GCN. Nonetheless, deep learning models are known to need many samples to work well and be computationally expensive.

Powerful representation learning ability. We have the choice of either 1D or 2D convolution to learn nodes’ new representation as , and feeding directly to a classifier layer can get a competitive performance compared with GCN and its extensions. One important future direction is to totally get rid of the GCN layer to explore an architecture in a more CNN manner. This will enable many ideas and tricks in CNN to be applied to graphs.

Good optimization performance. In the training process, our method can get better performance within 50 training epochs, while GCN and GAT need about 100 (twice) training epochs as shown in Fig. 3 and Fig. 4. Besides, NFCGCN’s accuracy and loss’ curves are more steady than GCN and GAT, which is preferred in neural network training. We will further work on more efficient computation for each epoch and perform a deeper analysis of the optimization process of NFCGCN to understand its convergence / optimization behaviours.

New architectures and deeper models.
The NFC layer proposed in NFCGCN can be a layer or stacked into multiple layers for GCN and its extensions. In particular, NFCGCN allows for a deeper model, and has the potential to get better classification accuracy with tricks in CNN or better tuning of hyperparameters. Furthermore, deeper models can enable other powerful machine learning technique to be better applied to graphs, such as transfer learning.
7 Conclusions
In this paper, we proposed a novel neural network model for graphs: graph convolutional networks with nodefeature convolution (NFCGCN). The key idea is to construct fixed size 2D feature maps to enable convolution as in the popular CNN model. We constructed such fixed size feature maps via sampling technique. In the proposed nodefeature convolution layer, we used multiple filters to perform convolution on feature maps built from feature vectors of the central node and and its neighbors, which produces the firstlevel node representation. Then we fed the firstlevel node representation to a standard GCN model to learn the secondlevel / final representation suitable for downstream tasks. The filter weights were learned in an endtoend fashion with the whole NFCGCN such that the model learned to assign adaptive weights to different features of different (central or neighbor) nodes.
Experimental results showed that the proposed NFCGCN outperformed competing GCN methods (including both samplingbased and feature convolutionbased models) on three different popular citation graphs for node classification. Even without the GCN layer, the firstlevel NFC representation achieved decent performance. Furthermore, the NFCGCN model took much fewer epochs to converge compared to GCN and GAT and deeper models based on NFCGCN shows much less performance variation compared to GCN. On the whole, the proposed NFCGCN architecture opens many new doors for exploring and advancing representation learning for graphs.
References
 Ahmed et al. (2013) Ahmed, Amr, Shervashidze, Nino, Narayanamurthy, Shravan M., Josifovski, Vanja, and Smola, Alexander J. Distributed largescale natural graph factorization. In Proceedings of the 22th International Conference on World Wide Web, 2013.
 Bahdanau et al. (2015) Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. 2015.
 Belkin & Niyogi (2001) Belkin, Mikhail and Niyogi, Partha. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Proceedings of the 14th International Conference on Neural Information Processing Systems, pp. 585–591, 2001.
 Bruna et al. (2014) Bruna, Joan, Zaremba, Wojciech, Szlam, Arthur, and LeCun, Yann. Spectral networks and locally connected networks on graphs. In Proceddings of the 3rd International Conference on Learning Representations, 2014.
 Chen et al. (2018) Chen, Jie, Ma, Tengfei, and Xiao, Cao. Fastgcn: Fast learning with graph convolutional networks via importance sampling. In Proceedings of the 7th International Conference on Learning Representations, 2018.

Cho et al. (2014)
Cho, Kyunghyun, van Merrienboer, Bart, Çaglar Gülçehre, Bahdanau,
Dzmitry, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua.
Learning phrase representations using rnn encoderdecoder for
statistical machine translation.
In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
, 2014.  Duvenaud et al. (2015) Duvenaud, David K, Maclaurin, Dougal, Iparraguirre, Jorge, Bombarell, Rafael, Hirzel, Timothy, AspuruGuzik, Alán, and Adams, Ryan P. Convolutional networks on graphs for learning molecular fingerprints. In Proceedings of the 28th Advances in Neural Information Processing Systems, pp. 2224–2232, 2015.
 Gao et al. (2018) Gao, Hongyang, Wang, Zhengyang, and Ji, Shuiwang. Largescale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery Data Mining, pp. 1416–1424, 2018.

Girshick et al. (2014)
Girshick, Ross, Donahue, Jeff, Darrell, Trevor, and Malik, Jitendra.
Rich feature hierarchies for accurate object detection and semantic
segmentation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 580–587, 2014.  Gori et al. (2005) Gori, Michele, Monfardini, Gabriele, and Scarselli, Franco. A new model for learning in graph domains. Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., 2:729–734, 2005.
 Grover & Leskovec (2016) Grover, Aditya and Leskovec, Jure. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge discovery and data mining, pp. 855–864, 2016.
 Hamilton et al. (2017a) Hamilton, Will, Ying, Zhitao, and Leskovec, Jure. Inductive representation learning on large graphs. In Proceedings of the 30th Advances in Neural Information Processing Systems, pp. 1025–1035, 2017a.
 Hamilton et al. (2017b) Hamilton, William L., Ying, Rex, and Leskovec, Jure. Representation learning on graphs: Methods and applications. IEEE Data Eng. Bull., 40:52–74, 2017b.
 Karpathy et al. (2014) Karpathy, Andrej, Toderici, George, Shetty, Sanketh, Leung, Thomas, Sukthankar, Rahul, and FeiFei, Li. Largescale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732, 2014.
 Kingma & Ba (2015) Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. In Proceedings of the 4th International Conference on Learning Representations, 2015.
 Kipf & Welling (2017) Kipf, Thomas N and Welling, Max. Semisupervised classification with graph convolutional networks. In Proceedings of the 6th International Conference on Learning Representations, 2017.
 Krizhevsky et al. (2012) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
 Langley (2000) Langley, P. Crafting papers on machine learning. In Langley, Pat (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
 Li et al. (2016) Li, Yujia, Tarlow, Daniel, Brockschmidt, Marc, and Zemel, Richard S. Gated graph sequence neural networks. In Proceedings of the 5th International Conference on Learning Representations, 2016.
 Mikolov et al. (2013) Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, and Dean, Jeff. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, pp. 3111–3119, 2013.
 Niepert et al. (2016) Niepert, Mathias, Ahmed, Mohamed, and Kutzkov, Konstantin. Learning convolutional neural networks for graphs. In Proceedings of the 33rd International Conference on Machine Learning, pp. 2014–2023, 2016.
 Perozzi et al. (2014) Perozzi, Bryan, AlRfou, Rami, and Skiena, Steven. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge discovery and data mining, pp. 701–710, 2014.
 (23) Scarselli, Franco, Gori, Marco, Tsoi, Ah Chung, Hagenbuchner, Markus, and Monfardini, Gabriele. the graph neural network model. IEEE Transactions on Neural Networks, pp. 61–80.
 Schlichtkrull et al. (2018) Schlichtkrull, Michael, Kipf, Thomas N, Bloem, Peter, van den Berg, Rianne, Titov, Ivan, and Welling, Max. Modeling relational data with graph convolutional networks. In European Semantic Web Conference, 2018.
 Sen et al. (2008) Sen, Prithviraj, Namata, Galileo, Bilgic, Mustafa, Getoor, Lise, Galligher, Brian, and EliassiRad, Tina. Collective classification in network data. AI magazine, 29(3):93, 2008.
 Shaw & Jebara (2009) Shaw, Blake and Jebara, Tony. Structure preserving embedding. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 937–944, 2009.
 Tang et al. (2015) Tang, Jian, Qu, Meng, Wang, Mingzhe, Zhang, Ming, Yan, Jun, and Mei, Qiaozhu. Line: Largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077, 2015.
 Velickovic et al. (2018) Velickovic, Petar, Cucurull, Guillem, Casanova, Arantxa, Romero, Adriana, Lio, Pietro, and Bengio, Yoshua. Graph attention networks. In Proceedings of the 7th International Conference on Learning Representations, 2018.
 Xu et al. (2018) Xu, Keyulu, Li, Chengtao, Tian, Yonglong, Sonobe, Tomohiro, Kawarabayashi, Kenichi, and Jegelka, Stefanie. Representation learning on graphs with jumping knowledge networks. In Proceedings of the 35th International Conference on Machine Learning, 2018.

Yang et al. (2016)
Yang, Zhilin, Cohen, William W., and Salakhutdinov, Ruslan.
Revisiting semisupervised learning with graph embeddings.
In Proceedings of the 33rd International Conference on International Conference on Machine Learning, pp. 40–48, 2016.  Ying et al. (2018) Ying, Rex, He, Ruining, Chen, Kaifeng, Eksombatchai, Pong, Hamilton, William L., and Leskovec, Jure. Graph convolutional neural networks for webscale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge discovery and data mining, 2018.
Comments
There are no comments yet.