1. Introduction
Deep learning methods are becoming increasingly powerful in solving various challenging artificial intelligence tasks. Among these deep learning methods, convolutional neural networks (CNNs) (LeCun et al., 1998) have demonstrated promising performance in many imagerelated applications, such as image classification (Deng et al., 2009), semantic segmentation (Chen et al., 2016), and object detection (Ren et al., 2015; He et al., 2017). A variety of CNN models have been proposed to continuously set the performance records (Krizhevsky et al., 2012b; Simonyan and Zisserman, 2015; Szegedy et al., 2015; He et al., 2016a)
. In addition to images, CNNs have also been successfully applied to natural language processing tasks such as neural machine translation
(Cho et al., 2014; Luong et al., 2015; Gehring et al., 2017). One common characteristic behind these tasks is that the data can be represented by gridlike structures. This enables the use of convolutional operations in the form of the same local filters scanning every position on the input. Unlike traditional handcrafted filters, the local filters used in convolutional layers are trainable. The networks can automatically decide what kind of features to extract by learning the weights in these trainable filters, thereby avoiding handcrafted feature extraction
(Wang et al., 2012).In many realworld applications, the data can be naturally represented as graphs, such as social, citation, and biological networks. Figure 1 provides an illustration of graph data. Many interesting discoveries can be made by analyzing these graph data, such as social network analysis (Grover and Leskovec, 2016). An important task on graph data is node classification (Kipf and Welling, 2017; Veličković et al., 2017), in which models make predictions for every node in a graph based on node features and graph topology. As mentioned above, CNNs, with the power of automatic feature extraction, have achieved great success on tasks with gridlike data, which can be considered as special cases of graph data. Therefore, applying deep learning models, especially CNNs, on graph tasks is appealing. However, using regular convolutional operations on generic graphs faces two main challenges. These challenges are resulted from the fact that regular convolutions require the number of neighboring nodes for each node remains the same, and these neighboring nodes are ordered. In generic graphs, the numbers of neighboring nodes usually differ for different nodes in a graph. In addition, among the neighboring nodes of a node, there is no ranking information based on which we can order them to ensure the output is deterministic. In this work, we analyze the necessity of having a fixed number of ordered neighboring nodes in regular convolutional operations and propose elegant solutions to address these challenges.
Several recent studies tried to apply convolutional operations on generic graphs. Graph convolutional networks (GCNs) (Kipf and Welling, 2017)
proposed to use a convolutionlike operation to aggregate features of all adjacent nodes for each node, followed by a linear transformation to generate new a feature representation for a given node. Specifically, all feature vectors in the neighborhood, including the feature vector of the central node itself, are summed up, weighted by nontrainable weights depending on the number of neighbors. This can be thought of as a convolutionlike operation which, however, is intrinsically different from the regular convolutional operation in two aspects. First, it does not use the same local filter to scan every node; that is, nodes that have different numbers of adjacent nodes have filters of different sizes and weights. Second, the weights in the filters are the same for all neighboring nodes in the receptive field as they are determined by the number of neighbors. Consequently, the weights are not learned. Graph attention networks (GATs)
(Veličković et al., 2017) employed the attention mechanism (Bahdanau et al., 2015) to obtain different and trainable weights for adjacent nodes by measuring the correlation between their feature vectors and that of the central node. Yet graph attention operation still differs from the regular convolution which learns weights in local filters directly. Moreover, the attention mechanism requires extra computation in terms of pairs of feature vectors, resulting in excessive memory and computational resource requirements in practice.In this work, we make two major contributions to applying CNNs on generic graph data. First, we propose the learnable graph convolutional layer (LGCL) to enable the use of regular convolutional operations on graphs. Note that prior studies modified the original convolutional operations to fit them for graph data. In contrast, our LGCL transforms the graphs to enable the use of regular convolutions. Our models based on LGCL achieve better performance on both transductive learning and inductive node classification tasks, as demonstrated by our experimental results. Second, we observe another limitation of prior methods; that is, their training process takes the adjacency matrix of the whole graph as an input. This requires excessive memory and computational resources when the graph has a large amount of nodes, which is usually the case in realworld tasks. In order to overcome this limitation, we develop a subgraph training method, which is a simple yet effective approach to allow the training of deep learning methods on largescale graph data. The subgraph training method can significantly reduce the amount of required memory and computational resources, with negligible loss in terms of model performance.
2. Related Work
A few recent studies have tried to apply convolutional operations on graph data. Graph convolutional networks (GCNs) were introduced in (Kipf and Welling, 2017) and achieved the stateofart performance on several node classification tasks. The authors defined and used a convolutionlike operation termed the spectral graph convolution. This enables CNNs to directly operate on graphs. Basically, each layer in GCNs updates the feature vector representation of each node in the graph by considering the features of neighboring nodes. To be specific, the layerwise forwardpropagation operation of GCNs can be expressed as
(1) 
where and are the input and output matrices of layer , respectively. For both matrices, the numbers of rows are the same, corresponding to the number of nodes in the graph, while the numbers of columns can be different, depending on the dimensions of the input and output feature space. In Eq (1), is used to aggregate feature vectors of adjacent nodes, where is the adjacency matrix of the graph, and
is the identity matrix. Also,
is used, instead of , because the layers need to add selfloop connections to make sure that the old feature vector of the node itself is taken into consideration when updating the representation of a node. is the diagonal node degree matrix, which is used to normalize so that the scale of feature vectors after aggregation remains the same. is a trainable weight matrix and represents a linear transformation that changes the dimension of feature space. Therefore, the dimension of depends on how many features that each node in the input and output have, i.e., the number of columns in and , respectively.denotes an activation function like ReLU.
We analyze the convolutionlike operation, which is the feature aggregation step through premultiplying by . Consider a node with a feature vector corresponding to the th row in . The aggregation output, controlled by the th row in , is a weighted sum of the feature vectors of all of its adjacent nodes, including the node itself. We can see that the operation is equivalent to having a local filter for each node, whose receptive field consists of the node itself and all its neighboring nodes. As is common that nodes in a generic graph have different numbers of adjacent nodes, the receptive field size varies, resulting in different local filters. This is a key difference from the regular convolutional operation, where the same local filter is applied to scan each position in gridlike data. Moreover, while using local filters of different sizes for graph data seems reasonable, it is worth noting that there is no trainable parameter in . In addition, each adjacent node receives the same weight in the weighted sum, which makes it a simple average. While CNNs achieve the power of automatic feature extraction by learning the weights in local filters, this nontrainable aggregation operation in GCNs limits the capability of CNNs on generic graph data.
From this perspective, graph attention networks (GATs) (Veličković et al., 2017) tried to enable learnable weights when aggregating neighboring feature vectors by employing the attention mechanism (Bahdanau et al., 2015; Vaswani et al., 2017). Like GCNs, each node still has a local filter with a receptive field covering the node itself and all of its adjacent nodes. When performing the weighted sum of feature vectors, each neighbor receives a different weight by measuring the correlation between its feature vector and that of the central node. Mathematically, for a node and one of its adjacent nodes , the correlation measurement process between layer and is given by
(2)  
where and represent the corresponding feature vectors, i.e., the th and th row in , respectively, is a shared linear transformation and
represents a singlelayer feedforward neural network,
is the weight for node in the feature aggregation operation of node . Although in this way, GATs provide different and trainable weights to different adjacent nodes, the learning process differs from that of regular CNNs where weights in local filters are learned directly. Also, the attention mechanism requires extra computation between a node and all of its adjacent nodes, which will cause memory and computational resource problems in practice.Unlike these prior models, which modified the regular convolutional operations to fit them for generic graph data, we instead propose to transform graphs into gridlike data to enable the use of CNNs directly. This idea was previously explored in (Niepert et al., 2016). However, the transformation in (Niepert et al., 2016) is implemented in the preprocessing process while our method includes the transformation in the networks. Additionally, we introduce a subgraph training method in this work, which is a simple yet effective approach to allow largescale training.
3. Methods
In this section, we introduce the learnable graph convolutional layer (LGCL) and the subgraph training strategy on generic graph data. Based on these developments, we propose the largescale learnable graph convolutional networks (LGCNs).
3.1. Challenges of Applying Convolutional Operations on Graph Data
In order to apply regular convolutional operations on graphs, we need to overcome two main challenges that are caused by two major differences between generic graphs and gridlike data. First, the number of adjacent nodes usually varies for different nodes in a generic graph. Second, we cannot order the neighboring nodes in generic graphs, since there is no ranking information among them. For example, in a social network, each person in the network can be seen as a node and the edges represent friendships between people. Obviously, the number of adjacent nodes differs for each node since people can have different numbers of friends. Meanwhile, it is hard to order these friends without additional information for ranking.
Note that gridlike data can be viewed as a special type of graph data, where each node has a fixed number of ordered neighbors. As convolutional operations apply directly on gridlike data such as images, we analyze why the two characteristics mentioned above are necessary to performing regular convolutions. To see the need of having a fixed number of adjacent nodes with ranking information, consider a convolutional filter with a size of scanning an image. We think of the image as a special graph by thinking of each pixel as a node. During the scan, the computation involves a central node with adjacent nodes each time. These nodes become neighbors of the central node by having edges connecting them in the special graph. Meanwhile, we can order these neighboring nodes by their relative positions with respect to the central node. This is crucial to convolutional operations since the correspondence between weights in the filter and nodes in the graph must be maintained during the scan. For instance, in the example above, the upper left weight in the filter should always be multiplied with the neighboring node at the top left of the central node. Without such ranking information, the outputs of convolution operations are no longer deterministic. We can see from the above discussions that it is challenging to directly apply regular convolutional operations on generic graph data. To address these two challenges, we propose an approach to transform generic graphs into gridlike data.
3.2. Learnable Graph Convolutional Layers
To enable the use of regular convolutional operations on generic graphs, we propose the learnable graph convolutional layer (LGCL). Following the notations defined in Section 2, the layerwise propagation rule of LGCL is formulated as
(3)  
where is the adjacency matrix, is an operation that performs the largest node selection to transform generic graphs to data of gridlike structures, and denotes a regular 1D CNN that aggregates neighboring information and outputs a new feature vector for each node. We discuss and separately below.
largest Node Selection. We propose a novel method known as the largest node selection to achieve the transformation from graphs to gridlike data, where is a hyperparameter of LGCL. After this operation, each node aggregates neighboring information and is represented in a 1D gridlike format with positions. The transformed data is then fed into a 1D CNN to generate the updated feature vector.
Suppose with row vectors , representing a graph of nodes where each node has features. We are given the adjacency matrix and a fixed . Now consider a specific node whose feature vector is and it has neighboring nodes. Through a simple lookup operation in , we can obtain the indices of these adjacent nodes, say . Concatenating the corresponding feature vectors outputs a matrix . Without the loss of generalization, assume that . If in practice, we can pad using columns of zeros. The largest node selection is conducted on ; that is, for each column, we rank the values and select largest values. This gives us a output matrix. As the columns in represent features, the operation is equivalent to selecting largest values for each feature. By inserting in the first row, the output becomes . This is illustrated in the left part of Figure 2. By repeating this process for each node, transforms to .
Note that can be viewed as a 1D gridlike structure by considering , , and as the batch size, the spatial size, and the number of channels, respectively. Therefore, the largest node selection function successfully achieves the transformation from generic graphs to gridlike data. The operation makes use of the natural ranking information among real numbers and forces each node to have a fixed number of ordered neighbors.
1D Convolutional Neural Networks. As discussed in Section 3.1, regular convolutional operations can be directly applied on gridlike data. As is 1D, we employ a 1D CNN model . The basic functionality of LGCL is to aggregate adjacent information and update the feature vector for each node. Consequently, it requires , where is the dimension of the updated feature space. The 1D CNN should take as input and output a matrix of dimension , or equivalently, . Basically, reduces the spatial size from to 1.
Note that is considered as the batch size, which is not related to the design of . As a result, we focus on only one data sample, i.e., one node in the graph. Taking the example above, for node , the transformed output is , which serves as the input to . Due to the fact that any regular convolutional operation with a filter size larger than one and no padding reduces the spatial size, the simplest has only one convolutional layer with a filter size of and no padding. The numbers of input and output channels are and , respectively. Meanwhile, any multilayer CNN can be employed, provided its final output has the dimension of . The right part of Figure 2 illustrates an example of a twolayer CNN. Again, applying for all the nodes outputs . In summary, our LGCL transforms generic graphs to gridlike data using the proposed largest node selection and applies a regular 1D CNN to perform feature aggregation and refine the feature vector for each node.
3.3. Learnable Graph Convolutional Networks
It is known that deeper networks usually yield better performance. However, prior deep models on graphs like GCNs only have two layers. While they suffer from performance loss when going deeper (Kipf and Welling, 2017), our LGCL enables a deeper design, resulting in the learnable graph convolutional networks (LGCNs) for graph node classification. We build LGCNs based on the architecture of densely connected convolutional networks (DCNNs) (Huang et al., 2017; He et al., 2016b)
, which achieved stateoftheart performance in the ImageNet classification challenge
(Krizhevsky et al., 2012a).In LGCNs, we first apply a graph embedding layer to produce lowdimensional representations of nodes, since the original inputs are usually very highdimensional feature vectors in some graph dataset, such as the Cora (Sen et al., 2008). The graph embedding layer is essentially a linear transformation in the first layer expressed as
(4) 
where represents the highdimensional input and changes the dimension of feature space from to . As a result, and . Alternatively, a GCN layer can be used for graph embedding. As illustrated in Section 2, the number of training parameters in a GCN layer is equal to that of a regular graph embedding layer.
After the graph embedding layer, we stack multiple LGCLs, according to the complexity of the graph data. As each LGCL only aggregates information from firstorder neighboring nodes, i.e., direct neighboring nodes, stacked LGCLs can collect information from a larger set of nodes, which is commonly done in regular CNNs. In order to promote the model performance and facilitate the training process, we apply skip connections to concatenate the inputs and outputs of LGCLs. Finally, a fullyconnected layer is used before the softmax function for final predictions.
Following the design principle of LGCNs, and the number of stacked LGCLs are the most important hyperparameters. The average degree of nodes in the graph can be a good reference for selecting . Meanwhile, the number of LGCLs should depend on the complexity of tasks, such as the number of classes, the number of nodes in a graph, etc. More complicated tasks require deeper models.
Dataset  #Nodes  #Features  #Classes  #Training Nodes  #Validation Nodes  #Test Nodes  Degree  Setting 

Cora  2708  1433  7  140  500  1000  4  Transductive 
Citeseer  3327  3703  6  120  500  1000  5  Transductive 
Pubmed  19717  500  3  60  500  1000  6  Transductive 
PPI  56944  50  121  44906 (20 graphs)  6514 (2 graphs)  5524 (2 graphs)  31  Inductive 
3.4. SubGraph Training on LargeScale Data
Most prior deep models on graphs suffer from another limitation. In particular, during training the inputs are the feature vectors of all the nodes along with the adjacency matrix of the whole graph, whose sizes become large for large graph data. These prior models work properly on smallscale graphs. However, for largescale graphs, those methods usually result in excessive memory and computational resource requirements, which limit the practical applications of these models.
Similar problems also happen for deep neural networks on other types of data, such as gridlike data. For example, deep models on image segmentation usually use randomly cropped patches when dealing with large images. Motivated by this strategy, we intend to randomly “crop” a graph to obtain smaller graphs for training. However, while a rectangular patch of an image naturally maintains neighboring information among pixels, how to handle irregular connections between nodes in a graph remains challenging.
In this work, we propose a subgraph selection algorithm to address the memory and computational resource problems on largescale graph data, as shown in Algorithm 1. Given a graph, we first sample some initial nodes. Staring from them, we use the BreadthFirstSearch (BFS) algorithm to expand adjacent nodes into the subgraph iteratively. With multiple iterations, highorder neighboring nodes of the initial nodes are included. Note that we use a single parameter in Algorithm 1 for simplicity. In practice, we can set to different values for each iteration. Figure 4 provides an example of the subgraph selection process.
With such randomly “cropped” subgraphs, we are able to train deep models on largescale graphs. In addition, we can take advantage of the minibatch training strategy to accelerate the learning process. In each training iteration, we can use the proposed subgraph selection algorithm to sample several subgraphs and put them in a minibatch. The corresponding feature vectors and adjacency matrices form the inputs to the networks.
4. Experimental Studies
In this section, we evaluate our proposed largescale learnable graph convolutional networks (LGCNs) on node classification tasks under both transductive and inductive learning settings. In addition to comparisons with prior stateoftheart models, some performance studies are performed to investigate how to choose hyperparameters. Experiments are also conducted to analyze the training strategy based on the proposed subgraph selection algorithm. Experimental results show that LGCNs yield improved performance, and the subgraph training is much more efficient than wholegraph training. Our code is publicly available^{1}^{1}1https://github.com/divelab/lgcn/.
4.1. Datasets
In our experiments, we focus on node classification tasks under both transductive and inductive learning settings.
Transduction Learning. Under the transductive setting, the unlabeled testing data are accessible and available during training. To be specific, for node classification, only a part of nodes in the graph are labeled. The testing nodes, which are also in the same graph, are accessible during training, including their features and connections, except for the labels. This means the training process knows about the graph structure that contains testing nodes. We use three standard benchmark datasets for transductive learning experiments; those are the Cora, Citeseer, and Pubmed (Sen et al., 2008), as summarized in Table 1. These three datasets are citation networks with nodes and edges representing documents and citations, respectively. The feature vector of each node corresponds to a bagofword representation for a document. For these three datasets, we employ the same experimental settings as those in GCN (Kipf and Welling, 2017). For each class, 20 nodes are used for training, 500 nodes are used for validation and 1,000 nodes are used for testing.
Inductive Learning. For inductive learning, the testing data are not available during training, which means the training process does not learn about the structure of test graphs. In inductive learning tasks, we usually have different training, validation, and testing graphs. During training, the model only use the training graphs without access to validation and testing graphs. We use the proteinprotein interaction (PPI) dataset (Zitnik and Leskovec, 2017), which contains 20 graphs for training, 2 graphs for validation, and 2 graphs for testing. Since the graphs for validation and testing are separate, the training process does not use them. There are 2,372 nodes in each graph on average. Each node has 50 features including positional, motif genes and signatures. Each node has multiple labels from 121 classes.
4.2. Experimental Setup
We describe the experimental setup under both transductive and inductive learning settings.
Transduction Learning. In transductive learning tasks, we employ the proposed LGCN models as illustrated in Figure 3. Since transductive learning datasets employ highdimensional bagofword representations as feature vectors of nodes, the inputs go through a graph embedding layer to reduce the dimension. Here, we use a GCN layer as the graph embedding layer. The dimension of the embedding output is 32. Then we apply LGCLs, each of which uses
and produces 8component feature vectors. For the Cora, Citeseer, and Pubmed, we stack 2, 1, and 1 LGCLs, respectively. We use concatenation in skip connections. Finally, a fullyconnected layer is used as a classifier to make predictions. Before the fullyconnected layer, we perform a simple sum to aggregate feature vectors of adjacent nodes. Dropout
(Srivastava et al., 2014) is applied on both input feature vectors and adjacency matrices in each layer with rates of 0.16 and 0.999, respectively. All LGCN models in transductive learning tasks use the subgraph training strategy. The subgraph size is set to .Inductive Learning. For inductive learning, the same LGCN model as above is used except for some hyperparameters. For the graph embedding layer, the dimension of output feature vectors is 128. We stack two LGCLs with . We also employ the subgraph training strategy, with subgraph initial node size equal to 500 and 200. Dropout with a rate of 0.9 is applied in each layer.
For both transductive and inductive learning LGCN models, the following configurations are shared. For all layers, only the identity activation function is used, which means no nonlinearity is involved in the networks. In order to avoid overfitting, the regularization with is applied. For training, the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.1 is used. Weights in LGCNs are initialized by the Glorot initialization (Glorot and Bengio, 2010)
. We employ the early stopping strategy based on the validation accuracy and train 1,000 epochs at most.
Models  Cora  Citeseer  Pubmed 

DeepWalk (Perozzi et al., 2014)  67.2%  43.2%  65.3% 
Planetoid (Yang et al., 2016)  75.7%  64.7%  77.2% 
Chebyshev (Defferrard et al., 2016)  81.2%  69.8%  74.4% 
GCN (Kipf and Welling, 2017)  81.5%  70.3%  79.0% 
83.3 0.5%  73.0 0.6%  79.5 0.2% 
Models  PPI 

GraphSAGEGCN (Hamilton et al., 2017)  0.500 
GraphSAGEmean (Hamilton et al., 2017)  0.598 
GraphSAGEpool (Hamilton et al., 2017)  0.600 
GraphSAGELSTM (Hamilton et al., 2017)  0.612 
0.772 0.002 
4.3. Analysis of Results
The experimental results are summarized in Tables 2 and 3 for transductive and learning settings, respectively.
Transduction Learning. For transductive learning experiments, we report node classification accuracies as in (Kipf and Welling, 2017). Table 2 provides the comparisons with other graph models. According to the results, our LGCN models achieve better performance over the current stateoftheart GCNs by a margin of 1.8%, 2.7%, and 0.6% on the Cora, Citeseer, and Pubmed datasets, respectively.
Inductive Learning. For inductive learning experiments, we report microaveraged F1 scores like (Hamilton et al., 2017). From table 3, we can observe that our LGCN model outperforms GraphSAGELSTM by a margin of 16%. Without observing the structure of test graphs in training, the LGCN model still achieves good generalization.
The results above show that the proposed LGCN models on generic graphs consistently yield new stateoftheart performance in node classification tasks on different datasets. These results demonstrate the effectiveness of applying regular convolutional operations on transformed graph data. In addition, the proposed transformation approach through the largest node selection is shown to be effective.
4.4. LGCL versus GCN Layers
It may be argued that our LGCN models employ a deeper network architecture than GCNs, which could explain the improved performance. However, the performance of GCNs is reported to decrease when going deeper by stacking more layers. In addition, we conduct another experiment by replacing all LGCLs in LGCN models by GCN layers, denoted as LGCNGCN model. All the other settings remain the same in order to ensure the fairness of the comparisons. Table 4 provides the comparison results between LGCN and LGCNGCN. The results show that LGCN has better performance than LGCNGCN, which indicates that the LGCL is more effective than the GCN layer.
Models  Cora  Citeseer  Pubmed 

LGCNGCN  82.2 0.5%  71.1 0.5%  79.0 0.2% 
83.3 0.5%  73.0 0.6%  79.5 0.2% 
4.5. SubGraph versus WholeGraph Training
For the experiments above, we use the subgraph training strategy to learn the LGCN models, which aims at saving memory and training time. However, since the subgraph selection algorithm samples some nodes as a subgraph from the whole graph, it means that the models trained in this way do not learn about the structure of whole graph during training. Meanwhile, in transductive learning tasks, the information of testing nodes may be ignored, which raises the risk of performance loss. To address this concern, we perform experiments on transductive learning tasks to compare the subgraph training strategy with the previous wholegraph training strategy. Through the experiments, we show the advantages of using the subgraph training strategy, with negligible loss in terms of model performance.
For the subgraph selection process described in Algorithm 1, the algorithm starts with some initial nodes that are randomly selected. In transductive learning tasks, we sample initial nodes only from the nodes with training labels to make sure that training can be conducted. To be specific, we sample 140, 120, and 60 initial nodes when selecting the subgraph for the Cora, Citeseer, and Pubmed datasets, respectively. For each iteration in the subgraph selection algorithm, we do not set to limit the number of nodes expanded into the subgraph. The maximum number of nodes in the subgraph is set to 2,000 for all the three datasets, which is an feasible size for our GPUs in hand.
For comparison, we perform experiments using the same LGCN models, but train them using the same wholegraph training strategy as GCNs, which means the inputs are representations of the entire graph. We denote such models as LGCN, compared to LGCN with the subgraph training strategy. The comparing results of these two models with GCNs are provided in Table 5. The number of nodes reported represents how many nodes are used for one iteration of training. The time reported here is the training time for running 100 epochs using a single TITAN Xp GPU.
Cora  Citeseer  Pubmed  

GCN  # Nodes  2708  3327  19717 
Accuracy  81.5%  70.3%  79.0%  
Time  7s  4s  38s  
# Nodes  2708  3327  19717  
Accuracy  83.8 0.5%  73.0 0.6%  79.5 0.2%  
Time  58s  30s  1080s  
# Nodes  644  442  354  
Accuracy  83.3 0.5%  73.0 0.6%  79.5 0.2%  
Time  14s  3.6s  2.6s 
It can be seen that the actual numbers of nodes in the training subgraph for the Cora, Citeseer, and Pubmed datasets are 644, 442, and 354, respectively, which are far smaller than the maximum subgraph size of 2,000. This indicates that the nodes in the Cora, Citeseer, and Pubmed datasets are sparsely connected. Specifically, starting from several initial nodes with training labels, only a small set of nodes will be selected by expanding neighboring nodes to form connected subgraphs. While these datasets are usually considered as a single large graph, the whole graph is actually composed of several separate subgraphs that have no connection to each other. The subgraph training strategy takes advantage of this fact and makes efficient use of the nodes with training labels. Since only the initial nodes have training labels and all their connectivity information is included in the selected subgraphs, the amount of information loss in the subgraph training is minimized, resulting in negligible performance loss. This is demonstrated by comparing the node classification accuracies of LGCN and LGCN. According to the results, LGCN models only have a subtle performance loss of 0.5% on the Cora dataset, while yielding the same performance on the Citeseer and Pubmed datasets, as compared to the LGCN models.
After investigating the risk of performance loss, we point out the great advantages of the subgraph training strategy in terms of training speed. By using the subgraph training, LGCN models take a subgraph of fewer nodes as inputs in contrast to the whole graph, which is expected to greatly promote the training efficiency. It can be seen from the results in Table 5 that the improvement is outstanding. Although GCNs require simpler computation, its running time is much longer than that of LGCN models on largescale graph datasets like the Pubmed. Powerful deep models are usually used on largescale data, which makes the subgraph training strategy useful in practice. The subgraph training strategy enables using more complex layers such as the proposed LGCLs without the concern of long training time. As a result, our largescale LGCNs with the subgraph training strategy are not only effective but also very efficient.
4.6. Performance Study of k
As described in Section 3.3, the average degree of nodes in graph can be helpful when choosing the hyperparameter in LGCNs. In this part, we conduct experiments to show how different values of affect the performance of LGCN models. We vary the value of in LGCLs and observe the node classification accuracies on the Cora, Citeseer, and Pubmed datasets. The values of are selected from 2, 4, 8, 16, and 32, which cover a reasonable range of integer values.
Figure 5 plots the performance change of LGCN models under different values of . As demonstrated in the figure, the LGCN models achieve the best performance on all the three datasets when choosing . In the Cora, Citeseer, and Pubmed datasets, the average node degrees are 4, 5, and 6, respectively. This indicates that the best is usually a bit larger than the average node degree in the dataset. When is too large, the performance of LGCN models decreases. A possible explanation is that if is much larger than the average node degree in the graph, too many zero padding is used in the largest node selection process, which compromises the performance of the following 1D CNN models. For the inductive learning task on the PPI dataset, we also explore different values of . The best performance is given by while the average node degree is 31. This is consistent with our results above.
5. Conclusions and Future Work
In this work, we propose the learnable graph convolutional layer (LGCL), which transforms generic graphs to data of gridlike structures and enables the use of regular convolutional operations. The transformation is conducted through a novel largest node selection process, which uses the ranking between node feature values. Based on our LGCL, we build deeper networks, known as learnable graph convolutional networks (LGCNs), for node classification tasks on graphs. Experimental results show that the proposed LGCN models yield consistently better performance than prior methods under both transductive and inductive learning settings. Our LGCN models achieve new stateoftheart results on four different datasets, demonstrating the effectiveness of LGCLs.
In addition, we propose a subgraph selection algorithm, resulting in the subgraph training strategy, which can solve the problem of excessive requirements for memory and computational resources on largescale graph data. With the subgraph training, the proposed LGCN models are both effective and efficient. Our experiments indicate that the subgraph training strategy brings a significant advantage in terms of training speed, with a negligible amount of performance loss. The new training strategy is very useful as it enables the use of more complex models efficiently.
Based on this work, we discuss several possible directions for future work. First, our methods mainly address the node classification problems. In practice, many other interesting tasks can be formulated as graph classification problems, where each graph has a label. While they are similar to image classification tasks, current graph convolutional methods, including ours, are not able to perform downsampling on graphs, like the pooling operations on image data. We need a layer to reduce the number of nodes effectively, which is necessary for graph classification. Second, our methods are mainly applied to generic graph data like citation networks. For other data like text, our methods may also be helpful, since we can treat text data as graphs. We will explore these directions in the future.
Acknowledgements.
This work was supported in part by National Science Foundation grants DBI1641223, IIS1633359 and Defense Advanced Research Projects Agency grant N660011724031.References
 (1)
 Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations (2015).
 Chen et al. (2016) LiangChieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2016. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. Transactions on Pattern Analysis and Machine Intelligence (2016).
 Cho et al. (2014) Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. Syntax, Semantics and Structure in Statistical Translation (2014), 103.
 Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems. 3844–3852.

Deng
et al. (2009)
J. Deng, W. Dong,
R. Socher, L.J. Li, K.
Li, and L. FeiFei. 2009.
ImageNet: A LargeScale Hierarchical Image
Database. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
.  Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, and Yann N Dauphin. 2017. A convolutional encoder model for neural machine translation. Annual Meeting of the Association for Computational Linguistics (2017).
 Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 249–256.
 Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 855–864.
 Hamilton et al. (2017) William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In NIPS.
 He et al. (2017) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask rcnn. IEEE International Conference on Computer Vision (2017).
 He et al. (2016a) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016a. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
 He et al. (2016b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
 Huang et al. (2017) Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. 2017. Densely connected convolutional networks. IEEE Conference on Computer Vision and Pattern Recognition (2017).
 Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. The International Conference on Learning Representations (2015).
 Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semisupervised classification with graph convolutional networks. International Conference on Learning Representations (2017).
 Krizhevsky et al. (2012a) Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. 2012a. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger (Eds.). 1106–1114.
 Krizhevsky et al. (2012b) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012b. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradientbased learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
 Luong et al. (2015) MinhThang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attentionbased neural machine translation. Conference on Empirical Methods in Natural Language Processing (2015).

Niepert
et al. (2016)
Mathias Niepert, Mohamed
Ahmed, and Konstantin Kutzkov.
2016.
Learning convolutional neural networks for graphs.
In
International Conference on Machine Learning
. 2014–2023.  Perozzi et al. (2014) Bryan Perozzi, Rami AlRfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 701–710.
 Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster RCNN: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems. 91–99.
 Sen et al. (2008) Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina EliassiRad. 2008. Collective classification in network data. AI magazine 29, 3 (2008), 93.
 Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for largescale image recognition. In Proceedings of the International Conference on Learning Representations.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
 Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper With Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 6000–6010.
 Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2017. Graph Attention Networks. arXiv preprint arXiv:1710.10903 (2017).
 Wang et al. (2012) Tao Wang, David J Wu, Adam Coates, and Andrew Y Ng. 2012. Endtoend text recognition with convolutional neural networks. In Pattern Recognition (ICPR), 2012 21st International Conference on. IEEE, 3304–3308.

Yang
et al. (2016)
Zhilin Yang, William W
Cohen, and Ruslan Salakhutdinov.
2016.
Revisiting semisupervised learning with graph embeddings.
International Conference on Machine Learning (2016).  Zitnik and Leskovec (2017) Marinka Zitnik and Jure Leskovec. 2017. Predicting multicellular function through multilayer tissue networks. Bioinformatics 33, 14 (2017), i190–i198.
Comments
There are no comments yet.