1 Introduction
Graph convolution networks (GCN) [kipf2016semi]
generalize convolutional neural networks (CNN)
[lecun1995convolutional] to graph data. Given a node in a graph, a GCN first aggregates the node embedding with its neighbor node embeddings, and then transforms the embedding through (hierarchical) feedforward propagation. The two core operations, i.e., aggregating and transforming node embeddings, take advantage of the graph structure and outperform structureunaware alternatives [perozzi2014deepwalk, tang2015line, grover2016node2vec]. GCNs hence demonstrate prevailing success in many graphbased applications, including node classification [kipf2016semi], link prediction [zhang2018link] and graph classification [ying2018graph].However, the training of GCNs has been a headache, and a hurdle to scale up GCNs further. How to train CNNs more efficiently has recently become a popular topic of explosive interest, by bypassing unnecessary data or reducing expensive operations [wang2019e2, you2019drawing, jiang2019accelerating]. For GCNs, as the graph dataset grows, the large number of nodes and the potentially dense adjacency matrix prohibit fitting them all into the memory, thus putting fullbatch training algorithms (i.e., those requiring the full data and holistic adjacency matrix to perform) in jeopardy. That motivates the development of minibatch training algorithms, i.e., treating each node as a data point and updating locally. In each minibatch, the embedding of a node at the th layer is computed from the neighborhood node embeddings at the ()th layer through the graph convolution operation. As the computation is performed recursively through all layers, the minibatch complexity will increase exponentially with respect to the layer number. To mitigate the complexity explosion, several samplingbased strategies have been adopted, e.g. GraphSAGE [hamilton2017inductive] and FastGCN [chen2018fastgcn], yet with few performance guarantees. VRGCN [chen2017stochastic]
reduces the sample size through variance reduction, and guarantees its performance convergence to the fullsample approach, but it requires to store the fullbatch node embeddings of each layer in the memory, limiting its efficiency gain. ClusterGCN
[chiang2019cluster] used graph clustering to partition the large graph into subgraphs, and performs subgraphlevel minibatch training, yet again being only empirical.In this paper, we propose a novel layerwise training algorithm for GCNs, called (LGCN). The key idea is to decouple the two key operations in the perlayer feedforward graph convolution: feature aggregation (FA) and feature transformation (FT), whose concatenation and cascade result in the exponentially growing complexity. Surprisingly, the resulting greedy algorithm will not necessarily compromise the network representation capability, as shown by our theoretical analysis inspired by [xu2018powerful] using a graph isomorphism framework. To bypass extra hyperparameter tuning, we then introduce layerwise and learned GCN training (LGCN), which learns a controller for each layer that can automatically adjust the training epochs per layer in LGCN. Table 1 compares the training complexity between LGCN, LGCN and existing competitive algorithms, demonstrating our approaches’ remarkable advantage in reducing both time and memory complexities. More experiments show that our proposed algorithms are significantly faster than stateofthearts, with a consistent usage of GPU memory not dependent on dataset size, while maintaining comparable prediction performance. Our contributions can be summarized below:

A layerwise training algorithm for GCNs with much lower time and memory complexities;

Theoretical justification that under some sufficient conditions the greedy algorithm does not compromise in the graphrepresentative power;

Learned controllers that automatically configure layerwise training epoch numbers, in place of manual hyperparameter tuning;

Stateoftheart performance achieved in addition to the light weight, on extensive applications.
2 Related Work
We follow [chiang2019cluster] to categorize existing GCN training algorithms into fullbatch and minibatch (stochastic) algorithms, and compare their pros and cons.
2.1 FullBatch GCN Training
The original GCN [kipf2016semi] adopted the fullbatch gradient descent algorithm. Let’s define an undirected graph as , where represents the vertex set with nodes, and represents the edge set with edges: indicates an edge between vertices and . is the feature matrix with the feature dimension , and is the adjacency matrix where . By constructing an layer GCN, we express the output of the th layer and the network loss as:
(1) 
where is the regularized adjacency matrix, , is the weight matrix, , is a nonlinear function, is the linear classification matrix, the training labels, and
the loss function. For simplicity and without affecting the analysis, we set
.For the time complexity of the network propagation in (1), costs in time and costs in time, which in total leads to time consumed for the entire network. For the memory complexity, storing the layer embeddings requires in memory. Both time and memory complexities are proportional to , which cannot scale up well for large graphs.
2.2 MiniBatch SGD Algorithms
The vanilla minibatch SGD algorithm propagates the vertex representations in a minibatch, rather than for all nodes. We rewrite the network propagation (1) for the th node in the th layer as:
(2) 
where . With (2), we can feed the feature matrix in a minibatch dataloader and run the stocastic gradient descent (SGD) optimizer. Suppose is the minibatch size and the neighborhood size, the time complexity for the propagation per minibatch is and the memory complexity is . We next discuss a few variants on top of the vanilla minimbatch algorithm:

GraphSAGE & FastGCN. [hamilton2017inductive, chen2018fastgcn] Both adopted sampling scheme to reduce complexities. GraphSAGE proposes to use fixedsize sampling for the neighborhood in each layer. It yet suffers from the “neighborhood expansion” problem, making its time and memory complexities grow exponentially with the layer number. FastGCN proposes global importance sampling rather than local neighborhood sampling, alleviating the complexity growth issue. Suppose is the sample size, the time and memory complexities are and for GraphSAGE, and and for FastGCN, respectively. Besides, for FastGCN, there is extra complexity requirement for importance weight computation.

VRGCN. [chen2017stochastic] proposes to use variance reduction to reduce the sample size in each layer, which managed to achieve good performance with smaller graphs. Unfortunately, it requires to store all the vertex intermediate embeddings during training, which leads to its memory complexity coming close to the fullbatch training. Suppose is the reduced sample size, the time and memory complexities of VRGCN are and , respectively (plus some overhead for computing variance reduction).

ClusterGCN. [chiang2019cluster] Instead of feeding nodes and their neighbors directly, [chiang2019cluster] first uses a graph clustering algorithm to partition subgraphs, and then runs the SGD optimizer over each subgraph. The performance of this approach heavily hinges on the chosen graph clustering algorithm. It is further difficult to ensure training stability, e.g., w.r.t different clustering settings.
3 Proposed Algorithm
To discuss the bottleneck of graph convolutional network (GCN) training algorithms, we first analyze the propagation of GCN following [wu2019simplifying] and factorize the propagation (1) into feature aggregation (FA) and feature transformation (FT).
Feature aggregation. To learn the node representation of the th layer, in the first step GCN follows the neighborhood aggregation strategy, where in the th layer it updates the representation of each node by aggregating the representations of its neighbors, and at the same time the representation of itself is aggregated by the representations of its neighbors, which is written as:
(3) 
With (3), the time and memory complexity is highly dependent on the edge number, and in the minibatch SGD algorithm it is highly dependent on the sample size. Since during minibatch SGD training for an layer network, times of FA for each node requires its th order neighbor nodes’ representations, which results in sampling a large number of neighbor nodes. FA is the main barrier for reducing the time and memory complexity of GCN in the minibatch SGD algorithm.
Feature transformation. After FA, in the second step GCN conducts FT in the th layer, which consists of linear and nonlinear transformations:
(4) 
With (4), the complexity is mainly relevant to the feature dimension. times FT for a node only requires its own representation in each layer. Given the supervised node labels , the conventional training process for a GCN is formulated as:
(5) 
For the entire propagation of a minibatch SGD over an layer GCN, there are times of FA and FT in each batch as shown in Figure 2(a) . Without FT, times of FA can aggregate the structure information, which lacks representation learning and is still time and memoryconsuming. Without FA,
times of FT is no more than a multilayer perceptron (MLP), which efficiently learns the representation but lacks structure information.
3.1 LGCN: Layerwise GCN Training
As described earlier, the onebatch propagation of the conventional training for an layer GCN consists of times of feature aggregation (FA) and feature transformation (FT). Both FA and FT are necessary for capturing graph structures and learning data representations but the coupling between the two leads to inefficient training. We therefore propose a layerwise training algorithm (LGCN) to properly separate the FA and FT processes while training GCN layer by layer.
We illustrate the LGCN algorithm in Figure 2(b). For training the th layer, we do FA once for all the th vertex representations, aggregating its th order structure information, and then feed the vertex embeddings into a single layer perceptron and run the minibatch SGD optimizer for batches. The th layer is trained by solving
(6) 
Note that depends on . After finishing the th layer training, we save the weight matrix between the current input layer and hidden layer as the weight matrix of the th layer, drop the weight matrices between hidden layer and output layer unless , and calculate the thlayer representations. This process is repeated until all layers are trained.
The time and memory complexities are significantly lower compared to the conventional training and the state of the arts, as shown in Table 1. For the time complexity, LGCN only conducts FA times in the entire training process and FT does once per batch, whereas the conventional minibatch training conducts FA times and FT times in each batch. Suppose that the total training batch number is , the time complexity of LGCN is . The memory complexity is since LGCN only trains a single layer perceptron in each batch.
3.2 Theoretical Justification of LGCN
We set out to answer the following question theoretically for LGCN: How close could the performance of layerwise trained GCN be compared with conventionally trained GCN? To establish the theoretical background of our layerwise training algorithm, we follow Xu and coworker’s work [xu2018powerful] and show that a layerwise trained GCN can be as powerful as a conventionally trained GCN under certain conditions.
In [xu2018powerful], the representation power of an aggregationbased graph neural network (GNN) is evaluated, when input feature space is countable, as the ability to map any two different nodes into different embeddings. The evaluation of the representation power is extended to the ability to map any two nonisomorphic graphs into nonisomophic embeddings, where the graphs are generated as the rooted subtrees of the corresponding nodes. An layer GNN (excluding the linear classifier described earlier) can be represented [xu2018powerful] as:
(7) 
where is the vertexwise aggregating mapping , is the multiset of dimension , and is the readout mapping as:
(8) 
where is the set of node neighbors for the node.
Since GCN belongs to aggregationbased GNN, we use the same graph isomorphism framework for our analysis. Xu et al. [xu2018powerful] provided the upperbound power of GNN as WeisfeilerLehman graph isomorphism test (WL test) [weisfeiler1968reduction], and proved sufficient conditions for GNN to be as powerful as the WL test, which is described in the following lemma and theorem.
Lemma 1. [xu2018powerful] Let and be any two nonisomorphic graphs, i.e. . If a GNN maps and into different embeddings, the WL test also decide and are not isomorphic.
Theorem 2. [xu2018powerful] Let be a GNN with sufficient number of GNN layers, maps any graphs and that the WL test of isomorphism decides as nonisomorphic, to different embeddings if the following conditions hold: a) The mappings are injective. b) The readout mapping is injective.
We further propose to use the graph isomorphism framework to characterize the “power” of a GNN. In this framework, we observe the fact that for an aggregationbased GNN (such as GCN), with any pair of isomorphic graphs and , we always have due to the identical input and aggregationbased mapping. In contrast, for any pair of nonisomorphic graphs and
, there exists certain probability
that wrongly maps them into identical embeddings, i.e. , as shown in Table 2. Therefore, to further analyze our algorithm, we first define a specific metric to evaluate the capacity of a GNN, as the probability of mapping any nonisomorphic graphs into different embeddings.1  
0 
Definition 3. Let be a GNN; and are i.i.d. The capacity of , , is defined as the probability to map and into different embeddings if they are nonisomorphic:
(9) 
Higher capacity of a GNN indicates its stronger distinguishing capability between nonisomorphic graphs, which corresponds to more power in graph isomorphism framework. In other words, not so powerful network will have a higher probability to map nonisomorphic graphs into the same embeddings and fail to distinguish them. With Theorem 2 and Definition 3, we have , i.e. the capacity of WL test is the upper bound of the capacity of any aggregationbased GNN. Intuitively, with the metric to evaluate the network power, we further define the training process as the problem of optimizing the network capacity.
Definition 4. Let a GNN with a fixed injective readout function , and are i.i.d. The training process for is formulated as:
(10) 
Therefore, when training the network, the optimizer tries to find the best layer mapping for GNN to map nonisomorphic graphs into different embeddings as much as possible. With training process in Definition 4, we formulate the greedy layerwise training for as:
(11) 
In the following theorem, we provide a sufficient condition for a network trained layerwise (11) to achieve the same capacity, as one trained from end to end (10).
Theorem 5. Let be a GNN with a fixed injective readout function . If can be conventionally trained by solving the optimization problem (10) and the resulting is as powerful as the WL test given the conditions in Theorem 2, then can also be layerwise trained by solving the optimization problem (11) with the resulting achieving the same capacity.
We provide the proof in the appendix. For the network architecture which is originally powerful enough through conventional training, we can train it to achieve the same capacity through layerwise training. The idea of the proof is that: if there exists the injective mapping for each layer as the conditions in Theorem 2 satisfied, we can prove to find the injective mapping with layerwise optimization problem as (11). Otherwise, when the network architecture can not be powerful enough through conventional training, the following theorem establishes that the layerwise trained network has nondecreasing capacity as the layer number increases.
Theorem 6. Let GNN with a fixed injective readout function , and are i.i.d., and . With layerwise training, if is not guaranteed to be injective for , but it still can distinguish different , i.e. if , then , then we have that the capacity of the network is monotonically nondecreasing with deeper layers:
(12) 
We again direct readers to the appendix for the proof. The theorem indicates that, if the network architecture is not powerful enough through conventional training , we can try to increase its capacity through training a deeper network. Layerwise training can also train deeper GCNs more efficiently compared to stateofthearts.
What remains challenging is that the network capacity is not available in an analytical form with regards to network parameters. In this study, we use the cross entropy as the loss function in classification tasks. More development in loss functions would be needed in future.
3.3 LGCN: Training with Learn Controllers
One challenge to apply the layerwise training algorithm to graph convolutional networks (LGCN) is that one may need to manually adjust the training epochs for each layer. A possible solution is early stopping, nevertheless it does not intuitively work well in LGCN since the training loss in each layer is not comparable with the final validation loss. Motivated by learning to optimize [andrychowicz2016learning, chen2017learning, li2016learning, cao2019learning], we propose LGCN, training a learned RNN controller to decide when to stop in each layer’s training via policybased REINFORCE [williams1992simple]. The algorithm is illustrated in Figure 3.
Specifically, we model the training process for LGCN as a Markov Decision Process (MDP) defined as follows: i)
Action. The action at time for the RNN controller is making the decision on whether to stop at the currentlayer training or not. ii) State. The state at time is the loss in the current epoch, the layer index, and the hidden state of the RNN controller at time . iii) Reward. The purpose of the RNN controller is to train the network efficiently with competitive performance, and therefore the nonzero reward is only received at the end of the MDP as the weighted sum of final loss and total training epochs (Time Complexity). iv) Terminal. Once the LGCN finishes the layer training, the process terminates.Given the above settings, a sample trajectory from MDP will be: . The detailed architechture of RNN controller is shown in Figure 4
. For each time step, the RNN will output a hidden vector, which will be decoded and classiﬁed by its corresponding softmax classiﬁer. The RNN controller works in an autogressive way, where the output of the last step will be fed into the next step. LGCN will be sampled for each time step’s output to decide whether to stop or not. When terminated, a final reward will be fed to the controller to update the weight.
4 Experiments
In this section, we evaluate predictive performance, training time, and GPU memory usages of our proposed LGCN and LGCN on single and multiclass classification tasks for six increasingly larger datasets: Cora & PubMed [kipf2016semi], PPI & Reddit [hamilton2017inductive], and Amazon670K & Amazon3M [chiang2019cluster]
, as summarized in appendix. For Amazon670K & Amazon3M, we use principal component analysis
[hotelling1933analysis]to reduce the feature dimension down to 100, and use the toplevel categories as the class labels. The train/validate/test split is following the conventional setting for the inductive supervised learning scenario. We implemented our proposed algorithm in PyTorch
[paszke2017automatic]: for layerwise training, we use the Adam optimizer with learning rate of 0.001 for Cora & PubMed, and 0.0001 for PPI, Reddit, Amazon670K and Amazon3M; for RNN controller, we set the controller to make a stoppingornot decision each 10 epochs (5 for Cora and 50 for PPI), use the controller architecture as in [gong2019autogan] and the Adam optimizer with the learning rate of 0.05. All the experiments are conducted on a machine with GeForce GTX 1080 Ti GPU (11 GB memory), 8core Intel i79800X CPU (3.80 GHz) and 16 GB of RAM.4.1 Comparison with State of the Arts
To demonstrate the efficiency and performance of our proposed algorithms, we compare them with stateofthearts in Table 3. We compare LGCN and LGCN to the stateoftheart GCN minibatch training algorithms as GraphSAGE [hamilton2017inductive], FastGCN [chen2018fastgcn] and VRGCN [chen2017stochastic], using their originally released codes and published settings, except that the batchsize and the embedding dimension of hidden layers are kept the same in all methods to ensure fair comparisons. Specifically, we set the batchsize at 256 for Cora and 1024 for others; and we did the embedding dimension of hidden layers at 16 for Cora & PubMed, 512 for PPI and 128 for others. We do not compare the controller with other hyperparameter tuning methods since the controller is widely used in many fields such as neural architecture search [gong2019autogan].
GraphSAGE [hamilton2017inductive]  FastGCN [chen2018fastgcn]  VRGCN [chen2017stochastic]  LGCN  LGCN  
F1 (%)  Time  Memory  F1 (%)  Time  Memory  F1 (%)  Time  Memory  F1 (%)  Time  Memory  F1 (%)  Time  Memory  
Cora  85.0  18s  655M  85.5  6.02s  659M  85.4  5.47s  253M  84.7  0.45s  619M  84.1  0.38s  619M 
PubMed  86.5  483s  675M  87.4  32s  851M  86.4  118s  375M  86.8  2.93s  619M  85.8  1.50s  631M 
PPI  68.8  402s  849M        98.6  63s  759M  97.2  49s  629M  96.8  26s  631M 
93.4  998s  4343M  92.6  761s  4429M  96.0  201s  1271M  94.2  44s  621M  94.0  34s  635M  
Amazon670K  83.1  2153s  849M  76.1  548s  1621M  92.7  534s  625M  91.6  54s  601M  91.2  30s  613M 
Amazon3M              88.3  2165s  625M  88.4  203s  601M  88.4  125s  613M 
On four common datasets Cora, PubMed, PPI and Reddit, we demonstrate that our proposed algorithm LGCN is significantly faster than stateofthearts, with a consistent usage of GPU memory not dependent on dataset size, while maintaining comparable prediction performance. With a learned controller to make the stopping decision, LGCN can further reduce the training time (here we do not include search time) by half with tiny performance loss compared to LGCN. For super large datasets, GraphSAGE and FastGCN fail to converge on Amazon670K, and exceed the time limit on Amazon3M in our experiment, whereas VRGCN achieves good performances after long training. Our proposed algorithms still stably achieve comparable performances efficiently on both Amazon670K and Amazon3M.
We did not include in Table 1 the time spent on hyperparameter tuning (search) for any algorithm. Such a comparison was impossible as search time was not accessible for pretrained stateofthearts. Although a typical controller learning can be expensive (as reported in Table 4), RNN controllers in LGCN learned over (especially large) datasets can be “transferrable” (shown next); and LGCN without controller retraining actually saves time compared to datasetspecific manual tuning. As to the memory usage, the trends in practical GPU memory usages during training did not entirely agree with those in the theoretical analyses (Table 1
). We contemplate that it is more likely in implementation: other models were implemented on TensorFlow and ours on PyTorch; and possible CPU memory usage of some models was unclear.
4.2 Ablation Study
Transferability. We explore the transferability of the learned controller. Results in Table 4 show that the controller learned from larger datasets could be reused for smaller ones (with similar loss functions) and thus save search time.
Cora  PubMed  
F1 (%)  Train  Search  F1 (%)  Train  Search  
ControllerCora  84.1  0.38s  16s       
ControllerPubMed  84.3  0.36s  0s  85.8  1.50s  125s 
ControllerAmazon3M  84.8  0.43s  0s  86.3  2.43s  0s 
Epoch configuration. We consider the influence of different epoch configurations in layerwise training on performance on six datasets. Table 5 shows that training under different epoch numbers in different layers will affect the final performance. For layerwise training (LGCN), we configure different numbers of epochs for the two layers of our GCN as reported in Table 5. For layerwise training with learning to optimize (LGCN), we let the RNN controller to learn the epoch configuration from randomly sampled subgraphs as training data and report the automatically learned epoch numbers. Experimental results show that, trained with more epochs for each layer, LGCN improves perfoemance except for Cora. Moreover, with learning to optimize, the RNN controller in LGCN automatically learns epoch configurations with tiny performance loss but much less epochs. Figure 5 compares the loss curves of layerwise training under various configurations and over various datasets.
Cora  PubMed  PPI  
F1 (%)  Epoch  F1 (%)  Epoch  F1 (%)  Epoch  
LGCNConfig1  83.2  60+60  86.8  100+100  93.7  400+400 
LGCNConfig2  84.7  80+80  86.3  120+120  94.1  500+500 
LGCNConfig3  83.0  100+100  86.4  140+140  94.9  600+600 
LGCN  84.1  75+75  85.8  30+60  94.1  300+350 
Amazon670K  Amazon3M  
F1 (%)  Epoch  F1 (%)  Epoch  F1 (%)  Epoch  
LGCNConfig1  93.0  60+60  91.4  60+60  88.2  60+60 
LGCNConfig2  93.5  80+80  91.6  80+80  88.4  80+80 
LGCNConfig3  93.8  100+100  91.7  100+100  88.3  100+100 
LGCN  92.2  30+60  91.2  70+30  88.0  20+80 
Deeper networks. We evaluate the necessity of training a deeper network using layerwise training. Previous attempts seem to suggest the usefulness of training deeper GCN [kipf2016semi]. However, the datasets used in the experiments there are not large enough to draw a definite conclusion. Here we conduct experiments on 3 large datasets PPI, Reddit and Amazon3M, with monotonically increasing layer number and total training epochs (each layer is trained for the same number of epochs) as shown in Table 6. Experimental results show that, with more network layers, the prediction performance of layerwise training gets better. Compared with 2layer network, 4layer LGCN gains performance increase of 4.0, 0.7, and 0.8 (%) on PPI, Reddit and Amazon3M, respectively. When it comes to learn to optimize, the RNN controller learns a more efficient epoch configuration, while still achieving comparable performances as manually set epoch configurations.
PPI  Amazon3M  
F1 (%)  Epoch  F1 (%)  Epoch  F1 (%)  Epoch  
2layer  LGCN  93.7  800  93.8  200  88.4  160 
LGCN  94.1  750  92.2  90  88.0  100  
3layer  LGCN  97.2  1200  94.2  300  89.0  240 
LGCN  96.8  650  94.0  210  88.7  120  
4layer  LGCN  97.7  1600  94.5  400  89.2  320 
LGCN  97.3  1100  94.3  250  89.0  170 
Therefore, in layerwise training, we have shown that deeper layer networks can have better empirical performances, consistent with the theoretical, nondecreasing network capacity of deeper networks shown in Theorem 6.
Applying layerwise training to NGCN. We also apply layerwise training to NGCN [abu2018n], a recent GCN extension. It consists of several GCNs over multiple scales so layerwise training is applied to each GCN individually. Results in Table 7 show that with layerwise training, NGCN is significantly faster with comparable performance.
Conventional Training  LayerWise Training  
F1 (%)  Time  F1 (%)  Time  
NGCN  83.6  62s  83.1  4s 
5 Conclusions
In this paper, we propose novel and efficient layerwise training algorithms for GCN (LGCN) which separate feature aggregation and feature transformation during training and greatly reduce the complexity. Besides, we analyze theoretical grounds to rationalize the power of LGCN in the graph isomorphism framework, provide a sufficient condition that LGCN can be as powerful as conventional training, and prove that LGCN is increasingly powerful as networks get deeper with more layers. Numerical results further support our theoretical analysis: our proposed algorithm LGCN is significantly faster than stateofthearts, with a consistent usage of GPU memory not dependent on dataset size, while maintaining comparable prediction performance. Finally, motivated by learning to optimize, we propose LGCN, designing an RNN controller to make the stopping decision for eachlayer training and training it to learn to make the decision rather than manually configure the training epochs. With the learned controller to make the stopping decision, LGCN on average further reduces the training time by half with tiny performance loss, compared to LGCN.