1. Introduction
Graph convolutional network (GCN) (Kipf and Welling, 2017) has become increasingly popular in addressing many graphbased applications, including semisupervised node classification (Kipf and Welling, 2017), link prediction (Zhang and Chen, 2018) and recommender systems (Ying et al., 2018)
. Given a graph, GCN uses a graph convolution operation to obtain node embeddings layer by layer—at each layer, the embedding of a node is obtained by gathering the embeddings of its neighbors, followed by one or a few layers of linear transformations and nonlinear activations. The final layer embedding is then used for some end tasks. For instance, in node classification problems, the final layer embedding is passed to a classifier to predict node labels, and thus the parameters of GCN can be trained in an endtoend manner.
Since the graph convolution operator in GCN needs to propagate embeddings using the interaction between nodes in the graph, this makes training quite challenging. Unlike other neural networks that the training loss can be perfectly decomposed into individual terms on each sample, the loss term in GCN (e.g., classification loss on a single node) depends on a huge number of other nodes, especially when GCN goes deep. Due to the node dependence, GCN’s training is very slow and requires lots of memory – backpropagation needs to store all the embeddings in the computation graph in GPU memory.
Previous GCN Training Algorithms: To demonstrate the need of developing a scalable GCN training algorithm, we first discuss the pros and cons of existing approaches, in terms of 1) memory requirement^{1}^{1}1Here we consider the memory for storing node embeddings, which is dense and usually dominates the overall memory usage for deep GCN.
, 2) time per epoch
^{2}^{2}2An epoch means a complete data pass. and 3) convergence speed (loss reduction) per epoch. These three factors are crucial for evaluating a training algorithm. Note that memory requirement directly restricts the scalability of algorithm, and the later two factors combined together will determine the training speed. In the following discussion we denote to be the number of nodes in the graph, the embedding dimension, and the number of layers to analyze classic GCN training algorithms.
Fullbatch gradient descent is proposed in the first GCN paper (Kipf and Welling, 2017). To compute the full gradient, it requires storing all the intermediate embeddings, leading to memory requirement, which is not scalable. Furthermore, although the time per epoch is efficient, the convergence of gradient descent is slow since the parameters are updated only once per epoch.
memory: bad; time per epoch: good; convergence: bad 
Minibatch SGD is proposed in (Hamilton et al., 2017). Since each update is only based on a minibatch gradient, it can reduce the memory requirement and conduct many updates per epoch, leading to a faster convergence. However, minibatch SGD introduces a significant computational overhead due to the neighborhood expansion problem—to compute the loss on a single node at layer , it requires that node’s neighbor nodes’ embeddings at layer , which again requires their neighbors’ embeddings at layer and recursive ones in the downstream layers. This leads to time complexity exponential to the GCN depth. GraphSAGE (Hamilton et al., 2017) proposed to use a fixed size of neighborhood samples during backpropagation through layers and FastGCN (Chen et al., 2018a) proposed importance sampling, but the overhead of these methods is still large and will become worse when GCN goes deep.
memory: good; time per epoch: bad; convergence: good 
VRGCN (Chen et al., 2018b)
proposes to use a variance reduction technique to reduce the size of neighborhood sampling nodes. Despite successfully reducing the size of samplings (in our experiments VRGCN with only 2 samples per node works quite well), it requires storing all the intermediate embeddings of all the nodes in memory, leading to
memory requirement. If the number of nodes in the graph increases to millions, the memory requirement for VRGCN may be too high to fit into GPU.
memory: bad; time per epoch: good; convergence: good.
In this paper, we propose a novel GCN training algorithm by exploiting the graph clustering structure. We find that the efficiency of a minibatch algorithm can be characterized by the notion of “embedding utilization”, which is proportional to the number of links between nodes in one batch or withinbatch links. This finding motivates us to design the batches using graph clustering algorithms that aims to construct partitions of nodes so that there are more graph links between nodes in the same partition than nodes in different partitions. Based on the graph clustering idea, we proposed ClusterGCN, an algorithm to design the batches based on efficient graph clustering algorithms (e.g., METIS (Karypis and Kumar, 1998)). We take this idea further by proposing a stochastic multiclustering framework to improve the convergence of ClusterGCN. Our strategy leads to huge memory and computational benefits. In terms of memory, we only need to store the node embeddings within the current batch, which is with the batch size . This is significantly better than VRGCN and full gradient decent, and slightly better than other SGDbased approaches. In terms of computational complexity, our algorithm achieves the same time cost per epoch with gradient descent and is much faster than neighborhood searching approaches. In terms of the convergence speed, our algorithm is competitive with other SGDbased approaches. Finally, our algorithm is simple to implement since we only compute matrix multiplication and no neighborhood sampling is needed. Therefore for ClusterGCN, we have memory: good; time per epoch: good; convergence: good.
We conducted comprehensive experiments on several largescale graph datasets and made the following contributions:

ClusterGCN achieves the best memory usage on largescale graphs, especially on deep GCN. For example, ClusterGCN uses 5x less memory than VRGCN in a 3layer GCN model on Amazon2M. Amazon2M is a new graph dataset that we construct to demonstrate the scalablity of the GCN algorithms. This dataset contains a amazon product copurchase graph with more than 2 millions nodes and 61 millions edges.

ClusterGCN achieves a similar training speed with VRGCN for shallow networks (e.g., 2 layers) but can be faster than VRGCN when the network goes deeper (e.g., 4 layers), since our complexity is linear to the number of layers while VRGCN’s complexity is exponential to .

ClusterGCN is able to train a very deep network that has a large embedding size. Although several previous works show that deep GCN does not give better performance, we found that with proper optimization, deeper GCN could help the accuracy. For example, with a 5layer GCN, we obtain a new benchmark accuracy 99.36 for PPI dataset, comparing with the highest reported one 98.71 by (Zhang et al., 2018).
2. Background
Suppose we are given a graph , which consists of vertices and edges such that an edge between any two vertices and represents their similarity. The corresponding adjacency matrix is an sparse matrix with entry equaling to 1 if there is an edge between and and otherwise. Also, each node is associated with an
dimensional feature vector and
denotes the feature matrix for all nodes. An layer GCN (Kipf and Welling, 2017) consists of graph convolution layers and each of them constructs embeddings for each node by mixing the embeddings of the node’s neighbors in the graph from the previous layer:(1) 
where is the embedding at the th layer for all the nodes and ; is the normalized and regularized adjacency matrix and is the feature transformation matrix which will be learnt for the downstream tasks. Note that for simplicity we assume the feature dimensions are the same for all layers (
). The activation function
is usually set to be the elementwise ReLU.
Semisupervised node classification is a popular application of GCN. When using GCN for this application, the goal is to learn weight matrices in (1
) by minimizing the loss function:
(2) 
where contains all the labels for the labeled nodes; is the th row of with the groundtruth label to be , indicating the final layer prediction of node . In practice, a crossentropy loss is commonly used for node classification in multiclass or multilabel problems.
GCN (Kipf and Welling, 2017)  Vanilla SGD  GraphSAGE (Hamilton et al., 2017)  FastGCN (Chen et al., 2018a)  VRGCN (Chen et al., 2018b)  ClusterGCN  

Time complexity  
Memory complexity 
3. Proposed Algorithm
We first discuss the bottleneck of previous training methods to motivate the proposed algorithm.
In the original paper (Kipf and Welling, 2017), full gradient descent is used for training GCN, but it suffers from high computational and memory cost. In terms of memory, computing the full gradient of (2) by backpropagation requires storing all the embedding matrices which needs space. In terms of convergence speed, since the model is only updated once per epoch, the training requires more epochs to converge.
It has been shown that minibatch SGD can improve the training speed and memory requirement of GCN in some recent works (Hamilton et al., 2017; Chen et al., 2018a, b). Instead of computing the full gradient, SGD only needs to calculate the gradient based on a minibatch for each update. In this paper, we use with size
to denote a batch of node indices, and each SGD step will compute the gradient estimation
(3) 
to perform an update. Despite faster convergence in terms of epochs, SGD will introduce another computational overhead on GCN training (as explained in the following), which makes it having much slower perepoch time compared with full gradient descent.
Why does vanilla minibatch SGD have slow perepoch time?
We consider the computation of the gradient associated with one node . Clearly, this requires the embedding of node , which depends on its neighbors’ embeddings in the previous layer. To fetch each node ’s neighbor nodes’ embeddings, we need to further aggregate each neighbor node’s neighbor nodes’ embeddings as well. Suppose a GCN has layers and each node has an average degree of , to get the gradient for node , we need to aggregate features from nodes in the graph for one node. That is, we need to fetch information for a node’s hop () neighbors in the graph to perform one update. Computing each embedding requires time due to the multiplication with , so in average computing the gradient associated with one node requires time.
Embedding utilization can reflect computational efficiency. If a batch has more than one node, the time complexity is less straightforward since different nodes can have overlapped hop neighbors, and the number of embedding computation can be less than the worst case . To reflect the computational efficiency of minibatch SGD, we define the concept of “embedding utilization” to characterize the computational efficiency. During the algorithm, if the node ’s embedding at th layer is computed and is reused times for the embedding computations at layer , then we say the embedding utilization of is . For minibatch SGD with random sampling, is very small since the graph is usually large and sparse. Assume is a small constant (almost no overlaps between hop neighbors), then minibatch SGD needs to compute embeddings per batch, which leads to time per update and time per epoch.
We illustrate the neighborhood expansion problem in the left panel of Fig. 1. In contrary, fullbatch gradient descent has the maximal embedding utilization—each embedding will be reused (average degree) times in the upper layer. As a consequence, the original full gradient descent (Kipf and Welling, 2017) only needs to compute embeddings per epoch, which means on average only embedding computation is needed to acquire the gradient of one node.
To make minibatch SGD work, previous approaches try to restrict the neighborhood expansion size, which however do not improve embedding utilization. GraphSAGE (Hamilton et al., 2017) uniformly samples a fixedsize set of neighbors, instead of using a fullneighborhood set. We denote the sample size as . This leads to embedding computations for each loss term but also makes gradient estimation less accurate. FastGCN (Chen et al., 2018a) proposed an important sampling strategy to improve the gradient estimation. VRGCN (Chen et al., 2018b) proposed a strategy to store the previous computed embeddings for all the nodes and layers and reuse them for unsampled neighbors. Despite the high memory usage for storing all the embeddings, we find their strategy very useful and in practice, even for a small (e.g., 2) can lead to good convergence.
We summarize the time and space complexity in Table 1. Clearly, all the SGDbased algorithms suffer from exponential complexity with respect to the number of layers, and for VRGCN, even though can be small, they incur huge space complexity that could go beyond a GPU’s memory capacity. In the following, we introduce our ClusterGCN algorithm, which achieves the best of two worlds—the same time complexity per epoch with full gradient descent and the same memory complexity with vanilla SGD.
3.1. Vanilla ClusterGCN
Our ClusterGCN technique is motivated by the following question: In minibatch SGD updates, can we design a batch and the corresponding computation subgraph to maximize the embedding utilization? We answer this affirmative by connecting the concept of embedding utilization to a clustering objective.
Consider the case that in each batch we compute the embeddings for a set of nodes from layer to . Since the same subgraph (links within ) is used for each layer of computation, we can then see that embedding utilization is the number of edges within this batch . Therefore, to maximize embedding utilization, we should design a batch to maximize the withinbatch edges, by which we connect the efficiency of SGD updates with graph clustering algorithms.
Now we formally introduce ClusterGCN. For a graph , we partition its nodes into groups: where consists of the nodes in the th partition. Thus we have subgraphs as
where each only consists of the links between nodes in . After reorganizing nodes, the adjacency matrix is partitioned into submatrices as
(4) 
and
(5) 
where each diagonal block is a adjacency matrix containing the links within . is the adjacency matrix for graph ; contains the links between two partitions and ; is the matrix consisting of all offdiagonal blocks of . Similarly, we can partition the feature matrix and training labels according to the partition as and where and consist of the features and labels for the nodes in respectively.
The benefit of this blockdiagonal approximation is that the objective function of GCN becomes decomposible into different batches (clusters). Let denotes the normalized version of , the final embedding matrix becomes
(6)  
due to the blockdiagonal form of (note that is the corresponding diagonal block of ). The loss function can also be decomposed into
(7) 
The ClusterGCN is then based on the decomposition form in (6) and (7). At each step, we sample a cluster and then conduct SGD to update based on the gradient of , and this only requires the subgraph , the , on the current batch and the models . The implementation only requires forward and backward propagation of matrix products (one block of (6)) that is much easier to implement than the neighborhood search procedure used in previous SGDbased training methods.
We use graph clustering algorithms to partition the graph. Graph clustering methods such as Metis (Karypis and Kumar, 1998) and Graclus (Dhillon et al., 2007) aim to construct the partitions over the vertices in the graph such that withinclusters links are much more than betweencluster links to better capture the clustering and community structure of the graph. These are exactly what we need because: 1) As mentioned before, the embedding utilization is equivalent to the withincluster links for each batch. Intuitively, each node and its neighbors are usually located in the same cluster, therefore after a few hops, neighborhood nodes with a high chance are still in the same cluster. 2) Since we replace by its block diagonal approximation and the error is proportional to betweencluster links , we need to find a partition to minimize number of betweencluster links.
In Figure 1, we illustrate the neighborhood expansion with full graph and the graph with clustering partition . We can see that clusterGCN can avoid heavy neighborhood search and focus on the neighbors within each cluster. In Table 2, we show two different node partition strategies: random partition versus clustering partition. We partition the graph into 10 parts by using random partition and METIS. Then use one partition as a batch to perform a SGD update. We can see that with the same number of epochs, using clustering partition can achieve higher accuracy. This shows using graph clustering is important and partitions should not be formed randomly.
Time and space complexity.
Since each node in only links to nodes inside , each node does not need to perform neighborhoods searching outside . The computation for each batch will purely be matrix products and some elementwise operations, so the overall time complexity per batch is . Thus the overall time complexity per epoch becomes . In average, each batch only requires computing embeddings, which is linear instead of exponential to . In terms of space complexity, in each batch, we only need to load samples and store their embeddings on each layer, resulting in memory for storing embeddings. Therefore our algorithm is also more memory efficient than all the previous algorithms. Moreover, our algorithm only requires loading a subgraph into GPU memory instead of the full graph (though graph is usually not the memory bottleneck). The detailed time and memory complexity are summarized in Table 1.
Dataset  random partition  clustering partition 

Cora  78.4  82.5 
Pubmed  78.9  79.9 
PPI  68.1  92.9 
3.2. Stochastic Multiple Partitions
Although vanilla ClusterGCN achieves good computational and memory complexity, there are still two potential issues:

After the graph is partitioned, some links (the part in Eq. (4)) are removed. Thus the performance could be affected.

Graph clustering algorithms tend to bring similar nodes together. Hence the distribution of a cluster could be different from the original data set, leading to a biased estimation of the full gradient while performing SGD updates.
In Figure 2, we demonstrate an example of unbalanced label distribution by using the Reddit data with clusters formed by Metis. We calculate the entropy value of each cluster based on its label distribution. Comparing with random partitioning, we clearly see that entropy of most clusters are smaller, indicating that the label distributions of clusters are biased towards some specific labels. This increases the variance across different batches and may affect the convergence of SGD.
To address the above issues, we propose a stochastic multiple clustering approach to incorporate betweencluster links and reduce variance across batches. We first partition the graph into clusters with a relatively large . When constructing a batch for an SGD update, instead of considering only one cluster, we randomly choose clusters, denoted as and include their nodes into the batch. Furthermore, the links between the chosen clusters,
are added back. In this way, those betweencluster links are reincorporated and the combinations of clusters make the variance across batches smaller. Figure 3 illustrates our algorithm—for each epochs, different combinations of clusters are chosen as a batch. We conduct an experiment on Reddit to demonstrate the effectiveness of the proposed approach. In Figure 4, we can observe that using multiple clusters as one batch could improve the convergence. Our final ClusterGCN algorithm is presented in Algorithm 1.
3.3. Issues of training deeper GCNs
Previous attempts of training deeper GCNs (Kipf and Welling, 2017) seem to suggest that adding more layers is not helpful. However, the datasets used in the experiments may be too small to make a proper justification. For example, (Kipf and Welling, 2017) considered a graph with only a few hundreds of training nodes for which overfitting can be an issue. Moreover, we observe that the optimization of deep GCN models becomes difficult as it may impede the information from the first few layers being passed through. In (Kipf and Welling, 2017)
, they adopt a technique similar to residual connections
(He et al., 2016) to enable the model to carry the information from a previous layer to a next layer. Specifically, they modify (1) to add the hidden representations of layer
into the next layer.(8) 
Here we propose another simple technique to improve the training of deep GCNs. In the original GCN settings, each node aggregates the representation of its neighbors from the previous layer. However, under the setting of deep GCNs, the strategy may not be suitable as it does not take the number of layers into account. Intuitively, neighbors nearby should contribute more than distant nodes. We thus propose a technique to better address this issue. The idea is to amplify the diagonal parts of the adjacency matrix used in each GCN layer. In this way, we are putting more weights on the representation from the previous layer in the aggregation of each GCN layer. An example is to add an identity to as follows.
(9) 
While (9) seems to be reasonable, using the same weight for all the nodes regardless of their numbers of neighbors may not be suitable. Moreover, it may suffer from numerical instability as values can grow exponentially when more layers are used. Hence we propose a modified version of (9) to better maintain the neighborhoods information and numerical ranges. We first add an identity to the original and perform the normalization
(10) 
and then consider
(11) 
Experimental results of adopting the “diagonal enhancement” techniques are presented in Section 4.3 where we show that this new normalization strategy can help to build deep GCN and achieve SOTA performance.
4. Experiments
We evaluate our proposed method for training GCN on two tasks: multilabel and multiclass classification on four public datasets. The statistic of the data sets are shown in Table 3. Note that the Reddit dataset is the largest public dataset we have seen so far for GCN, and the Amazon2M dataset is collected by ourselves and is much larger than Reddit (see more details in Section 4.2).
Datasets  Task  #Nodes  #Edges  #Labels  #Features 
PPI  multilabel  56,944  818,716  121  50 
multiclass  232,965  11,606,919  41  602  
Amazon  multilabel  334,863  925,872  58  N/A 
Amazon2M  multiclass  2,449,029  61,859,140  47  100 
Datasets  #hidden units  # partitions  #clusters per batch 

PPI  512  50  1 
128  1500  20  
Amazon  128  200  1 
Amazon2M  400  15000  10 
We include the following stateoftheart GCN training algorithms in our comparisons:

ClusterGCN (Our proposed algorithm): the proposed fast GCN training method.

VRGCN^{3}^{3}3GitHub link: https://github.com/thuml/stochastic_gcn (Chen et al., 2018b): It maintains the historical embedding of all the nodes in the graph and expands to only a few neighbors to speedup training. The number of sampled neighbors is set to be 2 as suggested in (Chen et al., 2018b)^{4}^{4}4Note that we also tried the default sample size 20 in VRGCN package but it performs much worse than sample size. .

GraphSAGE^{5}^{5}5GitHub link: https://github.com/williamleif/GraphSAGE (Hamilton et al., 2017): It samples a fixed number of neighbors per node. We use the default settings of sampled sizes for each layer () in GraphSAGE.
We implement our method in PyTorch
(Paszke et al., 2017). For the other methods, we use all the original papers’ code from their github pages. Since (Kipf and Welling, 2017) has difficulty to scale to large graphs, we do not compare with it here. Also as shown in (Chen et al., 2018b) that VRGCN is faster than FastGCN, so we do not compare with FastGCN here. For all the methods we use the Adam optimizer with learning rate as 0.01, dropout rate as 20%, weight decay as zero. The mean aggregator proposed by (Hamilton et al., 2017) is adopted and the number of hidden units is the same for all methods. Note that techniques such as (11) is not considered here. In each experiment, we consider the same GCN architecture for all methods. For VRGCN and GraphSAGE, we follow the settings provided by the original papers and set the batch sizes as 512. For ClusterGCN, the number of partitions and clusters per batch for each dataset are listed in Table 4. Note that clustering is seen as a preprocessing step and its running time is not taken into account in training. In Section 6, we show that graph clustering only takes a small portion of preprocessing time. All the experiments are conducted on a machine with a NVIDIA Tesla V100 GPU (16 GB memory), 20core Intel Xeon CPU (2.20 GHz), and 192 GB of RAM.4.1. Training Performance for median size datasets
2layer  3layer  4layer  

VRGCN  ClusterGCN  GraphSAGE  VRGCN  ClusterGCN  GraphSAGE  VRGCN  ClusterGCN  GraphSAGE  
PPI (512)  258 MB  39 MB  51 MB  373 MB  46 MB  71 MB  522 MB  55 MB  85 MB 
Reddit (128)  259 MB  284 MB  1074 MB  372 MB  285 MB  1075 MB  515 MB  285 MB  1076 MB 
Reddit (512)  1031 MB  292 MB  1099 MB  1491 MB  300 MB  1115 MB  2064 MB  308 MB  1131 MB 
Amazon (128)  1188 MB  703 MB  N/A  1351 MB  704 MB  N/A  1515 MB  705 MB  N/A 
Training Time vs Accuracy: First we compare our proposed method with other methods in terms of training speed. In Figure 6, the axis shows the training time in seconds, and axis shows the accuracy (F1 score) on the validation sets. We plot the training time versus accuracy for three datasets with 2,3,4 layers of GCN. Since GraphSAGE is slower than VRGCN and our method, the curves for GraphSAGE only appear for PPI and Reddit datasets. We can see that our method is the fastest for both PPI and Reddit datasets for GCNs with different numbers of layers.
PyTorch  TensorFlow  

Avg. time per epoch (128)  8.81s  2.53s 
Avg. time per epoch (512)  45.08s  7.13s 
Benchmarking on the Sparse Tensor operations in PyTorch and TensorFlow. A network with two linear layers is used and the timing includes forward and backward operations. Numbers in the brackets indicate the size of hidden units in the first layer. Amazon data is used.
For Amazon data, since nodes’ features are not available, an identity matrix is used as the feature matrix
. Under this setting, the shape of parameter matrix becomes 334863x128. Therefore, the computation is dominated by sparse matrix operations such as . Our method is still faster than VRGCN for 3layer case, but slower for 2layer and 4layer ones. The reason may come from the speed of sparse matrix operations from different frameworks. VRGCN is implemented in TensorFlow, while ClusterGCN is implemented in PyTorch whose sparse tensor support are still in its very early stage. In Table 6, we show the time for TensorFlow and PyTorch to do forward/backward operations on Amazon data, and a simple twolayer network are used for benchmarking both frameworks. We can clearly see that TensorFlow is faster than PyTorch. The difference is more significant when the number of hidden units increases. This may explain why ClusterGCN has longer training time in Amazon dataset.Memory usage comparison: For training largescale GCNs, besides training time, memory usage needed for training is often more important and will directly restrict the scalability. The memory usage includes the memory needed for training the GCN for many epochs. As discussed in Section 3, to speedup training, VRGCN needs to save historical embeddings during training, so it needs much more memory for training than ClusterGCN. GraphSAGE also has higher memory requirement than ClusterGCN due to the exponential neighborhood growing problem. In Table 5, we compare our memory usage with VRGCN’s memory usage for GCN with different layers. When increasing the number of layers, ClusterGCN’s memory usage does not increase a lot. The reason is that when increasing one layer, the extra variable introduced is the weight matrix , which is relatively small comparing to the subgraph and node features. While VRGCN needs to save each layer’s history embeddings, and the embeddings are usually dense and will soon dominate the memory usage. We can see from Table 5 that ClusterGCN is much more memory efficient than VRGCN. For instance, on Reddit data to train a 4layer GCN with hidden dimension to be 512, VRGCN needs 2064MB memory, while ClusterGCN only uses 308MB memory.
Categories  number of products 

Books  668,950 
CDs & Vinyl  172,199 
Toys & Games  158,771 
4.2. Experimental results on Amazon2M
A new GCN dataset: Amazon2M. By far the largest public data for testing GCN is Reddit dataset with the statistics shown in Table 3, which contains about 200K nodes. As shown in Figure 6 GCN training on this data can be finished within a few hundreds seconds. To test the scalability of GCN training algorithms, we constructed a much larger graph with over 2 millions of nodes and 61 million edges based on Amazon copurchasing networks (McAuley et al., 2015b; McAuley et al., 2015a). The raw copurchase data is from Amazon3M^{6}^{6}6http://manikvarma.org/downloads/XC/XMLRepository.html
. In the graph, each node is a product, and the graph link represents whether two products are purchased together. Each node feature is generated by extracting bagofword features from the product descriptions followed by Principal Component Analysis
(Hotelling, 1933) to reduce the dimension to be 100. In addition, we use the toplevel categories as the labels for that product/node (see Table 7 for the most common categories). The detailed statistics of the data set are listed in Table 3.In Table 8, we compare with VRGCN for GCNs with a different number of layers in terms of training time, memory usage, and test accuracy (F1 score). As can be seen from the table that 1) VRGCN is faster than ClusterGCN with 2layer GCN but slower than ClusterGCN when increasing one layer while achieving similar accuracy. 2) In terms of memory usage, VRGCN is using much more memory than ClusterGCN (5 times more for 3layer case), and it is running out of memory when training 4layer GCN, while ClusterGCN does not need much additional memory when increasing the number of layers, and achieves the best accuracy for this data when training a 4layer GCN.
Time  Memory  Test F1 score  

VRGCN  ClusterGCN  VRGCN  ClusterGCN  VRGCN  ClusterGCN  
Amazon2M (2layer)  337s  1223s  7476 MB  2228 MB  89.03  89.00 
Amazon2M (3layer)  1961s  1523s  11218 MB  2235 MB  90.21  90.21 
Amazon2M (4layer)  N/A  2289s  OOM  2241 MB  N/A  90.41 
4.3. Training Deeper GCN
In this section we consider GCNs with more layers. We first show the timing comparisons of ClusterGCN and VRGCN in Table 9. PPI is used for benchmarking and we run 200 epochs for both methods. We observe that the running time of VRGCN grows exponentially because of its expensive neighborhood finding, while the running time of ClusterGCN only grows linearly.
Next we investigate whether using deeper GCNs obtains better accuracy. In Section 4.3, we discuss different strategies of modifying the adjacency matrix to facilitate the training of deep GCNs. We apply the diagonal enhancement techniques to deep GCNs and run experiments on PPI. Results are shown in Table 11. For the case of 2 to 5 layers, the accuracy of all methods increases with more layers added, suggesting that deeper GCNs may be useful. However, when 7 or 8 GCN layers are used, the first three methods fail to converge within 200 epochs and get a dramatic loss of accuracy. A possible reason is that the optimization for deeper GCNs becomes more difficult. We show a detailed convergence of a 8layer GCN in Figure 5. With the proposed diagonal enhancement technique (11), the convergence can be improved significantly and similar accuracy can be achieved.
Stateoftheart results by training deeper GCNs. With the design of ClusterGCN and the proposed normalization approach, we now have the ability for training much deeper GCNs to achieve better accuracy (F1 score). We compare the testing accuracy with other existing methods in Table 10. For PPI, ClusterGCN can achieve the stateofart result by training a 5layer GCN with 2048 hidden units. For Reddit, a 4layer GCN with 128 hidden units is used.
2layer  3layer  4layer  5layer  6layer  

ClusterGCN  52.9s  82.5s  109.4s  137.8s  157.3s 
VRGCN  103.6s  229.0s  521.2s  1054s  1956s 
PPI  

FastGCN (Chen et al., 2018a)  N/A  93.7 
GraphSAGE (Hamilton et al., 2017)  61.2  95.4 
VRGCN (Chen et al., 2018b)  97.8  96.3 
GaAN (Zhang et al., 2018)  98.71  96.36 
GAT (Veličković et al., 2018)  97.3  N/A 
GeniePath (Liu et al., 2019)  98.5  N/A 
ClusterGCN  99.36  96.60 
2layer  3layer  4layer  5layer  6layer  7layer  8layer  
ClusterGCN with (1)  90.3  97.6  98.2  98.3  94.1  65.4  43.1 
ClusterGCN with (10)  90.2  97.7  98.1  98.4  42.4  42.4  42.4 
ClusterGCN with (10) + (9)  84.9  96.0  97.1  97.6  97.3  43.9  43.8 
ClusterGCN with (10) + (11),  89.6  97.5  98.2  98.3  98.0  97.4  96.2 
5. Conclusion
We present ClusterGCN, a new GCN training algorithm that is fast and memory efficient. Experimental results show that this method can train very deep GCN on largescale graph, for instance on a graph with over 2 million nodes, the training time is less than an hour using around 2G memory and achieves accuracy of 90.41 (F1 score). Using the proposed approach, we are able to successfully train much deeper GCNs, which achieve stateoftheart test F1 score on PPI and Reddit datasets.
References
 (1)
 Chen et al. (2018a) Jie Chen, Tengfei Ma, and Cao Xiao. 2018a. FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling. In ICLR.
 Chen et al. (2018b) Jianfei Chen, Jun Zhu, and Song Le. 2018b. Stochastic Training of Graph Convolutional Networks with Variance Reduction. In ICML.
 Dai et al. (2018) Hanjun Dai, Zornitsa Kozareva, Bo Dai, Alex Smola, and Le Song. 2018. Learning SteadyStates of Iterative Algorithms over Graphs. In ICML. 1114–1122.

Dhillon
et al. (2007)
Inderjit S. Dhillon,
Yuqiang Guan, and Brian Kulis.
2007.
Weighted Graph Cuts Without Eigenvectors A Multilevel Approach.
IEEE Trans. Pattern Anal. Mach. Intell. 29, 11 (2007), 1944–1957.  Hamilton et al. (2017) William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In NIPS.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. CVPR (2016), 770–778.
 Hotelling (1933) H. Hotelling. 1933. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology 24, 6 (1933), 417–441.
 Karypis and Kumar (1998) George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1 (1998), 359–392.
 Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. SemiSupervised Classification with Graph Convolutional Networks. In ICLR.
 Liu et al. (2019) Ziqi Liu, Chaochao Chen, Longfei Li, Jun Zhou, Xiaolong Li, Le Song, and Yuan Qi. 2019. GeniePath: Graph Neural Networks with Adaptive Receptive Paths. In AAAI.
 McAuley et al. (2015a) Julian McAuley, Rahul Pandey, and Jure Leskovec. 2015a. Inferring Networks of Substitutable and Complementary Products. In KDD.
 McAuley et al. (2015b) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015b. ImageBased Recommendations on Styles and Substitutes. In SIGIR.
 Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPSW.
 Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. (2018).

Ying et al. (2018)
Rex Ying, Ruining He,
Kaifeng Chen, Pong Eksombatchai,
William L. Hamilton, and Jure
Leskovec. 2018.
Graph Convolutional Neural Networks for WebScale Recommender Systems. In
KDD.  Zhang et al. (2018) Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and DitYan Yeung. 2018. GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs. In UAI.
 Zhang and Chen (2018) Muhan Zhang and Yixin Chen. 2018. Link Prediction Based on Graph Neural Networks. In NIPS.
6. More Details about the experiments
In this section we describe more detailed settings about the experiments to help in reproducibility.
6.1. Datasets and software versions
We describe more details about the datasets in Table 12. We download the datasets PPI, Reddit from the website^{7}^{7}7http://snap.stanford.edu/graphsage/ and Amazon from the website^{8}^{8}8https://github.com/HanjunDai/steady_state_embedding. Note that for Amazon, we consider GCN in an inductive setting, meaning that the model only learns from training data. In (Dai et al., 2018) they consider a transductive setting. Regarding software versions, we install CUDA 10.0 and cuDNN 7.0. TensorFlow 1.12.0 and PyTorch 1.0.0 are used. We download METIS 5.1.0 via the offcial website^{9}^{9}9http://glaros.dtc.umn.edu/gkhome/metis/metis/download and use a Python wrapper^{10}^{10}10https://metis.readthedocs.io/en/latest/ for METIS library.
6.2. Implementation details
Previous works (Chen et al., 2018a, b) propose to precompute the multiplication of in the first GCN layer. We also adopt this strategy in our implementation. By precomputing , we are essentially using the exact 1hop neighborhood for each node and the expensive neighbors searching in the first layer can be saved.
Another implementation detail is about the technique mentioned in Section 3.2 When multiple clusters are selected, some betweencluster links are added back. Thus the new combined adjacency matrix should be renormalized to maintain numerical ranges of the resulting embedding matrix. From experiments we find the renormalization is helpful.
As for the inductive setting, the testing nodes are not visible during the training process. Thus we construct an adjacency matrix containing only training nodes and another one containing all nodes. Graph partitioning are applied to the former one and the partitioned adjacency matrix is then renormalized. Note that feature normalization is also conducted. To calculate the memory usage, we consider tf.contrib.memory_stats.BytesInUse() for TensorFlow and torch.cuda.memory_allocated() for PyTorch.
6.3. The running time of graph clustering algorithm and data preprocessing
The experiments of comparing different GCN training methods in Section 4 consider running time for training. The preprocessing time for each method is not presented in the tables and figures. While some of these preprocessing steps such as data loading or parsing are shared across different methods, some steps are algorithm specific. For instance, our method needs to run graph clustering algorithm during the preprocessing stage.
In Table 13, we present more details about preprocessing time of ClusterGCN on the four GCN datasets. For graph clustering, we adopt Metis, which is a fast and scalable graph clustering library. We observe that the graph clustering algorithm only takes a small portion of preprocessing time, showing a small extra cost while applying such algorithms and its scalability on large data sets. In addition, graph clustering only needs to be conducted once to form the node partitions, which can be reused for later training processes.
Datasets  #Partitions  Clustering  Preprocessing 

PPI  50  1.6s  20.3s 
1500  33s  286s  
Amazon  200  0.3s  67.5s 
Amazon2M  15000  148s  2160s 