Graph convolutional networks (GCNs) [kipf2016semi] have emerged as state-of-the-art (SOTA) algorithms for graph-based learning tasks, such as graph classification [xu2018powerful], node classification [kipf2016semi], link prediction [zhang2018link], and recommendation systems [ying2018graph]
. It is well recognized that the superior performance of GCNs on graph-based data largely benefits from GCNs’ irregular and unrestricted neighborhood connections, enabling GCNs to outperform their structure-unaware alternatives such as multilayer perceptrons (MLPs). Specifically, for each node in a graph, GCNs first aggregate neighbor nodes’ features, and then transform the aggregated feature through (hierarchical) feed-forward propagation to update the feature of the given node.
Despite their promising performance, GCN training and inference can be notoriously challenging, hindering their great potential from being unfolded in large real-world graphs. This is because as the graph dataset grows, the large number of node features and the abundant adjacency matrix can easily explode the required memory and data movements. For example, a mere 2-layer GCN model with 32 hidden units requires 19 GFLOPs (FLOPs: floating point operations) to process the Reddit graph [tailor2020degree]canziani2016analysis]. The giant computational cost of GCNs comes from three aspects. First, graphs (or graph data), especially real-world ones, are often extraordinarily large and irregular as exacerbated by their intertwined complex neighbor connections, e.g., there are a total of 232,965 nodes in the Reddit graph with each node having about 50 neighbors [KKMMN2016]. SecondThird, the extremely high sparsity and unbalanced non-zero data distribution in GCNs’ adjacency matrices impose a paramount challenge for effectively accelerating GCNs [geng2020awb, yan2020hygcn], e.g., the sparsity of an adjacency matrix often exceeds 99.9%, while DNNs’ sparsity generally ranges from 10% to 50%.
To tackle the aforementioned challenges and unleash the full potential of GCNs, various techniques have been developed. For instance, Tailor et al. [tailor2020degree] leverages quantization-aware training to demonstrate 8-bit GCNs; SGCN [li2020sgcn] is the first to consider GCN sparsification by formulating and solving it as an optimization problem; and NeuralSparse [zheng2020robust] proposes edge pruning and utilizes a DNN to parameterize the sparsification process.
The impressive performance achieved by existing GCN compression works shows that there are redundancies within GCNs to be leveraged for aggressively trimming down their complexity while maintaining their performance. In this work, we attempt to take a new perspective by drawing inspiration from the tremendous success of DNN compression, particularly the lottery ticket finding [frankle2018the, liu2018rethinking]. Specifically, [frankle2018the, liu2018rethinking] shows that there exist winning tickets (small but critical subnetworks) for dense, randomly initialized networks, that can be trained alone to achieve a comparable accuracy with the dense ones in a similar number of iterations; later [You2020Drawing] finds that those winning tickets emerge at the very early training stage, namely early-bird (EB) tickets, and leverages this to greatly reduce the training cost of DNNs. While conceptually simple, the unique structures of GCNs make it not straightforward to apply those findings to compressing GCNs. This is because (1) the graph instead of the MLPs in GCNs dominates the complexity, for which the existence of EB remains unknown; and (2) it is unclear how to jointly optimize the two phases of GCN operations (i.e., feature aggregation and combination) even though doing so promises the maximum complexity reduction.
This paper aims to close the above gap to minimize the complexity of GCNs without hurting their competitive performance, and to make the following contributions:
We discover the existence of graph early-bird (GEB) tickets that emerge at the very early stage when sparsifying GCN graphs, and propose a simple yet effective detector to automatically identify the emergence of GEB tickets. To our best knowledge, we are the first to show that the EB finding holds for GCN graphs.
we develop a generic efficient GCN training framework dubbed GEBT that significantly boosts GCN training efficiency by (1) drawing joint-EB tickets between the GCN graphs and models and (2) simultaneously sparsifying both the GCN graphs and models, while resulting in efficient GCNs for inference.
Experiments on various GCN models and datasets consistently validate our GEB finding and the effectiveness of the proposed GEBT. For example, our GEBT achieves up to 80.2% 85.6% and 84.6% 87.5% GCN training and inference costs savings while leading to a comparable or even better accuracy as compared to state-of-the-art (SOTA) methods.
2 Related Works
Graph Convolutional Networks (GCNs). GCNs have amazed us for processing non-Euclidean and irregular data structures [zhang2018end]. Recently developed GCNs can be categorized into two groups: spectral and spatial methods. Specifically, spectral methods [kipf2017semi, peng2020learning]
model the representation in the graph Fourier transform domain based on eigen-decomposition, which are time-consuming and usually handle the whole graph simultaneously making it difficult to parallel or scale to large graphs[gao2019graphnas, wu2020comprehensive]. On the other hand, spatial approaches [hamilton2017inductive, simonovsky2017dynamic], which directly perform the convolution in the graph domain by aggregating the neighbor nodes’ information, have rapidly developed recently. To further improve the performance of spatial GCNs, Veličković et al. [GAT] introduces the attention mechanism to select information which is relatively critical from all inputs; Zeng et al. [zeng2019accurate] proposes mini-batch training to improve GCNs’ scalability of handling large graphs; and [xu2018how] theoretically formalizes an upper bound for the expressiveness of GCNs. Our GEB finding and GEBT enhance the understanding of GCNs and promote efficient GCN training, and can be generally applicable to SOTA GCN models.
GCN Compression. The prohibitive complexity and powerful performance of GCNs have motivated growing interest in GCN compression. For instance, Tailor et al. [tailor2020degree] for the first time shows the feasibility of adopting 8-bit integer arithmetic representation for GCN inference without sacrificing the classification accuracy; two concurrent pruning works [li2020sgcn, zheng2020robust] aim to sparsify the graph adjacency matrices; and Ying et al. [ying2018hierarchical] proposes a DiffPool layer to reduce the size of GCN graphs by clustering similar nodes during training and inference. Our GEBT explores from a new perspective and is complementary with exiting GCN compression works, i.e., can be applied on top of them to further reduce GCNs’ training/inference costs.
Early-Bird Tickets Hypothesis. Frankle et al. [frankle2018the] shows that winning tickets (i.e., small subnetworks) exist in randomly initialized dense networks, which can be retrained to restore a comparable or even better performance than their dense network counterparts. This finding has attracted lots of research attentions as it implies the potential of training a much smaller network to reach the accuracy of a dense, much larger network without going through the time and cost consuming pipeline of fully training the dense network, pruning and then retraining it to restore the accuracy. Later, You et al. [You2020Drawing]
demonstrates the existence of EB tickets, i.e., the winning tickets can be consistently drawn at the very early training stages across different models and datasets, and leverages this to largely reduce the training costs of DNNs. More recently, the EB finding has been extended to natural language processing (NLP) models (e.g., BERT)[anonymous2021earlybert]
and generative adversarial networks (GANs)[mukund2020winning]. Our GEB finding and GEBT draw inspirations from the prior arts, and for the first time demonstrate that the EB phenomenon holds for GCNs which have unique and different algorithm structures as compared to DNNs, NLP, and GANs. Furthermore, we are the first to derive joint EB tickets between GCN graphs and networks.
3 Our Findings and Proposed Techniques
3.1 Preliminaries of GCNs and GCN Sparsification
GCN Notation and Formulation. Let represents a GCN graph, where and denote the nodes and edges, respectively; and and denote the total number of nodes and edges, respectively. The node degrees are denoted as where indicates the number of neighbors connected to the node . We define as the degree matrix whose diagonal elements are formed using . Given the adjacency matrix and the feature matrix of the graph , a two-layer GCN model [kipf2017semi] can then be formulated as:
where is calculated by a pre-processing step, thus multiplying captures GCNs’ neighbor aggregation; and are the weights of the GCN model for the 1st and 2nd layers, e.g., is an input-to-hidden weight matrix for a hidden layer with feature maps and is a hidden-to-output weight matrix with feature maps (i.e., ), where the mapping from the input to the hidden or output layer is called GCN combination which combines each node’s features and its neighbors; The softmax function is applied in a row-wise manner [kipf2017semi]
. For semi-supervised multiclass classification, the loss function represents the cross-entropy errors over all labeled examples:
where is the set of node indices that have labels, and are the ground truth label matrix and the GCN output predictions, respectively. During GCN training, and are updated via gradient descents.
Graph Sparsification. The goal of graph sparsification is to reduce the total number of edges in GCNs’ graph (i.e., the size of the adjacency matrices). A SOTA graph sparsification pipeline [li2020sgcn] is to first pretrain GCNs on their full graphs, and then sparsify the graphs based on the pretrained GCNs. Note that the weights of GCNs are not updated during graph sparsification, during which is replaced with in Eq. (2) to derive the loss function . As such, the overall loss function during graph sparsification can be written as [li2020sgcn]:
where denotes the sparse regularization term, which ideally will become zero if the sparsity of the graph adjacency matrices reaches the specified pruning ratio (e.g., for a given ratio of ). As is not differentiable, SOTA graph sparsification work [li2020sgcn] formulates Eq. (3) as an alternating optimization problem for updating the graph adjacency matrices using gradient descent.
3.2 Finding 1: EB Tickets Exist in GCN Graphs
In this subsection, we first conduct an extensive set of experiments to show that GEB tickets can be observed across popular graph datasets, and then propose a simple yet effective method to detect the emergence of GEB tickets.
Experiment Settings. For this set of experiments, we follow the SOTA graph sparsification work [li2020sgcn] to first pretrain GCNs on unpruned graphs, train and prune the graphs based on the pretrained GCNs, and then retrain GCNs from scratch on the pruned graphs to evaluate the achieved accuracy. In addition, we adopt a two-layer GCN as described in Eq. (1), in which both the GCN and graph training take a total of 100 epochs and an Adam solver is used with a learning rate of 0.01 and 0.001 for training the GCNs and graphs, respectively. For retraining the pruned graphs, we keep the same setting by default.
Existence of GEB Tickets. We follow the SOTA method [li2020sgcn] to sparsify the graph, but instead prune the graph that have not been fully trained (before the accuracy reaches their final top values), to see if reliable GEB tickets can be observed, i.e., the retraining accuracy reaches the one drawn from the corresponding fully-trained graph. Fig. 1 shows the accuracies achieved by re-training the pruned graphs drawn from different early epochs, considering three different graph datasets and six pruning ratios. Two intriguing observations can be made: (1) there consistently exist GEB tickets drawn from certain early epochs (e.g., as early as 10 epochs w.r.t. the total of 100 epochs), of which the retraining accuracy is comparable or even better than those drawn in a later stage, including the “ground-truth” tickets drawn from the fully-trained graphs (i.e., at the 100-th epoch); and (2) some GEB tickets (e.g., on Pumbed) can even outperform their unpruned graphs (denoted using dashed lines), potentially thanks to the sparse regularization as as mentioned in [You2020Drawing]. The first observation implies the possibility of “overcooking” when identifying important graph edges at later training stages.
Detection of GEB Tickets. The existence of GEB tickets and the prohibitive cost of GCN training motivate us to explore the possibility of automatically detecting the emergence of GEB tickets. To do so, we develop a simple yet effective detector via measuring the “graph distance” between consecutive epochs during graph sparsification. Specifically, we define a binary mask of the drawn GEB tickets (i.e., pruned graphs), where 1 denotes the reserved edges and 0 denotes the pruned edges, and use the hamming distance between the corresponding masks to measure the “distance” between two graphs.
Fig. 2 (a) visualizes the pairwise “graph distance” matrices among 100 training epochs, where the -th element within the matrices represents the distance between the pruned graphs drawn at the -th and -th epochs. We see that the distance deceases rapidly (i.e., color change from green to yellow) at the first few epochs, indicating that the reserved edges in pruned graphs quickly converge at the very early training stages. We therefore measure and record the distance between consecutive three epochs (i.e., look back for three epochs during training), and stop training the graph when all the recorded distances are smaller than a specified threshold . Fig. 2 (b) plots the maximum recorded distances as graph training epochs increase, where the red line denotes the threshold we adopt in all experiments with different pruning ratios. The identified GEB tickets are consistently drawn from the early (10- 26-th) epochs. These experiments validate the effectiveness of our developed GEB detector, which has negligible overheads compared with the total graph training cost (i.e., 0.1%).
3.3 Finding 2: Joint-EB Tickets Exist
In this subsection, we first develop a co-sparsification framework to prune the GCN graphs and networks, and then show in an set of extensive experiments (more are in the supplement) that joint-EB tickets exist across various models and datasets, and then propose a simple detector to detect the emergence of joint-EB tickets during co-sparsification of the GCN graphs and networks.
Co-sparsification of the GCN Graph and Network. To explore the possibility of drawing joint-EB tickects between GCN graphs and networks, we first develop a co-sparsification framework, as described in Fig. 5 (c) and Algorithm 2. Specifically, we iteratively update the GCN weights and graph adjacency matrices based on their corresponding loss functions formulated in Eq. (2) and Eq. (3), respectively; after training for a certain epochs (e.g., 100 epochs), we then simultaneously prune the trained GCN graphs and networks using a magnitude-based pruning method [han2015deep, frankle2018the], and finally retrain the resulting pruned GCNs on the pruned graphs. Fig. 3 shows the achieved accuracy-FLOPS trade-offs of our co-sparsification framework when evaluated GCNs [kipf2017semi] on Cora and CiteSeer graph datasets. We can see that co-sparsification can achieve up to 90% sparsity in GCN weights while maintaining a comparable accuracy as that of the unpruned GCN graphs and networks.
Existence of Joint-EB Tickets. The existence of GEB tickets in GCN graphs and EB tickets in DNNs motivate our curiosity on the existence of joint-EB tickets between GCN graphs and networks. Fig. 4 (a) visualizes the retraining accuracies of the GCN subnetworks on subgraphs with both being drawn from different early epochs, which consistently indicates the existence of joint-EB tickets under an extensive set of experiments with different graph datasets, graph pruning ratios, and weight pruning ratios . Furthermore, we can see that the joint-EB tickets emerge at the very early training stages (as early as epochs w.r.t. a total of 100 epochs), i.e., their retraining accuracy is comparable or even better than that of training the corresponding unpruned GCN graphs and networks or training the pruned graphs and unpruned GCN networks [li2020sgcn].
Detection of Joint-EB Tickets. We also develop a simple method to automatically detect the emergence of joint-EB tickets, of which the main idea is similar to the GEB tickets detector described in Sec. 3.2 but with an additional binary mask for the drawn GCN subnetwork. Similarly, in the binary masks, the pruned weights are set to 0 while the kept ones are set to 1, and the distance between subnetworks is characterize using the hamming distance between the corresponding binary masks following [You2020Drawing] but we additionally define a binary mask of the drawn GCN subnetwork, where the pruned weights are 0 while the kept ones are 1. Therefore the distance between subnetworks is represented by the hamming distance between the corresponding binary masks following [You2020Drawing]. For detecting the joint-EB tickets, we measure both the “subgraph distance” and “subnetwork distance” among consecutive epochs, resulting in three choices for the stop criteria (for a given the threshold ): (1) ; (2) ; (3) .
Fig. 4 (b) leverages the third criterion to visualize the distance’s trajectories of GCN networks on Cora and CiteSeer datasets, at different graph and network pruning ratio pairs . The ablation studies of all of the three criteria can be found at Sec. 4.4. We can see that all criteria can effectively identify the emergence of joint-EB tickets, e.g., as early as 11 epochs w.r.t. a total of 100 epochs. Interestingly, the drawn joint-EB tickets can achieve a comparable or even better retraining accuracy than the subgraph and subnetwork pairs drawn at a later stages, which again implies the possibility of “over-cooking” as in the case of DNNs discussed in [You2020Drawing]. All results in this set of experiments consistently validate the existence of joint-EB tickets and the effectiveness of our joint-EB ticket detector.
3.4 Proposed GEBT: Efficient Training + Inference
In this subsection, we present our proposed GEBT technique, which aims to leverage the existence of both GEB tickets and joint-EB tickets to develop a generic GCN efficient training framework. Note that GEBT achieves “win-win”: both efficient training and inference as the resulting trained GCN graphs and networks from GEBT training are naturally efficient. Here we will first describe the GEBT technique via drawing GEB tickets and joint-EB tickets, respectively, and then provide a complexity analysis to formulate the advantages of GEBT.
GEBT via GEB Tickets. Fig. 5 (b) illustrates the overall pipeline of the proposed GEBT via drawing GEB tickets. Specifically, GEBT via drawing GEB tickets involves three steps: pretrain GCNs on the full graphs, train and sparsify the graph for identifying GEB tickets, and then retrain the GCN networks on the GEB tickets. The GEB ticket detection scheme is described in Algorithm 1. Specifically, we use a magnitude-based pruning method [han2015deep] to derive the graph mask (i.e., ) for calculating the graph distance between subgraphs from consecutive epochs and then store them into a first-in-first-out (FIFO) queue with a length of ; The GEBT training will stop when the maximum graph distance in the FIFO queue is smaller than the specified threshold with a default value of 0.1 in all our experiments, and return the GEB tickets (i.e., ) to be retrained.
GEBT via joint-EB Tickets. Fig. 5 (c) shows the overall pipeline of the proposed GEBT technique via drawing joint-EB tickets. While SOTA efficient GCN training methods consist of three steps: (1) fully pretrain the GCN networks on the full graphs, (2) train and prune the graphs based on pretrained GCNs, and (3) retrain the GCN networks on pruned graph from scratch. Accordingly, here GEBT via drawing joint-EB tickets only have two steps, it first follows the co-sparsification framework described in Sec. 3.3 to prune and derive the GCN subgraph and subnetwork pairs, and then retrain the subnetwork on the drawn subgraph to restore accuracies. The joint-EB tickets detection scheme is described in Algorithm 2, where a FIFO queue is adopted for recording both the distance of subgraphs and subnetworks between consecutive epochs. GEBT training will stop when is smaller than a predefined threshold , and return the detected joint-EB tickets (i.e., and ) for further retraining.
Complexity Analysis of GEBT vs. SOTA Methods. Here we provide complexity analysis to quantify the advantages of our GEBT technique. The time and memory complexity of GCN inferences can be captured by and , respectively, where is the total number of GCN layers, is the toal number of nodes, is the total number of edges, and is the total number of features [chiang2019cluster]. Assuming that drawing joint-EB tickets leads to and sparsities in GCN graphs and networks, respectively, then the inference time and memory complexity of GCNs resulting from our GEBT is and , respectively. Note that the corresponding training complexity will be scaled up by the total number of the required training epochs.
4 Experiment Results
4.1 Experiment Setting
Models, Datasets, and Training Settings. We evaluate the proposed methods over the representative full-batch training GCN algorithm [kipf2017semi], on the three citation graph datasets (Cora, CiteSeer, Pumbed) [sen2008collective]. The statistics of these three datasets are summarized in Tab. 1. We train all the chosen two-layer GCN models of 16 hidden units for 100 epochs using an Adam optimizer [kingma2014adam] with a learning rate of 0.01, and also choose the same dataset splits as described in [kipf2017semi].
Baselines and Evaluation Metrics.
Baselines and Evaluation Metrics.We conduct two kinds of comparisons to measure the effectiveness of the proposed GEBT’s improved training and inference efficiency in terms of the node classification accuracy, total training FLOPs, and inference FLOPs.
Comparing the proposed methods with other graph sparsifiers, including the random pruning [frankle2018the] and SGCN [li2020sgcn].
Compare the performance of the sparsified GCN using sparsified subgraphs with the standard SOTA GCN algorithms using unpruned graphs, including the GCN [kipf2017semi], GraphSAGE [hamilton2017inductive], GAT [GAT], and GIN [xu2018how].
|Methods||Accuracy (%)||Inference FLOPs (M)||Memory (MB)|
|Improvement||1.1 1.1||2.2 10.0||0.1% 4.1%|
4.2 GEBT over SOTA Sparsifiers
We compare the proposed GEBT with existing SOTA GCN sparsification pipelines [li2020sgcn] on the three graph datasets to evaluate the effectiveness of GEBT. Fig. 6 shows that GEBT consistently outperforms all competitors in terms of measured accuracies and computational costs (i.e., training and inference FLOPs) trade-offs. Specifically, GEBT via GEB tickets achieves 24.7% 32.1% training FLOPs reduction while offering comparable accuracies (-1.4% 4.9%) across a wide range of graph pruning ratios, as compared to SGCN. Furthermore, GEBT via joint-EB tickets even aggressively reaches 80.2% 85.6% and 84.6% 87.5% reduction in training FLOPs and inference FLOPs, respectively, over SGCN when pruning the GCN networks up to 90% sparsity, meanwhile leading to a comparable accuracy range (-1.27% 1.35%). This set of experiments verify that the efficiency benefits of the proposed GEBT framework, and the high-quality of the drawn GEB tickets and joint-EB tickets.
4.3 GEBT over SOTA GCNs
We further compare the performance of GEBT over four SOTA GCN algorithms to evaluate its benefits of sparsification. As shown in Tab. 2, GEBT again consistently outperforms the baseline GCN algorithms in terms of efficiency-accuracy trade-offs. Specifically, GEBT achieves 2.2 10.0 inference FLOPs reduction and 0.1% 4.1% storage savings, while offering a comparable accuracy (-1.1% 1.1%), as compared to SOTA GCN algorithms.
4.4 Ablation Studies of Joint-EB Detectors
We adopt a mixture of “graph distance” and “network distance” to identify the joint-EB tickets in the all above experiments, and are further curious about whether a detector based on one of the two distances can still lead to similar benefits. To validate this, we first measure and compare the epoch ranges where joint-EB tickets emerge, and when applying another two criteria (1) and (2) , respectively. We can see that the graph distance criterion is more suitable than the network distance criterion, because the latter undergoes a “warming up” process starting from 0 to 1 and then quickly drops to nearly zero, making joint-EB detection collapse to the initial training stage even if we ignore such “warming up” as shown in Fig. 7 (a). Then, we compare the retraining accuracy of the drawn joint-EB tickets using all the three criteria to see their robustness to different criteria (i.e., drawing epochs). The results in Fig. 7 (b) show that the third identifier works best.
GCNs have gained increasing attention thanks to their SOTA performance on graphs. However, the notorious challenge of GCN training and inference limits their application to large real-world graphs and hinders the exploration of deeper and more sophisticated GCN graphs. To this end, we explore the possibility of drawing lottery tickets when sparsifying GCN graphs. Specifically, we for the first time discover the existence of GEB tickets that emerge at the very early stage when sparsifying GCN graphs, and propose a simple yet effective detector to automatically identify the emergence of such GEB tickets. Furthermore, we develop a generic efficient GCN training framework dubbed GEBT that can significantly boost the efficiency of GCN training and inference by enabling co-sparsification and drawing joint-EB of GCNs. Experiments on various GCN models and datasets consistently validate our GEB finding and the effectiveness of the proposed GEBT. This work opens a new perspective in understanding GCNs and in efficient GCN training and inference.