1 Introduction
Deep learning and its use in a wide range of applications, from image classification to video processing to speech recognition and natural language processing, have given rise to paradigms such as
ConvolutionalNeuralNetworks (CNNs) [17] and LongShortTermMemory (LSTM) [14]. These paradigms typically use data represented in the Euclidean space and can efficiently extract latent information from Euclidean data such as images, videos, audios and texts [25].While deep learning has achieved significant success in tasks with Euclidean data, there is an increasing number of applications that use data generated from nonEuclidean domains and are represented as graphs with complex relationships and interdependency between objects, for which most existing deep learning algorithms may fall short. For instance, in Ecommerce, a graphbased learning system exploits the interactions between users and products [3, 26] to make highly accurate recommendations. In chemistry, molecules are modeled as graphs for bioactivity identification in drug research [10]. In a citation network, papers can be categorized into different groups while linked to each other via citations [24, 16]. In each of these cases, the graph has various numbers of unordered nodes and each node has a different number of neighbors, leading to massive data dependencies among nodes.
The irregularity in graph data imposes significant challenges on existing machine learning algorithms and makes critical feature extraction operations, such as convolutions, not directly applicable. As a result, Graph Neural Networks have been proposed, in various forms, to extend deep learning approaches to graph data
[11, 21, 23, 20, 7, 25]. Among these, the Graph Convolutional Network (GCN), an approach that marries some ideas of CNNs to the distinct needs of graph data processing, has demonstrated significant potential [4, 8, 16].With the development of GCNs, their acceleration becomes an urgent issue. GCNs have been mainly adopted in two contexts: (1) big data and data mining, where social media and Ecommerce companies use GCNs to identify user preference and where police officers use GCNs to learn about the social relationships of a criminal suspect; and (2) embedded devices, where GCNs are used in realtime motion capture and 3D modeling in animation and sports game development. Both contexts need highperformance GCN inference. For big data mining, it requires high throughput, especially on events like “Black Friday”, when millions of people shop on Amazon at the same time. To advertise the correct products to every customer, a very large graph needs to be continuously evaluated. For realtime motion capture and modeling, extremely shortlatency inference is required. Although most of the existing GCNs have two or three layers, the current trend shows that GCNs are becoming deeper (as is what happened with CNNs). A GCN network with 152 layers has been proposed recently, and its efficiency has been demonstrated in the task of cloud semantic segmentation [19]. Strict timing requirement makes the acceleration of GCNs a critical area of research.
There has so far been relatively little work in GCN acceleration. The major kernel of GCN inference is SparseDenseMatrixMultiplication (SPMM). In the past few years, many efficient accelerators have been proposed for sparse CNN [12, 28, 15]
, as summarized in Section 6. These accelerators benefit from the following three features of sparse CNNs: (i) the nonzeros among CNN input and output channels are roughly balanced, leading to an easier workload distribution among parallel PEs; (ii) the sparsity of CNNs is relatively low, so typically the matrices can still be processed using a dense format; (iii) the sparse matrices or tensors in CNNs are relatively small in size. However, for this new GCN problem, these three assumptions are not held any more. Particularly, (a) the distribution of nonzeros can be extremely unbalanced and clustered (see the two examples in Figure
1) in sparse matrices for describing graphs (e.g., a graph adjacency matrix), given that realworld graphs often follow the powerlaw distribution. This leads to great challenges in workload distribution balancing, particularly on hardware; (b) the sparse matrices in GCN can be significantly sparser than in CNNs, so the hardware design need to efficiently handle sparse data formats (e.g., CSR, CSC, COO, etc.), which is much more difficult than processing it in software; (c) the imbalance issue is significantly exacerbated given the huge size of realworld graphs. For example, the Reddit graph (233K nodes and 12M links) is significantly larger than the typical images for sparse CNNs. Consequently, providing architectural support for efficient GCN acceleration with balanced workload distribution among large number of PEs becomes a very challenging task.In this work, we propose UWBGCN, an architecture for accelerating GCN inference through two dynamic strategies for workload balancing: local sharing and remote switching. We first propose a baseline architecture design and then present our online workload rebalancing techniques, which monitor and adjust the workload distribution by dynamically reconfiguring the task distribution network. The ideal configuration is reused for later iterations, forming a hardware performance autotuning paradigm. We implement UWBGCN in VerilogHDL and evaluate it on a Xilinx VCU118 FPGA. Compared with the baseline design, UWBGCN enhances PE utilization rate, on average, from 63.4% to 92.6%, leading to 2.7 speedups. This paper thus makes the following contributions:

We propose UWBSPMM, an SPMM engine, for hardware based sparsedensematrixmultiplication, on matrices with drastically imbalanced nonzero distribution. Particularly, we propose a novel hardwarebased performance autotuning paradigm for workload rebalancing.

We propose UWBGCN, a GCN accelerator based on the UWBSPMM engine that can significantly accelerate GCN inference with negligible overhead. Evaluation results show that UWBGCN can provide, on average, 246.7, 78.9, 2.7 speedups compared with CPU, GPU, and the baseline design without workload rebalancing.
2 Background
In this section we briefly introduce GCNs, showing their differences from traditional network models like CNNs and the challenges on hardware design arising from these differences.
2.1 Graph Convolutional Network
In this paper, we focus on spectralbased graph convolutional networks [25, 16] since it is one of the most fundamental and widely used GCN structures with numeric variants [13, 8, 16]. Please refer to Section 6 for the history and alternative types of GCNs. Eq. 1 shows the layerwise forward propagation of a multilayer spectral GCN:
(1) 
is the graph adjacency matrix with each row delineating the connection of a vertex with other vertices. is the matrix of input features in layer; each column of represents a feature while each row denotes a node. is the weight matrix of layer.
denotes the nonlinear activation function, e.g.,
ReLU [17]. In general, needs to be normalized via whereis the identity matrix, and
. The reason is that without normalization, multiplying the feature vector
by will change its scale – those nodes with more neighbors tend to have larger values under feature extraction. Note that during both training and inference of GCN, remains constant. Since can be computed from offline, in the remainder of this paper, we use to denote the normalized . In general, is multiplied only once per layer. However, when multihop neighboring information is to be collected, can be multiplied twice or more (i.e., , , etc.).Eq. 1 is essentially derived from graph signal processing theory: convolutions on a graph can be converted to a multiplication of signal (i.e., a scalar for each node) and a filter
in the frequency domain via the Fourier transform:
(2) 
where denotes the Hadamard product.
is collection of eigenvectors for the normalized graph Laplacian
. The diagonal matrixcomprises the corresponding eigenvalues. If a frequency domain filter
is defined, then Eq. 2 can be simplified [4] as:(3) 
Eq. 3 can be further simplified by defining the filter as the Chebyshev polynomials of the diagonal matrix [8, 16] to obtain Eq. 1.
Figure 3 illustrates the structure and compute flow for a layer of a GCN. By multiplying and , we are integrating information from connected neighboring nodes. By multiplying with , and going through the nonlinear activation function , we obtain the input features for the next layer. After multiple layers, the GCN is able to extract very highlevel abstracted features for various learning purposes.
2.2 GCN Matrices Profiling
To leverage the specific characteristics of the matrices for performance improvement, we profile the sparsity and dimensions of , , and for a 2layer GCN using 5 publicly available datasets that have been widely evaluated for GCNs[16]. The profiling results are listed in Table 1. As can be seen, is quite sparse (sparsity). For the most of datasets, the distribution of input features for the first layer is also very sparse (sparsity), as these are raw features obtained directly from the graph. becomes much denser (sparsity). s are, in general, dense matrices.
The dimensions of the matrices in GCNs depend on the dataset and can range from thousands to millions or even more. Therefore, can be extremely large and have be saved in a sparse data format. Different from CNNs, where the number of features per layer are roughly similar or increasing, in GCNs the number of features often reduces drastically by layer. There are typically thousands of features in the first layer, but only a few dozen in the second.
These observations here are essentially general and widely applicable to other datasets for GCNs. being large and sparse is due to the scale and fundamental properties (e.g., powerlaw) of realworld graphs. For CNNs, the size of featuremaps decreases with layers for higher abstraction, while the channels increases for stronger abstracting capability. In GCNs, however, the structure of the graph (i.e., ) keeps constant for each layer, while the number of feature channels decreases for the aggregation and abstraction of features. This overwhelming information aggregation explains why sparsity of drops significantly with layers.
CORA  CITESEER  PUBMED  NELL  
Dense  A  0.18%  0.11%  0.028%  0.0073%  0.043% 
W  100%  100%  100%  100%  100%  
X1  1.27%  0.85%  10.0%  0.011%  51.6%  
X2  78.0%  89.1%  77.6%  86.4%  60.0%  
Dim  Node  2708  3327  19717  65755  232965 
F1  1433  3703  500  61278  602  
F2  16  16  16  64  64  
F3  7  6  3  186  41 
3 GCN Baseline Architecture
We propose a baseline architecture design for GCN. Although we call it baseline, it is the first architectural design specially for GCN, to the best of our knowledge. The baseline design shares some similarity with existing sparse CNN accelerator designs, but in addition needs to support ultra high sparsity and large dimensions for the input matrices. In the next section, we show how to achieve near optimal workload balancing on top of the baseline design.
3.1 Matrix Computation Order
To compute , there are two alternative computation orders: and . The choice dictates the volume of nonzero multiplications. Based on our profiling observations, is ultra sparse and large, is general sparse and usually large in columns, and is small and dense. Since multiplying and leads to a very large dense matrix, then multiplying another dense matrix brings significant computation workload and delay. Alternatively, for , both are sparsedense matrix multiplications^{1}^{1}1Sparsematrixdensematrix multiplication is known as SPMM; Sparsematrixsparsematrix multiplication is known as SPGEMM.; the scale of computation is thus drastically smaller. Table 2 lists the amount of computation for four datasets following the two approaches. Since the difference is obviously huge, in our design we first perform and then left multiply with .
Layer  Order  CORA  CITESEER  PUBMED  NELL  

Layer1  62.3M  197.5M  163.2M  257G  16.3G  
999.7K  1.87M  17.5M  47M  6.1G  
Layer2  468.2K  493.0K  2.3M  800M  764.3M  
329.3K  357.6K  1.06M  735M  530.3M  
ALL  62.8M  198.0M  165.5M  258G  17.1G  
1.33M  2.23M  18.6M  782M  6.6G 
3.2 Baseline SPMM
Take as an example, if is in size , is in size , is in size , we can reformulate as:
(4) 
where is the th column of . is an element of at row and column. In other words, by broadcasting the th element from column of , to the entire column of , we can obtain a partial resulting column of . Essentially, is processed in a streaming fashion: each element finishes all computation it involves at once, and then aborts completely. In this way, we reuse the entire sparse matrix for each column of ( times in total). Such a design brings additional advantages when and are stored in CompressedSparseColumn (CSC) format (see Figure 4). Further benefit with this design is that it provides opportunity to pipeline multiple SPMM operations, as will be discussed later. Since a complete resulting element of requires an entire corresponding row of , to avoid expensive parallel reduction in hardware, we partition and along the rows and assign them to the PEs. Figure 5 depicts the procedure of calculating . The columns of and elements of in the same color are to be multiplied, and stored as partial results in with the same color.
Workload Mapping: In the baseline design, with the assumption that nonzeros of a sparse matrix are evenly distributed among the rows, we adopt a direct and static mapping from matrix rows to PEs. For example, in Figure 6, each two rows of are mapped to a separated PE; each PE processes three nonzeros of eventually.
3.3 Baseline Architecture Design
The objective here is an efficient architecture design for sparsematrixdensematrix multiplication (SPMM): , given can be extremely sparse (e.g., sparsity) or generally sparse (e.g., sparsity). Figure 7 illustrates our baseline design, comprising the modules of sparsematrixmemory (SPMMeM), densecolumnmemory (DCM), taskdistributor & Queue (TDQ), PEarray, and an accumulationbuffersarray (ACC). SPMMeM buffers the input sparse matrix . DCM buffers the input dense matrix . TDQ is for task distribution to the PEs. PEarray is for concurrent multiplication. Finally, ACC buffers the partial results of the resulting matrix for accumulation.
Depending on the sparsity and storage format of , we have two alternative designs for TDQ.
TDQ1 (left of Figure 7) is used when is general sparse and stored in dense format. We perform the direct row partition as discussed, and map to the input buffer of a PE (see Figure 6). Each cycle, data are forwarded to a PE given evenlydistributed nonzeros. As one PE may account for more than a single row of , we allocate multiple task queues (TQs) per PE. As shown in Figure 6(B), in each cycle a PE can receive up to 4 nonzero elements. We have four queues to buffer these nonzeros from different rows of . Each cycle, an arbitrator selects a nonempty queue, pops an element, checks for ReadafterWrite (RaW) hazard (discussed later), and forwards it to the PE for processing.
TDQ2 (right of Figure 7) is used when is ultrasparse and stored in CSC sparse format. Since in CSC the nonzeros are continuously stored in a dense array, if we can directly process the dense array, we gain from avoiding all the zeros. However, we suffer from the overhead of navigating to the correct PE as the indices are no longer continuous and essentially stored in another index array. We use a multistage Omeganetwork for routing the nonzero data stream to the correct PE according to their row indices from the index array. Each router in the Omeganetwork has a local buffer in case the buffer of the next stage is saturated. Our design attempts to balance the data forwarding rate and the processing capability of the PEs. This is achieved when nonzero elements are distributed evenly among rows. Compared with a global crossbar network, the Omeganetwork design incurs much less area and hardware complexity; this is especially the case when we have a large number of PEs. Meanwhile, TDQ also accepts streaming data from a particular column of dense matrix in DCM.
PEs fetch present partial results of from ACC, perform the new multiplication task, add to the partial results, and save back to ACC. Each PE is coupled with a bank of ACC to store the rows of it accounts for. A PE features two units: a multiplyaccumulateunit (MAC), and an addressgenerationunit (AGU) for result address generation and forwarding. Since is roughly a dense matrix and stored in dense format, the rows of are statically partitioned among ACC buffers. Synchronization is only needed when an entire column of the resulting matrix is completely calculated. Consequently, the imbalanced distribution of nonzeros across columns does not bring any performance issues.
An important issue here is the ReadafterWrite (RaW) hazard. Since the computations are all floatingpoint, the pipelined MAC unit usually takes several cycles to process, but can still accept new tasks while processing. If the new task tries to accumulate to the same partial result of (i.e., from the same row of ), it actually fetches a stale partial result from ACC, and a RaW hazard occurs. To avoid this hazard, we implement a stall buffer of size , where is the delay of the MAC units. We track the row indices currently being processed by the MAC and check whether the current element is targeting the same row in the RaWcheckunit (see Figure 7). If so, we buffer that job and delay for a few cycles until the hazard is resolved. This is similar to the role of the scoreboard for register RaW hazards in processor design.
Overall, for each layer of GCN, we first execute SPMM on . Since is general sparse and stored in dense format, we use TDQ1. The result of is dense. We then compute . Again, this is SPMM. However, as is ultrasparse, and stored in CSC format, we use TDQ2. The result is relatively dense. But after the activation function ReLU, a large portion of entries become zero, and we again have a general sparse matrix for the input feature matrix of the next layer.
To further improve the performance, we exploit the parallelism between the two consecutive SPMMs (i.e., and ). This is based on the observation that when a column of has finished computing, and is constant and ready, we can already start the multiplication of with that column, without the need to wait for the entire ; This is shown in Figure 8. This design brings two major benefits: (i) We gain extra parallelism and reduce the overall delay through coarsegrained pipelining and (ii) Instead of looking for large offchip storage to cache the resulting matrix, we only need to buffer a single column of onchip. Such a pattern can be reused within a GCN layer if leftmultiplied by other sparse matrices, e.g., some GCNs collect information from 2hop neighbors, so the layer formulation becomes ; the three multiplications can be pipelined.
3.4 Workload Balance Problem
The baseline architecture works well when nonzeros are evenly distributed among the rows of . However, when this assumption does not hold, which is very likely for powerlaw graphs, then the performance of the baseline architecture degrades significantly due to PE workload imbalance.
Figure 9(A) and (B) illustrate two types of workload imbalance: local imbalance and remote imbalance, and also the histogram when mapping to the baseline architecture with 8 PEs. Note that both types lead to significant performance degradation: the delay increases from the expected 2 cycles to 5 and 7 cycles, respectively.
This imbalance issue is unique for GCNs and has not been faced or resolved by existing works of sparseCNNs [12, 2, 28, 15]. The reason is that nonzeros in those sparse matrices are more or less evenly distributed. However, when dealing with huge and ultra sparse matrices such as the adjacency matrix of a socialnetwork graph following a powerlaw distribution, the condition is totally different. Efficiently handling of this unique workload balance problem from this new GCN application is the major research problem for this work. Typically, when dealing with sparse data structures such as sparse matrices/tensors, trees and graphs, etc., to achieve workload balance, the software approach is to first profile the structure through, for example, symbolic analysis, in a preprocessing stage, and then use the sampled information to guide the partition strategy later for real processing. In this work, we show how to dynamically adjust hardware configuration for continuous workload rebalancing. Our design can be applied to a variety of specialized accelerators for processing sparse data structures.
4 UWBGCN Architecture Design
We treat the two types of imbalance problem (shown in Figure 9 ) separately. For local imbalance, we propose dynamic local sharing; for remote imbalance, we propose dynamic remote switching. Both of them are dynamic techniques that measure and adjust for a better task distribution configuration each time a column of the dense input matrix is processed. After several columns, the optimal configuration best matching the nonzero structure of the sparse input matrix is obtained. This configuration is then reused for the processing of the remaining columns of the dense matrix.
Their difference is the granularity. Figure 10 describes our design flow. We use heatmap to represent the utilization of different PEs (from blue to red ). Initially, we employ equal partitioning for the baseline design. Some of the PEs (e.g., PE2, PE7 and PE8) are overutilized while some (PE1, PE4 and PE6) are underutilized. The ultimate purpose of the design is to balance the colors (i.e., utilization) by adjusting or exchanging the workloads of PEs (i.e., area in Figure 10). We first employ local balancing (right arrow) by averaging out some of the overloaded work to neighbors, improving the situation. However, the offloaded work needs to be returned for aggregation after processing. Due to chip area and design complexity restrictions, we may exchange workload between direct neighbors (234), 2hop neighbors (12345), or even 3hop neighbors (012345), but not all of them. In case nonzeros are clustered in a region across several PEs, local strategy may not work.
To allow an overloaded PE to exchange data with a remote underloaded PE, we propose remote PE switching (downarrow in Figure 10). By interchanging workloads between remote overutilized and underutilized PEs, followed by another round of local sharing (lower rightarrow), we can significantly improve load balancing. Note, this is what happens in the processing of one column of the dense matrix . Our accelerator can remember this plan and incrementally adjust it when processing the next column, for the same sparse matrix is reused every time. After several rounds, the configuration best matching the sparse structure of is obtained, and we use it for the remaining rounds. In the following, we discuss how to realize this strategy in hardware.
4.1 Dynamic Local Sharing
We need to estimate the PE utilization difference before adjusting the workload. This is achieved by comparing the number of pending tasks in its task queue (TQ) with the neighboring PEs. Figure
11 illustrates our design on how to realize 1hop local sharing for TDQ1 and TDQ2 respectively.TDQ1: Before a new task is pushed into a PE’s TQ, it compares the number of pending tasks (by checking a waiting task counter) with the TQs of neighboring PEs. The task is then forwarded to the TQ with fewer pending tasks. If forwarded to a neighbor, the result needs to be returned to the ACC buffer of its original PE for accumulation after the multiplication, as shown in Figure 11(B). The valid return address is calculated in the AGU unit of a PE.
TDQ2: For an Omega network, it is the final layer of the multistage network accounting for forwarding between neighboring PEs. For example, in Figure 11(C), two PEs share the same finallayer switch. Let us call them a group here. In Figure 12, a group has four PEs sharing the same finallayer switch. Therefore, we focus on the TQs of the final layer. After figuring out the pending task condition, we know the proper destination PE id. We adjust the address tag of the task before it is pushed into the TQs of the final layer. To enable PEs on the group border (e.g., the leftmost or rightmost PEs) to communicate with their outofthegroup neighbors, we add extra links in the final layer, as shown in Figure 11(D). Note, Figure 11(D) just shows the condition of sharing among 1hop neighbors. By considering more hop neighbors, we obtain a more balanced design at the cost of higher hardware complexity and area. This is a design tradeoff. We discuss it in more detail in the evaluation.
4.2 Dynamic Remote Switching
For remote switching, we also adopt the multiround autotuning approach. The idea is to find the most overutilized PE and the most underutilized PE per round (i.e., for a column of ), and switch a part or all of their workloads. The the percentage to be switched depends on their utilization gap. The architecture design is shown in Figure 12.
First, we discuss how to identify the most overutilized (hotspot) PE and the most underutilized (coldspot) PE. This is achieved by the PE Status Monitor (PESM). As previously mentioned, each TQ of a PE has a counter to track the number of pending tasks, which can trigger an "empty" signal when reaching zero. All the counters are connected to a multistage MUXtree for integration, and a single signal is the output. Right after the jobs of the current round are dispatched, we start to monitor . When triggers, we know some PEs become idle. By voting at each level, the Muxtree is able to identify the PE group with the highest number of "empty" signals triggered, i.e., the coldspot. When all PEs have triggered the empty signal, we record the present , which is the last PE to abort, i.e., the hotspot.
Having identified a hotspot and coldspot PEtuple with id for the current round , to avoid thrashing, we only exchange a portion of the workload between them. The number of jobs (i.e., rows of ) to be switched in the th round (i.e., a column of ) is calculated via:
(5) 
where is the largest workload gap (i.e., the workload difference between hotspot and coldspot PEs) for the th round, is the initial workload under equal partition. In the current design, we track each tuple for two rounds. In other words, in the PE Status Monitor in Figure 12, we have two slots for tracking the current PEtuple, and the PEtuple from the previous round. We ensure there is no conflict here. Each PEtuple being tracked is updated per round according to Eq 5. In this way, the workload switching ratio for each tracked PEtuple is adjusted for two or several rounds and highlikely to converge. The number of rounds we can track depends on the size of the tracking window in the PESM, and is a design tradeoff between area and performance. Calculating Eq 5 is conducted in the Utilization Gap Tracker in Figure 12. To reduce the hardware cost of division and multiplication in calculating , we also design a hardwareefficient approximation approach for computing Eq 5, which will not be discussed in detail here due to space limitation.
Knowing how many rows are to be switched between remote PEs, we use a Shuffling Lookup Table (SLT) to determine which rows are to be interchanged between the PEtuple. The IDs of these rows are forwarded to the Remote Balancing Control Register (RBCR). In the next round, the destination PE of these rows are updated in the Shuffling Switches (SS).
5 Evaluation
We evaluate the baseline and UWBGCN designs, and compare them with the same GCN networks on other platforms such as CPU and GPU.
5.1 Evaluation Configuration
To evaluate UWBGCN, we implement the RTL of the baseline architecture and UWBGCN in VerilogHDL. We measure the PE utilization, performance, energy efficiency, and hardware resources consumption on a Xilinx Virtex UltraScale+ VCU118 FPGA board. Note, we only use FPGA as an evaluation platform to demonstrate the performance and efficiency of UWBGCN. Our design is a general architecture design that does not leverage any FPGAspecific features or units.
We allocate a counter to each PE for tracking the number of idle cycles for utilization measurement. The number of operationcycles (i.e., execution delay) are measured by a cycleaccurate hardware counter. The counter triggers when the first data is forwarded to UWBGCN and stops when the last output feature is received. The hardware consumption and operation frequency are reported by the Vivado Design Suite2019.1 after synthesis and implementation. The boardlevel power consumption is measured by a power meter. To perform crossplatform comparison, we implement the reference GCN networks in PyTorch and run them on a highend server CPU Intel Xeon E52698V4 and a NVIDIA Tesla P100 GPU
. PyTorch calls
cuSPARSE library in the SPMM computation.The datasets for evaluation are Cora, Citeseer, Pubmed, Nell and Reddit, which are the five most widely used public available datasets in GCN research. Their characteristics are listed in Table 1.
5.2 UWBGCN Evaluation
The efficiency of our design is evaluated by comparing the performance, hardware resources consumption, PE utilization of the baseline design (i.e., Base) with the 4 different design choices of UWBGCNs: (i) 1hop local sharing (i.e., Design(A)), (ii) 2hop local sharing (i.e., Design(B)), (iii) 1hop local sharing + remote switching (i.e., Design(C)), (iv) 2hop local sharing + remote switching (i.e., Design(D)). The only exception is for Nell, where we use 2hop and 3hop local sharing; we explain the reason later.
Figure 13 illustrates the nonzero element distribution in the adjacency matrices of Citeseer, Nell and Reddit. The distributions of adjacent matrices of Cora and Pubmed have already been shown in Figure 1. As can be seen, the distribution of nonzeros in the graph adjacency matrices are extremely unbalanced, confirming the importance of workload rebalancing.
Figure 14(AE) compare the overall GCN inference delay and average utilization of PEs for the five designs over the five datasets, respectively. The lines show overall PE utilization. The bars show the break down of the delay cycles according to the GCN layers. We also mark the latency lower bound when assuming full PE utilization. For Cora, Citeseer and Pubmed, using 1hop and 2hop local sharing can improve PE utilization from 53%, 71% and 69%, to 83%, 83% and 93%, respectively, leading to , and performance improvement. Enabling remote switching can further promote PE utilization to 90%, 89% and 96%, respectively, bringing performance gain by , and . After analysis, we found the remaining 410% utilization gap is due to PE underutilization in the autotuning phase, e.g., in the first iteration. For Nell, as shown in Figure 13, the nonzeros are quite clustered. In this case, one or two PEs are extremely overutilized in the baseline design, leading to only 13% overall utilization. In this case, even 2hop local sharing is still insufficient to rebalance the workload. Therefore, for the Nell dataset only, we use 2hop and 3hop local sharing (rather than 1hop and 2hop) in our evaluation. Experiment results show that 2hop and 3hop local sharing can enhance PE utilization from 13% to 44% and 53%, bringing 3.4, 4.3 performance improvement. With remote switching enabled, the utilization further increases to 63% and 77%, leading to 5.7 and 7.2 performance gain. For Reddit, through local sharing, the utilization has already achieved 99% (from 92% in the baseline).
Figure 14(FJ) further break down the cycles of SPMM for the five designs over the five datasets, respectively. The shading area of the bars represent the "Sync" cycles due to workload imbalance (i.e., the waiting cycles at the barrier); the nonshading area represent the "Ideal" cycles assuming perfect workload balance. The bars in different color represent the cycles of the four SPMM operations (i.e., of Layer1, of Layer1, of Layer2, and of Layer2) in the twolayer GCNs [16, 25]. The curves show corresponding PE utilization.
Comparing among the datasets, for Cora, Citeseer and Pubmed, the imbalance mainly occurs in the SPMM operation of the input layer, which are significantly mitigated by the rebalancing techniques of UWBGCN. For Nell, the imbalance mainly occurs in the SPMM of the hidden layer, which is also diminished by UWBGCN’s rebalancing techniques. Raddit by itself is already very balanced. Comparing among the SPMM operations, the utilization improves significantly for of Layer1, of Layer1, and of Layer2. For of Layer2, although is sparse after filtered by the ReLU activation function of Layer1, its sparsity is much lower compared to in Layer1, so the utilization is also high for the baseline (except Cora).
Figure 14(KO) compare the overall hardware resources consumption of the five designs over the five datasets. The hardware resources cost is normalized to the number of Configurable Logic Blocks (CLBs) used in the design, which is the basic components of FPGA. In an ASIC design, this can be normalized to the number of transistors. The red area represents the CLB consumption for the TQs of the TDQ modules. The idea here is that, if the task distribution is more unbalanced, the TQs require more slots of the buffering queue. Therefore, by introducing the rebalancing techniques of UWBGCN, the area cost of TQs should be reduced. This is especially the case for Pubmed, Nell and Reddit. The green area represents other hardware modules excluding the TQs. For this part, it keeps almost unchanged across the five datasets, which means the area overhead from the rebalancing logic of UWBGCN is very small – only 2.7%, 4.3% and 1.9% of the whole baselinedesign area for 1hop localsharing, 2hop local sharing, and remote switching designs, respectively (for Nell, it is 2hop and 3hop). Combining the two parts, the UWBGCN design can essentially reduce hardware resources consumption when comparing to the baseline design, largely due to dramatically reduced perPE TQ size under more balanced workload (e.g., For Nell, the TQ depth for A(XW) of Layer1 is 65128 in the baseline, which reduces to 2675 in Design(D)).
Network  Cora  Citeseer  Pubmed  Nell  
IntelXeon E52698V4  Freq(MHz)  2.23.6 GHz  
Latency (ms)  3.90  4.33  34.15  1.61E3  1.08E4  
Energy (Graph Inference/kJ)  1.90E3  1.71E3  216.9  4.61  0.69  
GPU NVIDIA TeslaP100  Freq(MHz)  13281481 MHz  
Latency (ms)  1.78  2.09  7.71  130.65  2.43E3  
Energy (Graph Inference/kJ)  1.87E3  1.59E3  432.3  25.51  1.37  
EIElike: VCU118 FPGA  Freq(MHz)  285 MHz  
Latency (ms)  0.022  0.024  0.22  59.1  56.3  
Energy (Graph Inference/kJ)  1.19E6  1.11E6  1.20E5  438.2  452.1  
Baseline: VCU118 FPGA  Freq(MHz)  275 MHz  
Latency (ms)  0.023  0.025  0.23  61.0  58.9  
Energy (Graph Inference/kJ)  1.21E6  1.09E6  1.16E5  433.3  447.0  
UWBGCN: VCU118 FPGA  Freq(MHz)  275 MHz  
Latency (ms)  0.011  0.018  0.14  8.4  53.2  
Energy (Graph Inference/kJ)  2.38E6  1.43E6  1.86E5  3.06E3  497.3 
5.3 Scalability of UWBGCN
We evaluate the scalability of UWBGCN by increasing the number of PEs from 512 to 768 to 1024 for the baseline, local sharing, local sharing plus remote switching. For Cora, Citeseer, Pubmed and Reddit, we adopt 1hop local sharing; for Nell, we adopt 3hop local sharing. We show the performance, PE utilization and hardware consumption for the four SPMM operations of the five datasets, respectively, in Figure 15. The bars represent the hardware consumption (normalized to the number of CLBs). The lines represent the performance. The stars represent PE utilization. The dotted lines mark the full utilization (100%) upper bound.
For the baseline design, the PE utilization drops with more PEs due to increased unbalancing degree, since more PEs for partitioning means less rows per PE, highlighting the unbalance degree among PEs. In other words, PEs have less opportunities to absorb interrow imbalance due to less number of rows to averaging out the imbalance. In contrast, GCNs with both local sharing and remote switching show relatively stable and high PE utilization. The PE utilizations with only local sharing scales better than the baseline, but worse than with both local sharing and remote switching. Overall, by introducing the rebalancing techniques, the performance of UWBGCN scales almost in linear with the number of PEs, much better than the baseline design.
5.4 Crossplatform Comparison
Table 3 presents the crossplatform evaluation of our UWBGCN design. We compare the inference latency (milliseconds), energy efficiency (normalized to Graph Inference/kJ) and operation frequency (MHz) of UWBGCNs with implementations of GCNs in a highend Intel CPU, a Pascal architecture based NVIDIA TeslaP100 GPU, the baseline design without workload balancing, and the reproduced EIE reference implementation [12] (but tweaked for GCN processing). For UWBGCN, we use the design choice(d) for the comparison here. We use 1024 PEs in the UWBGCN design here. As we can see, despite running at a relatively low frequency, our design achieves , , and speedups on average, over the highend CPU, GPU, the baseline design without workload balancing, and the reference EIE design, respectively, across the five GCN graph datasets. Our design also achieves , , and better energy efficiency, respectively.
6 Related Work
The objects of early studies integrating graph and neural networks are known as graph neural networks or GNNs, which were first proposed by Gori et al. [11], and developed further by Micheli [21] and Scarselli et al. [23]. In GNNs, the representation of a target node is inferred by iteratively propagating neighbor information through recurrent neural architectures until they reach a stable fixed point [25]. The whole process is very computation intensive [7].
More recently, inspired by the huge success of convolutional neural networks (CNN) in extracting local features from images or videos, graph convolutional networks (GCNs) have emerged as an alternative approach in addressing graph data. In 2013, Bruna et al. [4] proposed a design for graph convolutional networks based on spectral graph theory; this was followed by a number of variants [13, 8, 16]. Since thenm other types of GCNs have been proposed, including spatialbased GCNs [9, 7], graph attention networks [1], and graph generative networks [27].
To the best of our knowledge, this is the first accelerator design focusing on GCNs. There have been many efforts on accelerating sparse CNNs [15, 28, 2, 12, 22, 18]. We briefly summarize these studies and explain why these solutions fall short when applied to GCNs. Kung et al. condense the sparse parameter matrix through column grouping [18]. In case of conflict, only the most significant parameters are kept, others are discarded. Essentially, some accuracy is tradedoff for performance. Kim et al. [15] mitigate the workload imbalance problem of sparse CNNs by using information from designtime profiling. Han et al. [12] propose EIE, an SPMV accelerator that forwards nonzeros to PEs in columnmajor order; this is similar to our baseline design with TDQ1. However, they only focus on SPMV and do not address the workload imbalance issue among the rows of the sparse matrix. Zhang et al. [28] rely on different indexing methods to identify and select nonzeros. However, these techniques do not function well when the matrix becomes ultrasparse, as in GCNs. Albericio et al. [2] extend the DaDianNao design [6] by enabling skipping of zeros. However, they also focus on sparseCNNs and do not address the workload imbalance issue. The reason these studies do not touch on the workload imbalance issue is partially because, compared with GCNs that process graphs, the impact of workload imbalance for sparseCNNs is much less significant. Chen et al. [5] propose Eyeriss. Rather than skipping zeros, Eyeriss saves energy by powergating computations with zeros involved. Finally, Zhuo and Prasanna [29] present an SPMV design for FPGAs. They use the CSR format, which can be applied to various sparse matrices. However, this design still suffers from irregular sparse structures and the workload imbalance problem.
7 Conclusion
In this paper, we propose an architecture design called UltraWorkloadBalancedGCN to accelerate graph convolutional network inference. To tackle the major performance issues from workload imbalance, we propose dynamic local workload sharing and remote workload switching techniques. They rely on hardware flexibility to realize performance autotuning with area and delay overhead. This is the first accelerator design for GCNs that relies on hardware autotuning to achieve workload rebalancing for sparse matrix computations. We conduct RTL design and experiments on a Xilinx VCU118 FPGA, which show that our design can achieve , , and speedups on average over the highend CPU, GPU, and the baseline design without workload rebalancing, across the five widely used GCN graph datasets.
8 Acknowledgement
This research was supported by the DMCCFA project under PNNL’s Laboratory Directed Research and Development Program. This research was supported by the U.S. DOE Office of Science, Office of Advanced Scientific Computing Research, under award 66150: "CENATE  Center for Advanced Architecture Evaluation". This research was supported by the High Performance Data Analytics (HPDA) program at PNNL.
References
 [1] (2018) Watch your step: learning node embeddings via graph attention. In Advances in Neural Information Processing Systems, pp. 9180–9190. Cited by: §6.

[2]
(2016)
Cnvlutin: ineffectualneuronfree deep neural network computing
. ACM SIGARCH Computer Architecture News 44 (3), pp. 1–13. Cited by: §3.4, §6.  [3] (2017) Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263. Cited by: §1.
 [4] (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §1, §2.1, §6.
 [5] (2016) Eyeriss: an energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of SolidState Circuits 52 (1), pp. 127–138. Cited by: §6.
 [6] (2014) Dadiannao: a machinelearning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622. Cited by: §6.
 [7] (2018) Learning steadystates of iterative algorithms over graphs. In International Conference on Machine Learning, pp. 1114–1122. Cited by: §1, §6, §6.
 [8] (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §1, §2.1, §2.1, §6.
 [9] (2018) Largescale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1416–1424. Cited by: §6.
 [10] (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1263–1272. Cited by: §1.
 [11] (2005) A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., Vol. 2, pp. 729–734. Cited by: §1, §6.
 [12] (2016) EIE: efficient inference engine on compressed deep neural network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 243–254. Cited by: Figure 2, §1, §3.4, §5.4, Table 3, §6.
 [13] (2015) Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163. Cited by: §2.1, §6.
 [14] (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
 [15] (2017) A novel zero weight/activationaware hardware architecture of convolutional neural network. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, pp. 1462–1467. Cited by: §1, §3.4, §6.
 [16] (2016) Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §1, §2.1, §2.1, §2.2, §5.2, §6.
 [17] (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §2.1.
 [18] (2019) Packing sparse convolutional neural networks for efficient systolic array implementations: column combining under joint optimization. In Proceedings of the TwentyFourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 821–834. Cited by: §6.
 [19] (2019) Can gcns go as deep as cnns?. arXiv preprint arXiv:1904.03751. Cited by: §1.
 [20] (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §1.
 [21] (2009) Neural network for graphs: a contextual constructive approach. IEEE Transactions on Neural Networks 20 (3), pp. 498–511. Cited by: §1, §6.
 [22] (2017) Scnn: an accelerator for compressedsparse convolutional neural networks. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 27–40. Cited by: §6.
 [23] (2008) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §1, §6.
 [24] (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1.
 [25] (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §1, §1, §2.1, §5.2, §6.
 [26] (2018) Graph convolutional neural networks for webscale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 974–983. Cited by: §1.
 [27] (2018) Graphrnn: a deep generative model for graphs. arXiv preprint arXiv:1802.08773. Cited by: §6.
 [28] (2016) Cambriconx: an accelerator for sparse neural networks. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 20. Cited by: §1, §3.4, §6.
 [29] (2005) Sparse matrixvector multiplication on fpgas. In Proceedings of the 2005 ACM/SIGDA 13th international symposium on Fieldprogrammable gate arrays, pp. 63–74. Cited by: §6.
Comments
There are no comments yet.