1 Introduction
Recently, the success of deep learning methods in many fields has provoked a keen interest in generalizing neural network architectures to nonEuclidean data, such as manifolds and graphs. However, traditional deep neural networks, such as convolutional neural network (CNN)
[25], long short term memory (LSTM)
[20], are proposed to work for regular gridlike structures in Euclidean space, they are not trivially portable to nonEuclidean data domains like graphs. Therefore, graph neural networks (GNNs) are recently emerging as a powerful approach for graph processing and achieved unparalleled performance on many classic graph processing tasks, such as citation network [41], [45], social networks [5], and knowledge graph [17, 46]. The success of graph neural network propelled the deployment of GNNs to the realworld production system. For example, Alibaba’s AliGraph [54] and Euler [1] platform leverage GNNs to analyze the ecommerce graph data of billion users and items. Facebook also released BigGraph platform to handle graph data in warehouse machines [26].Unfortunately, large graphbased neural network has gone beyond what existing deep learning frameworks and graph processing frameworks are designed for [28]. Thereby, high performance GNN processing frameworks, such as Deep Graph Library (DGL) [31]
, Pytorch Geometric (PyG)
[12], and Neugraph [28] are becoming prevalent. However, due to the overhead of the massive memory parallelism, processing and update activities incurred by large graphs, these GNN software frameworks generally adopt large computational nodes equipped with multiple GPUs or CPUs to deal with the largescale graph, which results in high cost and energy overhead. For example, NeuGraph uses eight GPUs to handle a dataset with million vertexes [28]. Therefore, although they gain significant performance improvement, the potentials of GNNs performance and energy efficiency are still bounded by the hardware architectures assumed by these frameworks.The past decades witnessed the thrive of the domain and applicationspecific processors, which are thought promising alternatives to generalpurpose architectures in specific domains. Intuitively, a specialized GNN architecture is a promising option to improve the efficiency of GNN processing together with the GNN frameworks. How to forge such a general and efficient architecture is a nontrivial task. By observing stateoftheart GNN processing frameworks, we generalize the architecture of typical GNN algorithms into three key stages: the vertex feature extraction stage, the feature aggregate stage, and the graph update stage. This paradigm enables GNNs to learn and perform many different complex tasks. Thereby, an efficient GNN processors have to be customized to deal with these three stages to support diverse GNN algorithms.
Because the current GNN algorithms fuse the merits of neural network and the concept of iterative graph propagation. Neither stateoftheart neural network processors [8] nor the graph processing accelerators [16] are suitable hardware for GNN processing. First of all, the traditional deep learning accelerators (DLA) are designed to support the convolution or matrix multiplication operations that extract features from regular data structure such as image or audio, but they do not support the other two critical processing stages of GNN propagation. The aggregate stage gathers neighbors’ features using edges on the graph, which requires not only the ability to process edge information, but also involves frequent but irregular scalar operations and random memory accesses induced by the random traversal of the large but sparse graph.
Meanwhile, there are also many graph processing accelerators [47, 44, 22, 21] designed for traditional graph algorithms, such as PageRank, BreadthFirstSearch, etc. However, GNNs algorithm can hardly be mapped to the existing graph processors. Traditional graph processor [47]
are able to support the aggregate and update stages of graph propagation model, but they generally do not work well for the feature extraction stage of GNN propagation, because they mostly support some of the simple arithmetic operations as required the traditional graph algorithm. Prior works of graph processors do not show the capability of processing the highdimension and dynamic vertexes property or the tensors of learned neural parameters in current GNN algorithms, because the dimensions of vertex properties in traditional graph algorithms are usually relatively small and invariant in the iteration process.
Algorithms  Feature extraction  Aggregate  Update 

GCN  
GraghSagePool  
RGCN  
GatedGCN  
GRN 
Therefore, in order to accelerate practical GNNbased applications that process realworld highthroughput and largescale graphs, GNN accelerator has to resolve the obstacles that exist in the realworld GNN algorithms: (1) How to tailor an unified architecture that efficiently supports the dataflow of feature extraction, aggregate and update stages in GNNs. It is observed that the property dimension of a vertex dynamically changes in wide ranges during the propagation of GNN layers, which leads to fluctuation of hardware utility and onchip data locality in feature extraction, aggregate and update stages. (2) large graphs containing millions of vertexes pose a significant challenge to the design of energyefficient and compact GNN accelerators with limited onchip memory space. Particularly, when massive graphs with million vertices are partitioned into sparselyconnected subgraphs, there will be intensive random and irregular offchip memory accesses induced, which leads to poor locality that are hard to harness in the aggregate and update stage. And (3) the powerlaw distribution [4] creates highdegree but imbalanced connection sparsity in large realworld graphs. Accelerator must be able to deal with the imbalanced sparsity distribution, which leads to processing elements underutility, poor locality and redundant memory access issues in hardware.
To cope with issues, we propose EnGN, a highthroughput and energyefficient accelerator for large graph neural network processing. First, in EnGN, a ringedgereduce (RER) dataflow and the accompanied hardware architecture of RER PEarrays are designed to simultaneously conduct the vertex property feature extraction, aggregate and vertex update on GNNs. It is known that aggregating the property and updating the vertices distributed in the large and sparselyconnected graphs will lead to poor hardware resources utility and more importantly intensive memory access overhead induced by the poor data locality. However, the proposed RER PEs connected into a ring topology leverages the ringallreduce dataflow to make vertex property flow between rows of Processing Elements (PE) and performs efficient update operations without randomly accessing the vertices and edges from the memory.
Second, for the feature extraction stage, EnGN constructs a graph property aware dataflow (GPA) that decouples the vertex property and the hardware computing structure, which makes the GNN mapping to the RER array independent of the vertex dimension. Meanwhile, the property dimension of vertices on the graph changes abruptly before or after the aggregate stage can cause a significant change of computation cost across the GNN layers. Thus, GPA can dynamically reorder the graph processing stage to reduce the total computation cost of GNN model on accelerator.
Third, considering the footprint of largescale graphs, EnGN adopts a graph tiling strategy to process the partitioned subgraphs with high degree of data reusability. Graph tiling aims to partition a largescale graph into subgraphs that fit the onchip memory and maximize the locality. The tiles are strategically scheduled in EnGN to select either roworiented or columnoriented processing dataflow to maximally reuse vertices between tiles and reduce the overhead caused by the offchip memory access.
Finally, due to the powerlaw distribution and sparsity characteristics of the realworld graphs, the accessing frequency to different vertices may vary in a large scale. For example, on Cora citation graph, the access frequency of a highdegree vertex is 100x times than that of a lowdegree vertex, which causes access imbalance issue. Thus, EnGN comprises a threelevel onchip memory hierarchy, and the L2 memory is a degreeaware vertex cache (DAVC) to locally cache the highradix vertices that are densely connected to other vertices in graphs. DAVC reduces considerable memory access cost. In summary, our main contributions are the following:

An compact but highthroughput accelerator is designed for large graph neural network, which is implemented based on the edgecentric paradigm and supports various large scale GNNs.

We proposed a graph property aware and ringedgereduce (RER) dataflow to enable the EnGN to handle a vertex with arbitrary dimension property and high throughput update operations. The onchip memory hierarchy is designed to be aware of the nonuniform distribution of highradix and lowradix graph vertexes and employ a specialized memory space management to enhance data locality on the chip.

We implement EnGN accelerator in 45nm process and make comprehensive evaluations and compare the performance, power, energy of EnGN to CPU and GPU baselines empowered by the latest highperformance GNN frameworks. Experimental results show that EnGN outperforms CPU and GPU by up to 303.45x and 4.44x speedup on average, respectively. EnGN also achieves higher energy efficiency by 1370.52x and 93.73x on average compared to the CPU and GPU baselines.
2 General GNN processing model
2.1 Graph neural networks
Unlike convolutional neural networks that mainly deal with Euclidean data like images and videos [28], graph neural networks (GNNs) generalize the conventional neural networks to operate directly on nonEuclidean data especially graph data such as social networks, point cloud, chemical molecules. It has been proven to be supremely successful on tasks like node classification, graph classification, and link prediction. As summarized in the surveys [51, 43] GNN models can be roughly categorized into graph convolutional network and graph recurrent network. The representative GNNs will be detailed in this section.
Graph convolution network (GCN) generalizes the convolution operation from regular image data to nonstructural graph data. It can be used for node classification [24] and chemistry molecules architecture analysis [13]. A typical GCN [24] is presented and formulated in Equation 1:
(1) 
Note that is the adjacency matrix of the graph, is the weight matrix at layer , is essentially the output of the normalized graph Laplacian [24] over where
is the identity matrix and
.GraghSagePool is proposed in [18] and used for citation network analysis and proteinprotein interaction task. Unlike the GCN models, it leverages the averaging function as an aggregation operator and has the source vertex property () involved when updating output in next iteration. The expression of GraphSagePool is defined in Equation 2.
(2) 
where refers to the function that concatenates a vertex’s property with the aggregated property of its neighbor vertices and represents the destination vertex property in layer .
Relational graph convolutional network (RRCN) is an extension of GCN and used to handle graphs with different edge types. For instance, the edges can be used to represent different relations and have distinct weights definition of [37]
. Similar to GCN, hidden representation of entities in the
layer in RGCN can be formulated in Equation 3:(3) 
where denotes the set of neighbor indices of node under relation and is a normalization constant. is used in prior entity classification work [37].
Gated graph convolution network (GatedGCN) is proposed in [11]
and utilized for community detection. It borrows the idea from gate recurrent neural networks and constructs a propagation function that receives and processes the property of source vertex and destination vertex simultaneously. The propagation function is depicted in
Equation 4.(4) 
where refers to elementwise multiplication, and
are typical nonlinear activation functions that have been widely adopted in conventional neural networks
[25].Graph Recurrent network (GRN) is similar to recurrent neural network (RNN), but aims to learn vertex representations [35]. GRN is mostly used in graph algorithm learning tasks, NLP tasks, traffic forecasting. For example, [27]
integrates typical RNN units (Gated recurrent unit) into the propagation function as formulated in
Equation 5 to perform graph algorithm learning tasks, such as shortest path and eulerian circuits.(5) 
Although these GNN algorithms are different in terms of architecture and target applications, we notice that they share common computing patterns. 1) GNNs initially condense vertex property of source vertex with learned parameters to obtain more compact feature representations. 2) Afterwards, GNNs usually gather neighbor properties to embed the information of graph topology to the extracted features. and 3) GNNs usually leverages learned parameters to further condense the output features obtained in the aggregate stage making GNN capable to learn and perform more complex tasks. GNN accelerators must be able to support the computation abstractions concluded above, in order to support different GNN architectures efficiently.
2.2 EnGN processing model
According to the goal of each computing pattern, the common computing patterns can be abstracted as feature extraction, aggregate, and update. The feature extraction stage condenses the property of each vertex in the graph using a neural network. Similar to conventional convolution operations, it can refine the highdimension vertex properties to lowdimension ones for more efficient feature representation. The aggregate stage embeds the graph topology by accumulating each vertex’s neighbor properties generated in the feature extraction stage to its vertex property with various arithmetic operations such as max, min, and add to produce unified output features. The update stage leverages learned parameters to further condense the output features obtained in the aggregate stage, then applied a nonlinear activation function or GRU/LSTM function to each vertex of the graph before output, making GNN capable to learn and perform more complex tasks. Note that when the aggregate stage adopts sum operation, the learned parameters in the update stage can be computed in the feature extraction stage because of the matrix multiplication associative law. It also provides an opportunity for EnGN to dynamically adjust the stages of matrix operations to optimize EnGN performance, which will be introduced in section 4.
On top of the abstraction, we propose a unified EnGN processing model that can cover general GNN models using the common computing functions as shown in Algorithm 1. Suppose the graph is represented as where and represent the set of vertices and edges in the graph respectively.The is the set of vertex property of the graph. By default, the input graph is stored as a coordinate list (COO). Each edge in the graph is a tuple (, , ), where usually stands for the edge property and it depends on graph definition. The EnGN execution flow follows the neighborhood aggregation strategy, which iteratively updates the representation of vertices by aggregating representations of their neighbors. Since all the vertices in the graph will be processed in each iteration for GNN algorithms, EnGN is organized and presented as an edgecentric processing model to ensure more efficient memory accesses [52].
For each edge, both the source vertex property and the destination vertex property are condensed with using to obtain a temporary property . Then the is added to the destination property using function. Since there may be multiple edges that are incident to the same destination vertices, is essentially a reduce function. When all the destination vertices are reduced, the activation function and sometimes the followed userdefined nonlinear operator such as LSTM or GRU layer with learnable weights are used to filter the output using for highlevel computing tasks such as node classification, and link prediction.
To help understand the EnGN execution model, we present a vivid example of GCN [24] processed by the EnGN architecture as shown in Figure 1. Suppose an input graph has four vertices and each vertex has a 5dimension property. The input property of the vertices are denoted as , , , . In function, the feature extraction function takes both the vertex property i.e. , , , and associated weight matrix as input. Then it has the weight matrix multiplied with the highdimension input vertex property to generate lowdimension temp features. Note that the size of the weight matrix is associated with both the input property dimension and output temp feature dimension. In this example, the size of the weight matrix is . With the feature extraction function, the input vertex properties are converted to 3dimension temp features donated as , , , . In function, it receives the results of function and aggregates the property of each vertex’s incoming neighbors. As shown in Figure 1, the temp properties of vertex 2 and 3 i.e. , are added to temp property of vertex 0 as vertex 2 and 3 are incoming neighbors of vertex 0 according to the graph topology. When the aggregation processing is done, starts. It has the vertex features i.e. , , , filtered using an activation function. The filtered output properties donated as , , , become the input to the next GNN iteration.
3 EnGN Architecture
3.1 EnGN hardware architecture
On top of the unified EnGN processing model, we develop a customized EnGN accelerator as shown in Figure 2. It integrates a neural graph processing unit (NGPU) to perform Feature extraction, Aggregate, and Update operation in a unified architecture. It has an array of homogeneous processing elements (PE) and the array size is . Each PE unit contains a local register file to store the temporary results and act as intermediate for interPE communication. Each PE in the same column of the RingEdgeReduce (RER) array is connected to its neighbors in a ring network to perform aggregate operation and each PE in the same row of the RER array can process a vertex property, which means the NGPU can process 32 vertices simultaneously. However, such processing parallelism requires substantial memory bandwidth. Thereby, to avoid performance degradation, EnGN optimizes the memory access pattern for vertex data and edge data moving. For source vertex data access in the large graph, we adopt the graph tiling technique and make the source vertex fetching only induces accesses to the continuous memory addresses. For random destination vertex accesses in the aggregate and update stage, EnGN leverages the hashed edge data layout and multilevel cache method to avoid write conflicts and improve data hit rate in the compact onchip buffer. During processing, to generate the PE control signal, the PE controller of NGPU reads the edge list of the graph from the edge banks and parses it into bitstream that controls the PEarray to perform interrow aggregate operation (① in Figure 2). In addition, as shown in Figure 2
, each PE in the NGPU is attached by an XPE to perform activation functions, bias operation, and rounding operation in the update stage. There also exists on the chip a vector processing unit (VPU), which is used to deal with different feature extraction, aggregate and update functions of GNNs illustrated in
Table 1.3.1.1 The RER PE array
The feature extraction stage maps the high dimensions property of vertices to the low dimensions by using the learned weight matrix, and this stage is simply matrix multiplication operation. Thus, EnGN modifies the classical systolic array architecture to perform the feature extraction stage in NGPU. As shown in Figure 3, in order to handle arbitrarydimension property of GNN algorithms, we propose the graph property aware (GPA) dataflow to decouple the input property of the vertex and the hardware computing structure. In this manner, each PE in the same column of PEarray is responsible for a single dimension of vertex property and each PE in the same row handles a single vertex. The properties of a vertex are arranged in columns and aligned in the property bank. The dimensions of input vertex property become independent to the hardware architecture and can be continuously injected to the PEarray regardless of the array size and the property dimension. However, as shown in Figure 3, each column is responsible for generating one dimension of output property while the dimension of output property is determined by the GNN models. Thereby, when the column size of the weight matrix is larger than that of the PEarray, we choose to split the weight matrix into partitions that match the column size of PEarray. For example, Figure 3 depicts the weight matrix with 6 columns and 5 rows while the column size of PEarray is 3, where the row size of weight matrix depends on 5d property of input vertex and the column size of weight matrix is determined by the target GNN model. In this case, the weight matrix is split by column into two parts to match the column size of PEarray. After that, the two pieces of weight submatrix are placed in the weight banks in the order of rows. In this manner, the processing unit can handle vertex with arbitrary dimension property.
3.1.2 The Ring Edge Reduce topology for PE communication
The aggregate procedure needs to collect the information according to the edge information. Thereby, as shown in Figure 3, each row of the PEarray in NGPU possesses a dedicated edge bank and each PE in the same row receives the same control signal parsed from edge list in the graph to gather the corresponding vertex property. Meanwhile, because each PE needs to broadcast its own vertex features generated by the feature extraction stage to all other PEs in the same column, aggregating the received information simultaneously can result in a large amount of hardware resource consumption. Thereby, inspired by the ringallreduce concept [9], we propose the ringedgereduce (RER) aggregate dataflow to conduct aggregate stage inside the PE array instead of moving the data out to the buffer. As shown in Figure 3, because each column of PE performs the same operations without any communication in between, each PE in the same column of array is connected to its neighbors through an onchip network of ring topology. Each PE in the same column only communicates with its two nearest neighbors (north, south). In our design, the PE sends its information to the northern neighbors and receives the information sent from the southern neighbors for property aggregating. In this manner, a PE can select the relevant vertices to aggregate based on the control signal parsed from the edges during the information flow between the PEs.
3.1.3 The Onchip Memory Hierarchy
PE register file The register files (RF) equipped in the PEs are divided into four groups including source vertex groups(SRC RF), destination vertex groups (DST RF), and two shadow groups (Shadow RF), which is depicted in Figure 4. The SRC RF stores the source vertex values generated in the feature extraction stage. The DST RF stores the destination vertex feature updated during the aggregate and update stages. In addition, there are two programmerinvisible Shadow RFs holding the SRC and DST vertex values previously generated by the PEs of the same column, which work together with the DST and SRC RFs in a pingpong way.
Multiplelevel caches The real world graph has up to millions of vertices. Although the graph tiling technique adopted by EnGN helps fit the subgraphs into the onchip buffer, the set of vertices in the subgraphs will still outsize the register files of the PE array. Meanwhile, the result banks are used to store the temporary aggregate results. PE frequently accesses the longlatency result bank will result in performance degradation. Thus, to improve the throughput of EnGN, as shown in Figure 4, we insert a degree aware vertex cache (DAVC) between the result banks and the register file of each PE to improve the performance of the EnGN. The register file, DAVC, and the result banks are regarded as the first, second, and last level memories on chip, respectively. DAVC is a directmapped cache with the writeback policy and cache hit time of one cycles. DAVC uses the destination vertex id of edges as the line tag to determine whether the access to the vertex data hit or not in the DAVC. If hit, the vertex data will be directly read to DST RF in the PE unit. Otherwise, EnGN will access the lastlevel result banks. DAVC adopts the least recently used (LRU) replacement policy [10]. In this manner, the DAVC can alleviate the overhead incurred by the result bank accesses. EnGN also allocates a small space in the DAVC to cache some of the vertices with highdegree to reduce the onchip memory traffics and the memory bandwidth required by the vertex data refetch because of the powerlaw distribution in the realworld graph. Since EnGN adopts graph tiling methods with a deterministic strategy, the vertex access can be predicted in advance. Thereby, EnGN provides a prefetcher unit to prefetch the vertices from offchip memory to replace the property of vertex resides in the result banks as long as the all edges associated with the hold vertex are processed.
Edge Bank EnGN adopts graph tiling technique to divide the large graph into intervals and shards subsection 4.3. The edges of each shards are divided into different bank using modulo function according to the destination vertex id and the PE array size (Edge Hash), which is shown in Figure 3. For example, edges , , will be placed into Row1, Row2, and Row3 of the PE array, respectively. In this manner, each bank is only responsible for some dedicated edges and vertices update operations, heavy write conflicts due to simultaneous updates to the same vertex in different PEs can also be avoided. Meanwhile, we also rearrange the order of the edges within a shards based on the flow direction of the data in the ring topology to avoid redundant data traffics in the ring and improve the processing throughput of the edges.
3.2 EnGN dataflow
We present in Figure 5 an example to illustrate the execution flow of a GNN layer on the proposed EnGN hardware architecture based on GCN model [24]. Similar to the EnGN processing model, the execution stage of one GNN layer on the accelerator is also divided into three stages: feature extraction, aggregate, and update stage. The feature extraction and aggregate stage are executed in a pipelined manner while the update stage can only be triggered when the aggregate stage is completed. Without losing generality, as shown in Figure 3 we consider a small design that consists of PEs and the input graph is the same as the Figure 1. For brevity, we only describe the flow at the and .
Feature extraction: At Cycle 0, receives the first dimension of the input vertex property at the layer , and the weight data to generate the temporary property result . The result are stored in the local register file.
At cycle 1, receives the second dimensions of input vertex property and the weight data . Both of them are used to update the temporary property result .
On the followup cycles, the operations on are similar to cycle 1. The following new input properties , , and the associated weight data , , continuously feed into until the dimensions of vertices have been covered.
At cycle 4, the result is generated and stored into the register file, waiting to be used by the aggregate stage.
Aggregate: At cycle 5, Since the edge buffer contains vertex 0, uses to generate the aggregate result . At the same time, sends to and prepares for receiving the sent from .
At cycle 6, although receives from , will not be utilized to update because vertex 1 does not exist in the according edge list provided by the edge buffer, which means vertex 1 is not connected to vertex 0. Meanwhile, send to and prepares for receiving sent from .
At cycle 7, receives from and leverages it to update the aggregate result because vertex 2 is in the according edge list provided by the edge buffer. Meanwhile, is sent to .
At cycle 8, receives from and leverages it to update the aggregate result . In this case, all vertices have been traversed by each PE. Thus, summarizes all the relevant edge information and will be stored in the register file to wait for the update stage.
Update: At cycle 9, since all edges associated with vertex 0 has been traversed, the aggregate result will not be changed. Thereby, the update stage is triggered, and the aggregate result is processed by the userdefined apply function, i.e. activation function in GCNs, to generate the result as the input to the next layer .
4 Engn Optimization
4.1 Observations of GNN computing
To further optimize the EnGN design, we try to explore the characteristics of GNN algorithms and seek key observations that may guide the EnGN architecture optimization. Instead of elaborate on the seperate EnGN processing functions, we try to illustrate the whole execution flow of a compact GNN on the EnGN accelerator. Suppose the input graph with vertices and edges is depicted with an adjacency matrix . The vertex property of the graph is with channels and the learned filters i.e. weight is where is output property dimension. Then, the output of the GNN i.e. can be represented as Equation Equation 6:
(6) 
According to the formulation of GNNs, we obtain three major exploitable observations:
1. The dimension of the vertex property in GNN generally exhibits serious variation across the iterations compared to traditional graph algorithms.
The dimension of the vertex property is determined by both the application (i.e. the number of vertices in the graph) and the architecture of the GNN model (i.e. the condensed feature dimension) according to Equation 6. While GNN is an iterative algorithm, the output feature in current iteration becomes input feature to the next iteration. Thereby, the vertex property dimension mostly relies on the weight array size i.e. the architecture of the GNN model and it varies across the different iterations. In each iteration, vertex property dimension may either increase or decrease after the computing.
2. The order of feature extraction processing and aggregate processing in GNNs are exchangeable when the operator in aggregate processing is sum.
When the operator used in aggregate is sum which is widely adopted in GNN algorithms, we notice that the computing in Equation 6 can be changed to Equation 7 without affecting the result because of matrix multiplication associative law. While the amount of operations using the distinct computing order is also different, we may choose the order that incurs less computation in each iteration.
(7) 
3. The weight size of GNNs is independent to the size of the input graph and it is usually small. While the input graphs in GNNs can be large and typically dominate the memory accesses.
According to Equation 6, the weight size of GNNs is determined by the dimension of the input graph vertex property and output feature property in each iteration. Thus, it is irrelevant to the number of vertices in the graph. In this case, the weight size can be much smaller compared to the graphs that may include millions of vertices, which is also a key distinction from conventional neural networks. Input graphs will dominate the memory accesses and dealing with the large graphs in GNNs will be critical to the accelerator performance.
4.2 Dimensionaware stage reordering
According to Observation 1 and 2, the processing order of GNN stages, the feature extraction, aggregate and update stages, will not affect the computing results, but it can change the total number of operations in a GNN. We analyze the quantity of operations when using different computing order, and aim to find the best way to schedule the stages. For feature_extraction, the number of operations i.e. multiplyaccumulate in Equation 6 and Equation 7 are the same and it is equal to . Similarly, update does not change with the computing order. Nevertheless, for aggregate, the order of GNN computing leads to different number of operations i.e. accumulation in aggregate. When Equation 6 is used, the number of operations is . When Equation 7 is chosen, the amount of operations becomes .
While the property dimension varies as observed in last subsection, is not equal to . To reduce the total computing, when the input vertex property dimension is larger than output feature dimension , we should choose for Equation 6 GNN computing. Otherwise, we should use Equation 7 for computing. Following this idea, we propose a dimensionaware stage reordering strategy based on the input and output property dimension comparison. The dimensionaware strategy can be implemented by altering the instruction sequence that defines the computing order of GNNs, so it will not incur additional hardware overhead.
4.3 Graph tiling and scheduling
According to Observation 3, a realworld graph that can be very large dominates the memory accesses in GNNs and it cannot be fitted to the limited onchip memory of EnGN. To address this issue, EnGN tiles the large graph into intervals and shards using a grid partition approach proposed in [56]. The basic idea of the grid partition is to divide all the vertices into disjointed intervals. Then the edges of the graph with both source and destination vertices limited to one interval can be partitioned into disjointed shards. Each shard must be fitted to the onchip memory of EnGN accelerator to ensure efficient computing without external memory accesses.
With the tiling, EnGN processes with the granularity of a tile. For each tile, the number of vertices remains larger than the row size of the PE array while each row of PE can only handle a single vertex at one time according to the dataflow proposed in prior section. In this case, the vertices are processed in batch and the batch size is equal to the row size of the PE array. The batch processing of a tile is described in Figure 6. Instead of conducting feature_extraction and aggregate sequentially, we have them overlapped. Basically, aggregate starts when a batch of vertices complete feature_extraction.
Although tiling ensures EnGN to process using just the data that are accommodated in the onchip buffers, there are still data dependency between the different tiles. The order of the tile execution essentially affects the data reuse and the amount of external memory accesses accordingly. Thereby, tile scheduling is also an important design option that needs to be intensively optimized.
The graph is split into a 2D array of tiles. The tiles in each row have the same source vertices while the tiles in the same column have the same destination vertices. Intuitively, we may schedule in either a row manner or a column manner. In the columnmajor order, new source vertices must be reloaded tile by tile while the the destination vertices in the same interval reside in onchip buffer until the column of tiles complete execution. Then, the next column of tiles can be scheduled. In the rowmajor order, source vertex properties can be buffered until the whole row of tiles are processed. In addition, we notice that there are also shared data between neighboring columns or rows and propose to schedule with an Sshape as shown in Figure 6. For example, the bottom tile of a column share the same source vertices with the bottom tile in next column. Similar data sharing can be observed in row manner.
The different tile scheduling strategies mainly differ on the external memory accesses and we quantitatively analyze the I/O cost. Suppose the amount of source vertices in each tile is a unit. For columnmajor order, each column of tiles requires to load tiles of source vertices and the total amount of load is . When neighboring column data reuse is considered, the amount of data to be loaded becomes . While the destination vertices in each column can be reused, the total amount of write is . For rowmajor order, the amount of read is the same, but the amount of write is much larger, because tiles in a row generates many intermediate output and must be frequently swapped to external memory among different tile execution. The total amount of write is . While the dimension of the vertex property also affects the amount of I/O cost and the dimension of input vertex property and output vertex property is usually different according to Observation 1, we further take the vertex property dimension into consideration and the I/O cost is summarized in Table 2.
Suppose that the latency of read and write external memory is equal. Comparing the overhead of the two different tile scheduling strategies, we obtain the following formulation:
(8) 
Based on Equation 8, it can be concluded that the columnmajor order scheduling outperforms the columnmajor order scheduling when F is smaller than 2H. Otherwise, rowmajor order scheduling is preferred. While and are mostly determined by the GNNs and the comparison varies, we employ an adaptive scheduling to minimize the external memory accesses. The adaptive scheduling option is explicitly encoded in the instructions which are generated at compilation time based on the GNN models.
Read Size  Write Size  

Columnoriented  
Roworiented 
5 Evaluation
5.1 Experimental setup
Accelerator simulator We built a cycleaccurate detailed simulator in C++ for fast performance and functional simulation. This simulator models each hardware module of EnGN accelerator faithfully, and the timing behavior of all modules are coverified with the synthesized RTL design. Meanwhile, the opensource DRAM simulator DRAMSim [34] is used to model the costs of the offchip memory access.
EnGN configuration&implementation
EnGN has a multibank 256KB property buffer, a multibank 512KB weight buffer, a multibank 256KB edge buffer, and a multibank 256KB result buffer. The total size of the distributed vertex cache in EnGN is set to 64KB so that it stores 1024 vertices with 32 dimensions property. To measure the chip area, power and critical path delay of the EnGN, we synthesized the EnGN design using Synopsys Design Compiler (DC) with the TSMC 45nm process technology. Meanwhile, we conduct placeandroute using the Synopsys ICC compiler (ICC), and estimated the power cost using Synopsys PrimeTime (PT).
Evaluation environment and baselines We compare the performance and energy efficiency of EnGN with two baselines:
CPU baseline is a dual Intel Xeon E52630@2.20GHz processors with 512GB DRAM memory. Since GNN models is directly mapped to the deep learning framework reuslts in low performance, we adopted two stateoftheart graph neural network framework DGL [31] and Pytorch geometric (PyG) to measure the performance of GNN algorithms on CPU, denoted as CPUDGL and CPUPyG.
GPU
baseline is a modern GPU card (NVIDIA GTX 1080Ti, 11GB GDDR5X). Since deep learning frameworks such as Pytorch and Tensorflow is unable to fully exploit the performance of GPU
[31, 12], we still leveraged the DGL and PyG framework to measure the performance of GNN algorithms on GPU, denoted as GPUDGL and GPUPyG.Model  hidden  Dataset  Vertex  Edge 

Label  

GCN  32  Cora  10858  10858  1433  7  
GSPool  32  PubMed  19717  88651  500  3  
Gated GCN  32  CoraFull  19793  65311  8710  67  
GRN  32  232965  114615892  602  41  
RGCN  32  AIFB  8285  29043  91  4  
MUTAG  23644  192098  47  2  
BGS  333845  2166243  207  2  
AM  1666764  13643406  267  11 
GNN models and Datasets For benchmarks, we selected five different GNN models to handle two representative workloads: semisupervised classification and knowledge graph. The GNN models and datasets is list in Table 3, where the feature and label columns represent the dimension of one vertex and number of labeled classes respectively, and the relation in RDF format [33] denotes different types of edges. For semisupervised classification workload, we leverages four different GNN models: GCN [24], GraphSagePool (GSPool) [18], GatedGCN [11], and GRN [48] because of their different computation patterns. All of them have 2 layers with 32 hidden units in experiments. These four models applied to four popular different scale datasets: Cora (CA) [38], PubMed (PB) [38], extended version of Cora denoted as CoraFull (CF) [3], and one large online discussion forum graph: Reddit (RD) [18], to measure the processing time of GNN models on different baselines. For knowledge graph, we leverages one classical GNN model, RGCN [37], to perform entities classification in knowledge graph. RGCN has 2 layers with 32 hidden units. We applied RGCN to four different typical knowledge graph datasets: [32]: AIFB (AB) [37], MUTAG (MG) [37], BGS (BS) [37] and AM (AM) [37].
Evaluation Metric In evaluation, we used the inference time of graph neural network as the performance metric of endtoend latency and billion operations per second (GOP/s) as the throughput metric. Meanwhile, we measured the powerefficiency of EnGN with GOPs per Watt (GOPs/W).
5.2 Experimental results
Layout and Area We show the physical layout of EnGN in Figure 7. The overall area of EnGN is and the peak power consumption is , operating at the clockrate of 800MHz.
Performance First, we evaluated EnGN by comparing it with the CPU and GPU baselines running stateoftheart highperformance GNN software frameworks DGL, Pytorch geometric. However, due to memory limitation, large graphs such as Reddit, cannot run on the GPU. Figure 8 shows the inference time of five GNN models on EnGN with respect to four baselines: CPUDGL, CPUPyG, GPUDGL, and GPUPyG, on the above eight different datasets. EnGN achieves better performance regardless of GCN model types and dataset size. Compared to the CPUDGL and CPUPyG implementation, EnGN achieves a 120.48x and 494.24x performance improvement on average, respectively. The explanation for better performance on CPUPyG than CPUDGL is that PyG is optimized for GPU instead of CPU. Compared to the GPUDGL and GPUPyG implementation, disregarding the big graphs that cannot be handled on the GPU baseline, our design gains a 0.31x to 25.22x and a 0.2x to 10.55x improvement over the GPUDGL and GPUPyG, respectively. For large graphs like Reddit and AM, the memory capacity of a single GPU cannot satisfy the requirement of the GNN model. By contrast, EnGN can support largescale graphs using the graph tiling strategy.
Throughput Figure 9 (a) shows the throughput comparison with CPU and GPU using two citation network Cora and Pubmed processed with two different GNN models, GCN and GSPool. On small scale GCNs, EnGN provides the throughput of 907.8 GOP/s on Cora dataset and 783.9 GOP/s on PubMed dataset, respectively. While CPUDGL achieves the throughput of 14.9 GOP/s and 4.97 GOP/s, which is 60x and 157x lower than EnGN. Because PyG is not optimized for CPU, it only gets 1.28 GOP/s and 2.75 GOP/s. Meanwhile, GPUDGL and GPUPyG respectively achieve 73.5 GOP/s and 75.8 GOP/s performance on Cora dataset. However, the performance boost is still lower than EnGN by 10.9 x and 10.6x. When implementing model with high computation requirement such as GraghSagePooling, the throughput of EnGN can reach 1570GOP/s and 1489 GOP/s on Cora and PubMed dataset, respectively. However, CPUDGL only makes to 32 GOP/s and 9.66 GOP/s, and GPUDGL only gets 154.5 GOP/s and 106.8 GOP/s. The performance improvement on EnGN is mostly attributed to: (1) high parallel vertices and edges processing supported by the RER PE array of EnGN, and (2) The customized memory hierarchy and the dimension aware stage reodering improve the throughput of the EnGN.
Power&Energy estimation We also compared the energy consumption of EnGN, CPUDGL, CPUPyG, GPUDGL, and GPUPyG, which can be estimated as products of power consumption (in Watt) and the execution time (in Millisecond). The power consumption of CPU and GPU is reported by the thermal design power (TDP) in official dataset and NVPROF, respectively. The peak power of EnGN is 5.68W in worstdesign case while the Xeon Chip on our baseline consumes 85W power on average. However, the GPU power is 26.41x higher than EnGN. As show in Figure 10, EnGN is on average 1303.03x and 1348.17x more energy efficient than CPUDGL and 1303.03x and 1348.17x over CPUPyG, respectively. The EnGN achieved better energy efficiency compared to GPUDGL and GPUPyG, which is attributed to the low parallelism of GPU when executing GNN algorithms. The energy consumption of GPUDGL and GPUPyG are 111.4x and 69.5X that of EnGN on average, respectively. Meanwhile, although PyG optimized for GPU brings higher performance than DGL on GPU, its energyefficiency is still lower than EnGN.
EnGN also dominates the power efficiency comparison. Figure 9 (a) shows the efficiency result of EnGN compared to the CPU and GPU baseline running software framework DGL and PyG. The reason for the boost brought by EnGN is twofold: 1) massive parallelism processing PE optimized for vertices and edges on graph makes EnGN performs highthroughput GNN analytics, 2) The RER dataflow, the hierarchical onchip memory, and the graph tiling and scheduling policy, contribute to a high data reusability for the massive number of vertices that are sparsely and irregularly connected in the graphs.
5.3 Effects of EnGN optimizations
The benefits of dimension aware stage reordering in EnGN as mentioned in subsection 4.2, our proposed dimension aware stage reordering technique can reduce the total computing cost. In this evaluation, because the aggregate stage of GraphSagePool adopts the average operator, we only adopted GCN and RGCN model to measure the performance of dimensionaware stage reordering (DASR) strategy compared to two fixed processing stage: (1) feature_extraction, aggregate and update (FAU), and (2) aggregate, feature_extraction and update (AFU). Figure 11 (a) illustrates that DASR strategy can reduce the operation number in the aggregate stage by 94.85x and 1.46x on averages when compared to two fixed processing strategy FAU and AFU, respectively. Note that in the CoraFull dataset, our DASR strategy can reduce operations number by 272x. The reason is that the input dimensions of vertex property is 8710 (Table 3), which is higher than other datasets. When the feature extraction stage performs after the aggregate stage, higher dimensions incurs massive accumulate operators (8710 for a vertex) in the aggregate stage. In contrast, when we performs the feature extraction stage before the aggregate stage, the dimension will be compressed to 16 and accumulates operators is only 16 for a vertex in the aggregate stage.
Graph tiling scheduling this subsection explores the impact of graph tiling scheduling strategy in EnGN. In this evaluation, we leveraged the columnmajor and rowmajor update strategy as baselines to evaluate our scheduling strategy on GCN model (16 hidden units) using four datasets: Cora, PubMed, CoraFull, and Reddit. Under the onchip memory constraint of the EnGN, these datasets are split to 43, 78, 618 and 1214 intervals (Q value) according to the tiling strategy. Figure 11 (b) illustrates the total I/O cost induced by the EnGN scheduling strategy, the columnoriented (Column), and roworiented (Row) strategies respectively. In the Cora and PubMed dataset, our graph tiling scheduling strategy reduce total I/O cost by 17.76x and 11.49x compared to columnmajor strategy. However, Cora and PubMed only contains 7 and 3 class labels, which is less than the output dimension of the first layer. Thereby, the performance of graph tiling scheduling is comparable to the rowmajor strategy. In contrast, Corafull and Reddit contains 67 and 41 class labels respectively. Thereby, in this case, graph tiling scheduling can reduce the total memory access cost by 139.9x and 2.89x when compared to the columnmajor and the rowmajor respectively on CoraFull dataset. Meanwhile, it also reduce the memory cost by 12.72x and 2.35x over the two baselines on Reddit dataset. The explanation for the performance benefits is that columnmajor and rowmajor strategy stick to the fixed policy to update the graph while our graph tiling scheduling can adjust the update dataflow from the rowmajor to the columnmajor dataflow according to the dimension changes in GNNs.
The merits of Degree Aware Vertex Cache We examine the effects of caching vertices on the degree aware vertex cache (DAVC) of the EnGN. In evaluation, we used four different scale datasets: Cora, PubMed, CoraFull and Reddit to estimate the energy saving brought by the DAVC with respect to three cache designs of other replacement strategies: LRU [10], LFU [39], FIFO [15] replacement strategy. Meanwhile, we also disable DAVC to shows the additional energy consumption due to increased accesses to the longlatency onchip result banks denoted as No Cache. The energy pre access is reported by the CACTI 7.0 [2]. Figure 12 (a) shows that DAVC can reduce the times of results banks (L3) access by 1.11x, 1.21x, and 1.12x, and 1.98x compared to LRU, LFU, FIFO, and No Cache methods on average, respectively. Benefiting from the reduction of memory access, Figure 12 (b) illustrates that DAVC saves onchip memory energy consumption incurred by the vertex access about 3.93%, 6.31%, and 4.07% compared to LRU, LFU, and FIFO strategy on average, respectively. Compared to No Cache methods, DAVC saves energy consumption by 25.28% on average. The reason is simply due to: LRU policy attains the recency of the vertex while it cannot discriminate well between frequently and infrequently. LFU policy considers the frequency of vertex while it cannot distinguish between vertex that occurred far back in the past and the more recent ones in recording the frequency. Whereas, our proposed DAVC with LRU policy and cached high degree vertices strategy will be accessed frequently by other vertices on the graph due to the edge requirement and spatial locality. In this case, fetching the relevant vertices from DAVC instead of the result banks will be less expensive.
6 Related Work
6.1 GNNs software framework
There is a large amount of work that aims at building an efficient system for graph applications on single nodemachines (CPUs) [29, 50, 56, 22], distributed systems [49, 30, 44, 42], and GPUs [14, 23]. However, these graph processing frameworks aim at traditional algorithms, and there is a lack of support for graph neural network computation. Even though TuX2 [55]
aims to bridge the gap between graph and traditional machine learning algorithm, it is still unable to support the inference and training stage of emerging GNN algorithms. Thereby, NeuGraph
[28] is proposed to recast the graph specific optimization as dataflow optimization based on Tensorflow. Meanwhile, [42] published a geometric learning library for deep learning on irregularly structured input data based on Pytorch. The deep graph library [31] provides a fast implementation of GNN models based on pytorch and MxNet.NeuGraph, Pytoch geometric, and DGL are generally running on the powerhungry CPU and GPUs, which incurs energyefficient issues and high cost. More importantly, GPUs suffer from the underutility of stream processors during parallel GNN computation because of the impact of the irregular graph data structure, which makes energyefficient issues more serious. Thereby, to address these issues, we build an EnGN accelerator designed for large graph neural network to support energyefficient graph neural network processing.
6.2 Deep learning & Graph accelerator
The resurgence of deep neural network (DNN) and its substantial progress in various applications including image, video, and speech spurs the flourishing of the DNN hardware accelerator [40, 36]. For example, Diannao [50] maps DNN onto an array of multiplyadd units and employ data tiling policy to exploiting the locality in the parameters. DaDiannao [6] employs a larger onchip eDRAM for both high bandwidth and data locality. EIE [19] performs inference using compressed technique and accelerates the inherent modified sparse matrixvector multiplication. Eyeriss [7]
is a low power realtime DNN accelerator that exploits zero valued neurons by using run length coding for memory compression. However, these DNN accelerators are designed for traditional DNN such as convolution neural network, which cannot handle GNNs because they lack graph propagation model on the accelerator.
The wide gap between the generalpurpose architectures and the unique features of graph processing promotes the rapid development of graph processingspecific accelerators based on FPGA and ASIC. For example, GraphGen [53] presented a vertexcentric framework to support various graph computations on FPGA. Graphicionado [16] presented a domainspecific accelerator for graph analytics based on welldefined, popular vertex programming model. However, traditional graph accelerators are designed for traditional graph algorithms, it lacks the computation abstraction required by the neural network, such as tensor and activation operations.
7 Conclusions
In this paper, we present a highthroughput and energy efficient accelerator EnGN specialized for large graph neural network processing. In order to provide high throughput processing ability and solve the arbitrary dimension change issues in the GNN algorithms, we proposed ringedgereduce update dataflow and the accompanied hardware architecture of RER PEarrays are designed to simultaneously conduct highthroughput processing in the featureextraction, aggregate and update stages on GNNs. Meanwhile, the proposed graph tiling and scheduling technique cooperating with welldesigned threelevel memory hierarchy enable EnGN to process large graph efficiently. Experimental results show that EnGN achieves 303.45x and 4.44x performance speedup while consuming 1370.52x and 93.73x less energy on average when compared to CPUs and GPUs, respectively.
References
 [1] A distributed graph deep learning framework. contribute to alibaba/euler development by creating an account on GitHub Note: originaldate: 20190110T06:32:32Z External Links: Link Cited by: §1.
 [2] (201706) CACTI 7: new tools for interconnect exploration in innovative offchip memories. ACM Trans. Archit. Code Optim. 14 (2), pp. 14:1–14:25. External Links: ISSN 15443566, Link, Document Cited by: §5.3.
 [3] (201707) Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking. arXiv eprints, pp. arXiv:1707.03815. External Links: 1707.03815 Cited by: §5.1.
 [4] (200006) Graph structure in the web. Comput. Netw. 33 (16), pp. 309–320. External Links: ISSN 13891286, Link, Document Cited by: §1.
 [5] (2018) FastGCN: fast learning with graph convolutional networks via importance sampling. CoRR abs/1801.10247. External Links: Link, 1801.10247 Cited by: §1.
 [6] (2014) DianNao: a smallfootprint highthroughput accelerator for ubiquitous machinelearning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, New York, NY, USA, pp. 269–284. External Links: ISBN 9781450323055, Link, Document Cited by: §6.2.
 [7] (2016) Eyeriss: a spatial architecture for energyefficient dataflow for convolutional neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, Piscataway, NJ, USA, pp. 367–379. External Links: ISBN 9781467389471, Link, Document Cited by: §6.2.
 [8] (2014) DaDianNao: a machinelearning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO47, Washington, DC, USA, pp. 609–622. External Links: ISBN 9781479969982, Link, Document Cited by: §1.
 [9] (2019) Bandwidth reduction using importance weighted pruning on ring allreduce. CoRR abs/1901.01544. External Links: Link, 1901.01544 Cited by: §3.1.2.
 [10] (1998) LRU is better than fifo. In Proceedings of the Ninth Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’98, Philadelphia, PA, USA, pp. 78–81. External Links: ISBN 0898714109, Link Cited by: §3.1.3, §5.3.
 [11] (2016) Language modeling with gated convolutional networks. CoRR abs/1612.08083. External Links: Link, 1612.08083 Cited by: §2.1, §5.1.
 [12] Geometric deep learning extension library for PyTorch: rusty1s/pytorch_geometric Note: originaldate: 20171006T16:03:03Z External Links: Link Cited by: §1, §5.1.
 [13] (2017) Neural message passing for quantum chemistry. CoRR abs/1704.01212. External Links: Link, 1704.01212 Cited by: §2.1.
 [14] (2012) PowerGraph: distributed graphparallel computation on natural graphs. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), Hollywood, CA, pp. 17–30. External Links: ISBN 9781931971966, Link Cited by: §6.1.
 [15] (201303) FIFO cache analysis for wcet estimation: a quantitative approach. In 2013 Design, Automation Test in Europe Conference Exhibition (DATE), Vol. , pp. 296–301. External Links: Document, ISSN 15301591 Cited by: §5.3.
 [16] (2016) Graphicionado: a highperformance and energyefficient accelerator for graph analytics. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO49, Piscataway, NJ, USA, pp. 56:1–56:13. External Links: Link Cited by: §1, §6.2.
 [17] (2017) Knowledge transfer for outofknowledgebase entities: A graph neural network approach. CoRR abs/1706.05674. External Links: Link, 1706.05674 Cited by: §1.
 [18] (2017) Inductive representation learning on large graphs. CoRR abs/1706.02216. External Links: Link, 1706.02216 Cited by: §2.1, §5.1.
 [19] (2016) EIE: efficient inference engine on compressed deep neural network. CoRR abs/1602.01528. External Links: Link, 1602.01528 Cited by: §6.2.
 [20] (199711) Long shortterm memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 08997667, Link, Document Cited by: §1.
 [21] On the Design of an Efficient Hardware Accelerator for Large Scale Graph Analytics. Technical report External Links: Link Cited by: §1.
 [22] (2018) GraFboost: using accelerated flash storage for external graph analytics. In Proceedings of the 45th Annual International Symposium on Computer Architecture, ISCA ’18, Piscataway, NJ, USA, pp. 411–424. External Links: ISBN 9781538659847, Link, Document Cited by: §1, §6.1.
 [23] (2014) CuSha: vertexcentric graph processing on gpus. In Proceedings of the 23rd International Symposium on Highperformance Parallel and Distributed Computing, HPDC ’14, New York, NY, USA, pp. 239–252. External Links: ISBN 9781450327497, Link, Document Cited by: §6.1.
 [24] (2016) Semisupervised classification with graph convolutional networks. CoRR abs/1609.02907. External Links: Link, 1609.02907 Cited by: §2.1, §2.2, §3.2, §5.1.
 [25] (201705) ImageNet classification with deep convolutional neural networks. Commun. ACM 60 (6), pp. 84–90. External Links: ISSN 00010782, Document Cited by: §1, §2.1.
 [26] (2019) PyTorchbiggraph: A largescale graph embedding system. CoRR abs/1903.12287. External Links: Link, 1903.12287 Cited by: §1.
 [27] (2016) Gated graph sequence neural networks. CoRR abs/1511.05493. Cited by: §2.1.
 [28] (201907) NeuGraph: parallel deep neural network computation on large graphs. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), Renton, WA, pp. 443–458. External Links: ISBN 9781939133038, Link Cited by: §1, §2.1, §6.1.
 [29] (2010) Pregel: a system for largescale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, New York, NY, USA, pp. 135–146. External Links: ISBN 9781450300322, Link, Document Cited by: §6.1.
 [30] (201606) Energy efficient architecture for graph analytics accelerators. SIGARCH Comput. Archit. News 44 (3), pp. 166–177. External Links: ISSN 01635964, Link, Document Cited by: §6.1.
 [31] Python package built to ease deep learning on graph, on top of existing DL frameworks.: dmlc/dgl Note: originaldate: 20180420T14:49:09Z External Links: Link Cited by: §1, §5.1, §5.1, §6.1.
 [32] (2016) A collection of benchmark datasets for systematic evaluations of machine learning on the semantic web. In International Semantic Web Conference, Cited by: §5.1.
 [33] (201610) A collection of benchmark datasets for systematic evaluations of machine learning on the semantic web. pp. 186–194. External Links: ISBN 9783319465463, Document Cited by: §5.1.
 [34] (201101) DRAMSim2: a cycle accurate memory system simulator. IEEE Comput. Archit. Lett. 10 (1), pp. 16–19. External Links: ISSN 15566056, Link, Document Cited by: §5.1.
 [35] (2018) Recent advances in recurrent neural networks. CoRR abs/1801.01078. External Links: Link, 1801.01078 Cited by: §2.1.
 [36] (201005) Hardware accelerators for biocomputing: a survey. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems, Vol. , pp. 3789–3792. External Links: Document, ISSN 02714302 Cited by: §6.2.
 [37] (2018) Modeling relational data with graph convolutional networks. In The Semantic Web, A. Gangemi, R. Navigli, M. Vidal, P. Hitzler, R. Troncy, L. Hollink, A. Tordai, and M. Alam (Eds.), Cham, pp. 593–607. External Links: ISBN 9783319934174 Cited by: §2.1, §5.1.
 [38] (2008) Collective classification in network data. Technical report . Cited by: §5.1.
 [39] (2004) LFUk: an effective buffer management replacement algorithm. In Database Systems for Advanced Applications, Y. Lee, J. Li, K. Whang, and D. Lee (Eds.), Berlin, Heidelberg, pp. 670–681. External Links: ISBN 9783540245711 Cited by: §5.3.
 [40] (2017) Efficient processing of deep neural networks: a tutorial and survey. Proceedings of the IEEE. Cited by: §6.2.
 [41] (201909–15 Jun) Simplifying graph convolutional networks. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 6861–6871. External Links: Link Cited by: §1.
 [42] (2015) GraM: scaling graph computation to the trillions. In Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC ’15, New York, NY, USA, pp. 408–421. External Links: ISBN 9781450336512, Link, Document Cited by: §6.1.
 [43] (2019) A comprehensive survey on graph neural networks. CoRR abs/1901.00596. External Links: Link, 1901.00596 Cited by: §2.1.
 [44] (201709) OmniGraph: a scalable hardware accelerator for graph processing. pp. 623–624. External Links: Document Cited by: §1, §6.1.
 [45] (2018) How powerful are graph neural networks?. CoRR abs/1810.00826. External Links: Link, 1810.00826 Cited by: §1.
 [46] (2019) Crosslingual knowledge graph alignment via graph matching neural network. CoRR abs/1905.11605. External Links: Link, 1905.11605 Cited by: §1.
 [47] (2018) An efficient graph accelerator with parallel data conflict management. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, PACT ’18, New York, NY, USA, pp. 8:1–8:12. External Links: ISBN 9781450359863, Link, Document Cited by: §1.
 [48] (2018) GraphRNN: A deep generative model for graphs. CoRR abs/1802.08773. External Links: Link, 1802.08773 Cited by: §5.1.
 [49] (201501) NUMAaware graphstructured analytics. SIGPLAN Not. 50 (8), pp. 183–193. External Links: ISSN 03621340, Link, Document Cited by: §6.1.
 [50] (201406) Medusa: simplified graph processing on gpus. IEEE Transactions on Parallel and Distributed Systems 25 (6), pp. 1543–1552. External Links: Document, ISSN 10459219 Cited by: §6.1, §6.2.
 [51] (2018) Graph neural networks: a review of methods and applications. ArXiv abs/1812.08434. Cited by: §2.1.

[52]
(201710)
Accelerating graph analytics on cpufpga heterogeneous platform.
In
2017 29th International Symposium on Computer Architecture and High Performance Computing (SBACPAD)
, Vol. , pp. 137–144. External Links: Document, ISSN Cited by: §2.2.  [53] (2015) Accelerating largescale singlesource shortest path on fpga. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, IPDPSW ’15, Washington, DC, USA, pp. 129–136. External Links: ISBN 9781467376846, Link, Document Cited by: §6.2.
 [54] (2019) AliGraph: A comprehensive graph neural network platform. CoRR abs/1902.08730. External Links: Link, 1902.08730 Cited by: §1.
 [55] (2016) Gemini: a computationcentric distributed graph processing system. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, Berkeley, CA, USA, pp. 301–316. External Links: ISBN 9781931971331, Link Cited by: §6.1.
 [56] (201507) GridGraph: largescale graph processing on a single machine using 2level hierarchical partitioning. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), Santa Clara, CA, pp. 375–386. External Links: ISBN 9781931971225, Link Cited by: §4.3, §6.1.