Recently, the success of deep learning methods in many fields has provoked a keen interest in generalizing neural network architectures to non-Euclidean data, such as manifolds and graphs. However, traditional deep neural networks, such as convolutional neural network (CNN)
, long short term memory (LSTM), are proposed to work for regular grid-like structures in Euclidean space, they are not trivially portable to non-Euclidean data domains like graphs. Therefore, graph neural networks (GNNs) are recently emerging as a powerful approach for graph processing and achieved unparalleled performance on many classic graph processing tasks, such as citation network , , social networks , and knowledge graph [17, 46]. The success of graph neural network propelled the deployment of GNNs to the real-world production system. For example, Alibaba’s AliGraph  and Euler  platform leverage GNNs to analyze the e-commerce graph data of billion users and items. Facebook also released BigGraph platform to handle graph data in warehouse machines .
Unfortunately, large graph-based neural network has gone beyond what existing deep learning frameworks and graph processing frameworks are designed for . Thereby, high performance GNN processing frameworks, such as Deep Graph Library (DGL) 
, Pytorch Geometric (PyG), and Neugraph  are becoming prevalent. However, due to the overhead of the massive memory parallelism, processing and update activities incurred by large graphs, these GNN software frameworks generally adopt large computational nodes equipped with multiple GPUs or CPUs to deal with the large-scale graph, which results in high cost and energy overhead. For example, NeuGraph uses eight GPUs to handle a dataset with million vertexes . Therefore, although they gain significant performance improvement, the potentials of GNNs performance and energy efficiency are still bounded by the hardware architectures assumed by these frameworks.
The past decades witnessed the thrive of the domain and application-specific processors, which are thought promising alternatives to general-purpose architectures in specific domains. Intuitively, a specialized GNN architecture is a promising option to improve the efficiency of GNN processing together with the GNN frameworks. How to forge such a general and efficient architecture is a non-trivial task. By observing state-of-the-art GNN processing frameworks, we generalize the architecture of typical GNN algorithms into three key stages: the vertex feature extraction stage, the feature aggregate stage, and the graph update stage. This paradigm enables GNNs to learn and perform many different complex tasks. Thereby, an efficient GNN processors have to be customized to deal with these three stages to support diverse GNN algorithms.
Because the current GNN algorithms fuse the merits of neural network and the concept of iterative graph propagation. Neither state-of-the-art neural network processors  nor the graph processing accelerators  are suitable hardware for GNN processing. First of all, the traditional deep learning accelerators (DLA) are designed to support the convolution or matrix multiplication operations that extract features from regular data structure such as image or audio, but they do not support the other two critical processing stages of GNN propagation. The aggregate stage gathers neighbors’ features using edges on the graph, which requires not only the ability to process edge information, but also involves frequent but irregular scalar operations and random memory accesses induced by the random traversal of the large but sparse graph.
Meanwhile, there are also many graph processing accelerators [47, 44, 22, 21] designed for traditional graph algorithms, such as PageRank, Breadth-First-Search, etc. However, GNNs algorithm can hardly be mapped to the existing graph processors. Traditional graph processor 
are able to support the aggregate and update stages of graph propagation model, but they generally do not work well for the feature extraction stage of GNN propagation, because they mostly support some of the simple arithmetic operations as required the traditional graph algorithm. Prior works of graph processors do not show the capability of processing the high-dimension and dynamic vertexes property or the tensors of learned neural parameters in current GNN algorithms, because the dimensions of vertex properties in traditional graph algorithms are usually relatively small and invariant in the iteration process.
Therefore, in order to accelerate practical GNN-based applications that process real-world high-throughput and large-scale graphs, GNN accelerator has to resolve the obstacles that exist in the real-world GNN algorithms: (1) How to tailor an unified architecture that efficiently supports the dataflow of feature extraction, aggregate and update stages in GNNs. It is observed that the property dimension of a vertex dynamically changes in wide ranges during the propagation of GNN layers, which leads to fluctuation of hardware utility and on-chip data locality in feature extraction, aggregate and update stages. (2) large graphs containing millions of vertexes pose a significant challenge to the design of energy-efficient and compact GNN accelerators with limited on-chip memory space. Particularly, when massive graphs with million vertices are partitioned into sparsely-connected sub-graphs, there will be intensive random and irregular off-chip memory accesses induced, which leads to poor locality that are hard to harness in the aggregate and update stage. And (3) the power-law distribution  creates high-degree but imbalanced connection sparsity in large real-world graphs. Accelerator must be able to deal with the imbalanced sparsity distribution, which leads to processing elements under-utility, poor locality and redundant memory access issues in hardware.
To cope with issues, we propose EnGN, a high-throughput and energy-efficient accelerator for large graph neural network processing. First, in EnGN, a ring-edge-reduce (RER) dataflow and the accompanied hardware architecture of RER PE-arrays are designed to simultaneously conduct the vertex property feature extraction, aggregate and vertex update on GNNs. It is known that aggregating the property and updating the vertices distributed in the large and sparsely-connected graphs will lead to poor hardware resources utility and more importantly intensive memory access overhead induced by the poor data locality. However, the proposed RER PEs connected into a ring topology leverages the ring-all-reduce dataflow to make vertex property flow between rows of Processing Elements (PE) and performs efficient update operations without randomly accessing the vertices and edges from the memory.
Second, for the feature extraction stage, EnGN constructs a graph property aware dataflow (GPA) that decouples the vertex property and the hardware computing structure, which makes the GNN mapping to the RER array independent of the vertex dimension. Meanwhile, the property dimension of vertices on the graph changes abruptly before or after the aggregate stage can cause a significant change of computation cost across the GNN layers. Thus, GPA can dynamically reorder the graph processing stage to reduce the total computation cost of GNN model on accelerator.
Third, considering the footprint of large-scale graphs, EnGN adopts a graph tiling strategy to process the partitioned sub-graphs with high degree of data reusability. Graph tiling aims to partition a large-scale graph into sub-graphs that fit the on-chip memory and maximize the locality. The tiles are strategically scheduled in EnGN to select either row-oriented or column-oriented processing dataflow to maximally reuse vertices between tiles and reduce the overhead caused by the off-chip memory access.
Finally, due to the power-law distribution and sparsity characteristics of the real-world graphs, the accessing frequency to different vertices may vary in a large scale. For example, on Cora citation graph, the access frequency of a high-degree vertex is 100x times than that of a low-degree vertex, which causes access imbalance issue. Thus, EnGN comprises a three-level on-chip memory hierarchy, and the L2 memory is a degree-aware vertex cache (DAVC) to locally cache the high-radix vertices that are densely connected to other vertices in graphs. DAVC reduces considerable memory access cost. In summary, our main contributions are the following:
An compact but high-throughput accelerator is designed for large graph neural network, which is implemented based on the edge-centric paradigm and supports various large scale GNNs.
We proposed a graph property aware and ring-edge-reduce (RER) dataflow to enable the EnGN to handle a vertex with arbitrary dimension property and high throughput update operations. The on-chip memory hierarchy is designed to be aware of the non-uniform distribution of high-radix and low-radix graph vertexes and employ a specialized memory space management to enhance data locality on the chip.
We implement EnGN accelerator in 45nm process and make comprehensive evaluations and compare the performance, power, energy of EnGN to CPU and GPU baselines empowered by the latest high-performance GNN frameworks. Experimental results show that EnGN outperforms CPU and GPU by up to 303.45x and 4.44x speedup on average, respectively. EnGN also achieves higher energy efficiency by 1370.52x and 93.73x on average compared to the CPU and GPU baselines.
2 General GNN processing model
2.1 Graph neural networks
Unlike convolutional neural networks that mainly deal with Euclidean data like images and videos , graph neural networks (GNNs) generalize the conventional neural networks to operate directly on non-Euclidean data especially graph data such as social networks, point cloud, chemical molecules. It has been proven to be supremely successful on tasks like node classification, graph classification, and link prediction. As summarized in the surveys [51, 43] GNN models can be roughly categorized into graph convolutional network and graph recurrent network. The representative GNNs will be detailed in this section.
Graph convolution network (GCN) generalizes the convolution operation from regular image data to non-structural graph data. It can be used for node classification  and chemistry molecules architecture analysis . A typical GCN  is presented and formulated in Equation 1:
Note that is the adjacency matrix of the graph, is the weight matrix at layer , is essentially the output of the normalized graph Laplacian  over where
is the identity matrix and.
GraghSage-Pool is proposed in  and used for citation network analysis and protein-protein interaction task. Unlike the GCN models, it leverages the averaging function as an aggregation operator and has the source vertex property () involved when updating output in next iteration. The expression of GraphSage-Pool is defined in Equation 2.
where refers to the function that concatenates a vertex’s property with the aggregated property of its neighbor vertices and represents the destination vertex property in layer .
Relational graph convolutional network (R-RCN) is an extension of GCN and used to handle graphs with different edge types. For instance, the edges can be used to represent different relations and have distinct weights definition of 
. Similar to GCN, hidden representation of entities in thelayer in R-GCN can be formulated in Equation 3:
where denotes the set of neighbor indices of node under relation and is a normalization constant. is used in prior entity classification work .
Gated graph convolution network (Gated-GCN) is proposed in 
and utilized for community detection. It borrows the idea from gate recurrent neural networks and constructs a propagation function that receives and processes the property of source vertex and destination vertex simultaneously. The propagation function is depicted inEquation 4.
where refers to element-wise multiplication, and
are typical nonlinear activation functions that have been widely adopted in conventional neural networks.
Graph Recurrent network (GRN) is similar to recurrent neural network (RNN), but aims to learn vertex representations . GRN is mostly used in graph algorithm learning tasks, NLP tasks, traffic forecasting. For example, 
integrates typical RNN units (Gated recurrent unit) into the propagation function as formulated inEquation 5 to perform graph algorithm learning tasks, such as shortest path and eulerian circuits.
Although these GNN algorithms are different in terms of architecture and target applications, we notice that they share common computing patterns. 1) GNNs initially condense vertex property of source vertex with learned parameters to obtain more compact feature representations. 2) Afterwards, GNNs usually gather neighbor properties to embed the information of graph topology to the extracted features. and 3) GNNs usually leverages learned parameters to further condense the output features obtained in the aggregate stage making GNN capable to learn and perform more complex tasks. GNN accelerators must be able to support the computation abstractions concluded above, in order to support different GNN architectures efficiently.
2.2 EnGN processing model
According to the goal of each computing pattern, the common computing patterns can be abstracted as feature extraction, aggregate, and update. The feature extraction stage condenses the property of each vertex in the graph using a neural network. Similar to conventional convolution operations, it can refine the high-dimension vertex properties to low-dimension ones for more efficient feature representation. The aggregate stage embeds the graph topology by accumulating each vertex’s neighbor properties generated in the feature extraction stage to its vertex property with various arithmetic operations such as max, min, and add to produce unified output features. The update stage leverages learned parameters to further condense the output features obtained in the aggregate stage, then applied a non-linear activation function or GRU/LSTM function to each vertex of the graph before output, making GNN capable to learn and perform more complex tasks. Note that when the aggregate stage adopts sum operation, the learned parameters in the update stage can be computed in the feature extraction stage because of the matrix multiplication associative law. It also provides an opportunity for EnGN to dynamically adjust the stages of matrix operations to optimize EnGN performance, which will be introduced in section 4.
On top of the abstraction, we propose a unified EnGN processing model that can cover general GNN models using the common computing functions as shown in Algorithm 1. Suppose the graph is represented as where and represent the set of vertices and edges in the graph respectively.The is the set of vertex property of the graph. By default, the input graph is stored as a coordinate list (COO). Each edge in the graph is a tuple (, , ), where usually stands for the edge property and it depends on graph definition. The EnGN execution flow follows the neighborhood aggregation strategy, which iteratively updates the representation of vertices by aggregating representations of their neighbors. Since all the vertices in the graph will be processed in each iteration for GNN algorithms, EnGN is organized and presented as an edge-centric processing model to ensure more efficient memory accesses .
For each edge, both the source vertex property and the destination vertex property are condensed with using to obtain a temporary property . Then the is added to the destination property using function. Since there may be multiple edges that are incident to the same destination vertices, is essentially a reduce function. When all the destination vertices are reduced, the activation function and sometimes the followed user-defined non-linear operator such as LSTM or GRU layer with learnable weights are used to filter the output using for high-level computing tasks such as node classification, and link prediction.
To help understand the EnGN execution model, we present a vivid example of GCN  processed by the EnGN architecture as shown in Figure 1. Suppose an input graph has four vertices and each vertex has a 5-dimension property. The input property of the vertices are denoted as , , , . In function, the feature extraction function takes both the vertex property i.e. , , , and associated weight matrix as input. Then it has the weight matrix multiplied with the high-dimension input vertex property to generate low-dimension temp features. Note that the size of the weight matrix is associated with both the input property dimension and output temp feature dimension. In this example, the size of the weight matrix is . With the feature extraction function, the input vertex properties are converted to 3-dimension temp features donated as , , , . In function, it receives the results of function and aggregates the property of each vertex’s incoming neighbors. As shown in Figure 1, the temp properties of vertex 2 and 3 i.e. , are added to temp property of vertex 0 as vertex 2 and 3 are incoming neighbors of vertex 0 according to the graph topology. When the aggregation processing is done, starts. It has the vertex features i.e. , , , filtered using an activation function. The filtered output properties donated as , , , become the input to the next GNN iteration.
3 EnGN Architecture
3.1 EnGN hardware architecture
On top of the unified EnGN processing model, we develop a customized EnGN accelerator as shown in Figure 2. It integrates a neural graph processing unit (NGPU) to perform Feature extraction, Aggregate, and Update operation in a unified architecture. It has an array of homogeneous processing elements (PE) and the array size is . Each PE unit contains a local register file to store the temporary results and act as intermediate for inter-PE communication. Each PE in the same column of the Ring-Edge-Reduce (RER) array is connected to its neighbors in a ring network to perform aggregate operation and each PE in the same row of the RER array can process a vertex property, which means the NGPU can process 32 vertices simultaneously. However, such processing parallelism requires substantial memory bandwidth. Thereby, to avoid performance degradation, EnGN optimizes the memory access pattern for vertex data and edge data moving. For source vertex data access in the large graph, we adopt the graph tiling technique and make the source vertex fetching only induces accesses to the continuous memory addresses. For random destination vertex accesses in the aggregate and update stage, EnGN leverages the hashed edge data layout and multi-level cache method to avoid write conflicts and improve data hit rate in the compact on-chip buffer. During processing, to generate the PE control signal, the PE controller of NGPU reads the edge list of the graph from the edge banks and parses it into bit-stream that controls the PE-array to perform inter-row aggregate operation (① in Figure 2). In addition, as shown in Figure 2
, each PE in the NGPU is attached by an XPE to perform activation functions, bias operation, and rounding operation in the update stage. There also exists on the chip a vector processing unit (VPU), which is used to deal with different feature extraction, aggregate and update functions of GNNs illustrated inTable 1.
3.1.1 The RER PE array
The feature extraction stage maps the high dimensions property of vertices to the low dimensions by using the learned weight matrix, and this stage is simply matrix multiplication operation. Thus, EnGN modifies the classical systolic array architecture to perform the feature extraction stage in NGPU. As shown in Figure 3, in order to handle arbitrary-dimension property of GNN algorithms, we propose the graph property aware (GPA) dataflow to decouple the input property of the vertex and the hardware computing structure. In this manner, each PE in the same column of PE-array is responsible for a single dimension of vertex property and each PE in the same row handles a single vertex. The properties of a vertex are arranged in columns and aligned in the property bank. The dimensions of input vertex property become independent to the hardware architecture and can be continuously injected to the PE-array regardless of the array size and the property dimension. However, as shown in Figure 3, each column is responsible for generating one dimension of output property while the dimension of output property is determined by the GNN models. Thereby, when the column size of the weight matrix is larger than that of the PE-array, we choose to split the weight matrix into partitions that match the column size of PE-array. For example, Figure 3 depicts the weight matrix with 6 columns and 5 rows while the column size of PE-array is 3, where the row size of weight matrix depends on 5-d property of input vertex and the column size of weight matrix is determined by the target GNN model. In this case, the weight matrix is split by column into two parts to match the column size of PE-array. After that, the two pieces of weight sub-matrix are placed in the weight banks in the order of rows. In this manner, the processing unit can handle vertex with arbitrary dimension property.
3.1.2 The Ring Edge Reduce topology for PE communication
The aggregate procedure needs to collect the information according to the edge information. Thereby, as shown in Figure 3, each row of the PE-array in NGPU possesses a dedicated edge bank and each PE in the same row receives the same control signal parsed from edge list in the graph to gather the corresponding vertex property. Meanwhile, because each PE needs to broadcast its own vertex features generated by the feature extraction stage to all other PEs in the same column, aggregating the received information simultaneously can result in a large amount of hardware resource consumption. Thereby, inspired by the ring-all-reduce concept , we propose the ring-edge-reduce (RER) aggregate dataflow to conduct aggregate stage inside the PE array instead of moving the data out to the buffer. As shown in Figure 3, because each column of PE performs the same operations without any communication in between, each PE in the same column of array is connected to its neighbors through an on-chip network of ring topology. Each PE in the same column only communicates with its two nearest neighbors (north, south). In our design, the PE sends its information to the northern neighbors and receives the information sent from the southern neighbors for property aggregating. In this manner, a PE can select the relevant vertices to aggregate based on the control signal parsed from the edges during the information flow between the PEs.
3.1.3 The On-chip Memory Hierarchy
PE register file The register files (RF) equipped in the PEs are divided into four groups including source vertex groups(SRC RF), destination vertex groups (DST RF), and two shadow groups (Shadow RF), which is depicted in Figure 4. The SRC RF stores the source vertex values generated in the feature extraction stage. The DST RF stores the destination vertex feature updated during the aggregate and update stages. In addition, there are two programmer-invisible Shadow RFs holding the SRC and DST vertex values previously generated by the PEs of the same column, which work together with the DST and SRC RFs in a ping-pong way.
Multiple-level caches The real world graph has up to millions of vertices. Although the graph tiling technique adopted by EnGN helps fit the sub-graphs into the on-chip buffer, the set of vertices in the sub-graphs will still outsize the register files of the PE array. Meanwhile, the result banks are used to store the temporary aggregate results. PE frequently accesses the long-latency result bank will result in performance degradation. Thus, to improve the throughput of EnGN, as shown in Figure 4, we insert a degree aware vertex cache (DAVC) between the result banks and the register file of each PE to improve the performance of the EnGN. The register file, DAVC, and the result banks are regarded as the first, second, and last level memories on chip, respectively. DAVC is a direct-mapped cache with the write-back policy and cache hit time of one cycles. DAVC uses the destination vertex id of edges as the line tag to determine whether the access to the vertex data hit or not in the DAVC. If hit, the vertex data will be directly read to DST RF in the PE unit. Otherwise, EnGN will access the last-level result banks. DAVC adopts the least recently used (LRU) replacement policy . In this manner, the DAVC can alleviate the overhead incurred by the result bank accesses. EnGN also allocates a small space in the DAVC to cache some of the vertices with high-degree to reduce the on-chip memory traffics and the memory bandwidth required by the vertex data refetch because of the power-law distribution in the real-world graph. Since EnGN adopts graph tiling methods with a deterministic strategy, the vertex access can be predicted in advance. Thereby, EnGN provides a prefetcher unit to prefetch the vertices from off-chip memory to replace the property of vertex resides in the result banks as long as the all edges associated with the hold vertex are processed.
Edge Bank EnGN adopts graph tiling technique to divide the large graph into intervals and shards subsection 4.3. The edges of each shards are divided into different bank using modulo function according to the destination vertex id and the PE array size (Edge Hash), which is shown in Figure 3. For example, edges , , will be placed into Row-1, Row-2, and Row-3 of the PE array, respectively. In this manner, each bank is only responsible for some dedicated edges and vertices update operations, heavy write conflicts due to simultaneous updates to the same vertex in different PEs can also be avoided. Meanwhile, we also rearrange the order of the edges within a shards based on the flow direction of the data in the ring topology to avoid redundant data traffics in the ring and improve the processing throughput of the edges.
3.2 EnGN dataflow
We present in Figure 5 an example to illustrate the execution flow of a GNN layer on the proposed EnGN hardware architecture based on GCN model . Similar to the EnGN processing model, the execution stage of one GNN layer on the accelerator is also divided into three stages: feature extraction, aggregate, and update stage. The feature extraction and aggregate stage are executed in a pipelined manner while the update stage can only be triggered when the aggregate stage is completed. Without losing generality, as shown in Figure 3 we consider a small design that consists of PEs and the input graph is the same as the Figure 1. For brevity, we only describe the flow at the and .
Feature extraction: At Cycle 0, receives the first dimension of the input vertex property at the layer , and the weight data to generate the temporary property result . The result are stored in the local register file.
At cycle 1, receives the second dimensions of input vertex property and the weight data . Both of them are used to update the temporary property result .
On the follow-up cycles, the operations on are similar to cycle 1. The following new input properties , , and the associated weight data , , continuously feed into until the dimensions of vertices have been covered.
At cycle 4, the result is generated and stored into the register file, waiting to be used by the aggregate stage.
Aggregate: At cycle 5, Since the edge buffer contains vertex 0, uses to generate the aggregate result . At the same time, sends to and prepares for receiving the sent from .
At cycle 6, although receives from , will not be utilized to update because vertex 1 does not exist in the according edge list provided by the edge buffer, which means vertex 1 is not connected to vertex 0. Meanwhile, send to and prepares for receiving sent from .
At cycle 7, receives from and leverages it to update the aggregate result because vertex 2 is in the according edge list provided by the edge buffer. Meanwhile, is sent to .
At cycle 8, receives from and leverages it to update the aggregate result . In this case, all vertices have been traversed by each PE. Thus, summarizes all the relevant edge information and will be stored in the register file to wait for the update stage.
Update: At cycle 9, since all edges associated with vertex 0 has been traversed, the aggregate result will not be changed. Thereby, the update stage is triggered, and the aggregate result is processed by the user-defined apply function, i.e. activation function in GCNs, to generate the result as the input to the next layer .
4 Engn Optimization
4.1 Observations of GNN computing
To further optimize the EnGN design, we try to explore the characteristics of GNN algorithms and seek key observations that may guide the EnGN architecture optimization. Instead of elaborate on the seperate EnGN processing functions, we try to illustrate the whole execution flow of a compact GNN on the EnGN accelerator. Suppose the input graph with vertices and edges is depicted with an adjacency matrix . The vertex property of the graph is with channels and the learned filters i.e. weight is where is output property dimension. Then, the output of the GNN i.e. can be represented as Equation Equation 6:
According to the formulation of GNNs, we obtain three major exploitable observations:
1. The dimension of the vertex property in GNN generally exhibits serious variation across the iterations compared to traditional graph algorithms.
The dimension of the vertex property is determined by both the application (i.e. the number of vertices in the graph) and the architecture of the GNN model (i.e. the condensed feature dimension) according to Equation 6. While GNN is an iterative algorithm, the output feature in current iteration becomes input feature to the next iteration. Thereby, the vertex property dimension mostly relies on the weight array size i.e. the architecture of the GNN model and it varies across the different iterations. In each iteration, vertex property dimension may either increase or decrease after the computing.
2. The order of feature extraction processing and aggregate processing in GNNs are exchangeable when the operator in aggregate processing is sum.
When the operator used in aggregate is sum which is widely adopted in GNN algorithms, we notice that the computing in Equation 6 can be changed to Equation 7 without affecting the result because of matrix multiplication associative law. While the amount of operations using the distinct computing order is also different, we may choose the order that incurs less computation in each iteration.
3. The weight size of GNNs is independent to the size of the input graph and it is usually small. While the input graphs in GNNs can be large and typically dominate the memory accesses.
According to Equation 6, the weight size of GNNs is determined by the dimension of the input graph vertex property and output feature property in each iteration. Thus, it is irrelevant to the number of vertices in the graph. In this case, the weight size can be much smaller compared to the graphs that may include millions of vertices, which is also a key distinction from conventional neural networks. Input graphs will dominate the memory accesses and dealing with the large graphs in GNNs will be critical to the accelerator performance.
4.2 Dimension-aware stage re-ordering
According to Observation 1 and 2, the processing order of GNN stages, the feature extraction, aggregate and update stages, will not affect the computing results, but it can change the total number of operations in a GNN. We analyze the quantity of operations when using different computing order, and aim to find the best way to schedule the stages. For feature_extraction, the number of operations i.e. multiply-accumulate in Equation 6 and Equation 7 are the same and it is equal to . Similarly, update does not change with the computing order. Nevertheless, for aggregate, the order of GNN computing leads to different number of operations i.e. accumulation in aggregate. When Equation 6 is used, the number of operations is . When Equation 7 is chosen, the amount of operations becomes .
While the property dimension varies as observed in last subsection, is not equal to . To reduce the total computing, when the input vertex property dimension is larger than output feature dimension , we should choose for Equation 6 GNN computing. Otherwise, we should use Equation 7 for computing. Following this idea, we propose a dimension-aware stage reordering strategy based on the input and output property dimension comparison. The dimension-aware strategy can be implemented by altering the instruction sequence that defines the computing order of GNNs, so it will not incur additional hardware overhead.
4.3 Graph tiling and scheduling
According to Observation 3, a real-world graph that can be very large dominates the memory accesses in GNNs and it cannot be fitted to the limited on-chip memory of EnGN. To address this issue, EnGN tiles the large graph into intervals and shards using a grid partition approach proposed in . The basic idea of the grid partition is to divide all the vertices into disjointed intervals. Then the edges of the graph with both source and destination vertices limited to one interval can be partitioned into disjointed shards. Each shard must be fitted to the on-chip memory of EnGN accelerator to ensure efficient computing without external memory accesses.
With the tiling, EnGN processes with the granularity of a tile. For each tile, the number of vertices remains larger than the row size of the PE array while each row of PE can only handle a single vertex at one time according to the dataflow proposed in prior section. In this case, the vertices are processed in batch and the batch size is equal to the row size of the PE array. The batch processing of a tile is described in Figure 6. Instead of conducting feature_extraction and aggregate sequentially, we have them overlapped. Basically, aggregate starts when a batch of vertices complete feature_extraction.
Although tiling ensures EnGN to process using just the data that are accommodated in the on-chip buffers, there are still data dependency between the different tiles. The order of the tile execution essentially affects the data reuse and the amount of external memory accesses accordingly. Thereby, tile scheduling is also an important design option that needs to be intensively optimized.
The graph is split into a 2D array of tiles. The tiles in each row have the same source vertices while the tiles in the same column have the same destination vertices. Intuitively, we may schedule in either a row manner or a column manner. In the column-major order, new source vertices must be reloaded tile by tile while the the destination vertices in the same interval reside in on-chip buffer until the column of tiles complete execution. Then, the next column of tiles can be scheduled. In the row-major order, source vertex properties can be buffered until the whole row of tiles are processed. In addition, we notice that there are also shared data between neighboring columns or rows and propose to schedule with an S-shape as shown in Figure 6. For example, the bottom tile of a column share the same source vertices with the bottom tile in next column. Similar data sharing can be observed in row manner.
The different tile scheduling strategies mainly differ on the external memory accesses and we quantitatively analyze the I/O cost. Suppose the amount of source vertices in each tile is a unit. For column-major order, each column of tiles requires to load tiles of source vertices and the total amount of load is . When neighboring column data reuse is considered, the amount of data to be loaded becomes . While the destination vertices in each column can be reused, the total amount of write is . For row-major order, the amount of read is the same, but the amount of write is much larger, because tiles in a row generates many intermediate output and must be frequently swapped to external memory among different tile execution. The total amount of write is . While the dimension of the vertex property also affects the amount of I/O cost and the dimension of input vertex property and output vertex property is usually different according to Observation 1, we further take the vertex property dimension into consideration and the I/O cost is summarized in Table 2.
Suppose that the latency of read and write external memory is equal. Comparing the overhead of the two different tile scheduling strategies, we obtain the following formulation:
Based on Equation 8, it can be concluded that the column-major order scheduling outperforms the column-major order scheduling when F is smaller than 2H. Otherwise, row-major order scheduling is preferred. While and are mostly determined by the GNNs and the comparison varies, we employ an adaptive scheduling to minimize the external memory accesses. The adaptive scheduling option is explicitly encoded in the instructions which are generated at compilation time based on the GNN models.
|Read Size||Write Size|
5.1 Experimental setup
Accelerator simulator We built a cycle-accurate detailed simulator in C++ for fast performance and functional simulation. This simulator models each hardware module of EnGN accelerator faithfully, and the timing behavior of all modules are co-verified with the synthesized RTL design. Meanwhile, the open-source DRAM simulator DRAMSim  is used to model the costs of the off-chip memory access.
EnGN has a multi-bank 256KB property buffer, a multi-bank 512KB weight buffer, a multi-bank 256KB edge buffer, and a multi-bank 256KB result buffer. The total size of the distributed vertex cache in EnGN is set to 64KB so that it stores 1024 vertices with 32 dimensions property. To measure the chip area, power and critical path delay of the EnGN, we synthesized the EnGN design using Synopsys Design Compiler (DC) with the TSMC 45nm process technology. Meanwhile, we conduct place-and-route using the Synopsys ICC compiler (ICC), and estimated the power cost using Synopsys PrimeTime (PT).
Evaluation environment and baselines We compare the performance and energy efficiency of EnGN with two baselines:
CPU baseline is a dual Intel Xeon E5firstname.lastname@example.orgGHz processors with 512GB DRAM memory. Since GNN models is directly mapped to the deep learning framework reuslts in low performance, we adopted two state-of-the-art graph neural network framework DGL  and Pytorch geometric (PyG) to measure the performance of GNN algorithms on CPU, denoted as CPU-DGL and CPU-PyG.
baseline is a modern GPU card (NVIDIA GTX 1080Ti, 11GB GDDR5X). Since deep learning frameworks such as Pytorch and Tensorflow is unable to fully exploit the performance of GPU[31, 12], we still leveraged the DGL and PyG framework to measure the performance of GNN algorithms on GPU, denoted as GPU-DGL and GPU-PyG.
GNN models and Datasets For benchmarks, we selected five different GNN models to handle two representative workloads: semi-supervised classification and knowledge graph. The GNN models and datasets is list in Table 3, where the feature and label columns represent the dimension of one vertex and number of labeled classes respectively, and the relation in RDF format  denotes different types of edges. For semi-supervised classification workload, we leverages four different GNN models: GCN , GraphSage-Pool (GS-Pool) , Gated-GCN , and GRN  because of their different computation patterns. All of them have 2 layers with 32 hidden units in experiments. These four models applied to four popular different scale datasets: Cora (CA) , PubMed (PB) , extended version of Cora denoted as Cora-Full (CF) , and one large on-line discussion forum graph: Reddit (RD) , to measure the processing time of GNN models on different baselines. For knowledge graph, we leverages one classical GNN model, R-GCN , to perform entities classification in knowledge graph. R-GCN has 2 layers with 32 hidden units. We applied R-GCN to four different typical knowledge graph datasets: : AIFB (AB) , MUTAG (MG) , BGS (BS)  and AM (AM) .
Evaluation Metric In evaluation, we used the inference time of graph neural network as the performance metric of end-to-end latency and billion operations per second (GOP/s) as the throughput metric. Meanwhile, we measured the power-efficiency of EnGN with GOPs per Watt (GOPs/W).
5.2 Experimental results
Layout and Area We show the physical layout of EnGN in Figure 7. The overall area of EnGN is and the peak power consumption is , operating at the clock-rate of 800MHz.
Performance First, we evaluated EnGN by comparing it with the CPU and GPU baselines running state-of-the-art high-performance GNN software frameworks DGL, Pytorch geometric. However, due to memory limitation, large graphs such as Reddit, cannot run on the GPU. Figure 8 shows the inference time of five GNN models on EnGN with respect to four baselines: CPU-DGL, CPU-PyG, GPU-DGL, and GPU-PyG, on the above eight different datasets. EnGN achieves better performance regardless of GCN model types and dataset size. Compared to the CPU-DGL and CPU-PyG implementation, EnGN achieves a 120.48x and 494.24x performance improvement on average, respectively. The explanation for better performance on CPU-PyG than CPU-DGL is that PyG is optimized for GPU instead of CPU. Compared to the GPU-DGL and GPU-PyG implementation, disregarding the big graphs that cannot be handled on the GPU baseline, our design gains a 0.31x to 25.22x and a 0.2x to 10.55x improvement over the GPU-DGL and GPU-PyG, respectively. For large graphs like Reddit and AM, the memory capacity of a single GPU cannot satisfy the requirement of the GNN model. By contrast, EnGN can support large-scale graphs using the graph tiling strategy.
Throughput Figure 9 (a) shows the throughput comparison with CPU and GPU using two citation network Cora and Pubmed processed with two different GNN models, GCN and GS-Pool. On small scale GCNs, EnGN provides the throughput of 907.8 GOP/s on Cora dataset and 783.9 GOP/s on PubMed dataset, respectively. While CPU-DGL achieves the throughput of 14.9 GOP/s and 4.97 GOP/s, which is 60x and 157x lower than EnGN. Because PyG is not optimized for CPU, it only gets 1.28 GOP/s and 2.75 GOP/s. Meanwhile, GPU-DGL and GPU-PyG respectively achieve 73.5 GOP/s and 75.8 GOP/s performance on Cora dataset. However, the performance boost is still lower than EnGN by 10.9 x and 10.6x. When implementing model with high computation requirement such as GraghSage-Pooling, the throughput of EnGN can reach 1570GOP/s and 1489 GOP/s on Cora and PubMed dataset, respectively. However, CPU-DGL only makes to 32 GOP/s and 9.66 GOP/s, and GPU-DGL only gets 154.5 GOP/s and 106.8 GOP/s. The performance improvement on EnGN is mostly attributed to: (1) high parallel vertices and edges processing supported by the RER PE array of EnGN, and (2) The customized memory hierarchy and the dimension aware stage reodering improve the throughput of the EnGN.
Power&Energy estimation We also compared the energy consumption of EnGN, CPU-DGL, CPU-PyG, GPU-DGL, and GPU-PyG, which can be estimated as products of power consumption (in Watt) and the execution time (in Millisecond). The power consumption of CPU and GPU is reported by the thermal design power (TDP) in official dataset and NVPROF, respectively. The peak power of EnGN is 5.68W in worst-design case while the Xeon Chip on our baseline consumes 85W power on average. However, the GPU power is 26.41x higher than EnGN. As show in Figure 10, EnGN is on average 1303.03x and 1348.17x more energy efficient than CPU-DGL and 1303.03x and 1348.17x over CPU-PyG, respectively. The EnGN achieved better energy efficiency compared to GPU-DGL and GPU-PyG, which is attributed to the low parallelism of GPU when executing GNN algorithms. The energy consumption of GPU-DGL and GPU-PyG are 111.4x and 69.5X that of EnGN on average, respectively. Meanwhile, although PyG optimized for GPU brings higher performance than DGL on GPU, its energy-efficiency is still lower than EnGN.
EnGN also dominates the power efficiency comparison. Figure 9 (a) shows the efficiency result of EnGN compared to the CPU and GPU baseline running software framework DGL and PyG. The reason for the boost brought by EnGN is twofold: 1) massive parallelism processing PE optimized for vertices and edges on graph makes EnGN performs high-throughput GNN analytics, 2) The RER dataflow, the hierarchical on-chip memory, and the graph tiling and scheduling policy, contribute to a high data reusability for the massive number of vertices that are sparsely and irregularly connected in the graphs.
5.3 Effects of EnGN optimizations
The benefits of dimension aware stage re-ordering in EnGN as mentioned in subsection 4.2, our proposed dimension aware stage reordering technique can reduce the total computing cost. In this evaluation, because the aggregate stage of GraphSage-Pool adopts the average operator, we only adopted GCN and RGCN model to measure the performance of dimension-aware stage re-ordering (DASR) strategy compared to two fixed processing stage: (1) feature_extraction, aggregate and update (FAU), and (2) aggregate, feature_extraction and update (AFU). Figure 11 (a) illustrates that DASR strategy can reduce the operation number in the aggregate stage by 94.85x and 1.46x on averages when compared to two fixed processing strategy FAU and AFU, respectively. Note that in the Cora-Full dataset, our DASR strategy can reduce operations number by 272x. The reason is that the input dimensions of vertex property is 8710 (Table 3), which is higher than other datasets. When the feature extraction stage performs after the aggregate stage, higher dimensions incurs massive accumulate operators (8710 for a vertex) in the aggregate stage. In contrast, when we performs the feature extraction stage before the aggregate stage, the dimension will be compressed to 16 and accumulates operators is only 16 for a vertex in the aggregate stage.
Graph tiling scheduling this subsection explores the impact of graph tiling scheduling strategy in EnGN. In this evaluation, we leveraged the column-major and row-major update strategy as baselines to evaluate our scheduling strategy on GCN model (16 hidden units) using four datasets: Cora, PubMed, Cora-Full, and Reddit. Under the on-chip memory constraint of the EnGN, these datasets are split to 43, 78, 618 and 1214 intervals (Q value) according to the tiling strategy. Figure 11 (b) illustrates the total I/O cost induced by the EnGN scheduling strategy, the column-oriented (Column), and row-oriented (Row) strategies respectively. In the Cora and PubMed dataset, our graph tiling scheduling strategy reduce total I/O cost by 17.76x and 11.49x compared to column-major strategy. However, Cora and PubMed only contains 7 and 3 class labels, which is less than the output dimension of the first layer. Thereby, the performance of graph tiling scheduling is comparable to the row-major strategy. In contrast, Cora-full and Reddit contains 67 and 41 class labels respectively. Thereby, in this case, graph tiling scheduling can reduce the total memory access cost by 139.9x and 2.89x when compared to the column-major and the row-major respectively on Cora-Full dataset. Meanwhile, it also reduce the memory cost by 12.72x and 2.35x over the two baselines on Reddit dataset. The explanation for the performance benefits is that column-major and row-major strategy stick to the fixed policy to update the graph while our graph tiling scheduling can adjust the update dataflow from the row-major to the column-major dataflow according to the dimension changes in GNNs.
The merits of Degree Aware Vertex Cache We examine the effects of caching vertices on the degree aware vertex cache (DAVC) of the EnGN. In evaluation, we used four different scale datasets: Cora, PubMed, CoraFull and Reddit to estimate the energy saving brought by the DAVC with respect to three cache designs of other replacement strategies: LRU , LFU , FIFO  replacement strategy. Meanwhile, we also disable DAVC to shows the additional energy consumption due to increased accesses to the long-latency on-chip result banks denoted as No Cache. The energy pre access is reported by the CACTI 7.0 . Figure 12 (a) shows that DAVC can reduce the times of results banks (L3) access by 1.11x, 1.21x, and 1.12x, and 1.98x compared to LRU, LFU, FIFO, and No Cache methods on average, respectively. Benefiting from the reduction of memory access, Figure 12 (b) illustrates that DAVC saves on-chip memory energy consumption incurred by the vertex access about 3.93%, 6.31%, and 4.07% compared to LRU, LFU, and FIFO strategy on average, respectively. Compared to No Cache methods, DAVC saves energy consumption by 25.28% on average. The reason is simply due to: LRU policy attains the recency of the vertex while it cannot discriminate well between frequently and infrequently. LFU policy considers the frequency of vertex while it cannot distinguish between vertex that occurred far back in the past and the more recent ones in recording the frequency. Whereas, our proposed DAVC with LRU policy and cached high degree vertices strategy will be accessed frequently by other vertices on the graph due to the edge requirement and spatial locality. In this case, fetching the relevant vertices from DAVC instead of the result banks will be less expensive.
6 Related Work
6.1 GNNs software framework
There is a large amount of work that aims at building an efficient system for graph applications on single node-machines (CPUs) [29, 50, 56, 22], distributed systems [49, 30, 44, 42], and GPUs [14, 23]. However, these graph processing frameworks aim at traditional algorithms, and there is a lack of support for graph neural network computation. Even though TuX2 
aims to bridge the gap between graph and traditional machine learning algorithm, it is still unable to support the inference and training stage of emerging GNN algorithms. Thereby, NeuGraph is proposed to recast the graph specific optimization as dataflow optimization based on Tensorflow. Meanwhile,  published a geometric learning library for deep learning on irregularly structured input data based on Pytorch. The deep graph library  provides a fast implementation of GNN models based on pytorch and MxNet.
NeuGraph, Pytoch geometric, and DGL are generally running on the power-hungry CPU and GPUs, which incurs energy-efficient issues and high cost. More importantly, GPUs suffer from the under-utility of stream processors during parallel GNN computation because of the impact of the irregular graph data structure, which makes energy-efficient issues more serious. Thereby, to address these issues, we build an EnGN accelerator designed for large graph neural network to support energy-efficient graph neural network processing.
6.2 Deep learning & Graph accelerator
The resurgence of deep neural network (DNN) and its substantial progress in various applications including image, video, and speech spurs the flourishing of the DNN hardware accelerator [40, 36]. For example, Diannao  maps DNN onto an array of multiply-add units and employ data tiling policy to exploiting the locality in the parameters. DaDiannao  employs a larger on-chip eDRAM for both high bandwidth and data locality. EIE  performs inference using compressed technique and accelerates the inherent modified sparse matrix-vector multiplication. Eyeriss 
is a low power real-time DNN accelerator that exploits zero valued neurons by using run length coding for memory compression. However, these DNN accelerators are designed for traditional DNN such as convolution neural network, which cannot handle GNNs because they lack graph propagation model on the accelerator.
The wide gap between the general-purpose architectures and the unique features of graph processing promotes the rapid development of graph processing-specific accelerators based on FPGA and ASIC. For example, GraphGen  presented a vertex-centric framework to support various graph computations on FPGA. Graphicionado  presented a domain-specific accelerator for graph analytics based on well-defined, popular vertex programming model. However, traditional graph accelerators are designed for traditional graph algorithms, it lacks the computation abstraction required by the neural network, such as tensor and activation operations.
In this paper, we present a high-throughput and energy efficient accelerator EnGN specialized for large graph neural network processing. In order to provide high throughput processing ability and solve the arbitrary dimension change issues in the GNN algorithms, we proposed ring-edge-reduce update dataflow and the accompanied hardware architecture of RER PE-arrays are designed to simultaneously conduct high-throughput processing in the feature-extraction, aggregate and update stages on GNNs. Meanwhile, the proposed graph tiling and scheduling technique cooperating with well-designed three-level memory hierarchy enable EnGN to process large graph efficiently. Experimental results show that EnGN achieves 303.45x and 4.44x performance speedup while consuming 1370.52x and 93.73x less energy on average when compared to CPUs and GPUs, respectively.
-  A distributed graph deep learning framework. contribute to alibaba/euler development by creating an account on GitHub Note: original-date: 2019-01-10T06:32:32Z External Links: Cited by: §1.
-  (2017-06) CACTI 7: new tools for interconnect exploration in innovative off-chip memories. ACM Trans. Archit. Code Optim. 14 (2), pp. 14:1–14:25. External Links: Cited by: §5.3.
-  (2017-07) Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking. arXiv e-prints, pp. arXiv:1707.03815. External Links: Cited by: §5.1.
-  (2000-06) Graph structure in the web. Comput. Netw. 33 (1-6), pp. 309–320. External Links: Cited by: §1.
-  (2018) FastGCN: fast learning with graph convolutional networks via importance sampling. CoRR abs/1801.10247. External Links: Cited by: §1.
-  (2014) DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, New York, NY, USA, pp. 269–284. External Links: Cited by: §6.2.
-  (2016) Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, Piscataway, NJ, USA, pp. 367–379. External Links: Cited by: §6.2.
-  (2014) DaDianNao: a machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, Washington, DC, USA, pp. 609–622. External Links: Cited by: §1.
-  (2019) Bandwidth reduction using importance weighted pruning on ring allreduce. CoRR abs/1901.01544. External Links: Cited by: §3.1.2.
-  (1998) LRU is better than fifo. In Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’98, Philadelphia, PA, USA, pp. 78–81. External Links: Cited by: §3.1.3, §5.3.
-  (2016) Language modeling with gated convolutional networks. CoRR abs/1612.08083. External Links: Cited by: §2.1, §5.1.
-  Geometric deep learning extension library for PyTorch: rusty1s/pytorch_geometric Note: original-date: 2017-10-06T16:03:03Z External Links: Cited by: §1, §5.1.
-  (2017) Neural message passing for quantum chemistry. CoRR abs/1704.01212. External Links: Cited by: §2.1.
-  (2012) PowerGraph: distributed graph-parallel computation on natural graphs. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), Hollywood, CA, pp. 17–30. External Links: Cited by: §6.1.
-  (2013-03) FIFO cache analysis for wcet estimation: a quantitative approach. In 2013 Design, Automation Test in Europe Conference Exhibition (DATE), Vol. , pp. 296–301. External Links: Cited by: §5.3.
-  (2016) Graphicionado: a high-performance and energy-efficient accelerator for graph analytics. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-49, Piscataway, NJ, USA, pp. 56:1–56:13. External Links: Cited by: §1, §6.2.
-  (2017) Knowledge transfer for out-of-knowledge-base entities: A graph neural network approach. CoRR abs/1706.05674. External Links: Cited by: §1.
-  (2017) Inductive representation learning on large graphs. CoRR abs/1706.02216. External Links: Cited by: §2.1, §5.1.
-  (2016) EIE: efficient inference engine on compressed deep neural network. CoRR abs/1602.01528. External Links: Cited by: §6.2.
-  (1997-11) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: Cited by: §1.
-  On the Design of an Efficient Hardware Accelerator for Large Scale Graph Analytics. Technical report External Links: Cited by: §1.
-  (2018) GraFboost: using accelerated flash storage for external graph analytics. In Proceedings of the 45th Annual International Symposium on Computer Architecture, ISCA ’18, Piscataway, NJ, USA, pp. 411–424. External Links: Cited by: §1, §6.1.
-  (2014) CuSha: vertex-centric graph processing on gpus. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC ’14, New York, NY, USA, pp. 239–252. External Links: Cited by: §6.1.
-  (2016) Semi-supervised classification with graph convolutional networks. CoRR abs/1609.02907. External Links: Cited by: §2.1, §2.2, §3.2, §5.1.
-  (2017-05) ImageNet classification with deep convolutional neural networks. Commun. ACM 60 (6), pp. 84–90. External Links: Cited by: §1, §2.1.
-  (2019) PyTorch-biggraph: A large-scale graph embedding system. CoRR abs/1903.12287. External Links: Cited by: §1.
-  (2016) Gated graph sequence neural networks. CoRR abs/1511.05493. Cited by: §2.1.
-  (2019-07) NeuGraph: parallel deep neural network computation on large graphs. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), Renton, WA, pp. 443–458. External Links: Cited by: §1, §2.1, §6.1.
-  (2010) Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, New York, NY, USA, pp. 135–146. External Links: Cited by: §6.1.
-  (2016-06) Energy efficient architecture for graph analytics accelerators. SIGARCH Comput. Archit. News 44 (3), pp. 166–177. External Links: Cited by: §6.1.
-  Python package built to ease deep learning on graph, on top of existing DL frameworks.: dmlc/dgl Note: original-date: 2018-04-20T14:49:09Z External Links: Cited by: §1, §5.1, §5.1, §6.1.
-  (2016) A collection of benchmark datasets for systematic evaluations of machine learning on the semantic web. In International Semantic Web Conference, Cited by: §5.1.
-  (2016-10) A collection of benchmark datasets for systematic evaluations of machine learning on the semantic web. pp. 186–194. External Links: Cited by: §5.1.
-  (2011-01) DRAMSim2: a cycle accurate memory system simulator. IEEE Comput. Archit. Lett. 10 (1), pp. 16–19. External Links: Cited by: §5.1.
-  (2018) Recent advances in recurrent neural networks. CoRR abs/1801.01078. External Links: Cited by: §2.1.
-  (2010-05) Hardware accelerators for biocomputing: a survey. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems, Vol. , pp. 3789–3792. External Links: Cited by: §6.2.
-  (2018) Modeling relational data with graph convolutional networks. In The Semantic Web, A. Gangemi, R. Navigli, M. Vidal, P. Hitzler, R. Troncy, L. Hollink, A. Tordai, and M. Alam (Eds.), Cham, pp. 593–607. External Links: Cited by: §2.1, §5.1.
-  (2008) Collective classification in network data. Technical report . Cited by: §5.1.
-  (2004) LFU-k: an effective buffer management replacement algorithm. In Database Systems for Advanced Applications, Y. Lee, J. Li, K. Whang, and D. Lee (Eds.), Berlin, Heidelberg, pp. 670–681. External Links: Cited by: §5.3.
-  (2017) Efficient processing of deep neural networks: a tutorial and survey. Proceedings of the IEEE. Cited by: §6.2.
-  (2019-09–15 Jun) Simplifying graph convolutional networks. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 6861–6871. External Links: Cited by: §1.
-  (2015) GraM: scaling graph computation to the trillions. In Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC ’15, New York, NY, USA, pp. 408–421. External Links: Cited by: §6.1.
-  (2019) A comprehensive survey on graph neural networks. CoRR abs/1901.00596. External Links: Cited by: §2.1.
-  (2017-09) OmniGraph: a scalable hardware accelerator for graph processing. pp. 623–624. External Links: Cited by: §1, §6.1.
-  (2018) How powerful are graph neural networks?. CoRR abs/1810.00826. External Links: Cited by: §1.
-  (2019) Cross-lingual knowledge graph alignment via graph matching neural network. CoRR abs/1905.11605. External Links: Cited by: §1.
-  (2018) An efficient graph accelerator with parallel data conflict management. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, PACT ’18, New York, NY, USA, pp. 8:1–8:12. External Links: Cited by: §1.
-  (2018) GraphRNN: A deep generative model for graphs. CoRR abs/1802.08773. External Links: Cited by: §5.1.
-  (2015-01) NUMA-aware graph-structured analytics. SIGPLAN Not. 50 (8), pp. 183–193. External Links: Cited by: §6.1.
-  (2014-06) Medusa: simplified graph processing on gpus. IEEE Transactions on Parallel and Distributed Systems 25 (6), pp. 1543–1552. External Links: Cited by: §6.1, §6.2.
-  (2018) Graph neural networks: a review of methods and applications. ArXiv abs/1812.08434. Cited by: §2.1.
Accelerating graph analytics on cpu-fpga heterogeneous platform.
2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Vol. , pp. 137–144. External Links: Cited by: §2.2.
-  (2015) Accelerating large-scale single-source shortest path on fpga. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, IPDPSW ’15, Washington, DC, USA, pp. 129–136. External Links: Cited by: §6.2.
-  (2019) AliGraph: A comprehensive graph neural network platform. CoRR abs/1902.08730. External Links: Cited by: §1.
-  (2016) Gemini: a computation-centric distributed graph processing system. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, Berkeley, CA, USA, pp. 301–316. External Links: Cited by: §6.1.
-  (2015-07) GridGraph: large-scale graph processing on a single machine using 2-level hierarchical partitioning. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), Santa Clara, CA, pp. 375–386. External Links: Cited by: §4.3, §6.1.