Alleviating Datapath Conflicts and Design Centralization in Graph Analytics Acceleration

Previous graph analytics accelerators have achieved great improvement on throughput by alleviating irregular off-chip memory accesses. However, on-chip side datapath conflicts and design centralization have become the critical issues hindering further throughput improvement. In this paper, a general solution, Multiple-stage Decentralized Propagation network (MDP-network), is proposed to address these issues, inspired by the key idea of trading latency for throughput. Besides, a novel High throughput Graph analytics accelerator, HiGraph, is proposed by deploying MDP-network to address each issue in practice. The experiment shows that compared with state-of-the-art accelerator, HiGraph achieves up to 2.2x speedup (1.5x on average) as well as better scalability.

READ FULL TEXT VIEW PDF

Authors

page 4

page 5

08/31/2019

EnGN: A High-Throughput and Energy-Efficient Accelerator for Large Graph Neural Networks

Inspired by the great success of convolutional neural networks on struct...
11/24/2021

Serpens: A High Bandwidth Memory Based Accelerator for General-Purpose Sparse Matrix-Vector Multiplication

Sparse matrix-vector multiplication (SpMV) multiplies a sparse matrix wi...
12/01/2019

A Novel FPGA-Based High Throughput Accelerator For Binary Search Trees

This paper presents a deeply pipelined and massively parallel Binary Sea...
10/21/2020

Slim NoC: A Low-Diameter On-Chip Network Topology for High Energy Efficiency and Scalability

Emerging chips with hundreds and thousands of cores require networks wit...
06/03/2018

An Efficient Graph Accelerator with Parallel Data Conflict Management

Graph-specific computing with the support of dedicated accelerator has g...
08/02/2021

RFC-HyPGCN: A Runtime Sparse Feature Compress Accelerator for Skeleton-Based GCNs Action Recognition Model with Hybrid Pruning

Skeleton-based Graph Convolutional Networks (GCNs) models for action rec...
08/29/2018

Scalable and Efficient Virtual Memory Sharing in Heterogeneous SoCs with TLB Prefetching and MMU-Aware DMA Engine

Shared virtual memory (SVM) is key in heterogeneous systems on chip (SoC...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Graphs exhibit powerful representation capacity for real-world data in a broad range of scenarios, which is accompanied by many graph analytics applications, such as placement (Hu and others, 2004), circuit partitioning (Selvakkumaran and Karypis, 2006), and technology mapping (Francis and others, 1990) in EDA flow. Not surprisingly, the demand for high throughput in the execution of graph analytics workloads is ever-growing as the scale of graph data increases.

Recently, plenty of graph analytics accelerators have been well designed to improve throughput by alleviating the critical performance bottleneck, i.e., irregular accesses to off-chip memory. For example, Graphicionado (Ham and others, 2016) and GraphDynS (Yan and others, 2019) leverage a large-capacity on-chip memory to buffer all vertices’ property data on the chip, significantly alleviating irregular off-chip accesses. Even in the case of large-scale graphs, graph slicing can be used to partition the graph into a set of small slices to perform the processing using limited capacity of on-chip memory (Ham and others, 2016).

Unfortunately, on-chip side datapath conflicts and design centralization are becoming increasingly significant bottlenecks on the throughput improvement. To exploit the high-degree parallelism in graph analytics workloads, most of the accelerators are in favor of adopting multiple parallel execution channels. However, due to the irregular connection pattern across vertices, interaction across execution channels is inevitable, which brings two following inefficiencies to previous designs. The first is Datapath Conflicts, which means multiple datapaths that process different vertices compete for the same accessible resource or same dataflow channel, causing serious stall in datapaths. As a result, the overall performance will significantly degrade due to the severe blocking of execution channels. The second is Design Centralization, which means the implementation of design becomes extremely difficult with the increasing amount of execution channels due to the overintensive interaction across total execution channels, causing frequency decline. For example, on-chip crossbar is a prevalent solution to direct the dataflow between different execution channels (Ham and others, 2016; Yan and others, 2019). However, it suffers from not only the frequency decline which hinders the pursuit of high throughput, but also a dramatic increase in area and power consumption, when channel number increases (Cagla and others, 2015).

In this work, we observe that the execution channel in the state-of-the-art accelerator is highly pipelined (Ham and others, 2016; Yan and others, 2019), which reveals that increasing the traversal latency of a single edge does not pose significant impact on overall performance. Therefore, inspired by the key idea of trading latency for throughput, we propose a general solution, Multiple-stage Decentralized Propagation network (MDP-network). MDP-network decentralizes the intensive interactions across execution channels to multiple stages and buffers data in each stage. Data in MDP-network is propagated deterministically to next stage until reaching their destinations. On one hand, the multiple-stage and deterministic propagation alleviate datapath conflicts.

On the other hand, the inefficiency of design centralization is avoided since the number of interactive execution channels in each stage is limited to a small number. To facilitate adoption, we provide an open-source automatic generator of MDP-network

111https://github.com/OpenSource88/MDP-network.git. Finally, a novel High throughput Graph analytics accelerator, HiGraph, is proposed by deploying MDP-network to tackle data conflicts and design centralization in practice.

The main contributions of this paper are as follows:

  • We identify the inefficiencies including datapath conflicts and design centralization in graph analytics acceleration.

  • We propose MDP-network, Multiple-stage Decentralized Propagation network, to alleviate datapath conflicts and design centralization. Besides, an automation tool is developed to generate MDP-network and open source to facilitate its deployment.

  • We propose HiGraph, a novel high throughput graph analytics accelerator coupled with MDP-network, and implement it in RTL. The experimental results show that compared to the state-of-the-art design, HiGraph achieves up to 2.2 speedup (1.5 on average) as well as better scalability.

2. Background and Motivation

2.1. CSR Format and VCPM

Compressed Sparse Row (CSR) format is a widely used storage-efficient technique to represent graph structures in software frameworks (Gonzalez and others, 2014; Shun and others, 2013) and accelerators (Ham and others, 2016; Yan and others, 2019). Fig. 1 illustrates that three data arrays, Offset, Edge, and Property, are used to encode a graph. Each Offset entry stores the position of its first neighbor in the Edge Array. The Edge Array maintains destination vertex ID and weight for each outgoing edge. The Property Array holds current property value for each vertex.

Existing graph analytics software frameworks (Malewicz and others, 2010; Fu and others, 2014) and accelerators (Ham and others, 2016; Yan and others, 2019) usually employ Vertex-Centric Programming Model (VCPM) to accomplish iterative graph algorithms. VCPM consists of scatter and apply phases, as shown in Fig. 3. In the scatter phase, each active vertex first aggregates the effect of its property and edge weight via a user-defined function ProcessEdge( ), then broadcasts the accumulated influence to update its outgoing neighbors in an additional tProperty Array using user-defined function Reduce( ). In the apply phase, data in the tProperty Array is synchronized to the Property Array using user-defined function Apply( ). VCPM updates the Property Array iteratively until all vertices are inactive.

Figure 1. An example graph in CSR format.
Figure 2. Pseudocode of VCPM.
Figure 3. The structure of parallel execution channels in VCPM-based graph analytics accelerator (top) and three types of datapath conflicts (bottom).

2.2. Inefficiencies of Previous Designs

By adopting multiple parallel execution channels, as shown in Fig. 3, previous VCPM-based graph analytics accelerators have achieved great throughput improvement. However, due to irregular connection pattern across vertices, interaction across execution channels is inevitable, bringing two following inefficiencies in previous designs.

Datapath Conflicts: multiple datapaths that process different vertices compete for the same accessible resource or same dataflow channel, causing serious datapath stall. Fig. 3 demonstrates three types of datapath conflicts existing in graph analytics accelerators with multiple parallel execution channels. Note that to meet the requirement of data-access throughput in such a design, the buffer for each data array is divided into several parts and organized in the fashion of interleaving.

The first datapath conflict occurs in

1
Offset Array Access, where two consecutive buffer parts are accessed concurrently to obtain start and end positions in the Edge Array of one active vertex. The second datapath conflict happens in

2
Edge Array Access, where a list of edges is accessed simultaneously from multiple buffer parts. The third datapath conflict occurs in

3
Dataflow Propagation, where the dataflow is directed according to the destination vertex ID of each edge. Datapath conflicts arise when multiple datapaths require the same accessible resource or same dataflow channel simultaneously. As a result, the execution channels failing in arbitration will be blocked, which inevitably degrades overall performance.

Design Centralization: the implementation of design becomes extremely difficult with the increasing amount of execution channels due to the overintensive interaction across total execution channels, causing frequency decline. In previous graph analytics accelerators, arbitration solution like crossbar is prevalently used to deal with interaction of multiple channels (Ham and others, 2016; Yan and others, 2019). Fig. 4 demonstrates that the frequency declines sharply with the increasing number of crossbar ports, limiting the throughput improvement. The frequency is determined by the critical path time which is derived from the synthesis result of Synopsys Design Compiler.

Opportunity: we observe that the execution channel in the state-of-the-art accelerator is highly pipelined, which reveals that increasing the traversal latency of a single edge has marginal impact on overall performance. It inspires us that we can trade latency for throughput. Thus, it is practical to insert additional stages in the datapath to gradually guide the data to the destination execution channel, which alleviates datapath conflicts. Besides, we can alleviate design centralization by limiting the number of interactive channels in each stage. Based on the above insights, we propose a general solution, MDP-network, to alleviate datapath conflicts and design centralization.

Figure 4. Frequency versus number of crossbar ports.

3. Multiple-stage Decentralized Propagation Network

Figure 5. Design theory of MDP-network.

3.1. Design Theory

In this subsection, we first abstract the interaction across execution channels, which brings datapath conflicts and design centralization. Then, we introduce the design theory of MDP-network from a naive solution.

Fig. 5 (a) abstracts the interaction across execution channels in Fig. 3, where data from two input channels is directed to two output channels. Using arbitration, e.g., crossbar, encounters datapath conflicts.

A naive solution for datapath conflicts is to use n-Write-1-Read (nW1R) First-In-First-Out queue (FIFO), which means FIFO can input n datums and output one datum in each cycle. As shown in Fig. 5 (b), 2W1R FIFO is used to direct data to destination channel in each cycle, therefore no input channels would stall unless the FIFO is full. However, it encounters two critical issues when the number of channels increases, as shown in Fig. 5 (c). One is large requirement of buffer capacity. For example, when the number of write ports is 32, the FIFO can accept data only when the remaining capacity is not less than 32. This causes large requirement and low utilization of buffer capacity. Another is design centralization, the increasing number of write ports makes FIFO implementation more difficult and even results in frequency decline.

Inspired by the idea of trading latency for throughput, we propose a general solution, MDP-network, to address this general problem. As shown in Fig. 5 (d), MDP-network decentralizes the intensive interactions across execution channels to multiple stages and buffers data in each stage. Data in MDP-network is deterministically propagated to next stage until reaching their destinations.

Benefiting from the multiple-stage and deterministic propagation dataflow, the head-of-line datum does not block data behind it and data is propagated to the destination channels stage by stage, resulting in less datapath conflicts and better throughput. Moreover, since the number of interactive execution channels in each stage of MDP-network is limited to a small number, the implementation complexity is reduced. As a result, MDP-network improves the scalability of the multiple parallel execution channels design without declining the frequency.

1 Input: // number of total channels
2
3 Illustartes radix 2 as an example
Step 1: 2W2R module construction
Use two 2W1R FIFOs to construct a 2W2R module;
4 Step 2: Input ports connection
for  do // stage
5        pair_list [ ];
6        target_group ;
7        group_base n target_group;
8        channel_step group_base 2;
9        for  do // group
10               real_base group_base j; for  do //  pair
11                      channel_1 real_base k;
12                      channel_2 real_base k channel_step;
13                      pair_list.append([channel_1, channel_2]);
14                     
15               end for
16              
17        end for
18       Input ports of each stage within the same pair are connected to one 2W2R module using the ( - i)th bit of address;
19       
20 end for
Algorithm 1 MDP-network Generation Algorithm

3.2. Automatic Generator for MDP-network

Algorithm 1 describes the simplified workflow of an automatic MDP-network generator. We define radix as the number of FIFO write ports and take radix 2 as an example here. To facilitate the deployment of MDP-network, the algorithm consists of two steps: 2W2R module construction (line 1) and input ports connection (line 2 - line 16). It is straightforward to construct a 2W2R model following the rule shown in Fig. 5 (b). Once the construction is finished, input ports connection is performed to choose the corresponding 2W2R module for input ports in each MDP-network stage.

Fig. 5 (d) presents a toy example of input ports connection for four channels. MDP-network uses two address bits, addr[0: 1], to specify four destination channels and constructs

stages. In the first stage, we classify all input channels into one group (

target_group = 1) since they have the same target range, i.e., output channels 0-3. Channel_step is the difference between two input channel IDs connecting to one 2W2R module (channel_step = 2). So we connect input ports {0, 2} and {1, 3} to two 2W2R modules respectively with addr[1]. Note that we draw two 2W1R modules using the same color in Fig. 5 (d) if they come from one 2W2R module. In the second stage, target_group is 2 and channel_step is 1. Then we connect input ports {0, 1} from Group 1 and {2, 3} from Group 2 to two 2W2R modules with addr[0]. With MDP-network, data is propagated stage by stage until reaching destinations.

Figure 6. Architecture of HiGraph.

4. HiGraph

In this section, we describe the architecture of HiGraph, a novel high throughput graph analytics accelerator. HiGraph buffers all data arrays of a graph in the fashion of interleaving and adopts parallel execution channels design.

As shown in Fig. 6, the HiGraph architecture contains front-end and back-end parts. The front-end part processes the source vertex of an edge, including reading active vertex IDs from the ActiveVertex Array and obtaining corresponding edge offsets from the Offset Array. The back-end part executes Process_Edge( ) and Reduce( ) after reading the Edge Array and tProperty Array, then writes back the updated value to tProperty Array.

As mentioned in Section 2.2, there are three types of interaction across execution channels. Next, we will analyze their access patterns and address how MDP-network is deployed to ease datapath conflicts and design centralization.

4.1. MDP-network for Offset Array Access

Referring to Fig. 3

1
, the access pattern in reading Offset Array is one-to-two, indicating that the algorithm reads two consecutive buffer parts indexed by u.ID and u.ID+1 of source vertex u.

Fig. 6 shows that, to deal with such an access pattern, we first deploy MDP-network to guide source vertices to corresponding output channels, e.g., u.ID

0 is directed to channel 0. Then it is apparently that each source vertex requires to occupy its corresponding and next read channels, e.g., vertices in channel 0 requires the read channel 0 and 1. In this way, any source vertices in one channel will only have conflicts with those in neighbor channels. Thus we insert an Odd-Even Arbiter to determine vertices in which channels should be issued in current cycle.

The arbiter rule is alternating priority, which means odd and even channels alternately have higher priority to issue vertices. In this way, those vertices in the channel with higher priority can always be issued immediately without considering datapath conflicts and occupy several read channels. The other vertices would be issued only when their destination read channels are not occupied or their target addresses are the same with those who have occupied the read channels.

4.2. MDP-network for Edge Array Access

Referring to Fig. 3

2
, the access pattern in reading Edge Array is one-to-multiple, indicating that one {Off, nOff} requires to access multiple consecutive buffer parts to get edges.

Fig. 6 demonstrates the variant of MDP-network for Edge Array access pattern. We insert Replay Engines to divide {Off, nOff} into several {Off, Len} with an appropriate length. Then we use MDP-network to guide data to destination channels and one extra operation is required here. While the target range is becoming smaller as the data is propagated stage by stage, correspondingly, we will split the input length into small output length to make {Off, Len} fit in small target range. For example, when Off 4 with Len 9 needs to be propagated to target 0-7 and target 8-15 in stage 1, it will be split into Off 4 with Len 4 and Off 8 with Len 5. After several stage propagation, the target range becomes smaller and more specific. Meanwhile, it can be found that there are fewer channels with the same target range. In other words, through the propagation of MDP-network, the competition for subsequent datapaths can be reduced stage by stage. In the last stage, we just need to integrate a set of small and simple units (i.e., Dispatcher) to distribute access requests to consecutive output channels.

4.3. MDP-network for Dataflow Propagation

Referring to Fig. 3

3
, the pattern of dataflow propagation is that multiple input data are directed to multiple output channels. As MDP-network is designed to deal with such a pattern, we directly deploy original MDP-network to this stage to alleviate datapath conflicts and design centralization.

5. Evaluation

5.1. Experimental Setup

Methodology. We implement HiGraph in RTL with Verilog. We use the Synopsys Design Compiler with the TSMC 12 standard VT library for synthesis. We give the synthesis tools an operating voltage of 0.8 and a target clock cycle of 1. The slowest module has a critical path of 0.93 including setup and hold time, putting the HiGraph design comfortably at 1GHz. The ID and property data of each vertex are quantified to 19 bits to fully use on-chip memory capacity. The layout of HiGraph is shown in Fig. 7.

Figure 7. Layout of HiGraph.
HiGraph HiGraph-mini GraphDynS
Frequency 1GHz 1GHz 1GHz
#Front-end channels 32 4 4
#Back-end channels 32 32 32
On-chip memory 16MB 16MB 32MB
Table 1. Configurations used for HiGraph and baselines.

Baselines. To compare HiGraph with state-of-the-art work, we prototype GraphDynS in RTL. We set the number of front-end channels in GraphDynS as four since a larger number would give rise to frequency decline due to the delicate arbitration in reading Offset Array. To compare fairly, we also set up HiGraph-mini with the same number of front-end channels. Table 1 shows the configurations for these implementations.

Datasets. Table 2 describes the datasets used for our evaluation. A mixture of real-world graphs - VT, EP, SL, TW and synthetic graphs - R14, R16 are used in the evaluation. Four graph algorithms - BFS (Breadth-First Search), SSSP (Single Source Shortest Path), SSWP (Single Source Widest Path) and PR (PageRank) are used to evaluate HiGraph. For the evaluation on unweighted graphs, random integer weights are assigned.

Name #Vertices #Edges #Degree Description
Real-world Graphs
Vote (VT) (Leskovec and others, 2010) 7K 0.10M 15 Wikepedia Who-votes-on-whom
Epinions (EP) (Richardson et al., 2003) 76K 0.51M 7 Epinions Who-trusts-whom
Slashdot (SL) (Leskovec et al., 2009) 82K 0.95M 12 Slashdot Social Network
Twitter (TW) (McAuley and Leskovec, 2012) 81K 1.77M 22 Twitter Social Circles
Synthetic Graphs
RMAT14 (R14) (Ang et al., 2010) 16K 1.05M 64 Synthetic Graph
RMAT16 (R16) (Ang et al., 2010) 66K 4.19M 64 Synthetic Graph
Table 2. Benchmark Datasets.

5.2. Overall Results

Speedup: Fig. 8 shows the speedups of HiGraph and HiGraph-mini normalized to GraphDynS. With the same number of front-end channels, HiGraph-mini achieves 1.19 to 1.85 speedup over GraphDynS, and 1.46 on average. With MDP-network optimization, HiGraph increases the number of front-end channels without frequency decline. Thus, with more front-end channels, HiGraph achieves up to 2.23 speedup over GraphDynS (1.54 on average).

Figure 8. Speedup over GraphDynS.

Throughput: Fig. 9 compares the throughput of HiGraph against that of GraphDynS. Throughput is defined as the number of edges processed per second (GTEPS, giga-traversed edges per second). The ideal throughput is 32 GTEPS. HiGraph achieves up to 25.0 GTEPS and reaches 78.1% of ideal throughput. Compared to GraphDynS, the throughput is improved by 2.7 GTEPS to 13.1 GTEPS, and 6.7 GTEPS on average.

Figure 9. Throughput.
Figure 10. Effects of our optimizations: (a) Throughput. (b) Starvation cycles of vPE.
Figure 11. Throughput versus number of back-end channels.
Figure 12. Throughput versus buffer size of FIFO in each channel.

5.3. Effects of Optimizations

A detailed evaluation on the RMAT14 dataset is presented to provide more insights into the effects of our optimizations, as shown in Fig. 12 (a). The baseline is without any our optimizations. Opt-O, Opt-E, and Opt-D indicate the use of MDP-network for Offset Array Access, Edge Array Access, and Dataflow Propagation, respectively. Following two obvious phenomenons can be found in Fig. 12 (a). First, when using Opt-D in optimization, the design gains more performance improvement, up to 6.2 GTEPS. This is because Opt-D not only alleviates the datapath conflict in back-end part, but also has synergy with the optimizations in front-end part. Second, the optimizations in front-end part almost gain no performance improvement on the PR algorithm. This is because the Offset Array and Edge Array are read in order on the PR algorithm, so that no datapath conflict arises in frond-end part.

To show the reduction of datapath conflicts, we provide the number of starvation cycles of vPE to reveal that. The vPE performs the Reduce( ) function, the final step in the scatter phase. Fig. 12 (b) shows a massive number of starvation cycles of baseline on the RMAT14 dataset, which reveals that plenty of vPEs are starved for data due to the datapath conflicts. Benefiting from our optimizations, datapath conflicts are alleviated so that more data is transferred to vPE without stall. Thus, the number of starvation cycles reduces significantly, up to 58%. This validates the effects of our optimizations.

Fig. 12 shows the scalability of HiGraph and GraphDynS on the PR algorithm with RMAT14 dataset as the number of back-end channels increases. GraphDynS does not support more than 64 channels due to significant frequency decline as shown in Fig. 4, so we only evaluate GraphDynS with 32 and 64 channels. For HiGraph, we synthesis it from 32 to 256 channels using Synopsys Design Compiler and find that the most critical path of it only raising from 0.93ns to 0.97ns, still meeting the requirement of 1GHz. The result in Fig. 12 demonstrates that HiGraph’s scalability is much better than GraphDynS.

Discussion: For the large graph processing, the graph can be partitioned into small slices, so that each slice is processed on chip (Ham and others, 2016). Therefore, our optimizations can improve throughput in large-scale graph analytics. Besides, the time consumed in the replacement of slices can be overlapped using double buffer design.

5.4. Design Option of MDP-network

A detailed evaluation of design option is presented to provide more insights into the design of MDP-network.

We run experiments on various radices, i.e., the write port number of the FIFO used to construct the stage of MDP-network. We find that a too large radix still encounters design centralization, which degrades the performance. By contrast, the performance changes slightly with relatively small radices. Thus, we choose radix 2 in our design.

We run experiments on buffer size of MDP-network. We keep all designs in HiGraph the same except for the dataflow propagation stage, in which we replace MDP-network with FIFO-plus-crossbar design. Fig. 12 demonstrates that MDP-network outperforms FIFO-plus-crossbar consistently with various buffer sizes on the PR algorithm with RMAT14 dataset. We choose 160 entries as the buffer size of FIFO in each channel because the throughput rarely increases with larger buffers. We synthesis MDP-network with buffer size 160 and the area is 0.375 while power is 621.2mW. We also synthesis FIFO-plus-crossbar design with buffer size 128 and the area is 0.292 while power is 508.1mW. The area and power of MDP-network is slightly higher due to the larger buffer, showing that replacing crossbar with MDP-network brings little overhead.

6. Related Work

Prior works focus on optimizing irregular off-chip memory accesses to pursue high throughput of graph analytics (Ham and others, 2016; Yan and others, 2019; Rahman and others, 2020; Ozdal and others, 2016). By preprocessing (Kyrola and others, 2012) and partitioning (Ham and others, 2016), a large graph can be partitioned into a set of slices to fit the data being irregularly accessed in on-chip memory. Moreover, Centaur (Addisie and others, 2020) is an accelerator that only maps high-degree vertices to on-chip memory. To further pursue higher throughput, prior accelerators (Ham and others, 2016; Yan and others, 2019) provide parallel execution channels design according to the execution characteristic of graph analytics workloads. Unfortunately, few attention is paid to datapath conflicts and design centralization that have become the critical issues. Using our MDP-network, these issues are alleviated and high throughput is realized.

7. Conclusion

In this paper, we identify the inefficiencies in graph analytics acceleration including the datapath conflicts and design centralization. To this end, we propose MDP-network, inspired by the idea of trading latency for throughput. Besides, an automatic generator for MDP-network is developed and open source. Finally, a novel high throughput graph analytics accelerator, HiGraph, is proposed by deploying MDP-network to tackle data conflicts and design centralization in practice. HiGraph archives up to 2.2 speedup (1.5 on average) compared to state-of-the-art accelerator.

Acknowledgements.
This work was supported by the National Natural Science Foundation of China (Grant No. 61732018, 61872335, and 61802367), Austrian-Chinese Cooperative R&D Project (FFG and CAS) (Grant No. 171111KYSB20200002), CAS Project for Young Scientists in Basic Research (Grant No. YSBR-029), and CAS Project for Youth Innovation Promotion Association. The correponding author is Mingyu Yan, yanmingyu@ict.ac.cn.

References

  • A. Addisie et al. (2020) Centaur: hybrid processing in on/off-chip memory architecture for graph analytics. In DAC, pp. 1–6. Cited by: §6.
  • J. A. Ang, B. W. Barrett, K. B. Wheeler, and R. C. Murphy (2010) Introducing the graph 500. cray users group. Cited by: Table 2.
  • C. Cagla et al. (2015) Modeling and design of high-radix on-chip crossbar switches. In NOCS, pp. 1–8. Cited by: §1.
  • R. J. Francis et al. (1990) Chortle: A technology mapping program for lookup table-based field programmable gate arrays. In DAC, pp. 613–619. Cited by: §1.
  • Z. Fu et al. (2014) MapGraph: A high level API for fast development of high performance graph analytics on gpus. In GRADES, pp. 2:1–2:6. Cited by: §2.1.
  • J. E. Gonzalez et al. (2014) GraphX: graph processing in a distributed dataflow framework. In OSDI, pp. 599–613. Cited by: §2.1.
  • T. J. Ham et al. (2016) Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In MICRO, pp. 1–13. Cited by: §1, §1, §1, §2.1, §2.1, §2.2, §5.3, §6.
  • B. Hu et al. (2004) Fine granularity clustering-based placement. IEEE TCAD 23 (4), pp. 527–536. Cited by: §1.
  • A. Kyrola et al. (2012) GraphChi: large-scale graph computation on just a PC. In OSDI, pp. 31–46. Cited by: §6.
  • J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney (2009) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math. 6 (1), pp. 29–123. Cited by: Table 2.
  • J. Leskovec et al. (2010) Predicting positive and negative links in online social networks. In WWW, pp. 641–650. Cited by: Table 2.
  • G. Malewicz et al. (2010) Pregel: a system for large-scale graph processing. In SIGMOD, pp. 135–146. Cited by: §2.1.
  • J. J. McAuley and J. Leskovec (2012) Learning to discover social circles in ego networks. In NIPS, pp. 548–556. Cited by: Table 2.
  • M. M. Ozdal et al. (2016) Energy efficient architecture for graph analytics accelerators. In ISCA, pp. 166–177. Cited by: §6.
  • S. Rahman et al. (2020) GraphPulse: an event-driven hardware accelerator for asynchronous graph processing. In MICRO, pp. 908–921. Cited by: §6.
  • M. Richardson, R. Agrawal, and P. M. Domingos (2003) Trust management for the semantic web. In ISWC, Vol. 2870, pp. 351–368. Cited by: Table 2.
  • N. Selvakkumaran and G. Karypis (2006) Multiobjective hypergraph-partitioning algorithms for cut and maximum subdomain-degree minimization. IEEE TCAD 25 (3), pp. 504–517. Cited by: §1.
  • J. Shun et al. (2013) Ligra: a lightweight graph processing framework for shared memory. In PPoPP, pp. 135–146. Cited by: §2.1.
  • M. Yan et al. (2019) Alleviating irregularity in graph analytics acceleration: a hardware/software co-design approach. In MICRO, pp. 615–628. Cited by: §1, §1, §1, §2.1, §2.1, §2.2, §6.