Advances in mobile computing, coupled with the proliferation of online social networks, have given rise to a new class of applications and computing challenges 
. These applications tend to be relational by nature. In other words, they express or encode relations, communications, connectivity and interactions between people, places, objects or systems. As such, the data of interest in these applications are often best represented in the form of graphs. Graph-based applications range from social network analyses to anomaly detections. For computing purposes, graphs are commonly represented in one of two forms: (1) as adjacency matrix or (2) as adjacency list. Adjacency Matrix works well for densely connected graphs, i.e., the number of edges in the graph is close to the maximal number of edges. In general, computing on dense graphs can be easily parallelized and GPU and SIMD architectures have proven to be the platform of choice for executing such graph-based applications . Unfortunately, the vast majority of large graph-based applications are sparse. For the efficient storage of large sparse graphs, adjacency list or other compressed representation schemes are used. Memory access and load balancing are some of the key bottlenecks to the efficient processing of large sparse graph algorithms and applications . The memory access patterns often lack spatial and temporal localities resulting in high cache miss rates. Current cache-based processor architectures are simply not well suited for the computational flow of graph processing. In addition to the storage problem, computing on large sparse graphs currently presents a number of challenges including effective programming abstractions and models of computation that leverage the graph structure in the application. In this work, we present a domain-specific architecture tailored to graph-based algorithms and applications.
Ii Proposed Graph Processor Architecture
Figure 1 shows an illustration of the proposed architecture. The three key modules of the architecture are (1) the graph processor, (2) the co-processor and (3) the main memory. The graph processor (1) module has a Memory Interface unit (1a) to coordinate batch accesses to the main memory or external memory units, a Dispatch Logic (1b) to perform scatter operations on data from the main memory, an Output Logic (1c) to gather output data from the graph processor, and a systolic array of simple processing elements called Node Arithmetic Logic Engines (NALEs) (1d) to carry out the actual graph computations. The co-processor (2) performs three key functions. It (1) executes non-graph parts of the application, (2) schedules the graph part of the application and (3) monitors the execution flow of the graph.
Graph Processor Micro-Architecture: Figure 2 shows the micro-architecture of a NALE. The NALE is optimized for fast MAC (Multiply-And-Accumulate) operations with a three-state output comparator for fast node value sorting. It has two FIFO structures, one to communicate with neighbors and one internal FIFO to emulate multiple graph nodes (node cluster mode execution).
Each NALE operates independently of others depending on the readiness of inputs. Communicating through FIFOs only allows each NALE to run on its own clock speed. Furthermore, this approach allows us to adopt a GasP asynchronous  design methodology that can seamlessly scale to hundreds of thousands of NALEs. Figure 3(a) illustrates the the clockless handshake logic between NALEs. In addition to the scalability benefits, the absence of a global clock allows for the underlining data dependencies to dictate application execution time. Figure 3(b) shows the synthesizable equivalent of the GasP circuit.
Model of computation and compilation : An asynchronous model of computation is adopted to fully take advantage of the graph processor. Given a graph application specification and a number of available NALEs for its computation, the execution preprocessing flow follows five key steps. Figure 4 illustrates these steps. In the first step, the application is profiled to extract the graph topology, followed by the clustering of nodes, clusters dependency analysis, placement and finally the compilation step.
Iii Architecture Evaluation
Experimental setup: To get high-fidelity performance and power measurements for the proposed architecture, we prototype it alongside a conventional CPU and a GPU with comparable complexity in FPGA. The Xilinx Virtex7-XC7VX980T FPGA device is used for our prototyping platform. We implement a synthesizable RTL version of the graph processor. We use the 7-stage RISC core in the Heracles  RTL simulator for the CPU. We adopte the MIAOW open-source general-purpose graphics processor (GPGPU) based on the AMD Southern Islands ISA  for the GPU architecture. The three architectures are evaluated based on their execution time and power.
Graph algorithms and applications: For the evaluation, we consider a set of representative graph algorithms, namely, Single Source Shortest Path (SSSP), Breadth First Search (BFS), Depth First Search (DFS), PageRank (PR), Minimal Enclosing Triangles (MiniTri), and Connected Components (CC). We use three difference graph applications: (1) California road network (CA) which has 1,965,206 vertices, 2,766,607 edges and an average degree of 1.41, (2) Facebook social network (FB) with 2,937,612 vertices, 41,919,708 edges and an average degree of 14.3 and (3) Livejournal social network (LJ) with 4,847,571 vertices, 85,702,475 edges and 17.6 average degree.
-  A. Kyrola, G. Blelloch, and C. Guestrin, “Graphchi: Large-scale graph computation on just a pc,” in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’12. Berkeley, CA, USA: USENIX Association, 2012, pp. 31–46.
-  A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry, “Challenges in parallel graph processing,” Parallel Processing Letters, vol. 17, no. 01, pp. 5–20, 2007.
-  M. Roncken, S. M. Gilla, H. Park, N. Jamadagni, C. Cowan, and I. Sutherland, “Naturalized communication and testing,” in 21st IEEE International Symposium on Asynchronous Circuits and Systems, May 2015, pp. 77–84.
-  M. A. Kinsy, M. Pellauer, and S. Devadas, “Heracles: A tool for fast rtl-based design space exploration of multicore processors,” in Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, ser. FPGA 2013, pp. 125–134.
-  R. Balasubramanian et al., “Enabling gpgpu low-level hardware explorations with miaow: An open-source rtl implementation of a gpgpu,” ACM Trans. Archit. Code Optim., vol. 12, no. 2, pp. 21:25, Jun. 2015.