Triangles are the basic substructure of networks and play critical roles in network analysis. Due to the importance of triangles, triangle counting problem (TC), which counts the number of triangles in a given graph, is essential for analyzing networks and generally considered as the first fundamental step in calculating metrics such as clustering coefficient and transitivity ratio, as well as other tasks such as community discovery, link prediction, and Spam filtering . TC problem is not hard but they are all memory bandwidth intensive thus time-consuming. As a result, researchers from both academia and industry have proposed many TC acceleration methods ranging from sequential to parallel, single-machine to distributed, and exact to approximate. From the computing hardware perspective, these acceleration strategies are generally executed on CPU, GPU or FPGA, and are based on Von-Neumann architecture [1, 2, 3]. However, due to the fact that most graph processing algorithms have low computation-memory ratio and high random data access patterns, there are frequent data transfers between the computational unit and memory components which consumes a large amount of time and energy.
In-memory computing paradigm performs computation where the data resides. It can save most of the off-chip data communication energy and latency by exploiting the large internal memory inherent bandwidth and inherent parallelism [4, 5]. As a result, in-memory computing has appeared as a viable way to carry out the computationally-expensive and memory-intensive tasks [6, 7]. This becomes even more promising when being integrated with the emerging non-volatile STT-MRAM memory technologies. This integration, called Processing-In-MRAM (PIM), offers fast write speed, low write energy, and high write endurance among many other benefits [8, 9].
In the literature, there have been some explorations on in-memory graph algorithm accelerations [10, 11, 12, 13], however, existing TC algorithms, including the intersection-based and the matrix multiplication-based ones, cannot be directly implemented in memory. For large sparse graphs, highly efficient PIM architecture, efficient graph data compression and data mapping mechanisms are all critical for the efficiency of PIM accelerations. Although there are some compression methods for sparse graph, such as compressed sparse column (CSC), compressed sparse row (CSR), and coordinate list (COO) , these representations cannot be directly applied to in-memory computation either. In this paper, we propose and design the first in-memory TC accelerator that overcomes the above barriers. Our main contributions can be summarized as follows:
We propose a novel TC method that uses massive bitwise operations to enable in-memory implementations.
We propose strategies for data reuse and exchange, and data slicing for efficient graph data compression and mapping onto in-memory computation architectures.
We build a TC accelerator with the sparsity-aware processing-in-MRAM architecture. A device-to-architecture co-simulation demonstrates highly encouraging results.
The rest of the paper is organized as follows: Section II provides some preliminary knowledge of TC and in-memory computing. Section III introduces the proposed TC method with bitwise operations, and Section IV elaborates a sparsity-aware processing-in-MRAM architecture which enables highly efficient PIM accelerations. Section V demonstrates the experimental results and Section VI concludes.
Ii-a Triangle Counting
Given a graph, triangle counting (TC) problem seeks to determine the number of triangles. The sequential algorithms for TC can be classified into two groups. In the matrix multiplication based algorithms, a triangle is a closed path of length three, namely a path of three vertices begins and ends at the same vertex. Ifis the adjacency matrix of graph , represents the number of paths of length three beginning and ending with vertex . Given that a triangle has three vertices and will be counted for each vertex, and the graph is undirected (that is, a triangle will be also counted as ), the number of triangles in can be obtained as , where is the sum of elements on the main diagonal of a matrix. In the set intersection based algorithms, it iterates over each edge and finds common elements from adjacency lists of head and tail nodes. A lot of CPU, GPU and FPGA based optimization techniques have been proposed [1, 2, 3]. These works show promising results of accelerating TC, however, these strategies all suffer from the performance and energy bottlenecks brought by the significant amount of data transfers in TC.
Ii-B In-Memory Computing with STT-MRAM
STT-MRAM is a promising candidate for the next generation main memory because of its properties such as near-zero leakage, non-volatility, high endurance, and compatibility with the CMOS manufacturing process . In particular, prototype STT-MRAM chip demonstrations and commercial MRAM products have been available by companies such as Everspin and TSMC. STT-MRAM stores data with magnetic-resistances instead of conventional charge based store and access. This enables MRAM to provide inherent computing capabilities for bitwise logic with minute changes to peripheral circuitry .
As the left part of Fig. 1 shows, a typical STT-MRAM bit-cell consists of an access transistor and a Magnetic Tunnel Junction (MTJ), which is controlled by bit-line (BL), word-line (WL) and source-line (SL). The relative magnetic orientations of pinned ferromagnetic layer (PL) and free ferromagnetic layer (FL) can be stable in parallel (P state) or anti-parallel (AP state), corresponding to low resistance () and high resistance (, ), respectively. READ operation is done by enabling WL signal, applying a voltage across BL and SL, and sensing the current that flows ( or ) through the MTJ. By comparing the sense current with a reference current (,), the data stored in MTJ cell (logic ‘0’ or logic ‘1’) could be readout. WRITE operation can be performed by enabling WL, then applying an appropriate voltage () across BL and SL to pass a current that is greater than the critical MTJ switching current. To perform bitwise logic operation, as demonstrated in the right part of Fig. 1, by simultaneously enabling and , then applying across and (), the current that feeds into the -th sense amplifier (SA) is a summation of the currents flowing through and , namely . With different reference sensing current, various logic functions of the enabled word line can be implemented.
Iii Triangle Counting with Bitwise Operations
In this section, we seek to perform TC with massive bitwise operations, which is the enabling technology for in-memory TC accelerator. Let be the adjacency matrix representation of a undirected graph , where indicates whether there is an edge between vertices and . If we compute , then the value of represents the number of distinct paths of length two between vertices and . In the case that there is an edge between vertex and vertex , and can also reach through a path of length two, where the intermediate vertex is , then vertices , , and form a triangle. As a result, the number of triangles in is equal to the number of non-zero elements () in (the symbol ‘’ defines element-wise multiplication here), namely
Since is either zero or one, we have
According to Equation (2),
Because the element in is either zero or one, the bitwise Boolean AND result is equal to that of the mathematical multiplication, thus
in which BitCount
returns the number of ‘1’s in a vector consisting of ‘0’ and ‘1’, for example,.
Therefore, TC can be completed by only AND and BitCount operations (massive for large graphs). Specifically, for each non-zero element , the -th row () and the -th column () are executed AND operation, then the AND result is sent to a bit counter module for accumulation. Once all the non-zero elements are processed as above, the value in the accumulated BitCount is exactly the number of triangles in the graph.
Fig. 2 demonstrates an illustrative example for the proposed TC method. As the left part of the figure shows, the graph has four vertices, five edges and two triangles ( and ), and the adjacency matrix is given. The non-zero elements in are , , , , and . For , row =‘0110’ and column =‘1000’ are executed with AND operation, then the AND result ‘0000’ is sent to the bit counter and gets a result of zero. Similar operations are performed to other four non-zero elements. After the execution of the last non-zero element is finished, the accumulated BitCount result is two, thus the graph has two triangles.
The proposed TC method has the following advantages. First, it avoids the time-consuming multiplication. When the operation data are either zero or one, we can implement the multiplication with AND logic. Second, the proposed method does not need to store the intermediate results that are larger than one (such as the elements in ), which are cumbersome to store and calculate. Third, it does not need complex control logic. Given the above three advantages, the proposed TC method is suitable for in-memory implementations.
Iv Sparsity-Aware Processing-In-MRAM Architecture
To alleviate the memory bottleneck caused by frequent data transfers in traditional TC algorithms, we implement an in-memory TC accelerator based on the novel TC method presented in the previous section. Next, we will discuss several dataflow mapping techniques to minimize space requirements, data transfers and computation in order to accelerate the in-memory TC computation.
Iv-a Data Reuse and Exchange
Recall that the proposed TC method iterates over each non-zero element in the adjacency matrix, and loads corresponding rows and columns into computational memory for AND operation, followed by a BitCount process. When the size of the computational memory array is given, it is important to reduce the unnecessary space and memory operations. We observe that for AND computation, the non-zero elements in a row reuse the same row, and the non-zero elements in a column reuse the same column. The proposed data reuse mechanism is based on this observation.
Assume that the non-zero elements are iterated by rows, then the current processed row only needs to be loaded once, at the same time the corresponding columns are loaded in sequence. Once all the non-zero elements in a row have been processed, this row will no longer be used in future computation, thus we can overwrite this row by the next row to be processed. However, the columns might be used again by the non-zero elements from the other rows. Therefore, before loading a certain column into memory for computation, we will first check whether this column has been loaded, if not, the column will be loaded to a spare memory space. In case that the memory is full, we need to select one column to be replaced with the current column. We choose the least recently used (LRU) column for replacement, and more optimized replacement strategy could be possible.
As demonstrated in Fig. 2, in step and step , the two non-zero elements and of row are processed respectively, and corresponding columns and are loaded to memory. Next, while processing and , will overlap and reuse existing in step , and load in step . In step , to process , will be overlapped by , and is reused. Overlapping the rows and reusing the columns can effectively reduce unnecessary space utilization and memory WRITE operations.
Iv-B Data Slicing
To utilize the sparsity of the graph to reduce the memory requirement and unnecessary computation, we propose a data slicing strategy for graph data compression.
Assume is the -th row, and is the -th column of the adjacency matrix of graph . The slice size is (each slice contains bits), then each row and column has number of slices. The -th slice in , which is represented as , is the set of . We define that slice is valid if and only if .
Recall that in our proposed TC method, for each non-zero element in the adjacency matrix, we compute the AND result of the corresponding row and column. With row and column slicing, we will perform the AND operation in the unit of slices. For each , we only process the valid slice pairs, namely only when both the row slice and column slice are valid, we will load the valid slice pair to the computational memory array and perform AND operation.
Fig. 3 demonstrates an example, after row and column slicing, only slice pairs and are valid, therefore, we only load these slices for AND computation. This scheme can reduce the needed computation significantly, especially in the large sparse graphs.
Memory requirement of the compressed graph data. With the proposed row and column slicing strategy, we need to store the index of valid slices and the detailed data information of these slices. Assuming that the number of valid slices is , the slice size is , and we use an integer (four Bytes) to store each valid slice index, then the needed space for overall valid slice index is Bytes. The needed space to store the data information of valid slices is Bytes. Therefore, the overall needed space for graph is Bytes, which is determined by the sparsity of and the slice size. In this paper, we set in the experimental result section. Given that most graphs are highly sparse, the needed space to store the graph can be trivial. Moreover, the proposed format of compressed graph data is friendly for directly mapping onto the computational memory arrays to perform in-memory logic computation.
Iv-C Processing-In-MRAM Architecture
Fig. 4 demonstrates the overall architecture of processing-in-MRAM. The graph data will be sliced and compressed, and represented by the valid slice index and corresponding slice data. According to the valid slice indexes in the data buffer, we load the corresponding valid slice pairs into computational STT-MRAM array for bitwise computation. The storage status of STT-MRAM array (such as which slices have been loaded) is also recorded in the data buffer and utilized for data reuse and exchange.
As for the computational memory array organization, each chip consists of multiple Banks and works as computational array. Each Bank is comprised of multiple computational memory sub-arrays, which are connected to a global row decoder and a shared global row buffer. Read circuit and write driver of the memory array are modified for processing bitwise logic functions. Specifically, the operation data are all stored in different rows in memory arrays. The rows associated with operation data will be activated simultaneously for computing. Sense amplifiers are enhanced with AND reference circuits to realize either READ or AND operations. By generating , the output by the sense amplifier is the AND result of the data that is stored in the enabled WLs.
Iv-D Pseudo-code for In-Memory TC Acceleration
Algorithm 1 demonstrates the pseudo-code for TC accelerations with the proposed processing-in-MRAM architecture. It iterates over each edge of the graph, partitions the corresponding rows and columns into slides, then loads the valid slice pairs onto computational memory for AND and BitCount computation. In case that there is no enough memory space, it adopts an LRU strategy to replace a least recently used slice.
V Experimental Results
V-a Experimental Setup
To validate the effectiveness of the proposed approaches, comprehensive device-to-architecture evaluations along with two in-house simulators are developed. At the device level, we jointly use the Brinkman model and Landau-Lifshitz-Gilbert (LLG) equation to characterize MTJ . The key parameters for MTJ simulation are demonstrated in Table I. For the circuit-level simulation, we design a Verilog-A model for 1T1R STT-MRAM device, and characterize the circuit with nm FreePDK CMOS library. We design a bit counter module based on Verilog HDL to obtain the number of non-zero elements in a vector. Specifically, we split the vector and feed each -bit sub-vector into an - look-up-table to get its non-zero element number, then sum up the non-zero numbers in all sub-vectors. We synthesis the module with Synopsis Tool and conduct post-synthesis simulation based on
nm FreePDK. After getting the device level simulation results, we integrate the parameters in the open-source NVSim simulator and obtain the memory array performance. In addition, we develop a simulator in Java for the processing-in-MRAM architecture, which simulates the proposed function mapping, data slicing and data mapping strategies. Finally, a behavioural-level simulator is developed in Java, taking architectural-level results and memory array performance to calculate the latency and energy that spends on TC in-memory accelerator. To provide a solid comparison with other accelerators, we select from the real-world graphs from SNAP dataset  (see TABLE II), and run comparative baseline intersect-based algorithm on Inspur blade system with the Spark GraphX framework on Intel E5430 single-core CPU. Our TC in-memory acceleration algorithm also runs on single-core CPU, and the STT-MRAM computational array is set to be MB.
|MTJ Surface Length|
|MTJ Surface Width|
|Spin Hall Angle|
|Resistance-Area Product of MTJ|
|Oxide Barrier Thickness|
|Gilbert Damping Constant|
|Perpendicular Magnetic Anisotropy|
|Dataset||# Vertices||# Edges||# Triangles|
V-B Benefits of Data Reuse and Exchange
TABLE III shows the memory space required for the bitwise computation. For example, the largest graph com-lj will need MB without incurring any data exchange. On average, only KB per vertices is needed for in-memory computation.
When the STT-MRAM computational memory size is smaller than those listed in TABLE III, data exchange will happen. For example, with MB, the three large graphs will have to do data exchange as shown in Fig. 5. In this figure, we also list the percentages of data hit (average ) and data miss (average ). Recall that the first time a data slice is loaded, it is always a miss, and a data hit implies that the slice data has already been loaded. So this shows that the proposed data reuse strategy saves on average memory WRITE operations.
V-C Benefits of Data Slicing
As shown in TABLE IV, the average percentage of valid slices in the five largest graphs is only . Therefore, the proposed data slicing strategy could significantly reduce the needed computation by .
V-D Performance and Energy Results
TABLE V compares the performance of our proposed in-memory TC accelerator against a CPU baseline implementation, and the existing GPU and FPGA accelerators. One can see a dramatic reduction of the execution time in the last columns from the previous three columns. Indeed, without PIM, we achieved an average speedup against the baseline CPU implementation because of data slicing, reuse, and exchange. With PIM, another acceleration is obtained. Compared with the GPU and FPGA accelerators, the improvement is and , respectively. It is important to mention that we achieve this with a single-core CPU and MB STT-MRAM computational array.
|Dataset||CPU||GPU ||FPGA ||This Work|
In this paper, we propose a new triangle counting (TC) method, which uses massive bitwise logic computation, making it suitable for in-memory implementations. We further propose a sparsity-aware processing-in-MRAM architecture for efficient in-memory TC accelerations: by data slicing, the computation could be reduced by , meanwhile the compressed graph data can be directly mapped onto STT-MRAM computational memory array for bitwise operations, and the proposed data reuse and exchange strategy reduces of the memory WRITE operations. We use device-to-architecture co-simulation to demonstrate that the proposed TC in-memory accelerator outperforms the state-of-the-art GPU and FPGA accelerations by and , respectively, and achieves a energy efficiency improvement over the FPGA accelerator.
Besides, the proposed graph data compression and data mapping strategies are not restricted to STT-MRAM or TC problem. They can also be applied to other in-memory accelerators with other non-volatile memories.
-  M. A. Hasan and V. S. Dave. Triangle counting in large networks: a review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(2):e1226, 2018.
-  V. S Mailthody, K. Date, Z. Qureshi, C. Pearson, R. Nagi, J. Xiong, and W. Hwu. Collaborative (CPU+GPU) algorithms for triangle counting and truss decomposition. In Proc. IEEE HPEC, pages 1–7, 2018.
-  S. Huang, M. El-Hadedy, C. Hao, Q. Li, V. S. Mailthody, K. Date, J. Xiong, D. Chen, R. Nagi, and W. Hwu. Triangle counting and truss decomposition using fpga. In Proc. IEEE HPEC, pages 1–7, 2018.
-  V. Seshadri and O. Mutlu. In-dram bulk bitwise execution engine. CoRR, abs/1905.09822, 2019.
-  S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie. Pinatubo: a processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories. In Proc. ACM/IEEE DAC, pages 173:1–173:6, 2016.
-  B. Li, B. Yan, and H. Li. An overview of in-memory processing with emerging non-volatile memory for data-intensive applications. In Proc. ACM GLSVLSI, pages 381–386, 2019.
-  S. Angizi, J. Sun, W. Zhang, and D. Fan. Aligns: A processing-in-memory accelerator for dna short read alignment leveraging sot-mram. In Proc. ACM/IEEE DAC, pages 1–6, 2019.
-  M. Wang, W. Cai, K. Cao, J. Zhou, J. Wrona, S. Peng, H. Yang, J. Wei, W. Kang, Y. Zhang, and W. Zhao. Current-induced magnetization switching in atom-thick tungsten engineered perpendicular magnetic tunnel junctions with large tunnel magnetoresistance. Nature communications, 9(1):671, 2018.
-  S. Jain, A. Ranjan, K. Roy, and A. Raghunathan. Computing in memory with spin-transfer torque magnetic RAM. IEEE Transactions on Very Large Scale Integration Systems (VLSI), 26(3):470–483, 2018.
-  L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen. GraphR: Accelerating graph processing using reram. In Proc. IEEE HPCA, pages 531–543, 2018.
-  S. Angizi, J. Sun, W. Zhang, and D. Fan. Graphs: A graph processing accelerator leveraging sot-mram. In Proc. DATE, pages 378–383, 2019.
-  G. Dai, T. Huang, Y. Wang, H. Yang, and J. Wawrzynek. Graphsar: A sparsity-aware processing-in-memory architecture for large-scale graph processing on rerams. In Proc. ASPDAC, pages 120–126, 2019.
-  Y. Zhuo, C. Wang, M. Zhang, R. Wang, D. Niu, Y. Wang, and X. Qian. GraphQ: Scalable pim-based graph processing. In Proc. IEEE MICRO, pages 712–725, 2019.
-  Jianlei Yang, Xueyan Wang, Qiang Zhou, Zhaohao Wang, Hai Li, Yiran Chen, and Weisheng Zhao. Exploiting spin-orbit torque devices as reconfigurable logic for circuit obfuscation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 38(1):57–69, 2018.
-  Jianlei Yang, Peiyuan Wang, Yaojun Zhang, Yuanqing Cheng, Weisheng Zhao, Yiran Chen, and Hai Helen Li. Radiation-induced soft error analysis of stt-mram: A device to circuit approach. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 35(3):380–393, 2015.
-  X. Dong, C. Xu, Y. Xie, and N. P. Jouppi. Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 31(7):994–1007, 2012.
-  J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.