I Introduction
Triangles are the basic substructure of networks and play critical roles in network analysis. Due to the importance of triangles, triangle counting problem (TC), which counts the number of triangles in a given graph, is essential for analyzing networks and generally considered as the first fundamental step in calculating metrics such as clustering coefficient and transitivity ratio, as well as other tasks such as community discovery, link prediction, and Spam filtering [1]. TC problem is not hard but they are all memory bandwidth intensive thus timeconsuming. As a result, researchers from both academia and industry have proposed many TC acceleration methods ranging from sequential to parallel, singlemachine to distributed, and exact to approximate. From the computing hardware perspective, these acceleration strategies are generally executed on CPU, GPU or FPGA, and are based on VonNeumann architecture [1, 2, 3]. However, due to the fact that most graph processing algorithms have low computationmemory ratio and high random data access patterns, there are frequent data transfers between the computational unit and memory components which consumes a large amount of time and energy.
Inmemory computing paradigm performs computation where the data resides. It can save most of the offchip data communication energy and latency by exploiting the large internal memory inherent bandwidth and inherent parallelism [4, 5]. As a result, inmemory computing has appeared as a viable way to carry out the computationallyexpensive and memoryintensive tasks [6, 7]. This becomes even more promising when being integrated with the emerging nonvolatile STTMRAM memory technologies. This integration, called ProcessingInMRAM (PIM), offers fast write speed, low write energy, and high write endurance among many other benefits [8, 9].
In the literature, there have been some explorations on inmemory graph algorithm accelerations [10, 11, 12, 13], however, existing TC algorithms, including the intersectionbased and the matrix multiplicationbased ones, cannot be directly implemented in memory. For large sparse graphs, highly efficient PIM architecture, efficient graph data compression and data mapping mechanisms are all critical for the efficiency of PIM accelerations. Although there are some compression methods for sparse graph, such as compressed sparse column (CSC), compressed sparse row (CSR), and coordinate list (COO) [10], these representations cannot be directly applied to inmemory computation either. In this paper, we propose and design the first inmemory TC accelerator that overcomes the above barriers. Our main contributions can be summarized as follows:

We propose a novel TC method that uses massive bitwise operations to enable inmemory implementations.

We propose strategies for data reuse and exchange, and data slicing for efficient graph data compression and mapping onto inmemory computation architectures.

We build a TC accelerator with the sparsityaware processinginMRAM architecture. A devicetoarchitecture cosimulation demonstrates highly encouraging results.
The rest of the paper is organized as follows: Section II provides some preliminary knowledge of TC and inmemory computing. Section III introduces the proposed TC method with bitwise operations, and Section IV elaborates a sparsityaware processinginMRAM architecture which enables highly efficient PIM accelerations. Section V demonstrates the experimental results and Section VI concludes.
Ii Preliminary
Iia Triangle Counting
Given a graph, triangle counting (TC) problem seeks to determine the number of triangles. The sequential algorithms for TC can be classified into two groups. In the matrix multiplication based algorithms, a triangle is a closed path of length three, namely a path of three vertices begins and ends at the same vertex. If
is the adjacency matrix of graph , represents the number of paths of length three beginning and ending with vertex . Given that a triangle has three vertices and will be counted for each vertex, and the graph is undirected (that is, a triangle will be also counted as ), the number of triangles in can be obtained as , where is the sum of elements on the main diagonal of a matrix. In the set intersection based algorithms, it iterates over each edge and finds common elements from adjacency lists of head and tail nodes. A lot of CPU, GPU and FPGA based optimization techniques have been proposed [1, 2, 3]. These works show promising results of accelerating TC, however, these strategies all suffer from the performance and energy bottlenecks brought by the significant amount of data transfers in TC.IiB InMemory Computing with STTMRAM
STTMRAM is a promising candidate for the next generation main memory because of its properties such as nearzero leakage, nonvolatility, high endurance, and compatibility with the CMOS manufacturing process [8]. In particular, prototype STTMRAM chip demonstrations and commercial MRAM products have been available by companies such as Everspin and TSMC. STTMRAM stores data with magneticresistances instead of conventional charge based store and access. This enables MRAM to provide inherent computing capabilities for bitwise logic with minute changes to peripheral circuitry [9][14].
As the left part of Fig. 1 shows, a typical STTMRAM bitcell consists of an access transistor and a Magnetic Tunnel Junction (MTJ), which is controlled by bitline (BL), wordline (WL) and sourceline (SL). The relative magnetic orientations of pinned ferromagnetic layer (PL) and free ferromagnetic layer (FL) can be stable in parallel (P state) or antiparallel (AP state), corresponding to low resistance () and high resistance (, ), respectively. READ operation is done by enabling WL signal, applying a voltage across BL and SL, and sensing the current that flows ( or ) through the MTJ. By comparing the sense current with a reference current (,), the data stored in MTJ cell (logic ‘0’ or logic ‘1’) could be readout. WRITE operation can be performed by enabling WL, then applying an appropriate voltage () across BL and SL to pass a current that is greater than the critical MTJ switching current. To perform bitwise logic operation, as demonstrated in the right part of Fig. 1, by simultaneously enabling and , then applying across and (), the current that feeds into the th sense amplifier (SA) is a summation of the currents flowing through and , namely . With different reference sensing current, various logic functions of the enabled word line can be implemented.
Iii Triangle Counting with Bitwise Operations
In this section, we seek to perform TC with massive bitwise operations, which is the enabling technology for inmemory TC accelerator. Let be the adjacency matrix representation of a undirected graph , where indicates whether there is an edge between vertices and . If we compute , then the value of represents the number of distinct paths of length two between vertices and . In the case that there is an edge between vertex and vertex , and can also reach through a path of length two, where the intermediate vertex is , then vertices , , and form a triangle. As a result, the number of triangles in is equal to the number of nonzero elements () in (the symbol ‘’ defines elementwise multiplication here), namely
(1) 
Since is either zero or one, we have
(2) 
According to Equation (2),
(3) 
Because the element in is either zero or one, the bitwise Boolean AND result is equal to that of the mathematical multiplication, thus
(4) 
in which BitCount
returns the number of ‘1’s in a vector consisting of ‘0’ and ‘1’, for example,
.Therefore, TC can be completed by only AND and BitCount operations (massive for large graphs). Specifically, for each nonzero element , the th row () and the th column () are executed AND operation, then the AND result is sent to a bit counter module for accumulation. Once all the nonzero elements are processed as above, the value in the accumulated BitCount is exactly the number of triangles in the graph.
Fig. 2 demonstrates an illustrative example for the proposed TC method. As the left part of the figure shows, the graph has four vertices, five edges and two triangles ( and ), and the adjacency matrix is given. The nonzero elements in are , , , , and . For , row =‘0110’ and column =‘1000’ are executed with AND operation, then the AND result ‘0000’ is sent to the bit counter and gets a result of zero. Similar operations are performed to other four nonzero elements. After the execution of the last nonzero element is finished, the accumulated BitCount result is two, thus the graph has two triangles.
The proposed TC method has the following advantages. First, it avoids the timeconsuming multiplication. When the operation data are either zero or one, we can implement the multiplication with AND logic. Second, the proposed method does not need to store the intermediate results that are larger than one (such as the elements in ), which are cumbersome to store and calculate. Third, it does not need complex control logic. Given the above three advantages, the proposed TC method is suitable for inmemory implementations.
Iv SparsityAware ProcessingInMRAM Architecture
To alleviate the memory bottleneck caused by frequent data transfers in traditional TC algorithms, we implement an inmemory TC accelerator based on the novel TC method presented in the previous section. Next, we will discuss several dataflow mapping techniques to minimize space requirements, data transfers and computation in order to accelerate the inmemory TC computation.
Iva Data Reuse and Exchange
Recall that the proposed TC method iterates over each nonzero element in the adjacency matrix, and loads corresponding rows and columns into computational memory for AND operation, followed by a BitCount process. When the size of the computational memory array is given, it is important to reduce the unnecessary space and memory operations. We observe that for AND computation, the nonzero elements in a row reuse the same row, and the nonzero elements in a column reuse the same column. The proposed data reuse mechanism is based on this observation.
Assume that the nonzero elements are iterated by rows, then the current processed row only needs to be loaded once, at the same time the corresponding columns are loaded in sequence. Once all the nonzero elements in a row have been processed, this row will no longer be used in future computation, thus we can overwrite this row by the next row to be processed. However, the columns might be used again by the nonzero elements from the other rows. Therefore, before loading a certain column into memory for computation, we will first check whether this column has been loaded, if not, the column will be loaded to a spare memory space. In case that the memory is full, we need to select one column to be replaced with the current column. We choose the least recently used (LRU) column for replacement, and more optimized replacement strategy could be possible.
As demonstrated in Fig. 2, in step and step , the two nonzero elements and of row are processed respectively, and corresponding columns and are loaded to memory. Next, while processing and , will overlap and reuse existing in step , and load in step . In step , to process , will be overlapped by , and is reused. Overlapping the rows and reusing the columns can effectively reduce unnecessary space utilization and memory WRITE operations.
IvB Data Slicing
To utilize the sparsity of the graph to reduce the memory requirement and unnecessary computation, we propose a data slicing strategy for graph data compression.
Assume is the th row, and is the th column of the adjacency matrix of graph . The slice size is (each slice contains bits), then each row and column has number of slices. The th slice in , which is represented as , is the set of . We define that slice is valid if and only if .
Recall that in our proposed TC method, for each nonzero element in the adjacency matrix, we compute the AND result of the corresponding row and column. With row and column slicing, we will perform the AND operation in the unit of slices. For each , we only process the valid slice pairs, namely only when both the row slice and column slice are valid, we will load the valid slice pair to the computational memory array and perform AND operation.
Fig. 3 demonstrates an example, after row and column slicing, only slice pairs and are valid, therefore, we only load these slices for AND computation. This scheme can reduce the needed computation significantly, especially in the large sparse graphs.
Memory requirement of the compressed graph data. With the proposed row and column slicing strategy, we need to store the index of valid slices and the detailed data information of these slices. Assuming that the number of valid slices is , the slice size is , and we use an integer (four Bytes) to store each valid slice index, then the needed space for overall valid slice index is Bytes. The needed space to store the data information of valid slices is Bytes. Therefore, the overall needed space for graph is Bytes, which is determined by the sparsity of and the slice size. In this paper, we set in the experimental result section. Given that most graphs are highly sparse, the needed space to store the graph can be trivial. Moreover, the proposed format of compressed graph data is friendly for directly mapping onto the computational memory arrays to perform inmemory logic computation.
IvC ProcessingInMRAM Architecture
Fig. 4 demonstrates the overall architecture of processinginMRAM. The graph data will be sliced and compressed, and represented by the valid slice index and corresponding slice data. According to the valid slice indexes in the data buffer, we load the corresponding valid slice pairs into computational STTMRAM array for bitwise computation. The storage status of STTMRAM array (such as which slices have been loaded) is also recorded in the data buffer and utilized for data reuse and exchange.
As for the computational memory array organization, each chip consists of multiple Banks and works as computational array. Each Bank is comprised of multiple computational memory subarrays, which are connected to a global row decoder and a shared global row buffer. Read circuit and write driver of the memory array are modified for processing bitwise logic functions. Specifically, the operation data are all stored in different rows in memory arrays. The rows associated with operation data will be activated simultaneously for computing. Sense amplifiers are enhanced with AND reference circuits to realize either READ or AND operations. By generating , the output by the sense amplifier is the AND result of the data that is stored in the enabled WLs.
IvD Pseudocode for InMemory TC Acceleration
Algorithm 1 demonstrates the pseudocode for TC accelerations with the proposed processinginMRAM architecture. It iterates over each edge of the graph, partitions the corresponding rows and columns into slides, then loads the valid slice pairs onto computational memory for AND and BitCount computation. In case that there is no enough memory space, it adopts an LRU strategy to replace a least recently used slice.
V Experimental Results
Va Experimental Setup
To validate the effectiveness of the proposed approaches, comprehensive devicetoarchitecture evaluations along with two inhouse simulators are developed. At the device level, we jointly use the Brinkman model and LandauLifshitzGilbert (LLG) equation to characterize MTJ [15]. The key parameters for MTJ simulation are demonstrated in Table I. For the circuitlevel simulation, we design a VerilogA model for 1T1R STTMRAM device, and characterize the circuit with nm FreePDK CMOS library. We design a bit counter module based on Verilog HDL to obtain the number of nonzero elements in a vector. Specifically, we split the vector and feed each bit subvector into an  lookuptable to get its nonzero element number, then sum up the nonzero numbers in all subvectors. We synthesis the module with Synopsis Tool and conduct postsynthesis simulation based on
nm FreePDK. After getting the device level simulation results, we integrate the parameters in the opensource NVSim simulator
[16] and obtain the memory array performance. In addition, we develop a simulator in Java for the processinginMRAM architecture, which simulates the proposed function mapping, data slicing and data mapping strategies. Finally, a behaviourallevel simulator is developed in Java, taking architecturallevel results and memory array performance to calculate the latency and energy that spends on TC inmemory accelerator. To provide a solid comparison with other accelerators, we select from the realworld graphs from SNAP dataset [17] (see TABLE II), and run comparative baseline intersectbased algorithm on Inspur blade system with the Spark GraphX framework on Intel E5430 singlecore CPU. Our TC inmemory acceleration algorithm also runs on singlecore CPU, and the STTMRAM computational array is set to be MB.Parameter  Value 

MTJ Surface Length  
MTJ Surface Width  
Spin Hall Angle  
ResistanceArea Product of MTJ  
Oxide Barrier Thickness  
TMR  
Saturation Field  
Gilbert Damping Constant  
Perpendicular Magnetic Anisotropy  
Temperature 
Dataset  # Vertices  # Edges  # Triangles 

egofacebook  4039  88234  1612010 
emailenron  36692  183831  727044 
comAmazon  334863  925872  667129 
comDBLP  317080  1049866  2224385 
comYoutube  1134890  2987624  3056386 
roadNetPA  1088092  1541898  67150 
roadNetTX  1379917  1921660  82869 
roadNetCA  1965206  2766607  120676 
comLiveJournal  3997962  34681189  177820130 
VB Benefits of Data Reuse and Exchange
TABLE III shows the memory space required for the bitwise computation. For example, the largest graph comlj will need MB without incurring any data exchange. On average, only KB per vertices is needed for inmemory computation.
egofacebook  0.182  comDBLP  7.6  roadNetTX  12.38 

emailenron  1.02  comYoutube  16.8  roadNetCA  16.78 
comAmazon  7.4  roadNetPA  9.96  comlj  16.8 
When the STTMRAM computational memory size is smaller than those listed in TABLE III, data exchange will happen. For example, with MB, the three large graphs will have to do data exchange as shown in Fig. 5. In this figure, we also list the percentages of data hit (average ) and data miss (average ). Recall that the first time a data slice is loaded, it is always a miss, and a data hit implies that the slice data has already been loaded. So this shows that the proposed data reuse strategy saves on average memory WRITE operations.
VC Benefits of Data Slicing
As shown in TABLE IV, the average percentage of valid slices in the five largest graphs is only . Therefore, the proposed data slicing strategy could significantly reduce the needed computation by .
egofacebook  7.017%  comDBLP  0.036%  roadNetTX  0.010% 

emailenron  1.607%  comYoutube  0.013%  roadNetCA  0.007% 
comAmazon  0.014%  roadNetPA  0.013%  comlj  0.006% 
VD Performance and Energy Results
TABLE V compares the performance of our proposed inmemory TC accelerator against a CPU baseline implementation, and the existing GPU and FPGA accelerators. One can see a dramatic reduction of the execution time in the last columns from the previous three columns. Indeed, without PIM, we achieved an average speedup against the baseline CPU implementation because of data slicing, reuse, and exchange. With PIM, another acceleration is obtained. Compared with the GPU and FPGA accelerators, the improvement is and , respectively. It is important to mention that we achieve this with a singlecore CPU and MB STTMRAM computational array.
Dataset  CPU  GPU [3]  FPGA [3]  This Work  

w/o PIM  TCIM  
egofacebook  5.399  0.15  0.093  0.169  0.005 
emailenron  9.545  0.146  0.22  0.8  0.021 
comAmazon  20.344  N/A  N/A  0.295  0.011 
comDBLP  20.803  N/A  N/A  0.413  0.027 
comYoutube  61.309  N/A  N/A  2.442  0.098 
roadNetPA  77.320  0.169  1.291  0.704  0.043 
roadNetTX  94.379  0.173  1.586  0.789  0.053 
roadNetCA  146.858  0.18  2.342  3.561  0.081 
comLiveJournal  820.616  N/A  N/A  33.034  2.006 
Vi Conclusion
In this paper, we propose a new triangle counting (TC) method, which uses massive bitwise logic computation, making it suitable for inmemory implementations. We further propose a sparsityaware processinginMRAM architecture for efficient inmemory TC accelerations: by data slicing, the computation could be reduced by , meanwhile the compressed graph data can be directly mapped onto STTMRAM computational memory array for bitwise operations, and the proposed data reuse and exchange strategy reduces of the memory WRITE operations. We use devicetoarchitecture cosimulation to demonstrate that the proposed TC inmemory accelerator outperforms the stateoftheart GPU and FPGA accelerations by and , respectively, and achieves a energy efficiency improvement over the FPGA accelerator.
Besides, the proposed graph data compression and data mapping strategies are not restricted to STTMRAM or TC problem. They can also be applied to other inmemory accelerators with other nonvolatile memories.
References
 [1] M. A. Hasan and V. S. Dave. Triangle counting in large networks: a review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(2):e1226, 2018.
 [2] V. S Mailthody, K. Date, Z. Qureshi, C. Pearson, R. Nagi, J. Xiong, and W. Hwu. Collaborative (CPU+GPU) algorithms for triangle counting and truss decomposition. In Proc. IEEE HPEC, pages 1–7, 2018.
 [3] S. Huang, M. ElHadedy, C. Hao, Q. Li, V. S. Mailthody, K. Date, J. Xiong, D. Chen, R. Nagi, and W. Hwu. Triangle counting and truss decomposition using fpga. In Proc. IEEE HPEC, pages 1–7, 2018.
 [4] V. Seshadri and O. Mutlu. Indram bulk bitwise execution engine. CoRR, abs/1905.09822, 2019.
 [5] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie. Pinatubo: a processinginmemory architecture for bulk bitwise operations in emerging nonvolatile memories. In Proc. ACM/IEEE DAC, pages 173:1–173:6, 2016.
 [6] B. Li, B. Yan, and H. Li. An overview of inmemory processing with emerging nonvolatile memory for dataintensive applications. In Proc. ACM GLSVLSI, pages 381–386, 2019.
 [7] S. Angizi, J. Sun, W. Zhang, and D. Fan. Aligns: A processinginmemory accelerator for dna short read alignment leveraging sotmram. In Proc. ACM/IEEE DAC, pages 1–6, 2019.
 [8] M. Wang, W. Cai, K. Cao, J. Zhou, J. Wrona, S. Peng, H. Yang, J. Wei, W. Kang, Y. Zhang, and W. Zhao. Currentinduced magnetization switching in atomthick tungsten engineered perpendicular magnetic tunnel junctions with large tunnel magnetoresistance. Nature communications, 9(1):671, 2018.
 [9] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan. Computing in memory with spintransfer torque magnetic RAM. IEEE Transactions on Very Large Scale Integration Systems (VLSI), 26(3):470–483, 2018.
 [10] L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen. GraphR: Accelerating graph processing using reram. In Proc. IEEE HPCA, pages 531–543, 2018.
 [11] S. Angizi, J. Sun, W. Zhang, and D. Fan. Graphs: A graph processing accelerator leveraging sotmram. In Proc. DATE, pages 378–383, 2019.
 [12] G. Dai, T. Huang, Y. Wang, H. Yang, and J. Wawrzynek. Graphsar: A sparsityaware processinginmemory architecture for largescale graph processing on rerams. In Proc. ASPDAC, pages 120–126, 2019.
 [13] Y. Zhuo, C. Wang, M. Zhang, R. Wang, D. Niu, Y. Wang, and X. Qian. GraphQ: Scalable pimbased graph processing. In Proc. IEEE MICRO, pages 712–725, 2019.
 [14] Jianlei Yang, Xueyan Wang, Qiang Zhou, Zhaohao Wang, Hai Li, Yiran Chen, and Weisheng Zhao. Exploiting spinorbit torque devices as reconfigurable logic for circuit obfuscation. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems (TCAD), 38(1):57–69, 2018.
 [15] Jianlei Yang, Peiyuan Wang, Yaojun Zhang, Yuanqing Cheng, Weisheng Zhao, Yiran Chen, and Hai Helen Li. Radiationinduced soft error analysis of sttmram: A device to circuit approach. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems (TCAD), 35(3):380–393, 2015.
 [16] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi. Nvsim: A circuitlevel performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 31(7):994–1007, 2012.
 [17] J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
Comments
There are no comments yet.