I Introduction
Sparse linear algebra operations are important in many domains, including simulating physical body dynamics [27], multigrid methods [23], network routing[64], and integer factorization [7]. The difference in delivered GigaFloating Point Operations (GFLOPs) between dense and sparse codes is 1050 depending on the sparsity pattern. An exacerbating issue impacting performance is the need to keep the matrices in the sparse formats that involve a level of indirection. The challenge is both finding and matching matrix elements which need to be operated on, which is an extra step compared to the dense case. Indeed, the overhead of accessing the indices into the sparse structure, not including the matching cost, can exceed the work of the mathematical operations by factors of 25 [28].
Our work is motivated by the promise of Field Programmable Gate Arrays (FPGAs) as computation engines for highperformance computing. With thousands of functional units and their ability to internally route data in an applicationdependent manner, FPGAs present an opportunity to scale performance and obtain better performanceperwatt. An important challenge in accelerating sparse computation with FPGAs is the need to keep the computation units busy. A level of indirection with sparse computation hinders performance with an FPGA.
This paper introduces REAP, a cooperative CPUFPGA approach for sparse computation that takes effectively uses the strengths of both the CPU and FPGA. With many layers of caches, CPUs are good at manipulating smallscale, unpredictable memory access patterns. In contrast, FPGA can be synthesized with a large number of functional units and can be reprogrammed to have applicationspecific routing and logic. REAP uses the CPU to reorganize the sparse matrices into a form that can be streamed into the FPGA. We also show how the CPU providing the FPGA with metadata about the element location aids it in matching operands.
REAP takes as input matrices in standard formats, such as Compressed Sparse Row (CSR) and Coordinate Format (COO). Using a standard format enhances portability, maintenance, and data curation. The CPU interfaces with the FPGA using our proposed new intermediate representation, which we call REAP Intermediate Representation (RIR). CPU rearranges the sparse matrix elements into discrete chunks, which we call RIR bundles. A bundle contains metadata and the matrix elements that share a feature, such as a row or a column. The FPGA can then route the RIR bundles to pipelines of processing elements (PE) which process the data in the bundle to create partial results. The partial results are also maintained in bundles, which can then be sent to additional units to be matched to results from successive bundles. We have generalized the RIR bundles for both SpGEMM and sparse Cholesky factorization. This format allows a transformation of a sparse pattern into a linear sequence of bundles. The linearization of the matrix elements means the FPGA can stream the elements from the main memory to obtain a very effective high bandwidth. For Cholesky factorization, we found that adding some metadataonly bundles further improves performance.
Using a combination of synthesis from Hardware Description Languages (HDLs) and simulation, our results show that the cooperative approach of using a CPU to precondition the matrix elements followed by streaming them into the FPGA outperforms a multicore CPU even at modest matrix element densities. In particular, we show we can always obtain speedups over CPUs when the density ratio is over 1:1000 for both Sparse Generalized Matrix Multiplication (SpGEMM) and Cholesky factorization. We observe a geometric mean speedup of 3.2
over a single CPU version using Intel’s Math Kernel Library for SpGEMM. We observed that our designs attain a geometric mean speedup of 1.8 with Cholesky factorization over a single CPU that uses the CHOLMOD library [12].A surprising finding is that REAP is faster than a multicore CPU with an equal number of floating point units in the FPGA and CPU. Our hypothesis for this behavior is that streaming reformatted bundles into the FPGA results in better throughput through the floating point units compared to moving data through the CPU’s caches.
An important finding of our work is these speedups are not obtainable without sufficient bandwidth between the memory and FPGA
Summary. REAP’s cooperative CPUFPGA approach increases the effective bandwidth of the FPGA design by letting the CPU performs the scheduling and marshaling of the input. Hence, FPGA enjoys regular accesses to memory. Further, relegating all the computation to the FPGA where there are abundant computation resources including floatingpoint DSP units improves performance. Finally, decoupling the CPU and FPGA’s work in a coarsegrained fashion allows their overlapped execution for higher performance.
Ii Synergistic CPUFPGA Acceleration
This section provides an overview of REAP. We first show the overall architecture of the design and then describe how sparse data is regularized for the FPGA.
HighLevel FPGA design in REAP.
Figure 1 provides an overview of the organization of the computation in the FPGA for SpGEMM. The Cholesky factorization is similar. It is described in detail in Section IIIB. Both designs center around replicated pipelines, which are organized vertically in Figure 1. Pipelines consist of multiple stages; there is thus parallelism with regards to the number of pipelines we can fit on an FPGA as well as intrapipeline parallelism. A single memory feeds the pipelines and acts as a sink for the results. This FPGA design provides better performance than the CPU if the added parallelism of these replicated pipelines can exceed the CPU’s increased clock rate and parallelism from instructionlevel parallelism and multiple cores. A very high effective memory bandwidth for matrix elements is required because the pipelines’ data originate and terminate from the same memory component as shown in Figure 1.
Regularizing the FPGA data access. Keeping the pipelines busy is the key to performance. A traditional problem with sparse computations is the number of indirections needed to access the matrix elements. When accesses are regular the FPGA can provide significant acceleration for computation compared to a CPU because it can execute thousands of floating point operations using digital signal processing (DSP) units with support for fusedmultiplyadd operations, and its high throughput distributed onchip memory can store intermediate results, thus avoiding writebacks to DRAM.
However, common sparse formats (e.g., CSR) involve some level of indirection. For example, to access an element in the CSR format of the sparse matrix , one needs to consult the row pointer to obtain the location where the row i begins (i.e., ), and then search the column indices until the beginning of the next row to check if the item is present(i.e., search from to ), and then subsequently access the data element at the matched index.
This irregularity in the computation that involves indirect accesses prevents effective pipelining of operations resulting in low throughput and resource utilization on the FPGA. Our design uses the CPU to both schedule and pack the matrix elements in an intermediate representation that increases regularity and resource utilization when computed on the FPGA.
REAP’s intermediate sparse representation. REAP Intermediate Representation (RIR) increases locality and minimizes indirect accesses. RIR colocates both the value and the auxiliary indices/metadata based on the shared feature. The key idea behind many sparse formats including Compressed Sparse Format (CSR) is to group nonzero elements based on some shared feature, which allows compact storage of distinct features. RIR is inspired by the very same observation. It is straightforward to convert other sparse formats such as CSC, ELL, and diagonal formats to RIR.
To provide FPGA’s pipelines with a regular data stream, REAP performs a data preprocessing pass using the software on the CPU. In Figure 1, the CPU takes the matrices from its DRAM memory (not shown) and reformats them into RIR format placing them in the FPGA memory. The CPU also uses the information from this preprocessing pass to schedule the computation on the FPGA. The FPGA then stores the results back in its own DRAM, which the CPU can subsequently read them.
Figure 2(a) depicts the RIR format where the shared feature is the common attribute to all the distinct elements. For example, in the case of a CSR representation, all the elements belonging to a particular row will be packed together where the shared feature is the row index. The column index and the value are also colocated with each shared feature.
Figure 2(b) provides the translation of commonly used formats, CSR and CSC, to RIR, respectively.
One key feature of RIR is that it allows easy conversion to commonly used formats. To support any sparse format, one has to provide compress and decompress routines. The compress routine takes the sparse data in the format of interest and converts it to RIR. Similarly, the decompress routine converts from the data in RIR to a format of interest. This reformatting is all done in software by the CPU. This is another important advantage of RIR is to keep the FPGA design independent of the sparse format while providing good performance.
RIR includes scheduling. RIR bundles can sometimes carry purely the scheduling information rather than data. This scheduling information guides the FPGA to schedule the data for the computation and organize the data in the memory. This is especially useful in sparse linear algebra kernels such as sparse Cholesky factorization with data dependency. In other words, in those kernels, the output of the computation can be served as the input in the later rounds. It is inefficient to send the outputs to the CPU and read them back later. Instead, in those cases, the CPU performs symbolic analysis and uses it to guide the FPGA to organize the data in the memory without the need to transfer any data between the CPU and FPGA.
Finegrained communication with the CPU and the FPGA can be expensive especially when the communication happens over IO channels (e.g., PCIE) [16]. The use of RIR with coarsegrained scheduling information minimizes this overhead. In contrast to executing everything on the FPGA, this synergistic CPUFPGA approach that makes two passes over the data enhances locality, increases memory bandwidth, and provides higher throughput.
In the next section, we show how we adopt this approach and specific design details for two important sparse kernels, sparse matrix multiplication, and sparse Cholesky factorization. Similarly, many other sparse linear algebra kernels can be accelerated with the same approach, which we leave them for future work.
Iii Case studies
To better show how REAP can be applied in practice, we present a detailed design for two important sparse linear algebra kernels, sparse generalized matrix multiplication and, sparse Cholesky factorization. We chose these two kernels to study because they are core components of many applications, such as graph traversal [2], MonteCarlo simulations [57]
, and Kalman filters
[13]. Further, SpGEMM and Cholesky are also different in their degrees of parallelism. Our goal is to demonstrate that our cooperative technique applies to both strongly parallel kernels and ones with datadependencies.Iiia Accelerating SPGEMM with REAP
A SpGEMM kernel consists of two main tasks: multiply and merge (also known as accumulation). The multiply task generates partial products. A partial product is a result of multiplying a nonzero value of input A with a matched nonzero value of input B. A match occurs when the column index of A matches the row index of B. The merge task accumulates all partial products that have the same coordinates (i.e., row and column) and produces the final result. Some SpGEMM algorithms offer good throughput for the multiply task but have difficulty with the merge task [45]. It is necessary to balance the work between the multiply and merge tasks to achieve high throughput.
We use a rowbyrow formulation of SpGEMM that computes one row of the result matrix at any instant of time. This formulation has two advantages: (1) it provides improved data reuse compared to innerproduct and (2) it ensures lower complexity in the maintenance of partial products compared to the outerproduct approach [45].
Reducing storage for partial products. Algorithm 1 illustrates our rowbyrow SpGEMM formulation for multiplying two sparse matrices and . Intuitively, each row of is compared with all the rows of , matching elements (i.e., when the column index of an element in and the row index of an element in are equal) are multiplied to generate partial products, and the partial product belonging to the same column index are accumulated to produce a row of the final result matrix. The matrix is read once and the matrix is streamed into the FPGA for each row of .
Given that matrices and are sparse, it is not necessary to stream all the rows of for a given row of . When we read a row of , we identify the column indices of the nonzero elements in that row and only stream those rows of that matches one of the column indices of (lines 37 in Algorithm 1). For example, if there is only one nonzero element in a row of , then it is not necessary to stream all rows of . Just streaming one row of that matches the column index of the single nonzero element of is sufficient.
Given a row of and a row of , the algorithm identifies a nonzero element of in that row which matches with the elements in a row of and performs the multiplication (lines 1014 in Algorithm 1). The multiplication of matching elements produces a partial product. Each partial product maintains the value and column index of the result. After streaming all the rows of , the partial products are sorted and merged by accumulating the values with the same column index. The merged values after accumulation provide the nonzero elements of a row in the final result matrix (lines 1721 in Algorithm 1).
Hence, the main components to accelerate SpGEMM in our formulation involve (1) extraction of the nonzero elements of two matrices in a given row and (2) efficient execution of the pipeline consisting of multiply, sort, and merge operations. In our synergistic CPUFPGA approach, the CPU performs the first task above whereas the FPGA performs the second task.
CPU provides regular data and scheduling information in the RIR format. CPU reorganizes the input data to make it easier for the FPGA to attain higher throughput. CPU has information about the FPGA design (i.e., number of pipelines and the PE’s in each unit) and uses it to layout the data. CPU reads both sparse matrices in CSR format and creates RIR bundles for each row of the input matrices. During this process, it also layouts the bundles in memory (DRAM) by using the scheduling information.
Figure 3 illustrates how the data is organized by the CPU to make it regular and to encode the scheduling information for the FPGA. CPU is aware of the number of parallel pipelines in the FPGA (i.e., three in Figure 1) to properly perform the scheduling task. Each pipeline processes a row of A. Hence, it has laid out the three rows of A followed by all the rows of B necessary to produce all partial products. In summary, CPU and FPGA work in tandem to attain higher throughput.
FPGA design for SpGEMM. Figure 1 shows the details of the architecture of our SpGEMM design. The FPGA design consists of five modules that process the input data in the RIR format. The input controller reads the input matrices and the scheduling information in RIR and distributes the work to the pipelines. Each pipeline includes PEs that perform match and multiply unit, sorting, and merging. Distinct pipelines operate on a distinct row of matrix and all rows of are streamed to every pipeline. Each pipeline computes a small part of the matrix independently.
Match and multiply unit. The match and multiply unit consists of two smaller units  a match unit and a singleprecision multiplier. We use Content Addressable Memory (CAM) to perform matching. Each CAM is populated with columns of A as the key and the address to another location in memory where A’s row and value are stored as the value. The match and multiply units are connected by a buffer that acts as the work queue for the multiplier unit. Once, a column index of an element in the row of A matches with the row index of an element of B’s row, the values of the two elements of A and B are pushed to the multiplier workqueue. The multipliers produce partial products.
Sorting of partial products. These partial products then sorted by the sort unit before being merged. To design a highspeed sorting unit, we use shift registers and attach comparators to each register. Initially, all the entries in the shift register are invalid. When a partial product is created, we want to find where the new element should be inserted. The sorting unit compares the new value (i.e., the column index of the partial product) with all the existing values, identifies the appropriate position to insert, and shift other values to make space if necessary.
Merging of partial products. The merge units accumulate all the incoming partial products that have the same column index. The partial products are kept in a queue for the merge. As all the partial products belong to the same row in the pipeline and they are sorted according to their column index, the new element for the merge is just compared with the top element. If the new element has the same column index as the top element in the queue, it is accumulated. Otherwise, the top element is removed and sent to the output controller and the new element is added to the queue.
Improving scalability. Our goal is to handle matrices of any size with any sparsity pattern. Some rows can have a large number of nonzeros compared to the number of resources in a pipeline in our FPGA design. In such scenarios, it is necessary to split the row into smaller pieces. RIR format encodes the number of elements in each bundle. This allows us to break a large row into smaller RIR bundles. When the CPU packs the input data into an RIR bundle, it encodes the limit on the number of elements in it, which is a design parameter. Our match and multiply units in the FPGA design use CAMs. An increase in the number of CAMs impacts the frequency of the design. In our SpGEMM design, we use an RIR bundle size of 32. When the number of nonzero elements in a row exceeds the RIR bundle size, CPU breaks the whole row into multiple bundles. The RIR bundle also includes additional metadata to indicate the end of a row in the input matrix. This organization of the data and the scheduling information provided by the CPU enables the input and join controllers of the FPGA design to work efficiently.
Summary. To maximize the throughput, it is essential to keep all units in a pipeline and all pipelines active. The organization of the data and the schedule by the CPU minimizes the complexity of the input controller that its job is to distribute the inputs and enables effective utilization of the pipelines and the PEs in the FPGA.
IiiB Accelerating Cholesky with REAP
Cholesky factorization [24] is an important method to solve systems of equations, . If the matrix is positive semidefinite, and thus symmetric, then we can decompose the matrix into . The Cholesky method computes the lower triangular matrix from . In the leftlooking Cholesky, the columns of L are computed from left to right. The two main challenges of a sparse Cholesky factorization are:

There are data dependencies between the computation of columns of L. We cannot begin the computation of column until all the dependencies are resolved. This is in contrast to SpGEMM, where each row of the result matrix could be computed independently and in any order.

The nonzero structure of the sparse matrix L is modified as we compute L. In other words, L is both an input and an output. This makes Cholesky more challenging compared to SpGEMM where the sparsity pattern is fixed before the computation starts.
Computing Cholesky factorization. Algorithm 2 shows the steps in a leftlooking Cholesky factorization. The matrix is computed columnbycolumn based on the column of A and the previous column of L in the outerloop beginning in line 2. To compute the th column of L, one needs the th column of A and all the columns of from and all rows of starting from row . Given that is sparse, we first read the nonzero elements in a given column of
(line 3). The column vector
DOT holds partial results. It is initialized to be equal to the nonzero elements of the column in .An interesting aspect of Cholesky factorization is that it is possible to identify the nonzero elements in a column of from a pure symbolic analysis of the dependencies of and the values of [21]. Hence, the algorithm performs a symbolic analysis to identify the row indices that are nonzeros in a column of (i.e., ROWL in line 4 in Algorithm 2). To compute a nonzero element in column with a row index , we need to compute the dot product of the th row and th row of which is then subtracted from the values in ’s th column with row index . DOT stores this partial value (lines 57). Finally, the diagonal and nondiagonal elements of the th column of are computed as shown (lines 811 in Algorithm 2).
Challenges for FPGA acceleration. Sparse Cholesky factorization is challenging for FPGAs because the computation depends on the sparsity pattern. The location of nonzeros in a column of depends on the value of . FPGAs are pretty deficient in performing irregular accesses that happen with symbolic analysis. In the absence of such symbolic analysis, it is unclear how to schedule computation efficiently on the FPGA.
Symbolic analysis and data reorganization by the CPU. REAP’s synergistic CPUFPGA acceleration addresses these issues and provides acceleration for Cholesky factorization using FPGAs. In our approach, CPU performs the symbolic analysis based on the construction of the elimination tree [22, 63]. CPU preprocesses the matrix and performs a symbolic analysis of and to identify nonzeros rows in any column of (GetPattern in Algorithm 2). Figure 4 (b) shows the result of symbolic analysis indicating the rows that are nonzeros in a column of L.
To regularize the data, CPU generates RIR bundles for . Figure 4(c) shows the RIR bundles for matrix , which is labeled RA, where k is the column index (shared feature) and row and value pairs are the distinct features of the bundle. CPU also creates metadata RIR bundles to convey the scheduling information for computing a column of . As matrix is exclusively allocated and stored in the FPGA, CPU determines the size of the matrix and creates a metadata RIR bundle for . The metadata bundle for column is a vector of triples, which is labeled RL in Figure 4(c). It includes the row index () of the nonzero value in a column of , the start and end indices of the th row of L. Overall, CPU performs lines 34 in Algorithm 2 and the FPGA performs the rest.
Acceleration of computation in the FPGA. Figure 5 shows the overall architecture for the computation of Cholesky factorization on the FPGA. It computes a column of in parallel. The FPGA design consists of a collection of pipelines. Each pipeline computes a nonzero element of a column of . Each pipeline internally contains a pipeline of processing elements that perform the dot product and a division or a square root operation.
FPGA’s input controller uses the RIR bundles for and the metadata bundles for to distribute the computation. The matrix is maintained in the memory accessible by the FPGA. While computing column of matrix , the input controller broadcasts (1) row of matrix and (2) RIR bundle for column of to all the pipelines. Additionally, each pipeline retrieves a unique row of corresponding to a nonzero element in column of .
Each dot product unit in the pipeline is equipped with a CAM to match the indices and several multipliers as shown in Figure 5(c). The design is fully pipelined by adding intermediate buffers between each component of the design. Each dot product PE has multiple multipliers to increase parallelism within the PE. The results of the dot products are then used by the second kind of PEs, labeled “Div SqRoot”, to compute the final values for each row of a column of , as shown in Figure 5(d). This PE realizes lines 811 in Algorithm 2.
The nondiagonal elements of a column of also depend on the value of the diagonal element. To make the computation of each pipeline completely independent, each pipeline computes the diagonal element independently and increases throughput at the cost of performing some redundant computation.
Summary. CPU performs the symbolic analysis, regularizes the data, and identifies the scheduling of computation that enables the FPGA design to concurrently compute the entire column of in parallel and attain higher throughput.
Iv FPGA Prototype
We realized REAP prototypes of SPGEMM in 12,424 lines of handcoded Verilog and Cholesky in 9072 lines, including the testbenches. We synthesized these codes for an Intel Altera DE5netArria board, using Quartus16.1. The rest of this section describes some of the design choices.
Reading and writing RIR bundles. We used Intel’s FIFO block in the Quartus development environment and designed read/write controllers to access RIR bundles. We used dualported FIFOs, but the read and write signals synchronized to the same clock.
The write controller performs the following tasks: (1) It writes all the distinct features and keeps track of the number of elements written to the FIFO. (2) Once all the distinct features are written, it pushes the shared feature and the metadata that provides the number of distinct elements written. The read controller reads the bundle in the reverse order. It reads the metadata first, shared feature next, and finally the distinct elements of the bundle. To handle overflow with FIFOs, the controllers make use of an almostfull signal (with writes) and an empty signal (with reads).
Floating Point Operations. All the arithmetic computation including multiplication, addition, subtraction, division, and square root operations, in both designs, use dedicated hardware (i.e. from the DSP units) for singleprecision floating point. We did not use doubleprecision as there is no support for it as vendorsupplied Intellectual Property (IP) blocks. We did not explore synthesizing doubleprecision FP arithmetic using FPGA logic cells as this has been shown to have much worse frequency and area than dedicated FP units [3].
REAP with HighLevel Synthesis. As highlevel synthesis (HLS) is becoming an attractive alternative to designs with handcoded Verilog, we explored ideas from REAP with OpenCL HLS tools for FPGA. We observed that the performance of the HLS designs tend to be lot slower compared to handcoded Verilog and can vary with minor changes to the design. In the evaluation, we demonstrate that ideas from REAP such working on preprocessed data can speedup HLS designs.
V Experimental Evaluation
Name  SpGEMM  Cholesky  Row  NNZ(Density) 

mario_002  S1    389K  2.10M(0.001%) 
m133b3  S2    200K  800K(0.001%) 
filter3D  S3    106K  2.7M(0.02%) 
cop20K  S4    121K  2.6M(0.01%) 
offshore  S5    259K  4.2M(0.006) 
poission3Da  S6    13K  352K(0.19%) 
cage12  S7    130K  2.0M(0.011%) 
2cubes_sphere  S8    101K  1.64M (0.015%) 
bcsstk13  S9  C2  2K  83K(2.09%) 
bcsstk17  S10  C3  10K  428K(0.35%) 
cant  S11  C4  62K  4M(0.102%) 
consph  S12    83K  6M(0.086%) 
mbeacxc  S13    496  49K(20.29%) 
pdb1HYs  S14    36K  4.3M(0.32%) 
rma10  S15    46K  2.3M(0.108%) 
descriptor_xingo6u  S16    20k  73k(0.017%) 
g7jac060sc  S17    17k  203k(0.064%) 
ns3Da  S18    20k  1.6M (0.403%) 
TSOPF_RS_b162_c3  S19    15k  610k(0.25%) 
cbuckle  S20  C6  13k  676k(0.36%) 
Pre_poisson    C1  12K  715K(0.32%) 
gyro    C5  17K  1M(0.33%) 
bcsstk18    C7  11K  80K(0.056%) 
bcsstk36    C8  23K  1.1M(0.215%) 
Platform  Configuration 

CPU  16 cores, 2.1 GHz, 32 GB DDR4 (2666 Mhz) 
Intel Xeon 6130  Cache(KB) L1:32 L2:1024 L3:22528 
FPGA  1,150K logic elements 
DE5net Arria 10  67Mbits embedded memory, 8 GB DDR3 (933 Mhz) 
1518 DSP blocks  
Intel PAC Card  1,150K logic elements, 
Arria 10 GX FPGA  65.7 Mbit onchip memory, OpenCL 1.0 
We evaluated REAP by comparing the performance of two sparse kernels, SpGEMM and Cholesky against two welloptimized CPU libraries, the Intel MKL for SpGEMM [30] and CHOLMOD [12] for Cholesky.
Benchmarks. We used twentyfour sparse matrices from the SuiteSparse matrix collection [20], which are shown in Table I. These matrices span multiple application domains and vary in size and sparsity patterns. The second and third columns are the matrix ID names in the evaluation results. The blank means the matrix was not evaluated for that algorithm (i.e., it does not satisfy the precondition for Cholesky factorization).
Simulation framework. Although we have synthesized designs on the FPGA, we are limited by the memory bandwidth for our experiments on the DE5Net FPGA board. Recently, Intel [31] and Xilinx [59] released their newest highend FPGAs with High Bandwidth Memory (HBM) interfaces using newer packaging techniques. As these boards became recently available, we did not have access to them and experiment on them. To explore the design space, we also use simulation to measure the execution time of REAP designs when we scale the memory bandwidth. Our framework consists of two parts. First, for the FPGA part, we use a tracedriven simulation with our inhouse cycleaccurate SystemC simulator for high fidelity. All the cycle counts and FPGA frequencies used in the simulator are extracted from the RTL implementation synthesized by Quartus 16.1 version for the DE5Arria 10 board. The details of the FPGA board are available in Table II. To simulate the FPGA DRAM, we use a queuing model where the data transfers are not allowed to exceed the bandwidth set in the design. Second, for the CPU part of REAP, we use a singlethread C++ implementation. Our current framework does not support concurrent execution of the CPU and FPGA. Hence, we initially ran the CPU part to collect the CPU traces and then the simulator uses the trace for its execution.
Baseline. We use the measured execution time of Intel’s MKL’s SpGEMM implementation on a CPU as the baseline to compare with our REAP SpGEMM designs. For sparse Cholesky, we measured CHOLMOD, a specialized library for Cholesky factorization. Table II shows the specifics of the CPUs used and the FPGA. To evaluate the SpGEMM kernel, we multiply each sparse matrix by itself (i.e., ), which is a standard method for evaluating SpGEMM performance.
Va Performance Evaluation with SpGEMM
We compare three variants of our SpGEMM design to Intel MKL.
REAP32. This FPGA design has 32 pipelines and runs at 250 MHz frequency. We use a RIR bundle and CAM size of 32. The DRAM bandwidth for this design matches that available on a singlecore CPU, which is 14 GB/s on our machine for both reads and writes. We measured memory bandwidth with the pmbw tool [5].
REAP64. This variant has 64 pipelines and runs at 250 MHz. The DRAM bandwidth is scaled to reflect the DRAM bandwidth available for the 16core CPU, which is the peak measured memory bandwidth (147 GB/s for reads and 73 GB/s for writes) for our CPU. All other parameters are same as the REAP32 version.
REAP128. This version has 128 pipelines and runs at 220 MHz. The DRAM bandwidth remains the same as REAP64 which is the peak memory bandwidth measured for our CPU system. All other parameters are same as REAP64.
CPU. We run Intel MKL with a different number of threads (1,2, 4, 8, and 16), which indicates the change in performance on a CPU with core count. We achieved the best performance for the CPU running the Intel MKL with 16 threads with hyperthreading disabled. Variants with more threads ran slower, likely due to interference.
Figure 6 shows the SpGEMM speedup of our REAP FPGA designs and softwareonly CPU versions with various core counts relative to Intel MKL run on a single core. REAP overlaps the reformatting on the CPU and the computation on the FPGA after the initial round. In the initial round, the FPGA is idle while CPU reformats the data. Figure 6 shows the overall time taking into account both the CPU and the FPGA time. Due to space reasons, we only show the least and the best performance on the CPU and the three REAP variants. The CPU2 effectively has the same number of floatingpoint multiply/add units as the REAP32 while REAP32 effective bandwidth is the same as the CPU1. REAP64 and REAP128 have respectively a quarter and half of the number of floatingpoint multiply/add units than CPU16, with the same memory bandwidth.
The main observations from this figure are: (1) REAP32, where the FPGA has the same memory bandwidth as singlecore Intel MKL outperforms the CPU1 on all the matrices and with a geometric mean of 3.2 speedup. (2) REAP32 also beats CPU2 for the majority of the benchmarks and REAP64 outperforms CPU16 for half of the matrices, with 1/4 of the available floating points units. This higher performance is achieved despite the FPGA running at a frequency that is almost 8 slower than the CPU’s frequency. (3) REAP128 beats CPU16 for all benchmarks except three. This is again achieved even when REAP is using half of the number of floatingpoint units compared to a 16core CPU and also running with significantly lower frequency than the CPU.
CPU preprocessing vs FPGA computation. Figure 7 reports the percentage of the time spent in CPU preprocessing task and FPGA computation with our REAP32 design (the sum of the two should add up to 100%). In reality, most of the execution times on CPU and FPGA are effectively overlapped. From this figure, we observe that the FPGA computation time dominates the CPU’s preprocessing time for most cases, which is expected. In cases, where CPU preprocessing time exceeds FPGA’s time, the matrices have low density. In this scenario, the time spent to extract and organize the nonzero elements is more than the computation time. This confirms that accessing the nonzero elements could sometimes be costly as doing the actual computation with SpGEMM.
GFLOPSAnalysis. One important goal of our design is to improve the throughput by better utilizing the resources (e.g. floatingpoint units). Figure 8 shows the GFLOPS for both REAP and Intel MKL doing SpGEMM. The GFLOPS are normalized with the number of floatingpoint units available to both our design and the CPU. We only show the median, GEOMEAN, 25th percentile and 75th percentile results across all benchmarks. We observe that for the same available floatingpoint units, REAP achieves higher GFLOPS for all the cases. REAP better utilizes its resources thanks to the preprocessing done by the CPU. Further, REAP’s GFLOPS scales better with the increase in the number of floatingpoint units compared to the CPU.
Hardware Scalability. Figure 8 also shows how the frequency and logic utilization varies as we add more pipelines. While the number of pipelines changed from 2 to 128, the logic utilization has increased only 8 and the frequency has only dropped from 280 MHz to 220 MHz. This is because we have extensively benefited from the DSP units and onchip memory. The simple interconnection network between PEs and minimal use of CAMs are other reasons for this good scalability of our FPGA design.
Sensitivity to sparsity. Another interesting observation is in Figure 9 where the Xaxis shows the density of the input matrix (number of nonzeros / total number of elements)*100) in log scale and Yaxis shows the relative speedup of different variations of REAP compared to the CPU versions. For SpGEMM, our design achieves a relatively better speedup when the input matrix is more sparse. In other words, REAP favors sparse matrices. The dashed line shows where the CPU version beats the REAP. CPU beats REAP only for the case where the matrix is relatively denser.
VB Performance Evaluation with Cholesky
We compare our design of Cholesky with CHOLMOD [12], the stateoftheart library for sparse Cholesky factorization. CHOLMOD offers different optimizations and configurations for sparse Cholesky factorization while supporting both symbolic and numerical factorization. The library supports a variety of options for the factorization such as simplicial or supernodal, LL^{T} or LDL^{T}. For a fair comparison, we compare ourselves against simplicial, LL^{T} implementation with noordering, which is the closest to our implementation. Besides, both our design and CHOLMOD benefit from the symbolic analysis. We have not included the time spent to build the elimination tree. We compare REAP results with the CHOLMOD configuration that runs only numeric computation.
REAP variants for Cholesky. We have two design variants.
REAP32. This design has 32 pipelines and runs at 250 MHz. We set the DRAM bandwidth to match the DRAM bandwidth for a singlecore CPU. Each PE performing the dot product is equipped with 8 multipliers. We set the bundle size and CAM size as 32.
REAP64 This design has 64 pipelines and runs at 238 MHz. The DRAM bandwidth is scaled to reflect the DRAM bandwidth available for 16cores CPU. We have also doubled the number of multipliers (i.e., 16) for the PEs performing the dot product. All other parameters such as bundle size and CAM sizes are same as REAP32 above.
Figure 10 shows the relative speedup of REAP designs compared to CHOLMOD on a single core CPU. REAP32 design outperforms the CPU for all but one benchmark with a geometric mean speedup of 1.18. REAP64 outperforms the CPU version for all the benchmarks on average (geometric mean) speedup of 1.85. Unlike SpGEMM, the performance of the sparse Cholesky kernel is limited due to the data dependencies inherently present in the algorithm.
Figure 11 shows the percentage of time spent in CPU preprocessing and FPGA computation similar to our experiments with SpGEMM. FPGA execution time significantly dominates the CPU execution time for Cholesky. All the numeric computation take places in the FPGA and CPU only does the symbolic computation where no floating point operation involves.
To study the impact of dependencies on the performance of our designs, we measured the change in performance with the increase in pipelines and measured the idle time across all pipelines. We observed as we increase the number of pipelines increases, the idle cycles increase almost linearly with the number of pipelines. Hence, adding more resources is not going to help to improve the performance of Cholesky. There is active research in overcoming the issue of dependencies for matrix factorization, which are orthogonal to our work.
VC REAP with OpenCL HLS Designs
To evaluate the benefits of preprocessing with REAP’s RIR bundles with an OpenCL HLS framework, we used an Intel Programmable Acceleration Card (PAC Card) running Intel’s FPGA OpenCL version 1.0. However, shared memory between the FPGA and CPU, where the CPU’s load/store instructions as well as an OpenCL kernel’s read/write statements access the same memory, is not well supported in the current Intel OpenCL toolchain. Instead, accessor functions are required. Thus, to evaluate REAP with HLS, we first ran the first pass on the CPU and the FPGA did the computation on the RIR bundles generated by the first pass. Unsurprisingly, the HLS designs are significantly slower than the handcoded designs. However, the version of REAP with HLS outperforms the HLS version without any CPU preprocessing for all benchmarks and with a geometric mean of 16% and 35% for SpGEMM and Cholesky, respectively, on average across all the benchmarks. It demonstrates that the idea of synergistic CPUFPGA execution is beneficial to both handcoded designs and HLS tools.
Vi Related Work
There exists a large body of works on optimizing the performance of sparse linear algebra kernels [17, 6, 51, 26, 55]. We restrict our comparison to prior research that optimizes SpGEMM and Cholesky.
SpGEMM on CPUs. Intel Math Kernel Library (MKL) [30] provides math routines for sparse multiplication that are widely used and highly optimized for the CPU with OpenMP and vectorized with AVX extensions. There is a large body of work on building cachefriendly and distributed algorithms [53, 52, 1, 8, 9, 43, 29], system level optimizations with parallelism on modern hardware [49, 41], and exploration of various sparse formats for efficient storage [36] [10, 40, 19, 48, 35, 4, 11]. Recent work by Zhen et al [58] proposes a framework that accelerates the SpGEMM kernel for both CPU and GPUs by automatically using the best sparse format and the best parallel algorithm based on the characteristics of the input.
SpGEMM on GPUs. Although GPUs can outperform CPUs for dense kernels, sparse computation introduces new challenges. Commercial sparse libraries such as cuSPARSE [18] and CUSP [47] apply GPU specific optimization to improve the performance of sparse computation. Recent research [39, 38, 41] identifies three aspects to improve performance on a GPU: memory allocation for the result matrix, random memory accesses due to parallel operations, and load balancing that has to consider sparsity.
SpGEMM on FPGAs. Prior work has also explored sparse matrixmatrix multiplication architectures on FPGAs [37, 44, 60, 61]. They have focused on user control on the number of PEs and the block size to obtain a particular energydelay product or a powerdelay product. Similar to our designs, they also exploit dedicated DSP blocks and onchip memory for accelerating SpGEMM computation. However, they assume that the FPGA onchip resources can accommodate the entire matrices, which is not practical for many realworld scenarios. Recent work also uses 3Dstacked logicinmemory to accelerate the processing of sparse matrix data [62]. Use of such logicenhanced CAMs help with index matching, which is orthogonal to our work.
GEMM/SpGEMM with ASICs.
Google’s tensor processing unit (TPU) ASIC accelerates the inference phase of neural networks
[33]. The building block of a TPU is a matrixmultiply unit. However, TPU is designed for dense matrices. Similarly, Extensor [25] is an ASIC that supports highdimensional sparse data known as tensors and helps to match the nonzero elements quickly.Closely related work. SMASH [34] that supports a softwarehardware solution is closely related. The software encodes the sparse data as a hierarchy of bitmaps. This encoding is then used by the hardware to avoid a lot of unnecessary accesses and reduce the overhead. Also related are recent approaches for outerproduct based matrix multiplication [45, 42]. Like our approach, they also include multiply and merge phases and generate partial products. The use of the outerproduct algorithm eliminates the indexmatching phase and allows maximum reuse of the input matrices. However, the main drawback of an outerproduct formulation is the accumulation phase. The partial products that belong to the entire final result matrix need to be stored. Accumulation of partial product waits till all partial products are ready. Hence, it introduces complexities with the accumulation of partial results, which makes the latency of the entire algorithm quickly dominated by it. In contrast, our rowbyrow formulation strikes a good balance with the reuse of data and the accumulation of partial products as they belong to one row. Hence, our approach provides better throughput compared to an outerproduct formulation.
Cholesky factorization. Generating efficient sparse kernel for Cholesky factorization an active research area in the high performance computing community. TACO [17] is a code generation framework that helps generate sparse kernels in general. CHOLMOD [12] provides a comprehensive framework for sparse linear solver system including highperformance sparse Cholesky for both CPUs and GPUs. CHOLMOD provides highly optimized sequential implementations. Intel MKL’s paradiso [50] and SuperLU [32] provide parallel implementations of Cholesky for sharedmemory architectures. There are recent efforts to use inspectorexecutor paradigm to generate efficient sparse kernels for Cholesky and other sparse kernels in software [14, 15]. Prior research has also explored the use of GPUs to improve the performance of Cholesky factorization [46, 56].
Although there are reasonable projects for accelerating Cholesky on CPUs, there is a lack of work on designing FPGAs for sparse Cholesky as it is challenging to get performance with it. An exception is a prior work that analytically shows that FPGA have the potential to improve the performance of sparse Cholesky [54]. In contrast, this paper shows a generic design primitive that accelerates Cholesky and related sparse kernels with some preprocessing performed by the CPU.
Vii Conclusion
We introduce REAP, a system for high performance and efficient sparse linear algebra. REAP uses a cooperative strategy between a CPU and FPGA, where the CPU reorganizes the sparse matrices from a standard format to one that can be streamed into the FPGA. The FPGA then performs the computations in a systolicstyle, taking advantage of the inherent parallelism of matrix operations. Our evaluation over a diverse set of matrices demonstrates our cooperative approach significantly improves the performance of SpGEMM and Sparse Cholesky factorization compared to stateoftheart CPU implementations, even when the FPGA and a multicore CPU have an equal number of floating point units. A key take away from our research is that increasing bandwidth alone does not obtain the highest performance; having the CPU reformat the data into a more sequential access pattern gives the FPGA a huge performance boost that greatly outweighs the formatting cost. We hope our cooperative techniques inspire the HPC community to build applications in this style of design to better exploit HBM as it becomes more pervasive and affordable.
Acknowledgment
This paper is based on work supported in part by NSF CAREER Award CCF–1453086, NSF Award CCF1917897, NSF Award CCF1908798, NSF CNS1836901, NSF OAC1925482, and Intel Corporation.
References
 [1] (201708) Exploiting locality in sparse matrixmatrix multiplication on manycore architectures. IEEE Transactions on Parallel and Distributed Systems 28 (8), pp. 2258–2271. Cited by: §VI.
 [2] Basic linear algebra subprograms. 2018, http://www.netlib.org/blas/.. Cited by: §III.
 [3] (2008) Architectural modifications to enhance the floatingpoint performance of fpgas. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 16 (2), pp. 177–187. Cited by: §IV.
 [4] (2009) Patternbased sparse matrix representation for memoryefficient smvm kernels. In Proceedings of the 23rd International Conference on Supercomputing, New York, NY, USA. Cited by: §VI.
 [5] (201303) Pmbw  parallel memory bandwidth measurement / benchmark. . External Links: Document, Link Cited by: §VA.
 [6] (200307) Sparse matrix solvers on the gpu: conjugate gradients and multigrid. ACM Trans. Graph. 22 (3). External Links: ISSN 07300301, Link, Document Cited by: §VI.
 [7] (2013) The filtering step of discrete logarithm and integer factorization algorithms. Cited by: §I.
 [8] (2008) On the representation and multiplication of hypersparse matrices. In Proceedings IEEE International Symposium on Parallel and Distributed Processing, Vol. , pp. 1–11. Cited by: §VI.
 [9] (201105) Reducedbandwidth multithreaded algorithms for sparse matrixvector multiplication. In 2011 IEEE International Parallel Distributed Processing Symposium, Vol. . Cited by: §VI.
 [10] (2009) Parallel sparse matrixvector and matrixtransposevector multiplication using compressed sparse blocks. In Proceedings of the Twentyfirst Annual Symposium on Parallelism in Algorithms and Architectures, SPAA ’09, New York, NY, USA, pp. 233–244. External Links: ISBN 9781605586069, Link, Document Cited by: §VI.
 [11] (2009) Parallel sparse matrixvector and matrixtransposevector multiplication using compressed sparse blocks. In Proceedings of the TwentyFirst Annual Symposium on Parallelism in Algorithms and Architectures, New York, NY, USA. Cited by: §VI.
 [12] (2008) Algorithm 887: cholmod, supernodal sparse cholesky factorization and update/downdate. ACM Transactions on Mathematical Software, pp. 1–14. Cited by: §I, §VB, §V, §VI.
 [13] (2010) A linear algebraic approach to kalman filtering. In 2010 International Conference on Measuring Technology and Mechatronics Automation, Vol. 1, pp. 122–125. Cited by: §III.
 [14] (2017) Sympiler: transforming sparse matrix codes by decoupling symbolic analysis. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’17, New York, NY, USA, pp. 13:1–13:13. External Links: ISBN 9781450351140, Link, Document Cited by: §VI.
 [15] (2018) ParSy: inspection and transformation of sparse matrix computations for parallelism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18, pp. 62:1–62:15. Cited by: §VI.
 [16] (2016) A quantitative analysis on microarchitectures of modern cpufpga platforms. In Proceedings of the 53rd Annual Design Automation Conference, DAC ’16. Cited by: §II.
 [17] (2018) Unified sparse formats for tensor algebra compilers. CoRR abs/1804.10112. External Links: Link, 1804.10112 Cited by: §VI, §VI.
 [18] (2014) CuSPARSE library. https://developer.nvidia.com/cusparse. Cited by: §VI.
 [19] (1989) ITPACKV 2d user’s guide. Cited by: §VI.
 [20] (201112) The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38 (1), pp. 1:1–1:25. Cited by: TABLE I, §V.
 [21] (2006) Direct methods for sparse linear systems. Vol. 2. Siam. Cited by: §IIIB.
 [22] (1987) Symbolic cholesky factorization on a localmemory multiprocessor. Parallel Computing 5 (12), pp. 85–95. Cited by: §IIIB.
 [23] (2010) A streaming approach for sparse matrix products and its application in galerkin multigrid methods. Electronic Transactions on Numerical Analysis 37 (263275), pp. 3–5. Cited by: §I.
 [24] (2018) ScientificComputing, an introductory survey. SIAM. Cited by: §IIIB.
 [25] (2019) ExTensor: an accelerator for sparse tensor algebra. In Proceedings of the 52Nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’52, New York, NY, USA, pp. 319–333. Cited by: §VI.
 [26] (2019) Accelerating sparse deep neural networks on fpgas. In 2019 IEEE High Performance Extreme Computing Conference (HPEC), Vol. , pp. 1–7. Cited by: §VI.
 [27] (2011) Sparse matrix implicit numerical integration of the stiff differential/algebraic equations: implementation. Nonlinear Dynamics 65 (4), pp. 369–382. Cited by: §I.
 [28] (2004) Sparsity: optimization framework for sparse matrix kernels. The International Journal of High Performance Computing Applications 18 (1), pp. 135–158. Cited by: §I.
 [29] (2016) Improving performance of sparse matrix dense matrix multiplication on largescale parallel systems. Parallel Computing 59, pp. 71 – 96. Cited by: §VI.
 [30] (2018) Intel math kernel library (mkl). Cited by: §V, §VI.
 [31] Intel stratix 10. External Links: Link Cited by: §V.
 [32] (1999) An asynchronous parallel supernodal algorithm for sparse gaussian elimination. pp. . Cited by: §VI.
 [33] (2017) Indatacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, New York, NY, USA. Cited by: §VI.
 [34] (2019) SMASH: codesigning software compression and hardwareaccelerated indexing for efficient sparse matrix operations. In Proceedings of the 52Nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’52, New York, NY, USA, pp. 600–614. External Links: ISBN 9781450369381, Link, Document Cited by: §VI.
 [35] (2011) CSX: an extended compression format for spmv on shared memory systems. Cited by: §VI.
 [36] (2013) SMAT: an input adaptive autotuner for sparse matrixvector multiplication. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, New York, NY, USA. External Links: ISBN 9781450320146 Cited by: §VI.
 [37] (201012) Design space exploration for sparse matrixmatrix multiplication on fpgas. In 2010 International Conference on FieldProgrammable Technology, Vol. , pp. 369–372. External Links: Document, ISSN Cited by: §VI.
 [38] (201405) An efficient gpu general sparse matrixmatrix multiplication for irregular data. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Vol. , pp. 370–381. Cited by: §VI.
 [39] (201511) A framework for general sparse matrixmatrix multiplication on gpus and heterogeneous processors. J. Parallel Distrib. Comput. 85 (C), pp. 47–61. External Links: ISSN 07437315, Link, Document Cited by: §VI.
 [40] (2015) CSR5: an efficient storage format for crossplatform sparse matrixvector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15, pp. 339–350. External Links: Document Cited by: §VI.
 [41] (201212) Sparse matrixmatrix multiplication on modern architectures. In 2012 19th International Conference on High Performance Computing, Vol. , pp. 1–10. External Links: Document, ISSN Cited by: §VI, §VI.

[42]
(201701)
Finegrained accelerators for sparse machine learning workloads
. In 2017 22nd Asia and South Pacific Design Automation Conference (ASPDAC), Cited by: §VI.  [43] (20070501) When cache blocking of sparse matrix vector multiply works and why. Applicable Algebra in Engineering, Communication and Computing 18 (3), pp. 297–311. Cited by: §VI.
 [44] (201603) Hardware accelerator for analytics of sparse data. In 2016 Design, Automation Test in Europe Conference Exhibition (DATE), Vol. , pp. 1616–1621. Cited by: §VI.
 [45] (201802) OuterSPACE: an outer product based sparse matrix multiplication accelerator. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Vol. , pp. 724–736. Cited by: §IIIA, §IIIA, §VI.
 [46] (2016) Accelerating sparse cholesky factorization on gpus. 59. Cited by: §VI.
 [47] (2015) S. dalton, n. bell, l. olson, and m. garland. cusp: generic parallel algorithms for sparse matrix and graph computations. version 0.5.1.. Cited by: §VI.
 [48] (2003) Iterative methods for sparse linear systems. 2nd edition, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA. External Links: ISBN 0898715342 Cited by: §VI.
 [49] (2013) Performance evaluation of sparse matrix multiplication kernels on intel xeon phi. CoRR abs/1302.1078. External Links: Link, 1302.1078 Cited by: §VI.
 [50] (200003) Efficient sparse lu factorization with leftright looking strategy on shared memory multiprocessors. BIT 40, pp. 158–176. External Links: Document Cited by: §VI.

[51]
(2019)
Parallelismflexible convolution core for sparse convolutional neural networks on fpga
. IPSJ Transactions on System LSI Design Methodology 12 (), pp. 22–37. Cited by: §VI.  [52] (199803) Cachingefficient multithreaded fast multiplication of sparse matrices. In Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, Vol. , pp. 117–123. External Links: Document, ISSN 10637133 Cited by: §VI.
 [53] (199803) Cachingefficient multithreaded fast multiplication of sparse matrices. In Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, Vol. , pp. 117–123. Cited by: §VI.
 [54] (201706) Sparse cholesky factorization on fpga using parameterized model. pp. 11. Cited by: §VI.
 [55] (2015) A vector caching scheme for streaming fpga spmv accelerators. In Applied Reconfigurable Computing, K. Sano, D. Soudris, M. Hübner, and P. C. Diniz (Eds.), Cham, pp. 15–26. External Links: ISBN 9783319162140 Cited by: §VI.
 [56] (201006) On the limits of gpu acceleration. pp. . Cited by: §VI.
 [57] (2004) A sparse matrix approach to bayesian computation in large linear models. Computational Statistics and Data Analysis (), pp. 493–516. Cited by: §III.
 [58] (2019) IAspgemm: an inputaware autotuning framework for parallel sparse matrixmatrix multiplication. In Proceedings of the ACM International Conference on Supercomputing, ICS ’19. Cited by: §VI.
 [59] Xilinx virtex ultrascale. External Links: Link Cited by: §V.
 [60] (2017) Sparse matrix multiplication on an associative processor. External Links: 1705.07282 Cited by: §VI.
 [61] (2017) Sparse matrix multiplication on CAM based accelerator. CoRR abs/1705.09937. External Links: Link, 1705.09937 Cited by: §VI.
 [62] (2013Sep.) Accelerating sparse matrixmatrix multiplication with 3dstacked logicinmemory hardware. In 2013 IEEE High Performance Extreme Computing Conference (HPEC), Vol. , pp. 1–6. External Links: ISSN Cited by: §VI.
 [63] (1994) On constructing the elimination tree. Discrete applied mathematics 48 (1), pp. 93–98. Cited by: §IIIB.
 [64] (2002) All pairs shortest paths using bridging sets and rectangular matrix multiplication. Journal of the ACM (JACM) 49 (3), pp. 289–317. Cited by: §I.
Comments
There are no comments yet.