Hardware Acceleration of HPC Computational Flow Dynamics using HBM-enabled FPGAs

01/05/2021
by   Tom Hogervorst, et al.
Delft University of Technology
0

Scientific computing is at the core of many High-Performance Computing applications, including computational flow dynamics. Because of the uttermost importance to simulate increasingly larger computational models, hardware acceleration is receiving increased attention due to its potential to maximize the performance of scientific computing. A Field-Programmable Gate Array is a reconfigurable hardware accelerator that is fully customizable in terms of computational resources and memory storage requirements of an application during its lifetime. Therefore, it is an ideal candidate to accelerate scientific computing applications because of the possibility to fully customize the memory hierarchy important in irregular applications such as iterative linear solvers found in scientific libraries. In this paper, we study the potential of using FPGA in HPC because of the rapid advances in reconfigurable hardware, such as the increase in on-chip memory size, increasing number of logic cells, and the integration of High-Bandwidth Memories on board. To perform this study, we first propose a novel ILU0 preconditioner tightly integrated with a BiCGStab solver kernel designed using a mixture of High-Level Synthesis and Register-Transfer Level hand-coded design. Second, we integrate the developed preconditioned iterative solver in Flow from the Open Porous Media (OPM) project, a state-of-the-art open-source reservoir simulator. Finally, we perform a thorough evaluation of the FPGA solver kernel in both standalone mode and integrated into the reservoir simulator that includes all the on-chip URAM and BRAM, on-board High-Bandwidth Memory, and off-chip CPU memory data transfers required in a complex simulator software such as OPM's Flow. We evaluate the performance on the Norne field, a real-world case reservoir model using a grid with more than 10^5 cells and using 3 unknowns per cell.

READ FULL TEXT VIEW PDF

Authors

page 1

page 16

04/07/2020

ESP4ML: Platform-Based Design of Systems-on-Chip for Embedded Machine Learning

We present ESP4ML, an open-source system-level design flow to build and ...
10/04/2020

Exploring the acceleration of the Met Office NERC Cloud model using FPGAs

The use of Field Programmable Gate Arrays (FPGAs) to accelerate computat...
08/27/2021

A High-Fidelity Flow Solver for Unstructured Meshes on Field-Programmable Gate Arrays

The impending termination of Moore's law motivates the search for new fo...
08/06/2021

From Domain-Specific Languages to Memory-Optimized Accelerators for Fluid Dynamics

Many applications are increasingly requiring numerical simulations for s...
07/18/2019

FBLAS: Streaming Linear Algebra on FPGA

Energy efficiency is one of the primary concerns when designing large sc...
01/06/2020

Efficient Reordered Nonlinear Gauss-Seidel Solvers With Higher Order For Black-Oil Models

The fully implicit method is the most commonly used approach to solve bl...
06/26/2019

FPGA-based Multi-Chip Module for High-Performance Computing

Current integration, architectural design and manufacturing technologies...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The advent of data proliferation and the slowing down of Moore’s law is driving many application designers to implement an algorithm’s computationally demanding functionality using a hardware accelerator. While gpu benefit from a larger user base because of their massive amount of parallel cores, on board hbm, and ease of programming, fpga are gaining recently a lot of attention because of their application-specific customizability. Furthermore, the recent integration of hbm in fpga-based systems coupled with the maturing of hls programming tools [19]

, have touted the fpga as a viable competitor for gpu in emerging machine learning-related domains, but also in hpc applications where gpu were traditionally the sole candidate. The acquisition of Altera by Intel in 2016

[14] and the recent announcement of AMD acquiring Xilinx [4], two of the biggest fpga providers, can be considered an indication of this trend.

In this context, it is highly relevant to reassess the potential of fpga for hpc, when we consider the addition of hbm on board, the increase of on-chip memory (e.g., uram [26]

), and the continuing increase of logic cell resources enabling one to implement complex functionality. To perform this evaluation, we look at an ubiquitous scientific computing field used in many hpc application domains. Namely, the solution of systems of non-linear partial differential equations in time and space. These are typically not possible to solve exactly through analytical methods. Hence, the equations are discretized in time and over a domain space using finite difference, finite volume or finite element methods. Finally, the derived system of equations is solved, often iteratively using a combination of preconditioners and iterative linear solvers available in one of the many scientific solver libraries available, such as DUNE

[6]. PETSc [5] or others. Due to the complexity of implementing linear solvers on fpga, there are only a few previous works that studied the acceleration of them on fpga. However, many assumptions were made to simplify the design and implementation, such as either focusing solely on single precision floating point support, or using simple solvers without preconditioners, and all ignored the integration into the whole scientific computing pipeline. However, to realistically assess the potential of fpga, it is important to eliminate the aforementioned restrictions. Consequently, in this work we develop a complete solver with a preconditioner on fpga with support for double precision, which is utterly important in a real-world setting because it enables the convergence to a solution. Furthermore, we integrate our work in a state-of-the-art reservoir simulator, OPM Flow [21], and validate our work by running with a real-world use case containing a large number of grid cells ().

To accomplish our study goals, we develop a novel BiCGStab [27] solver implementation with ILU0 [7] as preconditioner on fpga. This combination of preconditioner and linear solver was chosen to get an apples-to-apples comparison with the CPU performance on the Norne field data set, as this combination was best performing in that case. To maximize the performance, we use a mixed programming model in which we combine hls with manually written rtl. Please note that there might be other solvers more efficient for a parallel architecture such as the fpga (and gpu) than an ILU0 preconditioner. However, this is left for future work because it was most important to make certain the results are close to the ones obtained by the already validated software only simulation. Therefore, we selected the same combination of preconditioner and solver as those used by default in OPM Flow and designed the solver to support double precision because high accuracy is required, otherwise the selected real world case would not converge. Furthermore, integration is key to benchmark the performance in the overall system. Previous work has only benchmarked the solvers in isolation, whereas we consider the whole system for benchmarking, where including the memory transfers between off-chip and off-board is crucial. Finally, we make our work available in the opm git repository [20].

The contributions of this paper are summarized as follows:

  • We propose a novel csr-based encoding to optimize the spmv on fpga, which is the core functionality required to implement iterative solvers.

  • We develop a hardware model to predict and guide the design of the solver, helping us to understand trade-off between area and performance when scaling the resources.

  • We design a novel ILU0-BiCGStab preconditioned solver on hbm-enabled fpga using a mixed programming model of hls and rtl.

  • We integrate the proposed solver in OPM Flow, both for fpga and gpu. Furthermore, we make all sources available in the opm git repository.

  • We provide extensive evaluation results for both standalone solver kernel benchmarking as well as complete reservoir simulation execution on a real world use case running on three different types of systems, CPU, fpga, and gpu.

In the next section II, we introduce the background required to understand the paper, while in section III we highlight related works. We then start a bottom-up explanation approach in which we first describe the spmv kernel in section II-B, then the solver in section V, and discuss the performance hardware model used to guide the design in section VI. Finally, in section VII, we present the experimental results.

Ii Background

Ii-a Sparse Matrix Iterative Solvers

The central problem in this paper is to solve for vector , where A is a known sparse matrix and

is a known dense vector. There are different methods for solving such a problem, which can be grouped into direct solvers and iterative solvers. Direct solvers, as the name implies, use linear algebra techniques to solve the problem for x directly and precisely. However, direct methods are not suitable to solve large sparse linear systems, as it is usually the case in a real-world hpc application, because they will fill-in the zero values of the sparse matrix in order to solve the problem, resulting in a very high computational cost and memory usage. Iterative solvers are a heuristic alternative to direct solvers: they start from an initial guess of the solution, and then iteratively improve this guess, until it is deemed close enough to the actual solution. In this paper, we use the BiCGStab (bi-conjugate gradient stabilized) solver with the ILU0 (incomplete LU factorization with no fill-in) as preconditioner because this combination is the most robust and the default software configuration used in

OPM Flow. Consequently, besides the goal of studying the acceleration potential of FPGAs, another implicit goal required by the reservoir engineers using the simulator was to obtain simulation results close to the software counterpart. Consequently, although it might not be the best choice for acceleration on a parallel architecture, fulfilling the conformance criteria led us to use the same solver used in the software stack.

Algorithm 1 shows the pseudo-code of the BiCGStab solver. This code makes use of several other functions:

  • An spmv function, which performs a Sparse Matrix-Vector (SpMV) multiplication

  • A dot function, which calculates the dot product (or inner product) of two vectors

  • An axpy function, which scales its first input vector by a scalar value, and adds it to its second input vector (i.e. performs: ).

  • A preconditioner function. This is an optional part of the solver, and does not have a set behavior. For more information about preconditioners, see section II-C.

The result of the solver are two vectors:

, which is the estimation of the solution of

, and , a residual vector, which is an indication of how close the result vector is to the exact solution . It is worth noting in Algorithm 1 how the exit condition of the solver is set, or in other words, how the algorithm determines when to stop iterating because the results are sufficiently close to the actual result. The solver stops iterating when the norm of the residual is lower than a certain convergence threshold. There are two ways how this convergence threshold can be chosen: either it is a constant value, in which case we speak of an absolute exit condition, or it is a value set by the norm of the initial residual by a desired improvement factor, in which case we speak of a relative exit condition. An absolute exit condition can be useful when one needs to be certain that the result of a solver is at least within a certain numerical distance from the actual result. However, when solving for multiple different matrices from different domains, a matrix with lower values all across the board might start off at a lower error and also reach the absolute exit condition more quickly. In such a case, a relative exit condition leads to a more fair comparisons between solves, which is why a relative exit condition is used in this paper.

Ignoring for now the preconditioner, what remains in the BiCGSTAB solver are a series of vector operations and spmv, as well as a few scalar operations. From profiling the solver, it is clear that the majority of its computational time is spent on performing spmv and spmv like computations. Therefore, the first goal to design a performant solver kernel would be to perform the spmv efficiently.

Input: , , and are element arrays, A is an
NxN matrix
Output: , Both are element arrays

1:
2:
3:
4:
5:
6:
7:while  do
8:     
9:     
10:     
11:     
12:     
13:     
14:     
15:     
16:     
17:     
18:     if  then
19:         
20:     end if
21:     
22:     
23:     
24:     
25:     
26:     
27:end while
Algorithm 1 Pseudo-code BiCGSTAB solver

Ii-B Sparse Matrix-Vector Multiplication

The spmv is a widely used, rather elementary operation in which a sparse matrix is multiplied by a dense vector. How this operation will be performed algorithmically depends on how the matrix is stored. A sparse matrix consist of many more zeroes than non-zero values, which means that storing all of them is not efficient. There are a variety of different formats in which only the non-zero values of a sparse matrix can be stored. Among these formats, Compressed Sparse Column (CSC) and Compressed Sparse Row (CSR) are two of the most commonly used, with both storing matrices efficiently without putting any restrictions on the content of the matrix, whereas other format may require the matrix to have certain sparsity pattern features such as symmetry, values being repeated multiple times, or store additional non-zeroes, e.g. sliced ELLPACK [16], in order to be efficient. csr and CSC do not have such restrictions. In this paper, we focus only on the csr representation, as the column-major CSC format would be more difficult to use during the application of the ILU0 preconditioner because ILU0 goes through the matrix row-by-row (see section II-C), which would be difficult to implement in the column-major CSC format.

The csr format consists of three arrays: an array with all non-zero values of the matrix stored row-major, an array of the column indices of those non-zero values, and an array of row pointers, which contains an index for each row to identify which non-zero value is the first value of that row, plus a last additional entry containing the total number of nonzeroes stored. The pseudo-code showing how the SpMV is performed on a matrix stored in the CSR format is given in algorithm 2. It is worth noting that there is a high degree of parallelism available in this SpMV. The simple explication is that the calculations for all rows can be done in parallel because the result of one row is never used during the calculation of another row. Nevertheless, this is true only in theory because the irregular memory accesses in the x vector, which depend on the random column index representing the location of the non-zero value in the system matrix, restrict in practice the amount of parallelism that can be exploited efficiently. The main bottleneck is the requirement to replicate the x vector to enable multiple read units to feed independent row processing units. This replication is impractical for realistic large use cases for two main reasons: 1) limited on-chip (i.e., cache) memory and 2) significant preprocessing computations required to align the read data feed into the multiple processing units, which is needed to exploit the maximum available bandwidth in state-of-the-art accelerators with hbm.

Input: , , ,
is an element array of floating point values
is an M element array of integers is an N+1 element array of integers
is an N element array of floating point values
Output: is an N element array of floating point values

1:for  do
2:     
3:     for  do
4:         
5:     end for
6:end for
Algorithm 2 Pseudo-code of a CSR SpMV

Ii-C The ILU0 Preconditioner

How hard a series of linear equations is to solve, depends on various different aspects of those equations, or the matrix that represents them. In other words, how many non-zeroes the matrix has, where they are located in the matrix, and their values in relation to one another, among other things, all play a role. The condition number of a matrix

, which is the ratio of the biggest eigenvalue of the matrix divided by the smallest one, is a measure of how relative changes in the right hand side

of a linear equation result in relative changes of the solution . In addition it a measure of how changes in affect the solution and how numerical errors affect the solution. Often, convergence of iterative methods is worse for matrices with high condition numbers. Therefore, while iterative solvers such as BiCGStab are a powerful method for solving sparse matrix problems, they might still take many iterations before they approach the result within the desired precision when solving matrices with a high condition number.

One technique for reducing the amount of iterations needed for BiCGStab and other Krylov methods is utilizing a preconditioner. Generally, a preconditioner is a transformation or matrix that approximate , but is less computationally expensive to solve than . If is a matrix then the preconditioned method solves . Note that if is a good approximation for , then this system will have a much better condition number than the original one and hence the iterative will converge faster. A simple example of a preconditioner is the Jacobi method with , the diagonal values of . Many different types of preconditioners exist, from very simple ones that only very weakly approximate the problem’s solution, to complex ones that almost solve the system by themselves. Unfortunately, how well a preconditioner performs depends not only on the type of preconditioner, but also on the matrix being solved. For example, the Jacobi preconditioner is fairly accurate if the original matrix has most of its non-zeroes clustered around the diagonal, and if the values on the diagonal have a higher magnitude than those off the diagonal, but does not work as well for matrices that do not have those characteristics.

In this paper, we utilize the zero fill-in incomplete LU factorization (ILU0) as preconditioner, which falls somewhere in the middle: it approximates the system reasonably while not being too computationally intensive to apply. Using an ILU0 consists of two steps: first, before the solver is started, an incomplete LU decomposition is performed. To understand what this is, we must first introduce the LU decomposition. LU decomposition is a direct solving method that calculates an upper-triangular matrix U and a lower-triangular matrix L so that . Such a decomposition can be done using Gaussian elimination, among other methods. Once such a decomposition has been found, can be solved for by first solving for and then solving for . Since L and U are triangular matrices, these two solves can be done using forward and backward substitution respectively. However, like we discussed in the previous section, direct solving method become very computationally intensive for sparse matrices because they start filling in the zero values of the matrix, and LU decomposition is no exception. Incomplete LU(N) decomposition decreases the high computational cost of LU decomposition for sparse matrices by reducing the amount of fill-in that happens, where N is an indicator of how much fill-in is done. The case were (ILU0) means that for the L and U matrices the decomposition generates non-zero values only at the same indices where A had non-zero values.

Once the ILU0 decomposition is complete, the solver can start. The solver will then apply the ILU0 preconditioner in every iteration. Pseudo-code of the ILU0 application is shown in algorithm 3. The ILU0 application consists of two parts, represented by the two for loops in lines 1 and 11: a forward substitution on the lower-triangular part of the LU matrix, and a backward substitution on the upper-triangular part of that matrix. Both loops are very similar in the operations they perform, with the main differences being that the forward substitution traverses the matrix top-to-bottom, while the backward substitution traverses the matrix bottom to top, and that the backward substitution concludes by dividing the result array by the values on the diagonal of the LU matrix. It is worth noting that while these loops ostensibly have the same basic structure as the SpMV loop, there is one major difference: the results of one row depend on rows before it, which means the operations on the rows can not be performed all in parallel. See section II-D for more on how this issue can be circumvented somewhat by re-ordering the matrix.

Input: LUmat,
LUmat is an NxN sparse matrix in the CSR format
is an N element array of floating point values
Output: is an N element array of floating point values

1:for  do
2:     
3:     for LURowPointers LURowPointers do
4:          LUcolInds
5:         if  then
6:              
7:         end if
8:         LUnnzs
9:     end for
10:end for
11:for  do
12:      LURowPointers
13:     for LURowPointers LURowPointers do
14:          LUcolInds
15:         if  then
16:              
17:              
18:         end if
19:         LUnnzs
20:     end for
21:     LUnnzs
22:end for
Algorithm 3 Pseudo-code of the applying of the ILU0 preconditioner

To quantify the effect of the ILU0 preconditioner in practice for our solver performance , we ran software implementations of the solver without preconditioner and with the Jacobi and ILU0 preconditioners on several benchmark matrices. The matrices were obtained from the SuiteSparse matrix collection [24, 9].

Precond. None None Jacobi Jacobi ILU0 ILU0
Matrix
Hummocky 5.9 925 1.2 125.8 3.1 32.3
bcsstk25 5.8 3396 3.7 34.9 18.9 4331
bodyy6 4.4 224 1.2 86.3 3.7 15.7
wathen120 8.0 112 5.2 25.2 12.8 18.7
gridgena 2.0 403 5.9 643.3 18.7 324.7
qa8fm 12.7 114 13.1 79.7 44.8 60.3
TABLE I: cpu solver time in ms with & exit condition

Table I shows the results of these tests, which were obtained by running the preconditioned solver for a single-thread on a Intel i7-9700 CPU with 16 GB of RAM. Two exit conditions were used: the more precise exit condition requiring the absolute value of the residue being reduced by of its initial value, and the less precise exit condition requiring a residue norm reduction of . Not shown in the table are the number of iterations in which the solvers converged, which in all matrices except for bcsstk25 were lower when ILU0 preconditioning was used over the other two preconditioning options. However, a lower iteration count does not directly result in a higher solver performance. The first reason for this is the time the pre-processing of the preconditioner takes, which accounts for between 60 % and 90% of the time taken by the lower-accuracy solver with ILU0 preconditioner. The Jacobi precoditioner has barely any pre-processing time, and is powerful enough to be preferred over no preconditioner for all but one matrix, and preferred over ILU0 for all matrices when the solver’s precision is low. Only when the solver’s precision was set higher, did the ILU0 preconditioner’s power show: the number of iterations required to reach this accuracy was high enough that the pre-processing time no longer dominated the run time. For most matrices, the ILU0 preconditioner reduced the iteration count enough that the solve performance was better than that of the other preconditioner options tested. The bcsstk25 matrix being an exception, as its solve was only accelerated by the Jacobi preconditioner. We suspect the matrix’ narrow banded structure and high values on the diagonal make it particularly well-estimated by the Jacobi preconditioner. The second reason for why a lower iteration counts doesn’t translate to a higher solver performance directly is not captured in our software tests. That is, different preconditioners might be easier or harder to parallelize; therefore, making them more or less attractive to be accelerated on hardware respectively. The single vector calculation that is the applying of the Jacobi preconditioner is easy to parallelize, and therefore attractive for hardware acceleration. ILU0 application on the other hand is not that inherently parallel, but can be made more parallel when using graph coloring.

Ii-D Matrix Reordering

As covered in section II-C, the ILU0 application function does not inherently have a lot of parallelism for a hardware accelerator to exploit. However, by reordering the matrix, some parallelism in this function can be obtained. The goal of such a reordering is to group rows together that do not depend on one another (i.e., that do not have non-zero values in columns that have the same index as the row indices of other rows in the group). During our work, we have focused on two ways of doing such a reordering: level scheduling and graph coloring.

Ii-D1 Level Scheduling

Level scheduling (LS) is a way of re-ordering a matrix by moving the rows and the columns of the matrix. That is, the rows are grouped together in colors that can be operated on in parallel, and then the columns are reordered in the same way to keep the matrix consistent in the (re)order of the unknowns, which is required during ILU decomposition. LS does not only group independent rows together into ’levels’, but that also keeps the order of dependencies of the original matrix intact [22]. In other words, if a certain row j that depends on row i comes after row i in the original order, then after level scheduling, row (the row to which row j was moved during level scheduling) will still come after for all possible values of i and j. A level-scheduling algorithm will go through the matrix and gather all rows that do not depend on any other rows into the first level. Then it will go through the matrix again, and group all rows that only depend on rows in level 0 into level 1. Then it will fill level 2 with rows only dependent on rows from levels 0 and 1, and so on until every row has been added to a level.

Ii-D2 Graph Coloring

Graph Coloring (GC) is a term that we use to describe the reordering of matrices into groups (’colors’) of independent rows that do not require the maintaining the order of dependencies in a matrix. To be more specific, we use the Jones-Plassmann algorithm [15] to do graph coloring. This algorithm assigns a random value to every row in the matrix, and then groups all rows that have a higher number than all not-yet-grouped rows it depends on into a single color, and keeps repeating this process until all rows have been assigned a color. In this work, graph coloring was chosen over the other method because it is parallelizable and it allows more control over the maximum amount of rows and columns in a color, which will alleviate some of the inherent hardware design limitations. The disadvantage of graph-coloring is that it is not guaranteed to find the minimum number of colors, and thus the highest amount of parallelism because of its random nature.

Ii-D3 Reordering Scheme Selection

Selecting one of the two matrix reordering methods depends on the answer to how important it is to obtain a solution that is as close as possible to the one obtained without reordering. If the answer is very important, then level scheduling is the preferred reordering scheme. However, when the answer is not important, because an iterative solver is already estimating the result instead of solving the system precisely, graph coloring can be used if the result is close enough to the actual solution.

Ii-E Sparstitioning

To increase the scalability of our design and allow it to work on larger matrices, we will use sparstitioning [23]. Sparstitioning is a method for splitting both the matrix and the vector of a matrix-vector operation into partitions, so that no unnecessary data is read during the running of the operation on each partition. However, we make two changes to the sparstitioning algorithm. Firstly, we modify how the partitions are chosen to conform with the matrix reordering scheme chosen. Concretely, every color/level will be one partition. Secondly, we modify how the vector partitions are stored. When sparstitioning is used for an SpMV, the vector that is operated on is stored in a partitioned format, which introduces some duplication of the vector data. However, storing the vector in this way is not convenient because the non-partitioned vector is needed for the vector operations of the solver. Therefore, instead of storing the vector partitions themselves, our design stores an array that denotes for every partition which indices of the vector are part of that partition. Consequently, only those values can be read from the vector when computations start on that partition.

Therefore, because of the matrix reordering and sparstitioning optimization required to design a hpc iterative solver, we modify the standard CSR matrix storage format in two ways. Firstly, the number of non-zeroes, rows, and vector partition indices used in each partition of the matrix need to be known. These will be collectively referred to as the sizes of each color. Secondly, the vector partition indices themselves are needed, so that the solver knows which values of the matrix are accessed by each color. To account for these CSR modifications, we introduce our own format, CSRO, which we will formally introduce in section IV-B.

Iii Related works

Iii-a Sparse Matrix-Vector Multiplication on FPGAs

As described in section II-A, in this paper we focus on cg solvers, with a focus on its general version (i.e., not only for symmetric matrices as CG is), BiCGStab. Previous research into accelerating (parts of) CG solvers using FPGA has focused until now mostly on accelerating the SpMV kernel for one good reason: The SpMV function has a wide range of uses in a broad variety of domains, and because of its irregular memory accesses, it is not efficiently performed by classical von-Neumann architectures like CPUs. This makes SpMV a perfect candidate to be accelerated using customizable dataflow architectures that can fully exploit application-specific configurable memory hierarchies. Furthermore, in the context of linear algebra iterative solvers, SpMV takes up most of time. Consequently, a plethora of previous work addressed the topic of accelerating SpMV on FPGAs. Therefore, we only list a few of the performant designs without trying to be exhaustive.

  • The SpMV unit proposed by Zhuo and Prasanna [29]

    achieves a performance of between 350 MFLOPS and 2.3 GFLOPS, depending on the number of processing units, the input matrix, the frequency, and the bandwidth. However, to realize their design, the authors use zero-padding and column-wise blocking of the matrix. These design choices make SpMV less efficient due to storage overhead that becomes critical for large matrices and very inefficient to use in combination with an ILU0 preconditioner.

  • Fowers et al. [13] implemented a SpMV unit that uses a novel sparse matrix encoding designed with the fact that the matrix will be divided over multiple processing units in mind, and a banked vector buffer that allows many memory accesses to the input vector at the same time. That implementation reaches a performance of up to 3.9 GFLOPS, but does not actually outperform GPU or CPU implementations for large matrices.

  • Dorrance et al. obtain a performance of up to 19.2 GFLOPS [10] for their CSC based SpMV accelerator, which is as far as we could find the highest performance of any FPGA-based implementation of SpMV. This high performance was thanks to the high computational efficiency of the implementation paired with its 64 processing units and high bandwidth (36 GB/s). However, as can be seen in algorithm 3, in the applying of the ILU0 preconditiner, the result of one row is needed to calculate the result of ones after it. So, unlike the SpMV, the apply_ILU0 can not be done as easily with CSC as it can with CSR. In fact, the CSC is completely unsuitable to use in conjunction with ILU0, and as a result we will not use that storage format in this paper.

Please note that the scope of the paper is not to design the fastest SpMV unit possible, but also to integrate this in a larger and complex application that is a preconditioned linear solver. Consequently, although some of the design choices might be sub-optimal for the design of the fastest SpMV (e.g., using a CSR like format instead of the more parallelizable CSC format), which is thus not a primary goal, we focused on the complete integration and looked for the best SpMV design options from a system-level view, and not SpMV in isolation as most of previous work considers. Nevertheless, the SpMV designed and implemented in this work is on par with the best SpMV solution we could find in the literature when we consider normalized resources and bandwidth as comparison parameters. Our SpMV is able to achieve up to 4.5 and 8.5 GFLOPS in double and single precision, respectively.

Iii-B Sparse Matrix CG Solvers on FPGAs

Research into designing a complete CG solver on an FPGA has been done as well. However, probably due to the complexity of iterative solver algorithms coupled with a limited area, small on-chip memory, and no HBMs available on previous generation FPGAs, we could find only a handful of related works of solvers on reconfigurable hardware.

  • Morris et al. [17] implement a CG solver on a dual-FPGA using both a HLL-to-HDL compiler and custom floating point units written in VHDL. Despite the fact that using the HLL-to-HDL compiler is probably less optimized that writing VHDL directly, they obtain a speed-up over software of 30%, showcasing the potential of FPGA in accelerating sparse CG solvers.

  • Wu et al. [28] implemented a CG solver on an FPGA consisting of an efficient SpMV unit, a vector update unit, and a diagonal preconditioner. They obtain a speedup between 4 and 9 times over their software implementation.

  • Chow et al. [8] introduce a novel way of statically scheduling accesses to the on-chip memory in a cg solver on an fpga. Despite the initial cost of performing this scheduling, an average speedup of 4.4 over a CPU solver, and of 3.6 over gpu one is achieved.

However, different from the work in this paper, all of this previous research focused on small scale matrices that are not enough for a HPC real-world application setting. Furthermore, another thing that all of these implementations share is that no or only a very simple preconditioner has been used. Preconditioners can improve the performance of a solver significantly, as they reduce the amount of iterations that the solver needs to run until it reaches its desired precision significantly. Despite costing additional time to perform each iteration, the ILU(0) preconditioner used in this work improves the performance of all kinds of CG solvers over cases without preconditioners or with the diagonal preconditioner [25]. Consequently, not using a preconditioner in an iterative solver for a HPC real-world application that requires very large and sparse matrices, is not an efficient approach. Finally, using the CG solver limits the applicability of the design to only problems that use symmetric matrices to describe the physical system.

Iv SpMV multiplication

Iv-a Design Objective

The main objective of our SpMV kernel on the FPGA is to ensure that the unit has a high computational efficiency, which is required to fully exploit the hbm available on the state-of-the-art FPGA generations such as the Xilinx Alveo U280 [3] data center accelerator card. The remainder of this section highlights the design choices of the SpMV implementation for the hpc domain when using the latest hbm-enhanced fpga accelerators.

Iv-B The Matrix Format

As described in section III-A, we have chosen to base our kernel on the row-major CSR matrix storage format. Furthermore, CSR is also the default format used in Flow, a real-world HPC software application that we will use in section VII to validate our results and benchmark the performance. However, this format poses a challenge for any HPC FPGA-based design, namely because the CSR contains one integer per matrix row to describe how many values are each row. Since our design objective is to maximize throughput of the SpMV unit, we want to feed as many non-zero values and column indices into the SpMV unit as it can sustain on every clock cycle. Unfortunately, it is not straight-forward when using the CSR’s rowPointers to determine to which row each value belongs to. The number of rows a given number of non-zero values in a matrix belong to can vary wildly based on the matrix’s sparsity pattern. Consequently, the number of rowPointers that need to be read every cycle also varies, and also depends on the values of the rowPointers of the cycle before it. For this reason, we introduce a novel sparse matrix storage format based on CSR to be used by our accelerator.

Similar to CSR, this new format contains the matrix’s non-zero values and their column indices as two arrays, but instead of the rowIndices, it contains a new row offset for every nonzero value in the matrix. Because of the added offset array, this new format is tentatively called the Compressed Sparse Row-Offsets (CSRO) format. A value in the newRowOffset array is 0 if the non-zero value at that same index is in the same row as the non-zero value before it. If a non-zero value is not in the same row as the value before it, then the rowOffset at the same index will be equal to 1 + the number of empty rows between that non-zero value’s row and its predecessor’s row. To help illustrate how a matrix is stored in the CSRO format, figure 2 shows a matrix as it would be represented in the CSRO format. For comparison, figure 1 shows how that same matrix would be represented in the CSR format. Note that the NewRowOffset value of the first value in the CSRO format is 1. This is a convention adopted to show that the first row starts with the first value.

Fig. 1: An Example Matrix Stored in the CSR Format.
Fig. 2: An Example Matrix Stored in the Novel CSRO Format.

Iv-B1 The Blocked CSR

The OPM platform and its flow simulator do not use regular CSR, but rather blocked CSR, in which each non-zero value in the matrix is not one value, but a 3x3 matrix of values. This is done to group all relevant physical information about a single point of the reservoir that the matrix modelled together, but that is not relevant for this paper. What is relevant, is that the nine values in the 3x3 blocks are not easily used on an FPGA. Not only do these blocks often contain zeroes (about half of all values in the blocks of the NORNE testcase are zeroes), but the odd number of values in a block poses a problem when reading them over the ports of the FPGA or storing them in its internal memories, all of which are build to consist of an even number of values. Since the new matrix format that we introduce and the sparstitioning already require extensive reordering of the non-zero values of the matrix, we take this opportunity to also remove the blocking from the matrices in those steps. As a result, the FPGA solver kernel primarily works on non-blocked matrices. The only exception to this is the division at the end of the backwards substitution step in the apply_ILU0. Here, a division by the value on the diagonal need to be done, and this value on the diagonal is a block. These diagonal blocks are the only blocks that could not be removed, and as a result the hardware kernel still uses those blocks as inputs.

Iv-C Design Overview

Iv-C1 The SpMV Pipeline

The upper limit of the throughput of our SpMV unit design is determined by the number of multipliers in the unit. Thus, to achieve a high computational efficiency, our design aims to stream values into those multipliers every cycle. Figure 3

shows how data is streamed into the multipliers: non-zero values and column indices are read from the memories that hold them, and the column indices are used as addresses at which to read elements from the input vector, while the non-zero values are delayed by the delay of the vector memory. In this way, the non-zero matrix values and the corresponding vector elements arrive in the multiplier at the same time. There is one memory that holds the vector partition for every two multiplier units, because the BRAM blocks of an FPGA have at most two read ports. Which values of the vector are part of a certain partition is determined by the vector partition indices of that partition. Those values are calculated by the software during pre-processing. Each vector partition memory contains the entire vector partition, so the data is replicated and stored in each one. While not the most space-efficient solution, this does allow a high number of simultaneous memory accesses at different addresses, which is what the SpMV unit requires.

Fig. 3: A High-Level Schematic of the SpMV Unit.

After the multiplications have finished, the SpMV unit needs to add together the multiplication results that belong to the same matrix row together to obtain the SpMV result of that row. For this, we designed a selective adder tree (SAT) and a reduce unit, which receive instructions about which values to add together from a control unit. This control unit in turn obtains this information from the new row offset data. The selective adder tree adds all the values in a single cycle that belong to the same row together, while the reduce unit adds together the SAT results from different clock cycles that belong to the same row. SAT results that already represent the final result of a row (i.e. if all the values of that row were in the same clock cycle) do not need to go into the reduce unit. Finally, a merge unit merges all of the results of the adder tree and reduce unit into a set number of output ports that will be written to a result memory.

Iv-C2 SpMV Top Level

While the SpMV pipeline is the most important part of the SpMV unit, as it does the calculations, some additional units are needed to keep the pipeline running correctly. These units are as follows:

  • The external read unit handles all reads from outside of the FPGA chip. At the start of the SpMV run it reads the sizes of all colors, and uses these to read all matrix data that is needed during the SpMV.

  • For the SpMV unit to be able to use sparstitioning with vector partition indices, the vector values at those indices need to be able to be read quickly. To achieve this, our FPGA design contains a single large URAM memory to store the multiplicant vector once. Due to the current size chosen for this memory in the design, up to 262144 values can be stored, meaning that a matrix may not have more than 262144 columns to be able to be multiplied by the kernel. The internal read unit coordinates reads from this memory. It receives the vector partition indices of the current color from the external read unit, reads the vector values at those indices, and sends those values into vector partition memories of the SpMV pipeline.

  • The results of the SpMV pipeline can be out of order, as the time some result spend in the reduce and merge units are different than others. The write unit stores the results of the SpMV pipeline in a memory block that has been cyclically partitioned, so that it can write multiple results into it and at the same time still read from it. It also keeps track of up to which index all results have been received. Whenever it has received enough values to fill up a cacheline, it will queue up that line to be sent to the HBM memory.

An overview of the SpMV top-level unit is shown in figure 4. Not shown in this figure is the control logic around it that instructs the units how much data to read or write and to/from where. To do this, the control logic keeps track of which color it is currently operating on and how many still remain.

Fig. 4: Schematic SpMV Top-level

Iv-C3 Non-partitioned SpMV

One of the tasks the SpMV top-level unit must perform is the reading of the vector partition data that the sparstitioning adds to the matrix data. For a large part, this read operation can be done as look-ahead, meaning that while the SpMV pipeline operates on one color, the vector partition data of the next color can already be read from the URAM. This data is then stored in a BRAM block in the internal read unit. However, after the SpMV pipeline completes, this vector partition data needs to be transferred to the vector partition memories of the SpMV pipeline, and the vector partition indices for the next color need to be read from the off-chip memory. This is overhead added by our choice to use sparstitioning. To determine how much this overhead slows down the overall performance, we added the option to the design to not use sparstitioning. This version of the design has as downside that it has a more limited maximum matrix size that it can operate on, as the complete multiplicand vector need to be stored multiple times in the vector memories of the SpMV pipeline. The advantage is that these additional steps of reading vector partition indices and transferring vector partition data no longer need to occur.

Fig. 5: Overview of the Tasks in the Sparstitioned SpMV Unit.
Fig. 6: Overview of the Tasks the Non-Sparstitioned SpMV Unit.

To visualize the difference in way of operating between the sparstitioned and the non-sparstitioned SpMV unit, figure 5 shows the steps the sparstitioned SpMV unit takes, while figure 6 shows those the non-sparstitioned SpMV unit takes. Tasks horizontally next to one another in these figures are performed at the same time. Obviously, the flow in figure 6 is much simpler. Also note how in figure 5 the reading of the next vector partition (and the reading of the indices to read from) are already done while the SpMV unit is still working on another partition. This hides the time these reads would normally take, but also necessitates an additional memory to read the next vector partition into, and a step in which the data is transferred from this memory into the SpMV unit.

Iv-D GPU SpMV kernels

For the GPU, kernels in CUDA and OpenCL are tested. Only doubles are used for nonzero values. NVIDIA’s cuSPARSE [1] features a blocked SpMV function called cusparseDbsrmv(). For OpenCL, algorithm 1 from [12] is used. In this algorithm, a warp of 32 threads is assigned to a single block row, and threads are assigned to elements in the block row. To cover the entire block row, a warp iterates, handling blocks at a time, where bs is the blocksize. A warp only operates on complete blocks, so some threads might be always idle. For 3x3 blocks, a warp can cover 3 blocks with 27 threads, and has 5 inactive threads. During initialization, a thread calculates the index of the block row, how many blocks the row has, and which block it is supposed to operate on. It also calculates its position within a block (row and column). Since a warp only covers complete blocks, a thread’s assigned position in a block never changes. During iterating, it multiplies the target element from A with the value from x, adding the product to a running total. Once the whole block row is complete, threads with the same vertical position (row) in the blocks reduce the values in their local registers and write the reduced output to global memory. Reduction can be done with shuffling or via shared memory.

V Sparse Matrix solver design

To expand the SpMV kernel into a kernel that can perform preconditioned BiCGStab, units to perform both the preconditioning and the vector operations need to be added. In this section, the design of those units is discussed, followed by a description of the complete design that links all units and performs the complete preconditioned solve.

V-a Applying the ILU0 Preconditioner

As shown in algorithms 2 and 3, the basic structure of the SpMV and the forward and backward substitutions that the ILU0 application performs are similar. This, combined with the fact that the SpMV and ILU0 application are never active at the same time, since the result of one is needed to start the other, makes re-using the hardware of the SpMV unit for the ILU0 apply a desired feature to save resources. Unlike the SpMV, which goes through the matrix it operates on line-by-line, the ILU0 application goes through its LU matrix in a no-sequential order. First, it goes through all of the values below the diagonal top-to-bottom, and then it goes through all values above and including the diagonal bottom-to-top (last row first). This memory access pattern is not efficient to use on the FPGA because it goes through the LU matrix skipping the second half of the values of each row during the forward substitution, since it only uses the lower triangular matrix there. Then, it goes back up through the matrix only using the second half of the values, since it only needs the upper triangular matrix during the backward substitution. Consequently, it needs to read all LU matrix data twice, ignoring about half of it each time. Which data it needs to use and which it doesn’t depends on the current row number and the column index. This memory access pattern is not specifically inefficient for the FPGA, but just in general, so in our design the lower-triangular matrix L, the upper triangular matrix U, and the diagonal values are split from the LU matrix, and used as three separate data structures. First, L is read to perform the forward substitution, and then U and the diagvals are read to perform the backward substitution. Since the latter step happens bottom-to-top, both the U matrix and the diagvals are stored in reverse order.

The main difference between the substitutions and the SpMV is that the substitutions modify the vector that they are operating on, while the SpMV doesn’t. All rows that are grouped into the same color by graph coloring can be in the pipeline at the same time, but between the colors, the results of the previous color need to be integrated into the vector that is currently being operated on. To achieve this, the results of the ILU0 application are not only written to the URAM vector memory, but also forwarded directly into the vector partition memories so that they can be used during the next color, without requiring the unit to wait with reading its vector partition until the previous color finishes its computation.

There are additional operations that the forward and backward substitution needs to apply over the SpMV. For the forward substitution, this is a subtraction of the SpMV unit result from the p vector, and for the backward substitution, it is this subtraction followed by a division by the diagonal value. The ILU0 unit is added to the SpMV unit to perform these operations. The reading of the P vector can be done directly from the URAM, since that is where the P vector is stored during the ILU0 application. The diagonal vector is read from off-chip memory by the external read unit. Figure 7 shows how the original SpMV top-level was modified to allow the unit to also apply the ILU0 preconditioning step. Note that this unit is not yet the top-level of the complete solver.

Fig. 7: Schematic of the SpMV/ILU0 unit.

V-B The Vector Operations

Over the course of a BiCGSTAB run, a number of vector operations of three different kinds need to be applied:

  • A dot product, which calculates the inner product of two vectors.

  • An axpy, which performs .

  • A norm, which calculates the absolute value of a vector (i.e. )

Of these, the dot product and norm are very similar in structure, as a norm is the square root of an inner product of a vector with itself. All three operations are highly parallel, and perform the same basic operations (one multiplication and one addition) for every element in their input vectors. For this reason, and to provide more flexibility in exploiting the task-level parallelism in the solver, we designed a unit that can perform either an axpy or a dot product. This dot_axpy unit contains an equal number of multiply and add floating point units, which can be connected in different ways, depending on which function needs to be performed. When an axpy need to be performed, the adders and multipliers are connected in parallel pipelines that each perform the axpy on one element of the input vectors at a time. For this function, an additional input port is used for the scaling constant , besides the two input vector and one output vector ports. Figure 8 gives an overview of how the adders and multipliers are arranged in the dot_axpy unit when it performs an axpy operation.

Fig. 8: Schematic Overview of dot_axpy Unit in axpy Mode.

For the dot products, all multipliers work in parallel, the results of which are then added together into a single value by a tree of adders. The result of this adder tree is added together with the result of previous cycles in the final adder, which adds a new input to its most recent output. After the adder tree has processed all values of the input vectors, logic around the final adder continuously feeds two subsequent valid output of that adder back into it, until one singular finally result is left. Figure 9 gives an overview of how the adders and multipliers are arranged in the dot_axpy unit when it performs a dot product.

Fig. 9: Schematic overview of dot_axpy unit in dot mode.

During the vector operations, a degree of task-level parallelism exists in the BiCGStab solver algorithm. This parallelism can be found in the following operations (using line numbers from algorithm 1):

  • The vector subtraction in line 1 can be pipelined with the SpMV operation in that same line, so that the subtraction of each element of happens as soon as the SpMV result in the same position as that element is done. The dot product in line 5 can be pipelined with the subtraction of line 1 in the same way.

  • The two axpy operations in line 9 can be pipelined.

  • The dot product in line 12 can be pipelined with the SpMV in line 11.

  • The axpy operations in lines 14 and 15 can be done in parallel, and the dot product in line 16 can be pipelined with the axpy in line 15.

  • The two dot products in line 19 can be done in parallel, and can both be pipelined with the SpMV in line 18.

  • The axpy operations in lines 20 and 21 can be done in parallel. The dot product in line 22 can be pipelined with the axpy in line 21, and so can the dot product in line 7 in the following iteration.

Together, these instances of possible task-level parallelism cover all vector operations that the BiCGStab performs. The number of vector operations that can be performed in parallel in each of these instances is, in order: 2, 2, 1, 3, 2 and 4. The complete vector unit contains multiple dot_axpy units to exploit this task-level parallelism, but to have the dot_axpy units that it doesn’t need to be idle too often, and to not increase the number of read/write ports too much (see section V-C), we choose to implement 2 dot_axpy unit in the vector_ops unit.

V-C Read and Write Ports

The Xilinx Alveo U280 data center accelerator card has two different types of large memory that the FPGA chip is connected to: an off-chip DDR memory that consists of two 16 GB DDR4 memory banks, each of which is connected to the FPGA via a single read/write port, and an 8 GB on-chip HBM memory stack that is connected to the FPGA via 32 read/write ports. Each memory port is accessed at the kernel level either via a 512-bit wide AXI4 interface (for the DDR memory) or via a 256-bit wide AXI3 interface (for the HBM memory). This memory architecture has led to the following design decisions:

  • Reading and writing delays from/to the HBM are slightly shorter than those from/to the DDR memory, so all vectors (which all need to be read and written multiple times during the solver run) are stored in the HBM memory.

  • To reduce the number of memories to which the host CPU needs to transfer its data, and to thereby reduce the initial data transfer time, all matrix data and the initial B vector are stored on the DDR memory.

  • Some vector operations read from and write to the same vector, for example the axpy in line 14 of algorithm 1. To avoid conflicts and to make data buffering easier, each AXI port is used either to read or write, hence the x, r, and p vectors are stored in two memory locations connected to different ports. During an operation to update one of those vectors, it is read from the port that it was written to most recently, and written to the other port. This essentially makes use of the ping pong buffering technique used to stream data efficiently.

  • All HBM ports are located on one of the two narrow sides of the FPGA chip (being the shape rectangular), where the HBM crossbar and memory stack is located, so, to avoid routing congestion, we choose to limit the number of HBM ports. We chose to use six HBM ports, because this number of read and write ports can be efficiently used during the vector operations.

  • Since the width of the read ports is 512 bits, we design our computation pipelines to be able to accept one full cache line every cycle. For the HBM ports, which are natively 256 bit wide, we let the implementation tools to handle the AXI port width conversion. That means each dot_axpy unit has 8 double-precision multipliers and adders.

V-D The Complete FPGA Solver

A top level solver unit was designed to encompass both the SpMV/ILU0 unit and vector_ops unit. This top-level unit instructs those units which tasks to perform and with which input/output vectors. It also regulates the access of both units to the URAM internal vector memory and handles filling of this memory during the initialization step of the solver. Figure 10 gives a schematic overview of the top-level unit of the BiCGStab solver kernel. To not overly complicate the figure, the signals between the units and the memory are not pictured. Besides the units already covered in the previous sections, the figure also shows a block of floating point operators and a memory for floating point variables. The operators are used to apply the operations that only need to be performed on a single value at a time, such as the square root of the norm calculation or the multiplications and division of the calculation in line 8 of algorithm 1. The memory is used to stored the results of these operations, and contains , , , , and conv_treshold.

Fig. 10: Schematic Overview of the Complete Solver Design.

V-E The GPU Solvers

For completeness of the evaluation and comparison of our work, we also developed a couple of GPU solvers using first CUDA and the NVIDIA’s cuSPARSE library [1]. cuSPARSE has functions for vector operations like dot, axpy and norm, as well as a blocked ILU0 preconditioner. Second, we also implemented the solver using OpenCL [18], where a similar structure to the CUDA is used, defining kernels for vector operations and applying the ILU0 preconditioner. One big difference is that the construction of the ILU0 preconditioner (the decomposition) is done on the CPU. This has not been implemented on GPU yet, whereas CUDA features an ILU0 decomposition on the GPU in cuSPARSE.

Vi Performance Modeling

To predict the performance of both the hardware design as it was in development as well as to gauge how well the design scales, we constructed a performance model that we run in Matlab. This model was made to emulate the data flowing through the computation units and to calculate how many clock cycles the execution would take. This section describes first what parts of the design are modeled and which are not, showing second the predicted modelling results.

Vi-a Model Details

The performance model is composed of functions that model the individual hardware kernels: it has a function to estimate the performance of the SpMV, one for the ILU0, and one for the vector operations. To make the performance estimations as accurate as possible, the hardware models work on the same data that is sent to the kernel, including the partitioning data. The exception being that, since the model does not actually perform the solving calculation, it does not read the non-zero matrix values or any vector data. Nevertheless, the partitioning and sparsity pattern information are needed to calculate the performance. Furthermore, the functions that calculate the performance of the ILU0_apply and SpMV operations share the same structure, in the same way that those operations share the same module in the hardware kernel. These functions do the following for every color:

  • First, it calculates the time to transfer the vector partition data into the SpMV unit. It does this by dividing the number of values in the partition by the number of ports allocated to the internal URAM memory from which the partition is read.

  • If apply_ILU0 is performed, the function calculates the time the transfer of the P vector into the ILU0 unit memory would take in the same manner.

  • Then, it calculates the time the SpMV pipeline takes. During this step, the function’s emphasis is on the time the reading of the matrix data takes. It assumes a set bandwidth for each of three ports from which the three matrix arrays (non-zero values, column indices and new row offsets) are read, and simulates reading into three FIFOs at the speed set by that bandwidth. Whenever enough data is available in the FIFOs for all three arrays to fill an input line of the SpMV pipeline, that data is removed from the FIFOs.

  • After enough data has been read from all three ports for the current color, and the simulated FIFOs are empty, a set delay is added for the SpMV pipeline to write back the result data. Since much less data needs to be written back than needed to be read, no delay beyond the regular write overhead of the final lines is deemed necessary.

  • If apply_ILU0 is performed, a set delay for the ILU0 unit is added, and the write delay is calculated with the on-chip URAM instead of an off-chip memory.

The cycle count results of all colors are added together to find the total number of clock cycles that the SpMV or apply_ILU0 takes. The calculations done to estimate the cycle count of the vector operations are comparatively more simple, but follow the same method: first the reading of data into FIFOs at a set bandwidth is modeled, data is removed from the FIFO whenever enough is in it to fill up a line of inputs of the vector operation unit, and finally, when all reading is done, the latency of the pipeline of the operation is added, as well as some writing overhead cycles for the axpy operation.
A top-level function adds together the cycle count results of apply_ILU0, SpMV, and vector operations to get to a final estimate. This top-level function has a number of variables that can be changed to do domain space exploration and to assess the scalability of the kernel. These are:

  • The number of multipliers and adders in both the vector operation and SpMV/ILU0 units.

  • The bandwidth to the off-chip memory in GB/s.

  • The bandwidth to the on-chip URAM in number of ports (how many values of the vector stored on-chip can be read simultaneously).

  • The delay of the adder and multiplier units

  • The matrix that is modeled for, and how long the solve of that matrix takes, in iterations.

Please note that the largest source of performance uncertainty in the design is the time the memory operations take. Memory accesses don’t happen at a constant pace, but in bursts, the size of which and the time between them may very based on various factors related to the memory. As a result, the weakest part of the model is that it does not accurately capture this memory behaviour. Similarly, it does not support modeling reading two arrays of data from the same port, nor the performance penalties from reading and writing on the same memory bank at the same time.

Fig. 11: Domain Search Exploration for the ILU0-BiCGStab Solver. One internal port has a maximum bandwidth of around 18 GB/s at the current kernel frequency of 280 MHz. The configuration achieved in this paper for the FPGA solver was 2 internal ports, 8 PUs, and 50 GB/s external bandwidth. This corresponds to the point in the top middle figure on the parse line estimating a 1.6x speedup, which was confirmed by the experimental results presented next.

Vi-B Modelling Results

We use the model to analyze the scalability of the design and to understand what are the critical points in achieving the best performance. We perform a domain space exploration by varying the model parameters, starting with the number of adders and multipliers set at 2, increasing the bandwidth from 10 GB/s, and increasing the number of internal ports from 2. When a parameter does not give anymore a performance increase, we switch to increase the new bottleneck parameter. We perform this exercise by increasing all parameters until a maximum value we select as an upper bound of a maximum practical boundary for our design. Please note that even the maximum theoretical bandwidth is 460 GB/s, due to the complexity of the design which influence the mapping and routing of resources from/to the memory, we restrict the bandwidth to only 100 GB/s as parameter. Figure 11 shows the modelling results.

Short
name
Host Accelerator
CPU (#Total cores) Frequency Memory (#Channels / Bandwidth) Name Memory (Bandwidth)
FPGA Xeon E5-2640V3 (16) 2.6 GHz DDR4-2133 (4 / 59 GB/s) Xilinx Alveo U280 DDR4+HBM (498 GB/s)
GPU Xeon E5-2640V3 (16) 2.6 GHz DDR4-2133 (4 / 59 GB/s) Nvidia K40m GDDR5 (288 GB/s)
GPU Xeon E5-2698V4 (40) 2.2 GHz DDR4-2400 (4 / 77 GB/s) Nvidia V100-SXM2-16GB HBM2 (900 GB/s)
CPU Xeon E5-2698V4 (40) 2.2 GHz DDR4-2400 (4 / 77 GB/s)
TABLE II: Characteristics of the platforms used to benchmark the kernels

Vii Experimental Results

Evaluation of the performance of our design is done in three stages. Firstly, the performance of the stand-alone SpMV unit is evaluated, followed by the performance of the complete solver stand-alone. Finally, the FPGA solver is integrated into the flow simulator, and its performance is compared to that of a CPU and a GPU. All the FPGA bitstream implementations were done in Xilinx Vitis 2019.2 and run on a Xilinx Alveo U280 data center accelerator card [3]. In this section, performance for the SpMV unit will be reported in Giga FLoating-point OPerations per second. These are calculated by dividing the total number of FP operations performed by the time it took to perform them. Therefore, the total number of required operations is two times the number of non-zero values in the matrix that is being multiplied (since one multiplication and one addition need to be performed on each non-zero value).
Table II lists the platforms we used to benchmark the kernels and the flow simulator. Each platform will be then referred in the rest of the paper with its short name (e.g. FPGA).

Vii-a SpMV results

Fig. 12: SpMV performance. Please note that for the FPGA, the maximum aggregated HBM memory bandwidth is 460 GB/s, however, due to design limitations caused by implementation issues, the maximum bandwidth used by the kernel is around 50 GB/s.

The SpMV unit has many configurable parameters that will impact both performance and FPGA resource usage. For this performance evaluation, we choose to compare between single-precision (float) and double-precision (double) implementations, as well as between partitioned (_p) and non-partitioned (_np) implementations. Other parameters were set to the values in Table III.

Parameter double_p double_np float_p float_np
Num mults 8 8 16 16
Max rows 262144 65536 262144 65536
Read ports 4 4 4 4
TABLE III: SpMV parameters of different SpMV implementations

The resource utilization of the four chosen SpMV kernel implementations is shown in Table IV. From this table, we can observe that the BRAM utilization sets the limit of how often this kernel could be replicated on this same FPGA device. All implementations are able to reach a frequency of 300 MHz, which is the maximum frequency at which the AXI memory interfaces can run.

Resource double_p double_np float_p float_np
LUT 4.08 % 4.12 % 4.92 % 5.00 %
LUTRAM 0.70 % 0.69 % 0.82 % 0.78 %
REG 4.14 % 4.45 % 4.96 % 5.24 %
BRAM 15.27 % 12.38 % 18.97 % 12.38 %
URAM 4.17 % 6.67 % 2.50 % 6.67 %
DSP 0.93 % 0.93 % 0.74 % 0.74 %
TABLE IV: FPGA resource utilization of different implementations of the SpMV kernel as % of the resources available in the FPGA’s dynamic region

The sizes and types of matrices that were used to test the SpMV unit are listed in Table V. All of the listed matrices were obtained from the SuiteSparse matrix collection [24, 9], apart for the NORNE matrix, which models a real-life oil field, and is a testcase in the OPM framework. As such, this matrix was not obtained from the SuiteSparse matrix collection, but instead by exporting it from the flow simulator. Please note that all but one of the matrices are square, and of the square matrices, three of them are non-symmetric. These matrices were chosen to prove that the SpMV unit works on such non-symmetric and non-square matrices. Some of these matrices are not used as benchmarks for the BiCGSTAB solver, however, as it does not work for non-square matrices and has additional limitations as explained in the next section.

Name Rows Columns NNZs Density% Symm.
nos1 237 237 2115 1.811 yes
epb1 14734 14734 95053 0.044 no
Hummocky 12380 12380 121340 0.079 yes
bodyy6 19366 19366 134748 0.036 yes
lp_ken_18 105127 154699 358171 0.002 no
gridgena 48962 48962 512084 0.021 yes
wathen120 36441 36441 565761 0.043 yes
finan512 74752 74752 596992 0.011 yes
qa8fm 66127 66127 1660579 0.038 yes
crystm03 24696 24696 1751310 0.096 yes
vanbody 47072 47072 2336898 0.105 yes
NORNE 133293 133293 2818879 0.016 no
cfd2 123440 123440 3087898 0.020 yes
TABLE V: Characteristics of the matrices used to benchmark the SpMV kernel

In Figure 12, the performance for all the matrices used to benchmark the kernel are shown. The corresponding tabulated values can be found in Table VI. For the FPGA and GPU, those performance numbers do not include the data transfer time between the host CPU and the on-board memories. From this figure, we can see that the single-precision versions of the FPGA kernel achieve almost 2 times more FLOP/s than their double-precision counterparts. This can be explained by the fact that the amount of data that needs to be read for each single-precision operations is half of the amount of data that needs to be read for a double-precision one. We also observe, as expected, that the non-partitioned version achieves a slightly higher performance than the corresponding partitioned version, at the trade-off that the non-partitioned version cannot operate on the largest of the benchmark matrices.

Matrix FPGA GPU CPU
double_p double_np float_p float_np cusparse OpenCL LS OpenCL GC (double)
nos1 0.34 0.34 0.34 0.34 0.20 0.08 0.08 0.00
epb1 2.50 3.17 3.57 5.76 14.62 0.15
Hummocky 2.79 3.64 4.05 6.11 15.17 5.92 6.22 0.29
bodyy6 2.52 3.36 3.67 6.78 17.97 8.17 8.69 0.22
lp_ken_18 1.25 1.40
gridgena 3.21 3.81 5.33 7.64 34.14 22.76 22.76 0.63
wathen120 3.51 3.82 6.24 7.67 40.41 23.09 23.09 0.62
crystm03 3.69 4.34 6.69 8.29 33.36 16.44 16.22 0.88
finan512 2.89 3.95 21.71 17.56 15.92 1.41
qa8fm 3.85 6.90 55.35 43.13 43.70 2.18
vanbody 4.15 4.60 7.84 8.69
NORNE 3.59 5.90 90.93 74.18 77.23 1.81
cfd2 3.80 6.89 57.18 43.80 43.49 6.04
TABLE VI: Performance in GFLOP/s for the SpMV kernel

Vii-B Stand-alone FPGA solver results

For the ILU0 BiCGSTAB solver, we are only interested in a version of the solver that can run on larger matrices with high precision, so only the double-precision, partitioned version of the solver is benchmarked. The design parameters as listed for the double_p SpMV implementation in Table III hold for this solver implementation as well, except for the number of read ports, which is increased by one.
In Table VII, the resource utilization of the ILU0 BICGSTAB solver is compared to that of the SpMV kernel with the same parameters. In both cases, these values refer to the percentage of resources used over the available ones in the dynamic region (which is reconfigurable by the user), and they include all the supporting circuitry to control the solver and access the memory on the FPGA board. Hence, not included in these values are the resources used by the FPGA’s static region, which is provided by Xilinx. Due to the increased complexity of the solver over the SpMV kernel, it could only reach a frequency of 280 MHz.

Resource SpMV Solver
LUT 4.08 % 6.93 %
LUTRAM 0.70 % 1.13 %
REG 4.14 % 6.41 %
BRAM 15.27 % 24.64 %
URAM 4.17 % 4.17 %
DSP 0.93 % 3.18 %
TABLE VII: FPGA resource utilization of the solver compared to the SpMV kernel as % of the resources available in the FPGA’s dynamic region

Some of the matrices used to test the SpMV unit were not usable during the testing of the solver, because they were non-square, had zero values on their diagonals, or exceeded on-board memories of the design. Table VIII lists the matrices that were used to test the BICGSTAB solver, along with their original sizes and the size of the derived L/U matrices used by the FPGA solver.

Name Dim. Non-zeroes
FPGA
matrix
FPGA
L-matrix
FPGA
U-matrix
GPU
matrix
nos1 237 1017 390 390 1017
bodyy6 19366 134197 121414 70762 134197
gridgena 48962 513060 421068 315336 513060
crystm03 24696 583770 279519 279522 583770
wathen120 36441 565761 468529 379446 565761
qa8fm 66127 1660564 1431908 1251238 1660564
NORNE 133293 1314999 726369 513512 2818879
TABLE VIII: Characteristics of the testing matrices

The performance results obtained after running the stand-alone ILU0 BiCGSTAB solver are shown in Figure 13. For each matrix, we present the best result obtained either by using the graph-coloring or the level-scheduling reordering algorithms. The detailed timings are listed in Table IX, which contains the run time as reported by the profiling info (Solver) of the runs (only for the FPGA, used in Figure 13), the run time as measured by the Host (used in the figure for the CPU and GPU results), and the time spent transferring (Transfer) the input data and the output results between the host CPU’s and the accelerators’ on-board DRAM/HBM memories. Also, the number of iterations (Iter#) needed to solve the system are reported. The Host time is, in all cases, higher than the Solver time, as the former includes overhead of starting and waiting for the end of the solver kernel. The transfer times are not included in the other two times.

Fig. 13: Solver execution time for selected matrices. For readability of the CPU/FPGA results, some GPU results have been clipped. Refer to Table IX for the actual values. Please note that for the FPGA, the maximum aggregated HBM memory bandwidth is 460 GB/s and the maximum aggregated DDR4 memory bandwidth is around 37 GB/s, however, due to design limitations caused by implementation issues, the maximum bandwidth used by the kernel is around 52 GB/s.
Matrix FPGA CPU DUNE GPU
cusparse OpenCL LS OpenCL GC
Solver Host Transfer Iter# Host Iter# Host Transfer Iter# Host Transfer Iter# Host Transfer Iter#
nos1 0.1 0.4 0.4 1.0 0.0 0.5 0.5 0.0 0.5 3.3 0.1 1.0 0.7 0.1 1.5
bodyy6 4.1 4.3 3.2 2.0 1.4 0.5 17.6 0.5 0.5 136.5 0.5 1.0 1.9 0.5 2.0
gridgena 9.6 10.2 12.4 1.5 19.3 4.5 528.7 1.2 4.5 1007.7 1.3 2.5 1.4 1.3 1.5
crystm03 4.4 4.9 9.4 0.5 3.2 0.5 2.1 1.6 0.5 4.2 1.7 0.5 2.3 1.7 1.0
wathen120 13.3 13.5 12.1 2.0 2.6 0.5 46.3 1.1 0.5 334.3 1.3 1.0 2.3 1.3 2.0
qa8fm 20.9 21.2 30.6 1.0 12.4 0.5 55.4 3.4 0.5 179.6 3.6 0.5 1.9 3.7 1.0
NORNE 58.0 58.6 27.3 4.5 70.2 4.5 14.0 2.7 4.5 32.4 3.2 4.5 8.7 3.0 6.5
TABLE IX: Timing results (in milliseconds) for solving selected matrices

In Table X we report the bandwidth utilization for each memory port as recorded by the profiling facilities present in the FPGA solver. Each column shows a memory port, along with its mapping (either to DDR4 or HBM memory). The bandwidth utilization is given a percentage of the maximum bandwidth available for each individual port, which is 18.8 GB/s for ports read 0-1 (DDR4 memory, 512-bit AXI bus @ 300 MHz), while it’s 14.1 GB/s for the remaining ports (HBM memory, 256-bit AXI bus @ 450 MHz).

Matrix DDR4 HBM
Aggregated read
bandwidth (GB/s)
% of max aggregated
read bandwidth
Read 0 Read 1 Read 2 Read 3 Read 4 Write 0 Write 1 Write 2
nos1 42.8 14.9 50.6 50.5 49.5 79.4 79.4 79.1 32.0 40.2
bodyy6 82.6 36.8 70.8 64.4 72.4 100 100 100 51.6 64.7
gridgena 81.7 36.2 70.0 65.8 70.2 100 100 100 51.1 64.1
crystm03 79.8 43.2 68.3 69.2 69.3 100 100 100 52.2 65.4
wathen120 80.6 35.6 68.9 67.0 70.0 100 100 100 50.7 63.7
qa8fm 81.4 36.0 68.6 66.5 72.1 100 100 100 51.1 64.2
NORNE 83.0 35.8 66.8 62.5 69.6 100 100 100 50.3 63.1
TABLE X: Read and write ports bandwidth utilization of the FPGA solver, as % of the available bandwidth

Vii-C Flow results

The ILU0 BiCGSTAB solvers were integrated into the flow simulator in OPM [20], by replacing the call to the simulator’s own iterative solver with calls to a function that does the required pre-processing, sends the resulting data to the accelerator, and then reads the results from it after it is done. The source code used was from version 2020.10-rc4, with custom modifications to integrate our solvers. The original flow solver is an ILU0 BICGSTAB solver based on the DUNE project [11], and in Table XI, the performance results between this software solver and our solvers are compared when running simulation on the NORNE testcase (file NORNE_ATW2013.DATA) [2] that is part of the OPM project. For all runs, the parameter ’–matrix-add-wellcontributions=true’ was added to the command line (to have a fair comparison with the FPGA results, because the FPGA solver currently does not support separated well contributions). For the Nornes case, including well contributions will increase run time on the CPU with approximately twenty percent compared to the default configuration. Moreover, for the FPGA run, the parameter ’–threads-per-process=8’ was also added. This will only impact the assembly part of the simulation, not the preconditioner nor the linear solver, in effect reducing the computational time spent outside the preconditioning and linear solve part. The reordering algorithm has been left to the default for FPGA (level scheduling), because graph coloring produces in this case more colors than the ones supported by the FPGA solver. For the executions using the FPGA solver, all the application’s threads were pinned to CPU #0, i.e. the one directly connected to the FPGA, to avoid fluctuations in bandwidth during the transfers between the host and the FPGA memory.
The default reduction of was used for the results shown in Table XI and Figure 13, while a reduction of was used for the results shown in Table XII. For an explanation of the characteristics of each platform used to run flow, please refer to Table II.

Function CPU FPGA GPU GPU
cusparse OpenCL LS OpenCL GC cusparse OpenCL LS OpenCL GC
Total time 540.6 751.8 489.7 799.6 749.3 320.1 507.6 458.5
   Solver time 540.6 751.7 489.6 799.5 749.2 320.0 507.6 458.5
   Assembly time 99.4 94.6 133.7 137.5 163.6 115.6 112.7 114.7
     Well assembly time 43.0 47.1 21.15 21.6 26.5 13.9 14.6 16.8
   Linear solve time 339.5 559.9 248.9 551.4 460.6 117.6 293.6 207.8
     Linear setup time 41.8 45.3 38.1 39.5 47.7 33.6 35.4 43.6
   Update time 57.3 54.4 67.3 70.1 84.5 53.6 56.9 71.2
   Output write time 3.5 3.5 3.2 3.2 3.2 2.9 2.9 3.0
Number of linear solves 1458 1441 1437 1507 1818 1449 1507 1861
Number of linear iterations 21991 21012 21282 22626 42160 21020 22626 40991
Speedup total time 1.00 0.72 1.10 0.68 0.72 1.69 1.07 1.18
Speedup linear solver time 1.00 0.61 1.36 0.62 0.74 2.89 1.16 1.63
TABLE XI: Comparison of performance for flow with reduction=1e-2(default) using CPU, FPGA and GPU (in seconds)
Function CPU FPGA GPU GPU
cusparse OpenCL LS OpenCL GC cusparse OpenCL LS OpenCL GC
Total time 1254.1 1512.9 900.0 1893.0 1333.3 484.5 950.5 579.4
   Solver time 1254.0 1512.9 900.0 1893.0 1333.2 484.5 950.5 579.3
   Assembly time 82.2 84.9 112.0 114.7 115.9 101.8 101.7 101.7
     Well assembly time 36.1 42.3 17.1 17.5 17.6 11.9 12.1 12.5
   Linear solve time 1081.3 1333.8 691.4 1680.1 1119.5 303.8 768.8 397.0
     Linear setup time 35.2 38.9 31.1 31.4 31.3 27.8 28.2 28.7
   Update time 46.1 50.9 56.9 57.4 57.3 45.6 45.6 45.8
   Output write time 3.5 3.5 3.0 3.2 3.2 2.8 2.9 2.9
Number of linear solves 1207 1207 1202 1207 1207 1207 1207 1207
Number of linear iterations 79163 78275 78447 82673 117463 78958 82673 117294
Speedup total time 1.00 0.83 1.39 0.66 0.94 2.59 1.32 2.16
Speedup linear solver time 1.00 0.81 1.56 0.64 0.97 3.56 1.41 2.72
TABLE XII: Comparison of performance for flow with reduction=1e-6 using CPU, FPGA and GPU (in seconds)
Device Step (accumulated time) CPU FPGA GPU GPU
cusparse OpenCL LS OpenCL GC cusparse OpenCL LS OpenCL GC
Host Reorder vectors - 1.10 - 1.08 1.14 - 1.14 1.30
Create preconditioner 38.5 128.75 22.27 60.09 95.35 5.76 58.00 81.00
   BILU0 reorder matrix - - - 20.47 27.72 - 21.30 23.60
   BILU0 decomposition - - - 31.42 57.34 - 30.20 49.60
   BILU0 copy to GPU - - - 5.04 5.97 - 3.79 4.63
Memory setup - 10.61 - - - - - -
CPU to accelerator transfer - 65.95 4.70 5.10 6.18 3.83 4.46 5.59
Accelerator Total solve 274.1 304.90 175.10 448.50 324.10 71.80 197.30 88.10
   ILU apply 112.6 - 134.50 395.10 68.00 63.70 182.30 62.30
   SpMV 127.3 - 28.40 37.30 25.50 2.60 4.40 7.70
   Rest 28.9 - 8.50 12.70 223.80 4.00 8.30 14.20
Host Accelerator to CPU transfer - 0.76 0.27 0.35 0.16 0.26 0.34 0.16
TABLE XIII: Breakdown of linear solve time (in seconds) spent running the solvers in flow with reduction=1e-2 (default)

To gain a better understanding of which parts of running the FPGA solver are causing it to be slower than the software version, we timed the different functions applied by this solver during a flow run, the results of which are shown in Table XI. During the running of the flow simulator, the sparsity pattern of the matrix that needs to be solved does not change. This makes it possible for functions that depend only on the sparsity pattern, like the finding of the re-ordering and partitioning pattern, to be done only once during initialization. From Table XIII, we observe that the majority of solver time is actually spent running the FPGA solver, but a significant portion of the run time is also taken up by the creation of the preconditioner, and the data transfers between the host and FPGA.

Viii Conclusion

In this paper, we have evaluated the potential of using fpgas in HPC, which is a highly relevant topic because of the rapid advances in reconfigurable hardware. To perform this study, we began by proposing a novel CSR based encoding to optimize a new SpMV kernel on fpga, which can be easily integrated with an ILU0 preconditioner. We subsequently developed a hardware model to predict and guide the design of the ILU0 preconditioned BiCGstab solver, which helped us to understand the trade-offs between area and performance when scaling the resources. Next, we implemented the ILU0-BiCGStab preconditioned solver targeted at an HBM-enabled fpga using a mixed programming model of HLS and RTL to maximize the performance of the double precision kernel. To validate our work, we integrated the complete solver in the OPM reservoir simulator, both for fpga and GPU. Finally, we provided extensive evaluation results for both the standalone SpMV and solver kernels as well as the complete reservoir simulation execution on a real world use case running on three different platforms, CPU, fpga, and gpu.

We find that the fpga is on par with the CPU execution and 3x slower than the gpu implementation when comparing only the kernel executions. Although the obtained results show that an fpga is not yet competitive with the gpu, we believe that with a better on-chip support for double precision, increased number of computation units and memory ports, increased internal parallelism, as well as other optimizations on the host software side (i.e. reduction of preconditioner computation time, increased efficiency of the host-FPGA data exchange and task execution control), the fpga can become a better alternative to speed-up scientific computing. This has been illustrated by scaling our hardware performance model beyond what it was possible to achieve in this work due to the encountered hardware limitations. In the future, we will work to eliminate these restrictions and provide a more efficient implementation. An alternative research could focus on developing other preconditioned solvers that might be more suitable for an fpga implementation.

To encourage research into this direction, the source codes for the RTL kernels and the integration with opm Flow are made available in the opm git repository [20]. A pull request was created (#2998), which can be used to reproduce the results of this paper. Please note that since the pull request was created against the most current version of the code at the time of writing (December 2020), the performance results obtained with that code may not be directly comparable with the results reported in this paper, which were gathered with a previous release of opm’s Flow as stated in section VII-C.

References

  • [1] External Links: Link Cited by: §IV-D, §V-E.
  • [2] External Links: Link Cited by: §VII-C.
  • [3] Alveo u280 data center accelerator card. External Links: Link Cited by: §IV-A, §VII.
  • [4] AMD to acquire xilinx. External Links: Link Cited by: §I.
  • [5] S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, A. Dener, V. Eijkhout, W. D. Gropp, D. Karpeyev, D. Kaushik, M. G. Knepley, D. A. May, L. C. McInnes, R. T. Mills, T. Munson, K. Rupp, P. Sanan, B. F. Smith, S. Zampini, H. Zhang, and H. Zhang (2020) PETSc users manual. Technical report Technical Report ANL-95/11 - Revision 3.14, Argonne National Laboratory. External Links: Link Cited by: §I.
  • [6] P. Bastian, M. Blatt, A. Dedner, N. Dreier, C. Engwer, R. Fritze, C. Gräser, C. Grüninger, D. Kempf, R. Klöfkorn, M. Ohlberger, and O. Sander (2021) The dune framework: basic concepts and recent developments. Computers & Mathematics with Applications 81, pp. 75 – 112. Note: Development and Application of Open-source Software for Problems with Numerical PDEs External Links: ISSN 0898-1221, Document, Link Cited by: §I.
  • [7] E. Chow and A. Patel (2015) Fine-grained parallel incomplete lu factorization. SIAM Journal on Scientific Computing 37 (2), pp. C169–C193. External Links: Document, Link, https://doi.org/10.1137/140968896 Cited by: §I.
  • [8] G. C. Chow, P. Grigoras, P. Burovskiy, and W. Luk (2014) An efficient sparse conjugate gradient solver using a beneš permutation network. In 2014 24th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–7. Cited by: 3rd item.
  • [9] T. A. Davis and Y. Hu (2011) The University of Florida Sparse Matrix Collection. ACM Transactions on Mathematical Software 38 (1). External Links: Document Cited by: §II-C, §VII-A.
  • [10] R. Dorrance, F. Ren, and D. Marković (2014) A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on fpgas. In Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays, pp. 161–170. Cited by: 3rd item.
  • [11] Dune project. External Links: Link Cited by: §VII-C.
  • [12] R. Eberhardt and M. Hoemmen (2016) Optimization of block sparse matrix-vector multiplication on shared-memory parallel architectures. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Vol. , pp. 663–672. External Links: Document Cited by: §IV-D.
  • [13] J. Fowers, K. Ovtcharov, K. Strauss, E. S. Chung, and G. Stitt (2014) A high memory bandwidth fpga accelerator for sparse matrix-vector multiplication. In 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 36–43. Cited by: 2nd item.
  • [14] Intel acquisition of altera. External Links: Link Cited by: §I.
  • [15] M. T. Jones and P. E. Plassmann (1993) A parallel graph coloring heuristic. SIAM Journal of Scientific Computing 14-3, pp. 654––669. Cited by: §II-D2.
  • [16] A. Monakov, A. Lokhmotov, and A. Avetisyan (2010) Automatically tuning sparse matrix-vector multiplication for gpu architectures. In High Performance Embedded Architectures and Compilers, Y. N. Patt, P. Foglia, E. Duesterwald, P. Faraboschi, and X. Martorell (Eds.), Berlin, Heidelberg, pp. 111–125. External Links: ISBN 978-3-642-11515-8 Cited by: §II-B.
  • [17] G. R. Morris, V. K. Prasanna, and R. D. Anderson (2006) A hybrid approach for mapping conjugate gradient onto an fpga-augmented reconfigurable supercomputer. In 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, pp. 3–12. Cited by: 1st item.
  • [18] A. Munshi, B. Gaster, T. G. Mattson, J. Fung, and D. Ginsburg (2011) OpenCL programming guide. 1st edition, Addison-Wesley Professional. External Links: ISBN 0321749642 Cited by: §V-E.
  • [19] R. Nane, V. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H. Hsiao, S. Brown, F. Ferrandi, J. Anderson, and K. Bertels (2016) A survey and evaluation of fpga high-level synthesis tools. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35 (10), pp. 1591–1604. External Links: Document Cited by: §I.
  • [20] Open porous media reservoir simulator. External Links: Link Cited by: §I, §VII-C, §VIII.
  • [21] A. F. Rasmussen, T. H. Sandve, K. Bao, A. Lauser, J. Hove, B. Skaflestad, R. Klöfkorn, M. Blatt, A. B. Rustad, O. Sævareid, K. Lie, and A. Thune (2020) The open porous media flow reservoir simulator. Computers & Mathematics with Applications. External Links: ISSN 0898-1221, Document, Link Cited by: §I.
  • [22] Y. Saad (2003) Iterative methods for sparse linear systems. Society for Industrial and Applied Mathematics. Cited by: §II-D1.
  • [23] B. Sigurbergsson, T. Hogervorst, T. D. Qiu, and R. Nane (2019) Sparstition: a partitioning scheme for large-scale sparse matrix vector multiplication on fpga. International Conference on Application-specific Systems, Architectures and Processors (ASAP) 30, pp. 51–58. Cited by: §II-E.
  • [24] SuiteSparse matrix collection. External Links: Link Cited by: §II-C, §VII-A.
  • [25] E. Topsakal, R. Kindt, K. Sertel, and J. Volakis (2001) Evaluation of the bicgstab (l) algorithm for the finite-element/boundary-integral method. IEEE Antennas and propagation Magazine 43 (6), pp. 124–131. Cited by: §III-B.
  • [26] Ultra ram. External Links: Link Cited by: §I.
  • [27] H. A. van der Vorst (1992) Bi-cgstab: a fast and smoothly converging variant of bi-cg for the solution of nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing 13 (2), pp. 631–644. External Links: Document, Link, https://doi.org/10.1137/0913035 Cited by: §I.
  • [28] G. Wu, X. Xie, Y. Dou, and M. Wang (2013) High-performance architecture for the conjugate gradient solver on fpgas. IEEE Transactions on Circuits and Systems II: Express Briefs 60 (11), pp. 791–795. Cited by: 2nd item.
  • [29] L. Zhuo and V. K. Prasanna (2005) Sparse matrix-vector multiplication on fpgas. In Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, pp. 63–74. Cited by: 1st item.