Multigrid Solvers in Reconfigurable Hardware

04/01/2019 ∙ by Safaa Kasbah, et al. ∙ Lebanese American University 0

The problem of finding the solution of Partial Differential Equations (PDEs) plays a central role in modeling real world problems. Over the past years, Multigrid solvers have showed their robustness over other techniques, due to its high convergence rate which is independent of the problem size. For this reason, many attempts for exploiting the inherent parallelism of Multigrid have been made to achieve the desired efficiency and scalability of the method. Yet, most efforts fail in this respect due to many factors (time, resources) governed by software implementations. In this paper, we present a hardware implementation of the V-cycle Multigrid method for finding the solution of a 2D-Poisson equation. We use Handel-C to implement our hardware design, which we map onto available Field Programmable Gate Arrays (FPGAs). We analyze the implementation performance using the FPGA vendor's tools. We demonstrate the robustness of Multigrid over other iterative solvers, such as Jacobi and Successive Over Relaxation (SOR), in both hardware and software. We compare our findings with a C++ version of each algorithm. The obtained results show better performance when compared to existing software versions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Physical, chemical and biological phenomena are modeled using Partial Differential Equations (PDEs). Interpreting and solving (PDEs) is the key for understanding the behavior of the modeled system. The broad field of modeling real systems has drawn the researchers’ attention for designing efficient algorithms for solving (PDEs). The Multigrid method has been shown to be the fastest method due to its high convergence rate which is independent from the problem size. However, the computation of such solvers is complex and time consuming. Many attempts for exploiting the inherent parallelism of Multigrid have been made to achieve the desired efficiency and scalability of the method. Yet, most efforts fail in this respect due to many factors (time and resources) governed by software implementations upon parallelizing the algorithm.

Over the past years, researchers have benefited from the continuous advances in hardware devices and software tools to accelerate the computation of complex problems [2]. At early stages, algorithms were designed and implemented to run on a general purpose processor (software). Techniques for optimizing and parallelizing the algorithm, when possible, were then devised to achiever better performance. As applications get more complex, the performance provided by processors degenerates. A better performance could be achieved using a dedicated hardware where the algorithm is digitally mapped onto a silicon chip, Integrated Circuit (IC). Though it provides better performance than the processor technology, the IC technology (hardware) lacks flexibility.

In the last decade, a new computing paradigm, Reconfigurable Computing (RC), has emerged [14]. RC-systems overcome the limitations of the processor and the IC technology. RC-systems benefit from the flexibility offered by software and the performance offered by hardware [23], [14]. RC has successfully accelerated a wide variety of applications including cryptography and signal processing [22]. This achievement requires a reconfigurable hardware, such as an FPGA, and a software design environment that aids in the creation of configurations for the reconfigurable hardware [14].

In this paper, we present a hardware implementation of the V-cycle Multigrid algorithm for the solution of a 2D-Poisson equation using different classes of FPGAs: Xilinx Virtex II Pro, Altera Stratix and Spartan3L which is embedded on the RC10 board from Celoxica. We use Handel-C, a higher-level hardware design language, to code our design which is analyzed, synthesized, and placed and routed using the FPGAs proprietary software (DK Design Suite, Xilinx ISE 8.1i and Quartus II 5.1). We demonstrated the robustness of the Multigrid algorithm over the Jacobi and the SOR algorithms, in both hardware and software. We compare our implementations results with existing software version of each algorithm, since there are no hardware implementations of MG, Jacobi and SOR in the literature.

The rest of the paper is organized as follows: In Sections 2 and 3, we present a general overview of Multigrid solvers and Reconfigurable Computing, respectively. In Section 4, we describe our hardware implementation of the V-cycle MG for the solution of 2D-Poisson equation. Then, the implementation results are presented in Section 5, where we: a) report MG results, b) compare these results with a software version written in C++ and running on a general purpose processor, c) report Jacobi and SOR hardware implementation results and compare them with their software versions, d) compare the results obtained in a) and c) showing how MG outperforms Jacobi and SOR, in both hardware and software versions. Section 6 concludes the work and addresses possible directions to future work.

2 Multigrid Solvers

Multigrid methods are fast linear iterative solvers used for finding the optimal solution of a particular class of partial differential equations. Similar to classical iterative methods (Jacobi, Successive Over Relaxation (SOR) and Gauss Seidel), an MG method ”starts with an approximate solution to the differential equation; and in each iteration, the difference between the approximate solution and the exact solution is made smaller” [7].

In general, the error resulting from the exact and approximate solution will have components of different wavelengths: high-frequency components and low-frequency components [7]. Classical iterative methods reduce high-frequency/ oscillatory components of error rapidly, but reduce low-frequency/smooth components of error much more slowly [42].

The Multigrid strategy overcomes the weakness of classical iterative solvers by observing that components that appear smooth on fine grid may appear oscillatory when sampled on coarser grid [8]. The high-frequency components of the error are reduced by applying any of the classical iterative methods. The low-frequency components of error are reduced by a coarse-grid correction procedure [11], [42].

A Multigrid cycle starts by applying any classical iterative method (Jacobi, Gauss Seidel or Successive Over Relaxation) to find an approximate solution for the system. The Residual operator is then applied to find the difference between the actual solution and the approximate solution. The result of this operator measures the goodness of the approximation. Since it is easier to solve a problem with less number of unknowns [9], [30], a special operator-Restriction- for mapping the residual to a coarser grid (less number of unknowns) - is applied for several iterations until the scheme reaches the bottom of the grid hierarchy. Then, the coarse grid solver operator is applied to find the error on the coarsest grid. Afterwards, the interpolation operator is applied to map the coarse grid correction to the next finer grid in an attempt to improve the approximate solution. This procedure is applied until the top grid level is reached giving a solution with residual zero. Finishing with several iterations back to the finest grid gives a so-called- V-cycle Multigrid [11], [18], [25].

Figure 1: V-cycle, W-cycle, and Full-cycle MG

2.1 Multigrid Components

A Multigrid algorithm uses five algorithmic components: Smoother/Relaxation, Residual computation, Restriction, Coarse grid solver, Interpolation.

Relaxation/Smoother: Is the most important component of Multigrid algorithm. This component is responsible for generating an approximate solution by reducing-smoothing/relaxing- the high frequency error component of the solution imposed upon approximating the solution.

Iterative methods that can be applied as the smoother in the Multigrid solver include: Gauss-Seidel, Jacobi, and Successive Over Relaxation (SOR) [7], [19] .The Gauss-Seidel method has the fastest convergence rate and is thus the best candidate. The Gauss-Seidel method can be used in both the pre-smoothing and post-smoothing steps. When used as a pre-smoother, the Gauss-Seidel method is responsible for reducing the high-frequency error components and getting a smoother error [8]. When used as a post-smoother, the Gauss-Seidel method is responsible for removing new high-frequency error component that might be produced by the coarse grid correction and interpolation.

In general, Gauss-Seidel method can generate a solution using 1

(1)

When used as a post-smoothing and pre-smoothing steps in the Multigrid method, the solution is in the form:

(2)

where and are the row and column indices of the gird (Barret et. al., 1994).

Residual Computation: Let be an approximate solution to the exact solution , the residual is defines as:

(3)

where is the error.

The residual must be computed before it can be restricted to the coarser grid.

Restriction: This component is responsible for transporting the residual of the fine grid to the coarser grid . There are various techniques for restricting the residual; full weighing, half weighting and injection [30].

Let be the mesh size on a finer grid.
Let be the prolongation of coarser-grid residual to the finer grid.
Let be a prolongation, from (bilinear in our case),
Now, We can compute using:

Course grid solver: This operator, often called coarse-grid correction is performed on the coarse grid. Applying this operator along with the smoothing operator has a substantial effect on the reduction of the residual for all frequencies. However, the coarse grid solver is applied only on the coarsest grid making the cost of this operator negligible to the overall computational cost of the Multigrid method [44].

(4)

Interpolation/Prolongation: Transports the correction obtained on the coarser grid to the fine grid. There are various techniques to do so: e.g., bilinear and 7 point interpolation.

Let be the mesh size on a finer grid.
Let be the prolongation of coarser-grid residual to the finer grid.
Let be a prolongation, from (bilinear in our case),
Now, We can compute using:

The simplest Multigrid algorithm is based on a two-grid improvement scheme: fine grid and coarse grid. The fine grid, , with points and the coarse grid, , with points.

In this work, we implement the V-cycle Multigrid to find the solution of a 2-D Poisson equation. Briefly, the V-cycle Multigrid algorithm starts with an initial approximation to the expected solution, goes down to the coarsest grid, and then goes back to the finest grid [11].

Figure 2: V-cycle MG

2.2 Multigrid Solution of Poisson’s Equation in 2-D

The V-cycle Multigrid algorithm is applied to find the solution to a 2-D Poisson equation in the form:

(5)

or in the form

when written in vector notation [35].

3 Reconfigurable Computing

Today, it becomes possible to benefit from the advantages of both software and hardware with the presence of the Reconfigurable Computing paradigm [14]. Actually, the first idea to fill the gap between the two computing approaches, software and hardware, goes back to the 1960s when Gerald Estrin proposed the concept of RC [40].

The basic idea of Reconfigurable Computing is the ”ability to perform certain computations in hardware to increase the performance, while retaining much of the flexibility of a software solution” [14].

Reconfigurable computing systems can be either of fine-grained or of coarse-grained architecture. An FPGA is a fine-grained reconfigurable unit while a reconfigurable array processor is a coarse-grained reconfigurable unit .In the fine-grained architecture each bit can be configured; while in the coarse-grained architecture, the operations and the interconnection of each processor can be configured. Example of a coarse-grained system is the MorphoSys which is intended for accelerating data path applications by combining a general purpose micro-processor and an array of coarse grained reconfigurable cells [3].

The realization of the RC paradigm is made possible by the presence of programmable hardware such as large scale Complex Programmable Logic Devices (CPLDs) and Field Programmable Gate Arrays (FPGAs) [37]. Reconfigurable computing involves the modification of the logic within the programmable device to suite the application at hand.

3.1 Hardware Compilation

There are certain procedures to be followed before implementing a design on an FPGA. First, the user should prepare his/her design by using either a schema editor or by using one of the Hardware Description Languages (HDLs) such as VHDL (Very high scale integrated circuit Hardware Description Language) and Verilog. With schema editors, the designer draws his/her design by choosing from the variety of available components (multiplexers, adders, resistors, ..) and connect them by drawing wires between them. A number of companies supply schema editors where the designer can drag and drop symbols into a design, and clearly annotate each component [39]. Schematic design is shown to be simple and easy for relatively small designs. However, the emergence of big and complex designs has substantially decreased the popularity of schematic design while increasing the popularity of HDL design. Using an HDL, the designer has the choice of designing either the structure or the behavior of his/ her design. Both VHDL and Verilog support structural and behavioral descriptions of the design at different levels of abstractions. In structural design, a detailed description of the system’s components, sub-components and their interconnects are specified. The system will appear as a collection of gates and interconnects [39]. Though it has a great advantage of having an optimized design, structural presentation becomes hard, as the complexity of the system increases. In behavioral design, the system is considered as a black box with inputs and outputs only, without paying attention to its internal structure [24]. In other words, the system is described in terms of how it behaves rather than in terms of its components and the interconnection between them. Though it requires more effort, structural representation is more advantageous than the behavioral representation in the sense that the designer can specify the information at the gate-level allowing optimal use of the chip area [41]. It is possible to have more than one structural representation for the same behavioral program.

Noting that modern chips are too complex to be designed using the schematic approach, we will choose the HDL instead of the schematic approach to describe our designs.

Whether the designer uses a schematic editor or an HDL, the design is fed to an Electronic Design Automation (EDA) tool to be translated to a netlist. The netlist can then be fitted on the FPGA using a process called place and route, usually completed by the FPGA vendors’ tools. Then the user has to validate the place and route results by timing analysis, simulation and other verification methodologies. Once the validation process is complete, the binary file generated is used to (re)configure the FPGA device. More about this process is found in the coming sections.

Implementing a logic design on an FPGA is depicted in the figure below:

Figure 3: FPGA Design Flow

The above process consumes a remarkable amount of time; this is due to the design that the user should provide using HDL

, most probably

VHDL or Verilog. The complexity of designing in HDL; which have been compared to the equivalent of assembly language; is overcome by raising the abstraction level of the design; this move is achieved by a number of companies such as Celoxica, Cadence and Synopsys. These companies are offering higher level languages with concurrency models to allow faster design cycles for FPGAs than using traditional HDLs. Examples of higher level languages are Handel-C, SystemC, and Superlog [31], [39].

3.2 Handel-C Language

Handel-C is a high level language for the implementation of algorithms on hardware. It compiles program written in a C-like syntax with additional constructs for exploiting parallelism [39]. The Handel-C compiler comes packaged with the Celoxica DK Design Suite which also includes functions and memory controller for accessing the external memory on the FPGA. A big advantage, compared to other C to FPGA tools, is that Handel-C targets hardware directly, and provides a few hardware optimizing features [12]. In contrast to other HDLs, such as VHDL, Handel-C does not support gate-level optimization. As a result, a Handel-C design uses more resources on an FPGA than a VHDL design and usually takes more time to execute. In the following subsections, we describe Handel-C features’ that we have used in our design [12], [32].

3.2.1 Types and Type Operator

Almost all ANSI-C types are supported in Handel-C

with the exception of float and double. Yet, floating point arithmetic can still be performed using the floating point library provided by

Celoxica. Also, Handel-C supports all ANSI-C storage class specifies and type qualifiers expect volatile and register which have no meaning in hardware. Handel-C offers additional types for creating hardware components such as memory, ports, buses and wires. Handel-C variables can only be initialized if they are global or if declared as static or const.

3.2.2 Values and Widths

Unlike conventional C types, Handel-C types are not limited to width since when targeting hardware, there is no need to be tied to a certain width. Variables can be of different widths, thus minimizing the hardware usage. For instant, if we have a variable a that can hold a value between and , then it is enough to use bits only.

However, care should be taken when performing arithmetic or comparisons on variables of different width. Handel-C offers the three operators: 1) concatenation ’@’ 2) take ’¡-’ and 3) drop ’

’ for dealing with variables of different width. For instant, if we want to add the variable a to another variable of type long, we have to pad a to 32 bits by using the concatenation operator @.

3.2.3 par Statement

The notion of time in Handel-C is fundamental. Each assignment happens in exactly one clock cycle, everything else is ”free” [12].

An essential feature in Handel-C is the ’par’ construct which executes instructions in parallel. Table 2 shows the effect of using ’par’.

par
 a=1; {
 b=1; a=1
 c=1; b=1
c=1
}
 No. of clock cycles = 3 No. of clock cycles = 1
Table 1: Effect of using ’’par’ construct

3.2.4 Handel-C Targets

Handel-C supports two targets. The first is a simulator that allows development and testing of code without the need to use hardware, P1 in Figure 4. The second is the synthesis of a netlist for input to place and route tools which are provided by the FPGA’s vendors, P2 in Figure 4.

Figure 4: Handel-C targets

The remaining of this section describes the phases involved in P2, as it is clear from P1 that we can test and debug our design when compiled for simulation.

The flow of the second target involves the following steps: Compile to netlist: The input to this phase is the source code. A synthesis engine, usually provided by the FPGA vendor, translates the original behavioral design into gates and flip flops. The resultant file is called the netlist. Generally, the netlist is in the Electronic Design Interchange Format (EDIF)

format. An estimate of the logic utilization can be obtained from this phase.

Place and Route (PAR): The input to this phase is the EDIF file generated from the previous phase; i.e. after synthesis. All the gates and flip flops in the netlist are physically placed and mapped to the FPGA resources. The FPGA vendor tool should be used to place and route the design. All design information regarding timing, chip area and resources utilization are generated and controlled for optimization at this phase.

Programming and configuring the FPGA: After synthesis and place and route, a binary file will be ready to be downloaded into the FPGA chip [15], [34].

3.3 Field Programmable Gate Arrays

An FPGA is a programmable digital logic chip. It consists of arrays of logic blocks with an interconnection network of wires. Both the logic blocks and the interconnects can be programmed by the designer so that the FPGA can perform whatever logical function is needed. Generally, the internal components of an FPGA can communicate with the outside world through the Input/Output blocks (IOB).

3.3.1 Fpga Structure

Three important elements that characterize the architecture of any FPGA are: building/logic blocks, routing channels, and switching elements. The great strength of FPGAs is their flexible architecture which enables the designer to program each of these elements.

The building/logic blocks are the basic elements of an FPGA. Each of these blocks is configured to perform a logic function. The interconnection between the logic blocks is provided by the channels of wiring segments of varying lengths [13]. The switching elements are used to determine the choice of active logic modules and their interconnects. The designer can activate or deactivate these elements to suite the requirement of the application in hand [27].

There are several programming technologies for implementing the programmable switches in an FPGA; anti-fuse, SRAM memory and EPROM-based technology. In the anti-fuse technology, a strong electric current is used to create a connection between the two-terminal anti-fuse device [33]. In the second technology, SRAM based configuration can be reprogrammed by downloading different configuration bits into the SRAM memory cells [27]. The anti-fuse technology is faster and more flexible than the SRAM-based technology whose programmability is volatile; i.e., the FPGA needs to be reconfigured whenever the power is turned off [28].

The architecture of an FPGA

can be classified according the size and flexibility of the logic cell as well as to the structure of the routing scheme [33]. The four basic architectures are: Symmetrical array, Row-based, Fine Grain Cellular Architecture (Sea-of-Gates), and Complex or Hierarchical

PLD.

3.3.2 Reconfigurability of Fpga

FPGA reconfiguration can be either static, semi static or dynamic. The dynamic reconfiguration, also known as run-time reconfiguration, is the most powerful form since a dynamically reconfigurable FPGA can be programmed / modified on-the-fly while the system is operating [4]. Dynamically reconfigurable FPGA may be either partially reconfigured (local run time reconfiguration) or programmed in a full reconfiguration (global run time reconfiguration). In the first case, only a portion of the FPGA is ”reconfigured while the other part is still running. In the second case, all the system is reconfigured. An external storage is needed to keep the intermediate results until the configured functions run” [36].

4 Hardware Implementation of V-cycle Mg

All available MG solvers are realized as software running on general purpose processors [30], [42].Available software packages have been implemented in C, Fortran-77, Java and other languages, where parallelized versions of these packages require inter-processor communication standards such as Message Passing Interface (MPI) [23]. Each of these packages attempt to achieve an efficient and a scalable version of the algorithm by compromising between the accuracy of the solution and the speed of realizing the solution.

The V-cycle MG, Jacobi, and SOR algorithms have been designed, implemented and simulated using Handel-C. We have targeted a Xilinx Virtex II Pro FPGA, an Altera Stratix FPGA, and an RC10 board from Celoxica. The tools provided by the device’s vendors were used to synthesize and place and route the design [1], [12], [43].

Finding the solution to PDEs using either of the aforementioned techniques (MG, Jacobi, SOR) requires floating point arithmetic operations which are 1) far more complex, and 2) consume more area than fixed point operations. For this reason, Handel-C does not support floating point type. Yet, floating point arithmetic can be performed using the Pipelined Floating Point Library provided in the Platform Developer’s Kit.

An unexpected crash in the Handel-C simulator persists whenever the numbers of floating point arithmetic operations exceed four. The only possible way to avoid the simulator’s failure was to convert/Unpack the floating point numbers to integers and perform integer arithmetic on the obtained unpacked numbers.Though it costs more logic to be generated, the integer operations on the unpacked floating point numbers have a minor effect on the total number of the design’s clock cycles.

The Multigrid method can be parallelized by parallelizing each of its components; i.e., smoother, coarse grid solver, restriction and prolongation. Each of these components is parallelized by using the Handel-C construct ’par’. This is used whenever it was possible to execute more than one instruction in parallel without affecting the logic of the source code. Figures 5 and 6 show the two MG operators ’Restrict Residual’ and ’Correct’. A snapshot of the parallel version of the ’Smoother’, ’Find Residual’ and ’Prolongate’ components is shown in Figure 7. Their implementation style is very similar to that of ’Restrict Residual’ and ’Correct’ operators.

Both Jacobi and SOR methods have been parallelized in the same way, i.e., using the par construct whenever possible. The results obtained show: a) The robustness of MG algorithm over Jacobi and SOR algorithm in both hardware and software implementations. b) A substantial improvement in the MG, Jabobi, and SOR performance when compared to the traditional way of executing instructions on a GPP.

Figure 5: MG Correct operator, illustrating the effect of using par construct: (6a), (6b), (6c) and (6d) shows sequential code, flowcharts, parallel code and combined flowchart/concurrent process model, respectively. The dots represent replicated instances in d. Dashed lines show the parallel tasks
Figure 6: MG Restrict Residual operator, illustrating the effect of using par construct: (7a), (7b), (7c) and (7d) shows sequential code, flow charts, parallel code and combined flow chart/concurrent process model, respectively. The dots represent replicated instances in d). Dashed lines show the parallel tasks.
Figure 7: V-cycle MG, iterative version showing each component parallelization. The dots in each of the component’s combined flowchart/concurrent process model represent replicated instances.

5 Experimental Results

The Handel-C simulators along with the FPGA vendor’s tools were used to obtain the results. We draw a comparison of the execution time between our results and a software version written in C++. The obtained results are based on the following criteria:

  • Speed of convergence: the time it takes the method of choice to find the solution to the PDE in hand. In another word, it is the time needed to execute MG, Jacobi or SOR algorithm. In hardware implementation, the speed of convergence is measured using the clock cycles of the design divided by the frequency at which the design operates at. The first parameter is found using the simulator while the second is found using the timing analysis report which is generated using the FPGA vendor’s tool.

  • Accuracy of the solution: The convergence of each algorithm is greatly dependent on the accuracy of the solution. The increase in the accuracy results in the increase of the computation as well as the increase in the logic utilization.

  • chip-area: this performance criterion measures the number of occupied slices on the FPGA on which the design is implemented. The number of occupied slices is generated using the FPGA vendor’s place and route tool.

We compare the timing performance between our hardware implementations of Multigrid, Jacobi, SOR and a C++ software version of the same algorithms on GPPs.

The following selections were used for all Multigrid performance tests:

  • Restriction: Full Weighting,

  • Interpolation: Bilinear,

  • Number of smoothing steps:

  • Smoother used: Gauss-Seidel

  • Accuracy: 0.001 for all Handel-C test cases and C++ test cases up to problem size 64x64.

As for SOR performance tests, the over-relaxation parameters, omega, is set to be 1.5.

The V-cycle MG execution time, for different problem sizes, along with the maximum frequency at which each design operates at are shown in Table 2. The execution time is calculated using: No. of clock cycles/Max. Frequency.

Mesh Size Execution Time Fmax
8x8 0.000063 159.74
16x16 0.00026 153.52
32x32 0.00118 136.15
64x64 0.00555 115.97
128x128 0.031 83.91
256x256 0.188 54.60
512x512 1.308 31.45
1024x1024 9.3 17.60
2048x2048 70.97 9.28
Table 2: Execution Time and Max frequency for different problem sizes

Figure 8 shows the results of comparing the execution time when running a C++ version of the V-cycle Multigrid algorithm and our proposed Handel-C version. The superiority of the hardware implementation over the software implementation is clear in both figures. However, for a problem size greater than 64x64, it becomes difficult to measure the execution time of the software (C++) version with the same accuracy of 0.001. At that time, our concern was to force the C++ version of MG to converge at any price. This was only possible by sacrificing with the accuracy of the solution; where we had to gradually increase this factor until we reached an accuracy of 2.0 for a problem size of 2048x2048, in contrast to an accuracy of 0.001 for a problem size of 8x8. On the other hand, Handel-C results were independent from the accuracy of the solution. The accuracy was constant all the way from a problem size of 8x8 to 2048x2048. Obviously, this explains the degeneration of the speedup indicated in (b).

Figure 8: MG execution time results in both versions, Handel-C and C++

In Table 3 we draw a comparison between the accuracy of the solution for each of the C++ and Handel-C test cases. The speedup of the design is calculated as the ratio of Execution Time (C++) / Execution Time (Handel-C).

Mesh Size Accuracy Speedup
C++ Handel-C MG SOR Jacobi
8x8 0.001 0.001 142.86 1.758 223.81
16x16 0.001 0.001 185.59 188 56.21
32x32 0.001 0.001 119.23 6.706 5.68
64x64 0.001 0.001 58.56 5.69 2.89
128x128 0.001 0.001 20.77 1.514 1.41
256x256 1 0.001 5.25 1.43 2.29
512x512 1.1 0.001 2.92 3.03 2.39
1024x1024 1.3 0.001 1.58 2.58 0.75
2048x2048 2 0.001 1.14 3.37 1.39
Table 3: Required accuracy of the solution for C++ and Handel-C test cases, and the designs speedup

The superiority of the hardware implementation over the software implementation for Jacobi and SOR is shown in Figures 9 and 10. This observation demonstrates the ability of realizing an accelerated version of the algorithm when implemented on hardware.

Figure 9: Jacobi execution time results in both versions, Handel-C and C++
Figure 10: SOR execution time results in both versions, Handel-C and C++
Figure 11: Robustness of MG over Jacobi and SOR

Tables 4, 5, 6 show, respectively, the Virtex II Pro., Spartan3L and Altera Stratix FPGA synthesis results for different problem sizes in MG,SOR, and Jacobi. When targeting Xilinx Virtex II Pro FPGA, the largest possible problem size that we could achieve was 2048x2048,where 99% of the slices were utilized. Meanwhile, the largest possible problem size was 512x512 when targeting Spartan3L FPGA.

Mesh Size Number of Occupied Slices Total equivalent gate count
MG SOR Jacobi MG SOR Jacobi
8x8 264 128 146 5990 2918 3229
16x16 295 136 159 6497 3033 3397
32x32 415 219 299 9321 4807 5090
64x64 536 265 380 12376 5978 7849
128x128 789 315 499 18107 7125 11864
256x256 1247 610 839 29244 14538 17864
512x512 2125 1098 1286 51115 23012 23649
1024x1024 3875 1601 1890 94484 31848 31327
2048x2048 4926 2289 3198 180879 53476 35839
Table 4: Xilinx Virtex II Pro Synthesis Results using Xilinx ISE
Mesh Size Number of Occupied Slices Total equivalent gate count
MG SOR Jacobi MG SOR Jacobi
8x8 687 302 416 355687 279010 356109
16x16 717 499 599 356163 281001 357631
32x32 769 589 7326 357224 282997 359989
64x64 832 745 9010 358921 284000 342768
128x128 1049 877 1198 361956 285872 389999
256x256 1507 1201 1665 367673 297134 397987
512x512 3187 2010 2810 375293 299858 498030
Table 5: Spartan3L Synthesis Results using Xilinx ISE
Mesh Size Total Logic Elements LE Usage by No. of LUT Inputs Total Registers
MG SOR Jacobi MG SOR Jacobi MG SOR Jacobi
8x8 725 519 610 402 250 354 228 120 189
16x16 818 601 709 554 310 401 265 155 232
32x32 925 810 880 625 501 556 301 199 300
64x64 1068 999 1001 709 637 681 360 280 385
128x128 1307 1274 1286 841 720 801 467 347 390
256x256 1739 1510 1590 1070 890 950 670 498 476
512x512 2653 2286 2589 1357 1087 1101 816 501 560
1024x1024 3491 2901 3342 1809 1450 1499 1002 569 689
2048x2048 4501 3286 3927 2201 1798 1941 482 640 819
Table 6: Altera Stratix Synthesis Results using Quartus II

6 Conclusions and Future Work

In this paper, we have presented a hardware implementation of the V-cycle Multigrid method for solving the Poisson equation in two dimensions. Handel-C hardware compiler is used to code and implement our designs (MG, Jacobi, and SOR) and map them onto high-performance FPGAs, such as, Virtex II Pro, Altera Stratix, and Spartan3L which is embedded in the RC10 FPGA-based system from Celoxica. The implementation performance is analyzed using the FPGAs vendors’ proprietary software. Moreover, we compare our implementation results with available software version results running on General Purpose Processors and written in C++. The obtained results have demonstrated that 1) MG algorithm outperforms the Jacobi and the SOR algorithms, on both hardware and software and 2) MG on hardware outperforms MG on GPP, where a speedup of 142.86 was achieved for a problem size of 8x8, whereas a speedup of 1.14 was achieved for 2048x2048. This degeneration of the speedup is due to the increase of the value of the required accuracy of the solution. Possible future directions include realizing a pipelined version of the algorithm, moving to a lower-level HDL such as VHDL, mapping the algorithm into a coarse grain reconfigurable systems (e.g., MorphoSys) [28], and benefiting from the advantages of formal modeling [29]. We can also extend the benefit of MG by implementing the W-cycle algorithm and the Algebraic MG.

References

  • (1) Altera http://www.altera.com
  • (2) D. Bailey and J. M. Borwein, Future Prospects for Computer-assisted Mathematics. Canadian Mathematical Society Notes, vol. 37, no.8, pp. 2-6, 2005.
  • (3) N. Bagherzade., F. Kurdahi, H. Singh, G. Lu, M. Lee E. and Filho, MorphoSys: Design and implementation of the MorphoSys reconfigurable computing processor, Journal of VLSI and Signal Processing-Systems for Signal, Image and Video Technology, 2000.
  • (4) M. Barr, ”A Reconfigurable Computing Primer”, Miller. Freeman Inc., 1998.
  • (5) R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. Van der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd ed. Philadelphia, PA: SIAM. 1994.
  • (6) S. Barsky, BARSKY. 2003. www.mgnet.org/mgnet-codes-barsky.html
  • (7) A. Borzi, Introduction to Multigrid Methods, Institut furMthematik und Wissenschaftliches Rechnen. 1999. www.kfunigraz.ac.at/imawww/borzi/mgintro.pdf
  • (8) J. Bramble, Multigrid methods, Pitman Research Notes in Mathematics Series, Longman Scientific. vol. 294. 1993.
  • (9) A. Brandt, Multi-Level Adaptive Solutions to Boundary Value Problems, Math. Comp, vol. l., no. 31, pp. 333-390, 1977.
  • (10) A. Brandt, The Gauss Center Research in Multiscale scientific computation, Electronic Transactions on Numerical Analysis. Vol 6, pp 1-34, 1997.
  • (11) W. L. Briggs, V.E. Henson and S.F. Mccormick, A Multigrid tutorial, SIAM, Philadelphia, PA, second ed. 2000
  • (12) Celoxica www.celoxica.com
  • (13) M. Chrzanowska-Jeske, Architecture and Synthesis Issues FPGAs, Proceedings of NORTHCON’93 Electrical and Electronics Convention.102-105. 1993.
  • (14) K. Compton and S. Hauck, Reconfigurable Computing: A Survey of Systems and Software. In ACM Computing Surveys: Vol. 34 , pp.171-210, 2002.
  • (15) J. Cong, FPGA Synthesis and Reconfigurable Computing. University of California, Los Angeles. 1997. www.ucop.edu/research/micro//.pdf
  • (16) I. Damaj, I. and H. Diab, Performance Evaluation of Linear Algebraic Functions Using Reconfigurable Computing, The International Journal of Super Computing, Kluwe, vol. 24, no. 1, pp 91-107, 2003.
  • (17) I. Damaj, J. Hawkins and A. Abdallah, ”Mapping High-Level Algorithms onto Massively Parallel Reconfigurable Hardware,IEEE International Conference of Computer Systems and Applications. pp 14-22, 2003.
  • (18) K. Datta and S. Merchant, Multigrid Methods for Titanium Heart Simulation. Retrieved January 2006, from University of California, Berkely, Department of Computer Science web site: http://www.cs.berkeley.edu/ kdatta/classes/cs267.doc
  • (19) J. Demmel, Solving the Discrete Poisson Equation using Jacobi, SOR, Conjugate Gradients, and the FFT. 1996. www.cs.berkeley.edu/ demmel/cs267/lecture24/lecture24.html.
  • (20) C. Douglas, MadPack: A Family of Abstract Multigrid or Multilevel Solvers, Computational and Applied Mathematics,14,3-20, 1995.
  • (21) R. Edward, J. Goodman and A. Sokal, Multigrid Method for the Random-Resistor Problem, Physical Review Letters. vol. 61, no. 12, 1988.
  • (22) A. J. Elbirt, and C. Paar, An FPGA Implementation and Performance Evaluation of the Serpent Block Cipher, ACM/SIGDA International Symposium on FPGAs, pp.33-40. 2000.
  • (23) R. Enzler, The Current Status of Reconfigurable Computing, Technical Report, Electronics Lab., SwissFederal Institute of Technology (ETH) Zurich. 1999.
  • (24) FPGA4Fun,FPGA design entry, 2005.www.fpga4fun.com/DesignEntry.html
  • (25) W. Hackbusch, Multigrid methods and application.Springer-Verlag, Berlin. 1985.
  • (26) V.E. Hensson, Multigrid Methods For Nonlinear Problems: An Overview, Center for Applied Scientific Computing. Lawrence Livermore Laboratory, 2000.www.llnl.gov/CASC/people/henson/postscript/ .pdf
  • (27) Y. Khalilollahi,Switching Elements, the Key to FPGA Architecture, WESCON Conference Record, pp: 682 - 687,1994.
  • (28) T. Lecklider, Drilling Down Into FPGAs, 2006. www.evaluationengineering.com/archive/articles/ .asp
  • (29) P. Luksch, Algebraic Multigrid, Department of Infromatics, Technische Universitat Manchen, 2002. www.lrr.in.tum.de/Par/appls/apps/amg.html
  • (30) MGNet Homepage: http://www.mgnet.org/
  • (31) D. Pellerin and S. Thibault, Practical FPGA programming in C. Upper Saddle River, N.J. : Prentice Hall Professional Technical Reference, 2005.
  • (32) C. Peter, Overview: Hardware Compilation and the Handel-C language, Oxford University Computing Laboratory, 2002. \(http://web.comlab.ox.uk/oucl/work/christian.peter/overview_{h}andelc\).html
  • (33) A. K. Sharma, Programmable logic handbook PLDs, CPLDs and FPGAs. Ed. McGraw Hill. 1998.
  • (34) J.Shewel, A Hardware/Software Co-Design System using Configurable Computing Technology, 1998. http://ipdps.cc.gatech.edu/1998/it/schewel.pdf
  • (35) M. L. Smedinghoff, Solving the Poisson Equation with Multigrid, 2005.http://cd-amr.fnal.gov/aas/poisson.pdf
  • (36) Terminology. Bournemouth University,School of Design,Engineering and Computing. \(http://dec.bournemouth.ac.uk/drhw_{l}ib/terminology\).html
  • (37) T.J. Todman, G.A. Constantinides, S.J.E Wilton, O. Mencer,W. Luk, and P.Y.K. Cheung, Reconfigurable computing:architectures and design methods, IEE Proceedings - Computers and Digital Techniques, vol. 152, no. 2, pp. 193-197, 2005.
  • (38) T. Torsti, M. Heiskanen, M. Puska, and R. Nieminen, MIKA:Multigrid-based program package for electronic structure calculations, International Journal of Quantum Chemisry, vol. 91,no. 2, pp. 171-176, 2003.
  • (39) J. Turely, How Chips are Designed, Prentice Hall, Professional Technical Reference, 2003.www.phptr.com/articles/
  • (40) F. Vahid and T. Givargis. Embedded systems design: a unified hardware/software introduction. New York: Wiley. 2002
  • (41) S. K. Valentina, Designing A Digital System with VHDL, Academic Open Internet Journal, vol. 11, 2004.
  • (42) P. Wesseling, An Introduction to Multigrid Methods, John Wiley and Sons, New York, .
  • (43) Xilinx www.xilinx.com
  • (44) J. Ystr m, Multigrid Methods Improve Solvers for Both Speed and Memory, 2005.www.comsol.fr/femlab2005/sample.doc