1 Introduction
Applying finite element methods on boundary value problems involves the solution of large and sparse systems of linear (or linearized) algebraic equations of the form
(1) 
an expensive computational task that requires most of the total processing time. To accelerate the computations, parallel processing in conjunction with domain decomposition methods Smith (DDM) are commonly used. DDM are separated in three major categories: Schwarzlike, Schur complement and full matrix methods. Applying Schwarzlike or Schur complement methods amounts to splitting the original system into smaller ones that are solved separately and in parallel while full matrix methods solve in parallel the original system (Eq.1). Elaborating on the advantages and disadvantages of the various DDM is outside the scope of this paper but the interested reader is referred to Saad_book (Chapter 13) and references therein.
In this paper, the parallel implementation is based on a Krylovtype solver that iterates on the original system (full matrix based DDM). Krylovtype iterative solvers are commonly used for such large and sparse systems due to their small memory requirements and high parallel efficiency. The GMRES (Generalized Minimal Residual) method Saad
is applied here; an iterative solver commonly used for general nonsymmetric matrices that is based on efficiently parallelizable operations like matrixvector products and vector inner products. Furthermore, to enhance the convergence of the method, a preconditioning technique by deflation is employed in our computational experiments. Henceforth, the full matrix DDM implemented for the parallel solution of a finite element problem along with the preconditioned GMRES solver is referred to as
parallel GMRES (PGMRES).Concerning the hardware, distributed, shared and hybrid memory systems were until recently the main computing architectures used for parallel computations. Sequential algorithms were transformed to run in those parallel architectures exploiting their high computing capabilities. Nevertheless, the constant need for solving efficiently demanding problems has led to the development of a new kind of parallel computing systems. Driven by the demands of the gaming industry, graphics processing units (GPUs) have evolved over the years offering remarkable floating point arithmetic performance
papadra , often exceeding 1 TFlops for a single GPU device. Their use for general purpose computations is known as general purpose computing on GPUs (GPGPU) and has become extremely popular over the last few years. With the release of the first NVIDIA’s Computing Unified Device Architecture (CUDAncuda ) Software Development Kit (SDK) in 2007, the tricky GPGPU programming became relatively easy. CUDA is a parallel computing framework developed by NVIDIA, which gives developers the capability to directly access GPU’s memory and instruction set.The past few years have seen intensive research on numerical methods and parallel processing. Efficient parallel implementations of the GMRES on distributed memory corebased clusters are reported by Haiwu GMRES_distr , Li Li , Erhel Erhel and Pashos et al. Pashos1 , while Calvetti et al. Calvetti implemented on shared memory architectures. In the field of GPGPU matrix computations, Galoppo et al. Galoppo studied the peak performance of a GPU dense matrix decomposition algorithm. Furthermore, Volkov et al. Volkov improved the standard CUBLAS algorithm for matrixmatrix multiplication, claiming to approach peak performance on a G80 series NVIDIA GPU. Sparse matrix algorithms on manycore GPUs were presented by Garland Garland , while Boltz et al. Boltz were among the first to report an implementation of a GPUbased iterative solver. These early works generally focused on tackling the various problems that arised in early numerical computations on GPUs and underlined the high computing capabilities and potential that these systems have. Focusing on the GMRES, several works like sparse_GPU  Wang implement GMRES in GPUbased computing architectures, proposing efficient kernels and special storage formats for accelerating the computations several times. Furthermore, a hybrid GMRES implementation is presented by Moret Moret , based on GRID systems, while multiGPU implementations of various numerical methods, like those presented in Xian and Ament , show promising results regarding the parallel computing capabilities of multiGPU systems.
Despite the extensive literature around PGMRES solver, there is a significant shortage of comparisons between various implementations of PGMRES on different computing architectures (distributed, shared and GPUs). This article focuses on illuminating the implementation of a single, custom parallel finite element method with the PGMRES in various computing architectures. The main goal is to compare the major advantages and drawbacks of each parallel computing implementation as far as parallel speedup, execution time and memory issues are concerned. That said, three codes are developed:

PGMRESclassic: PGMRES capable of executing in multiprocessing units. The data passing is performed with Message Passing Interface (MPI).

PGMRESGPU: A single GPUbased PGMRES. The parallel tasks (communication, synchronisation) are performed via CUDA.

PGMRESmGPU: A multiple GPUbased PGMRES. The parallel tasks (communication, synchronisation) in each GPU and between the GPUs are performed via CUDA.
This paper begins with an introduction to the partitioning technique (Section 2.1) used here for the distribution of computational work between the various processing units. Section 2.2 describes the architectures used in the experiments, giving specific details about the hardware available for computations. The Bratu benchmark problem is briefly described in Section 3.1 followed by a short presentation of the preconditioned GMRES in Sections 3.2 and 3.3, while experimental results and comparison of the various architectures are shown in Section 4. Finally, conclusions are presented in Section 5 and further research directions are proposed.
2 Computing
2.1 Partitioning
In order to apply PGMRES, a graph partitioning technique for the original matrix is required metis . Graph partitioning aims at the optimal distribution of the computational load between the multiple cores towards high computational speed. For an efficient partitioning, a balanced load is critical otherwise it will lead to a performance bottleneck, caused by the most heavy loaded core. Furthermore, solving the partitioned problem requires communication and synchronisation between the cores. High and frequent data exchange load between the cores has negative effect on the parallel speedup and can make parallel processing inefficient compared to sequential execution of a program. Thus, partitioning techniques should minimize this effect. The graph partitioning corresponds to the finite element mesh partitioning, i.e. splitting the original domain into several subdomains each assigned to a single core. Each subdomain is processed independently, assembling a corresponding local matrix and residual vector. In order to reduce the effects of the choice of the partitioning technique, i.e. the ”quality” of the produced subdomains, a simple, 3D, cubic domain is used in the present computations. At this point we should point out that the communication required by the various processes can be described as local, if it occurs between processes sharing neighboring subdomains or as global, if it occurs between all concurrently running processes.
2.2 Hardware
In this paper, a SPMD (single program, multiple data) technique is used for parallel processing involving both cores and GPUs. Shared memory and distributed memory systems are both included in the computational experiments. All computations are carried out in two clusters, hosted by the School of Chemical Engineering of the National Technical University of Athens, Pegasus and Andromeda. The cluster Pegasus h_pegasus is a distributed memory system and consists of 16 nodes interconnected via Myrinet2000 network. Each of these nodes comprises 2 Xeon CPUs at 3 GHz with 2GB RAM for a total of 32 CPUs and 32GB RAM. The cluster Andromeda h_andromeda is a distributed memory hybrid (CPU/GPU) cluster. It includes 4 nodes (HP ProLiant SL390s G7) each consisting of 2 Xeon 6core CPUs X5660 at 2.80 GHz with 16GB RAM for a total of 48 cores and 64GB RAM connected via Gigabit Ethernet network. Furthermore 2 NVIDIA GPUs Tesla M2050 with 3GB RAM are on each node for a total of 8 GPUs and 24GB RAM. Each NVIDIA GPU hosts 14 multiprocessors, each containing 32 GPUcores for a total of 448 GPUcores. In this work, only one node (shared memory architecture) of Andromeda is used in the experiments.
3 Benchmark problem and GMRES
3.1 The Bratu problem
The benchmark problem employed here is the single parameter, 3D Bratu problem (2)  (5). This boundary value problem was posed in a 3D cubic domain in cartesian coordinates with homogeneous Dirichlet boundary conditions on the and boundaries and homogeneous Neumann boundary conditions on the boundaries:
(2) 
(3) 
(4) 
(5) 
In the literature Pashos2 , the 3D Bratu problem is often studied in terms of its solution space structure and, in particular, the dependence of singularities on parameter values (like ). In this implementation though, singularities are not of interest and a constant value is chosen. The above problem is discretized and solved with the Galerkin/finite element method fem . The domain is divided into a mesh made of a finite number of 27node cubic elements. The unknown function u is approximated in terms of finite element basis functions, , that are quadratic polynomials in each spatial direction:
(6) 
where is the total number of nodes on the mesh covering and are the values of the unknown function u at the nodes. Each basis function, is so defined as to be equal to unity at node and zero at all other nodes. The Galerkin method forms weighted residuals by multiplying equation (2) by the same basis functions used to approximate the solution (cf. Eq. (6)) and then integrating over the domain :
(7) 
Following substitution from Eq. (6), Eq. (7) is transformed into a system of nonlinear algebraic equations:
(8) 
where n is the unit normal vector to the boundary . Due to the boundary conditions (3) and (4), all the residual equations (8) corresponding to nodes on the faces and faces of are replaced by equations setting the value of the local nodal unknowns equal to . Moreover, the boundary integral term in equations (8) vanishes in the residuals of all interior nodes of the domain since, by their definition, the basis functions corresponding to interior nodes are zero at the domain boundary. Due to boundary condition (5), the boundary integral term for the nodes corresponding to the faces of the problem domain vanishes as well. Therefore, there is no boundary integral contribution in the residual equations (8). The nonlinear equation set (8) is linearized and solved by Newton iteration. At the iteration, the following linear equation set is solved:
(9) 
where is the update vector; . , , is the residuals vector; is given by Eq. (8) with the boundary integral term eliminated. , , is the Jacobian matrix; . The superscript k in Eq. (9) denotes evaluation at the current approximation of the solution .
The Newton iteration terminates when a selected norm of the update vector falls below a tolerance. The linear system (9) is solved, at each iteration by the preconditioned GMRES with restarting, GMRES(m) Saad (see below).
Henceforth the term ”degrees of freedom  DOF” is used to refer to the unknowns the equations are solved for. Also, the matrix
J, the right hand side R and the solution vector du in Eq. (9) will be denoted, respectively, by A, b and x, to conform with the established notation (used in Eq. (1)).3.2 GMRES(m)
Starting from an initial guess, , of the solution of (9), GMRES employs Arnoldi method, combined with an orthogonalization technique  the modified GramSchmidt method is used here  to construct an orthonormal basis of the mdimensional Krylov subspace as follows
(10) 
where and the GMRES residual vector. The new approximation of the solution is computed as:
(11) 
where is a vector of size , which comes out of the least squares problem
(12) 
In Eq. (12), , and is an upper Hessenberg matrix that satisfies the equation
(13) 
where is an upper Hessenberg matrix obtained from the by deleting its last row. The least squares problem (12) is solved by transforming into an upper triangular matrix using plane rotations.
The residual vector at every step ( = 1, 2 . . ., ) of the GMRES is
(14) 
where is a by product of the plain rotations. Thus, at every step, the convergence of the GMRES can be monitored from Eq.(14) without computing the .
GMRES allows the Krylov subspace dimension to increase up to and always terminates in at most iterations. Since the GMRES requires memory storage and floating point operations, for larger values of m it becomes very expensive. Therefore a restarting variant of the GMRES  the GMRES(m) Saad  is used in practise. When reaches a certain preset value, the algorithm restarts, using the last approximation from (11) as a new initial guess Pashos1 . Thus, two iterations are performed: The ”inner” iterations and the ”outer” iterations that correspond to the restarts of the GMRES(m). The termination criteria is the relative error defined as
(15) 
to fall below a tolerance. In Eq. (15), is the Euclidean norm of the residual vector after the last outer iteration of GMRES(m) and the Euclidean norm of the initial residual vector . The initial guess is usually taken equal to zero.
3.3 Preconditioning by deflation
The convergence rate of GMRES depends on the smallest eigenvalues of the Jacobian matrix in Eq.(
9) and deteriorates as the eigenvalues tend to zero. Moreover, the minimum and maximum eigenvalues of the Jacobian can be approximated by the corresponding eigenvalues of the Hessenberg matrix in Eq.(13). These eigenvalues are known as the Ritz values mat_computations . However, the restarted GMRES(m) loses information that is related to the smallest Ritz values at each restart and several preconditioning techniques have been developed that aim at preserving those information through the process Moret Morgan .A preconditioner is used to improve the convergence properties of the GMRES(m), especially near singular points, where convergence is difficult and sometimes even impossible. Using the preconditioner, the original linear system is transformed to an equivalent one that has better convergence properties. In our implementation, (9) is preconditioned from the right as
(16) 
In (16), is a vector of size and is the preconditioner matrix which is constructed from a deflation technique burrage Erhel and is given from
(17) 
where is the largest eigenvalue of the Jacobian matrix, are identity matrices, is an orthonormal basis of the dimensional invariant subspace, , corresponding to the smallest (in absolute value) eigenvalues of the Jacobian and such as
(18) 
The largest eigenvalue needed in Eq. (17
), and the eigenvectors of the Jacobian needed in Eq. (
18), are approximated by those of the Hessenberg matrix . At each restart of the GMRES(m), eigenvectors of the Jacobian are approximated from eigenvectors, of that correspond to the smallest eigenvalues. These eigenvectors are computed using the inverse power method. In practice, we use . The new vectors:(19) 
are orthonormalized against those of and added to . So the dimension of increases by . In practice, the dimension of is small, therefore the computational cost of the eigenvalue problem for is negligible.
The preconditioner needs two arrays of size for the storage of and involved in Eq. (17) and (18). In order to reduce the memory requirements and computational cost, an upper limit on is set ( in our case). When the update of the preconditioner is continued (i.e. new vectors are added to ) but at the same time, vectors of that correspond to the largest eigenvalues of the matrix are deflated.
The orthonormal basis of is constructed from the eigenvectors of times the orthonormal basis of the Krylov subspace. Thus, the vectors approximate the eigenvectors of the Jacobian matrix that correpsond to the smallest eigenvalues.
4 Results
In this section, the results obtained from four different computational experiments are presented:
Case A: PGMRESclassic on a distributed memory architecture (Pegasus cluster). A maximum number of twelve single core processors (one per node) from Pegasus cluster is used.
Case B: PGMRESclassic on a shared memory architecture (Andromeda cluster). A maximum number of twelve cores belongs to each Andromeda node.
Case C: PGMRESGPU on a single GPU (one node of Andromeda cluster)
Case D: PGMRESmGPU on two GPUs (one node of Andromeda cluster)
The main quantity monitored during the experiments is the speedup performance achieved from the implementation of PGMRES on various architectures. Furthermore, the effect of the required communication time on parallel implementations is analyzed. The findings are strongly dependent on the size of the problem and therefore, all implementations are tested in a wide range of sizes, starting from a few thousands DOF up to a million.
PGMRESclassic is written in C with MPI and compiled with Intel C/C++ standard icc compiler, without any compilation flags, combined with OpenMPI library for communication in Andromeda and with mpichgm library for communication in Pegasus. PGMRESGPU and PGMRESmGPU are written in C with CUDA. The C code is compiled with Intel C/C++ standard icc compiler while for the GPU codes, NVIDIA’s Cuda standard nvcc compiler is used. No compilation flags are used. All computations are made with double precision floating point data for real variables.
In all PGMRES versions, a Compressed Sparse Row (CSR or CRS) format is adopted for storage of large sparce matrices arising from the finite element problem. The following table contains information about the different size, sparsity pattern, total number of entries and memory requirements of the compressed matrix A, as the number of DOF grows from some thousands to over a million.
DOF  Matrix elements  Nonzero entries  Sparsity  Memory (MBytes) 

29791  8.88E+08  1.60E+06  1.80E03  19 
132651  1.76E+10  7.72E+06  4.39E04  88 
226981  5.15E+10  1.33E+07  2.58E04  153 
357911  1.28E+11  2.13E+07  1.66E04  245 
531441  2.82E+11  3.06E+07  1.08E04  380 
753571  5.68E+11  4.55E+07  8.01E05  524 
1030301  1.06E+12  6.26E+07  5.89E05  720 
1367631  1.87E+12  8.35E+07  4.46E05  961 
1771561  3.14E+12  1.06E+08  3.46E05  1249 
As far as the GMRES(m) is concerned, in all experiments the number of both inner () and outer iterations is kept constant and equal to 50 and 100 respectively. The number of outer iterations, in particular, is taken constant in order to keep the computational results unaffected by accuracy variations and roundoff errors that may occur. Therefore, no termination criteria are used. Nevertheless, relative errors (see Eq. (15)) remain for all problems below .
4.1 GMRES(m) vs preconditioned GMRES(m)
A preconditioner is used to enhance the convergence behavior of the GMRES. To elucidate the effect of the preconditioner, Figure 2 shows the Euclidean norm of the residual vector () during the solution of a system with DOF=130000 versus the outer iterations of GMRES(m) with and without preconditioner. Figure 2 clearly depicts the acceleration in the convergence of the preconditioned version versus the nonpreconditioned one.
4.2 Case A: PGMRESclassic on distributed memory architecture
In Fig.3 the speedup achieved is shown for various DOF. This speedup is defined as the serial execution time, , (one MPI process), divided by the run time, , required by p (in number) MPI processes, for the same DOF,
(20) 
Increasing the number of processors leads to a smaller number of computations executed by each MPI process. As a result, the computing time required by multiple processors is lower than the time required by the sequential solution from a single processor. The ideal would be that the speedup and the number of cores relate linearly, so for algorithms with order of growth , like the GMRES, a speedup of is expected. On the contrary, in Fig.3 a superlinear speedup is achieved, occurring for rather small problems, where the acceleration surpasses p. This observation contradicts the analysis made about the correlation between the speedup and the size of the problem. The superlinear speedup is due to cacheeffects Pashos1 . Every processor has a L2 cache with capacity of 2MB. Therefore, problems with less than 30000 DOF assemble a matrix with size of less than 20 MB (see Table 1). The partitioning technique splits the data into chunks of memory smaller than 2 MB that lie permanently in the fast cache memory and not in the slower main memory (RAM) as in the larger problems (DOF ).
The speedup is, in general, affected by the capabilities of the network used for the communication. In the Pegasus cluster  where the Myrinet network is used  communication time is negligible as it can be seen in Table 2 in which the percentage of the communication time for twelve MPI processes vs DOF is presented. Recall (see section 2.1) that the existing communication types are two, one at local level and the other at global level.
DOF  Computation %  Local Communication %  Global Communication %  Total Communication % 

29791  71.85  9.61  18.54  28.15 
132651  76.83  3.99  19.18  23.17 
226981  92.23  2.41  5.37  7.77 
357911  92.14  2.98  4.88  7.86 
531441  94.81  1.89  3.30  5.19 
753571  95.13  1.75  3.12  4.87 
1030301  96.15  1.21  2.64  3.85 
1367631  96.52  1.22  2.26  3.48 
1771561  97.14  1.13  1.73  2.86 
4.3 Case B: PGMRESclassic on shared memory architecture
In Fig.4 the speedup achieved is shown at various DOF. The speedup behaviour of the shared memory architecture changes radically as the number of cores increases. The most noticeable aspect of Fig.4 though, is the fact that the speedup remains actually constant and lower than the ideal for a number of cores greater than six and for all the experiments compared to the previous distributed architecture where the speedup had a linear variation in the range of two to twelve MPI processes and large (DOF >30000) problems. This main difference can be explained by the limitations that the shared main memory puts in this kind of architectures. As the number of parallel executing cores increases, memory accesses become sequential and the cores remain idle, waiting for the data to be fetched. Consequently, core to memory connection becomes a bottleneck and the high computing power of the multicore system is only partially exploited.
4.4 Case C: PGMRESGPU
No partitioning is used in this implementation, because all instructions are executed concurrently by numerous threads with direct main memory access (see Section. For maximum performance, each instruction must be assigned to a single GPU thread that runs concurrently with the others. Therefore, as the size of the problem grows, so does the number of GPU threads used for computations. The GPU threads are organised in blocks that are executed in parallel by a single GPU multiprocessor. The number of threads per block is predefined by the user. The choice here is 512 threads per block. The number of blocks depends on the number of DOF of the problem and is calculated each time by:
(21) 
After one core finishes the assembly of the A and b in Eq. (1), they are uploaded to the GPU where the computations take place. In this case the speedup, is defined as:
(22) 
where is the run time in one core of Andromeda and is the run time in a single Tesla M2050 GPU.
As shown in Fig.5, the speedup obtained from a single GPU versus DOF shows varies. Firstly, where the number of DOF remains relatively small, the speedup increases sharply and linearly. This trend is explained by the massive parallelism of the GPU with thousands of threads executing concurrently instructions on different segments of data, accelerating the computations as the problem size grows until . For DOF > though, this behavior changes a bit; the speedup increases further but this time at a slower rate. The maximum speedup obtained from a single GPU is 6.5, greater than the maximum speedup measured in the shared memory, but still lower than the distributed memory model’s.
Furthermore, another version of the singleGPU implementation was tested. In this version, the partial sums from each dot product and norm computed by the multiple blocks were transferred to the host, followed by the reduction. The final result, computed by the host, was uploaded afterwards back to the GPU. In addition, the least squares problem (Eq. (12)) was solved on the host. This implementation involving reduction and synchronisation of the GPU threads by the host is less efficient than the initial singleGPU one. The maximum speedup observed in this version was 4.6, far smaller than the original one.
4.5 Case D: PGMRESmGPU
With the introduction of CUDA Toolkit 4.0 ncuda2 , the use of multiple GPUs for general purpose GPU computations became relatively easy. Before 4.0, each GPU had to be controlled by a single host thread, thus making the use of multiple GPUs a complicated and usually inefficient task. The main drawback was the communication between the GPUs requiring data transfers and synchronization between the two core threads. After the introduction of the new Toolkit, multiple GPUs can be controlled by a single core thread making synchronization and communication relatively easy and inexpensive. Furthermore, kernel execution is nonblocking and control is returned to the host before the kernel finishes computing. Therefore multiple kernels can be executed concurrently on multiple devices. The CUDA function CudaSetDevice enables the switching between available GPUs.
In this approach the mesh is partitioned in the same way as in the PGMRESclassic implementation. Each of the two resulting subdomains is assigned to a single GPU but here the data passing is implemented with CUDA instead of MPI. Two speedups are measured, and defined, respectively, as the execution time, , required by a single core of the Andromeda cluster divided by the execution time, , required by two Tesla M2050 GPUs and as the execution time, , required by a single GPU divided by the execution time, , required by the multiGPU implementation. The results for the speedup versus DOF are presented in Figs.2a and b.
(23) 
(24) 
A maximum speedup of almost 10 compared to a single core and about 1.9 compared to a single GPU is observed. The combined computational capability of the two GPUs with a peak double precision floating point performance of over 1 Teraflops accelerated the computations several times. Compared to a single GPU, the overall speedup variation comprises clearly two different regions. In the first region, the increased computational power of the two GPUs contributes to the observation of an almost ideal speedup () compared to a single GPU. In the second region though, for DOF >, the frequent communication and the larger amount of memory that has to be exchanged between the GPUs causes the GPUtoGPU bus to saturate and as a result the communication to become a bottleneck that slows the computations down. In this implementation, the time required for synchronisation and data exchange of any kind (local and global communication) is measured and summed for the total communication time presented in Table 3. For problems with DOF >20000, the decreasing number of operations per GPU overcomes the delay caused by the communication, thus achieving a speedup greater than 1. According to Table 3, as the problem size grows, the time required for the communication becomes less significant compared to the time required for computations. Nevertheless, for larger problems due to the higher processing power of the GPU compared to a core, the communication is still a major drawback, not allowing linear speedups to be achieved.
DOF  Computation %  Total Communication % 

29791  79.84  20.16 
132651  83.45  16.53 
226981  84.63  15.37 
357911  85.38  14.62 
531441  86.71  13.29 
753571  87.39  12.61 
1030301  87.95  12.05 
1367631  77.17  22.93 
1771561  76.49  23.51 
4.6 Comparison
Table 4 shows the values of a factor defined as relative speed:
(25) 
is the maximum wall clock execution time from all implementations for a specific problem size (i.e. number of DOF), is the minimum wall clock execution time for the implementation i for the same DOF and i is the implementation of the PGMRES (Cases AD).
Apparently, the relative speed is equal to one for the slowest implementation at a specific number of DOF. As shown in Table 4, for small problems, PGMRESclassic on the shared memory architecture (Case B) is the fastest, while PGMRESGPU (Case C) is the slowest. As the number of DOF increases, PGMRESclassic on the distributed memory architecture (Case A) becomes the slowest while PGMRESmGPU becomes the fastest one. At DOF >, comparable relative speeds of PGMRESclassic on Pegasus (distributed memory) and on one node of Andromeda (shared memory) are noticed, despite the Pegasus superiority in terms of speedup. This is due to the older hardware (CPUs, memory) of Pegasus compared to Andromeda’s. Andromeda’s single core performance is approximately 3.5 times faster than Pegasus.
DOF  PGMRESclassic (DM)  PGMRESclassic (SM)  PGMRESGPU  PGMRESmGPU 

29791  1.90  3.07  1.00  1.13 
132651  1.00  2.02  1.19  1.79 
226981  1.00  1.66  1.15  1.97 
357911  1.00  1.34  1.27  2.33 
531441  1.00  1.17  1.27  2.26 
753571  1.00  1.32  1.43  2.37 
1030301  1.00  1.39  1.73  2.52 
1367631  1.00  1.42  1.93  2.61 
1771561  1.00  1.51  2.01  2.83 
5 Conclusions
In this work is presented an implementation of the preconditioned GMRES for large sparse linear systems ranging from a few thousands DOF to over one million on various parallel computing architectures. The partitioning technique used for the distribution of the computations to the various processes is briefly described and the basic aspects of the GPUs architecture are presented.
The main issue addressed in this work is the method that someone should choose in order to efficiently solve large sparse linear systems of equations depending on the hardware available. As far as one node of the distributed memory core/GPU cluster Andromeda is concerned, the shared main memory puts a serious limitation to the number of cores that can run concurrently, before memory access becomes a bottleneck. The use of a single GPU device can accelerate the computations several times. Using multiple GPUs even greater speedups are achieved. Furthermore, the use of multiple GPUs allows for the solution of even larger problems, because of the mesh partition. Nevertheless, those facts presuppose that the total GPU’s main memory meets the memory requirements of the largescale problem under study. Otherwise, the use of the distributed memory cluster Pegasus is preferred where the achieved speedup is not affected by the communication because of its fast network. Ongoing research involves a hybrid parallel implementation of the PGMRES capable to run on both multiple cores and GPUs.
References

(1)
B. Smith, P. Bjorstad, and W. Gropp.
Domain Decomposition. Parallel multilevel methods for elliptic partial differential equations
. Cambridge University Press, 1996.  (2) Y. Saad. Iterative methods for sparse linear systems. PWS Publishing Company, Boston, 1996.
 (3) Y. Saad and M.H. Schultz. Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comp., 7, (1986) 856869.
 (4) M. Papadrakakis, G. Stavroulakis, and A. Karatarakis. A new era in scientific computing, domain decomposition methods in hybrid cpu/gpu architectures. Comput. Methods Appl. Mech. Engrg., 200, (2011) 14901508.
 (5) https://developer.nvidia.com/content/cuda10 (last accessed 19.03.2013).
 (6) H.E. Haiwu, C. Bergere, and S. Petiton. A hybrid gmres/ls  arnoldi method to accelerate the parallel solution of linear systems. Comput. Math. Appl., 51, (2006) 16471662.
 (7) Guangye Li. A block variant of the gmres method on massively parallel processors. Parallel Comput., 23, (1997) 10051019.
 (8) J. Erhel. A parallel gmres version for general sparse matrices. Electron. Trans. Numer. Anal., 3, (1995) 160176.
 (9) G. Pashos, M.E. Kavousanakis, A.N. Spyropoulos, J.A. Palyvos, and A.G. Boudouvis. Simultaneous solution of largescale linear systems and eigenvalue problems with a parallel gmres method. J. Comput. Appl. Math., 227, (2009) 196205.
 (10) D. Calvetti, J. Petersen, and L. Reichel. A parallel implementation of the gmres method. Numerical Linear Algebra, eds. L. Reichel, A. Ruttan and R.S. Varga, de Gruyter, (1993) 3146.
 (11) N. Galoppo, N.K. Govindaraju, M. Henson, and D. Manocha. Lugpu efficient algorithms for solving dense linear systems on graphics hardware. ACM/IEEE conference on Supercomputing, (2005) 3.
 (12) V. Volkov and J. Demmel. Lu, qr and cholesky factorizations using vector capabilities of gpus. Technical Report UCB/EECS200849, EECS Department, University of California, Berkeley, (2008).
 (13) M. Garland. Sparse matrix computations on manycore gpu’s. In Proceedings of the 45th annual Design Automation Conference (DAC ’08), (2008) 26.
 (14) J. Bolz, I. Farmer, E. Grinspun, and P. Schroder. Sparse matrix solvers on the gpu: conjugate gradients and multigrid. ACM SIGGRAPH 2003 Papers, San Diego, California, ACM, New York, 3, (2003) 917924.
 (15) M. Wang, H. Klie, M. Parashar, and H. SudG. Solving sparse linear systems on nvidia tesla gpus. Allen et al. (Eds.): ICCS 2009, Part I, LNCS 5544, 3, (2009) 864873.
 (16) M. Wang, H. Klie, M. Parashar, and H. Sudan. Solving sparse linear systems on nvidia tesla gpus. Allen et al. (Eds.): ICCS 2009, Part I, LNCS 5544, (2009) 864873.
 (17) I. Moret. A note on the superlinear convergence of gmres. SIAM J. Numer. Anal., 34, (1997) 513516.
 (18) W. Xian and A. Takayuki. Multigpu performance of incompressible flow computation by lattice boltzmann method on gpu cluster. Parallel. Comput., 37, (2011) 521535.
 (19) M. Ament, G. Knittel, D. Weiskopf, and W. Straßer. A parallel preconditioned conjugate gradient solver for the poisson problem on a multigpu platform. 18th Euromicro Conference on Parallel, Distributed and Networkbased Processing, (2010).
 (20) G. Karypis and V. Kumar. A fast and highly quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput., 20, (1999) 359392.
 (21) http://febui.chemeng.ntua.gr/pegasus.htm (last accessed 19.03.2013).
 (22) http://febui.chemeng.ntua.gr/andromeda.htm (last accessed 19.03.2013).
 (23) G. Pashos, E.D. Koronaki, A.N. Spyropoulos, and A.G. Boudouvis. Accelerating an inexact newton/gmres scheme by subspace decomposition. Appl. Numer. Math., 60, (2010) 397410.
 (24) G. Strang and G.J. Fix. An Analysis of the Finite Element Method. Englewood Cliffs, NJ, 1993.
 (25) G.H. Golub and C.F. Van Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, 1983.
 (26) R.B. Morgan. A restarted gmres method augmented with eigenvectors. SIAM J. Matrix Anal. Appl., 16, (1995) 11541171.
 (27) K. Burrage and J. Erhel. On the performance of various adaptive preconditioned gmres strategies. Numer. Linear Algebra Appl., 5, (1998) 101121.
 (28) http://developer.nvidia.com/cudatoolkit40 (last accessed 19.03.2013).