1 Introduction
Finite difference (FD) modeling methods have been actively applied to seismic wave propagation problems on the last fifty years. In this area, advanced methods use staggered grids (SG) in combination to fourth and higherorder FD formulas to resolve the wave equation in first order formulation, and then solving for all dependent variables [2], [17], [19], [26], [35], [37] [41], [42]. In particular, the velocitypressure formulation of the acoustic wave equation allows stating free surface and rigid wall boundary conditions as simple Dirichlet conditions, and is also well suited for the implementation of absorbing boundaries. In a SG, discrete wave fields may be defined on separate nodal meshes, which are displaced by the half of grid spacing in one or more coordinate directions. On these grids, an unknown wave field is placed at the center of those it depends, and thus SG differentiation yields more accuracy in comparison to a traditional nodal FD stencil with an error dependent on full .
FD SG methods mentioned above, among many other wave propagation schemes, employ explicit secondorder Leapfrog time integrations that demand minimal storage for wave field histories, but strongly limit time stepping to enforce stability. Multistep explicit time integrations with third and higherorder accuracy have been progressively explored in search for wider CourantFriedrichsLewy (CFL) stability ranges, while reducing numerical dispersion and other anomalies [3], [4], [43], [51],[52]. On the other hand, alternating direction implicit (ADI) methods apply a semiimplicit integration to the wave equation that allows for large time stepping, and only involves previously computed wave fields. Thus, ADI methods are onestep and low memory demanding. These methods were developed by Peaceman and Rachford [34] to solve parabolic equations, and then used in hyperbolic problems [8], [21], [30]. Since then, several ADI schemes have been focused on solving the acoustic wave equation by using compact finite differences (CFD) to discretize the spatial derivatives. Recent applications are [1], [13], [14], [22], [24], and [28]. A CFD formula linearly couples discrete evaluations of a function and its derivative of certain order at subsequent grid points, and requires solving a banded systems of linear equations (SEL) for the latter values along a whole grid line. Given this formulation, a CFD stencil presents a shorter length compared to a traditional FD with similar accuracy, and this property shorten the linear system bandwidth. For instance, a set of fourth order CFD can be applied through the solution of a tridiagonal system via a fast solver like Thomas’ algorithm (TA). The literature on CFD is broad, but the seminal paper by Lele [25] presents Taylorbased constructions of central and lateral formulas of several order of accuracy, and their optimization for maximum resolution and low dispersion on oneway advection problems. More recent contributions on CFD can be found on cited works above. Notice that ADI CFD methods have shown a potential for fast and accurate wave simulations, combining the large ADI CFL condition with the CFD lowdispersive propagation modeling.
Time explicit FD wave propagation methods based on local spatial stencils are well suited for domain decomposition and code parallelization. Main reasons are their formulation on structured meshes and their low data dependency among grid subdomains. Particularly, implementations on Graphics Processing Units (GPU) have become useful and powerful wave simulation tools in 2D and 3D, given that realistic applications may result computationally intensive even in two spatial dimensions [31], [33], [39], [47], [49], [54]. Conventional ADI methods and CFD schemes addressing for high computational efficiency must rely on fast SEL solvers. In the case of parallel implementations, SEL solvers must be amenable to parallelism and efficiently deal with data dependencies and scalability issues. On GPUs, most of ADI methods have been developed for parabolic equations as those found in [15], [46], [50], and [53]. Alternatively, CFD implementations on GPU have been mostly assembled for computational fluid dynamics based on solving the NavierStokes equations, as given by [16], [29], and [45]. All these implementations with fourthorder spatial accuracy have adopted the Cyclic Reduction (CR) tridiagonal solver, and apply hardwaredependent specializations and optimizations for better performance. However, hybrid CRTA solvers have been also proposed in [18] and [23]. In addition, solution methods based on approximate matrix inversion have been alternatively used by [48], and Authors in [12] and [11] developed a problembased solution approach. Thus, last word on GPU tridiagonal solvers has not yet been said, and this motivates recent the reviews and exploratory studies in [20] and [27].
This work builds on the contribution of Cordova et.al [9]. Grid types, FD spatial discretizations, and the CNADI time integration used by the former methods in [9] are the same that support our current numerical schemes. However, latter schemes differ in two main numerical aspects. First, our new CFD nodal method employs exclusively fourth order compact stencils at all grid points, as given by Lele in [25] and summarized in Appendix A, whereas the former method reduces accuracy to third order at boundary closures. On wave problems exhibiting boundary layers, or any source of strong gradients near boundaries, the fully fourth order method may perform more precisely. Second, the former mimetic SG method, although globally fourth order accurate, presents a formulation in terms of compactlike operators that apply differentiation in two steps. This formulation feature has no advantage at the computation level, so we here reformulate this SG method in terms of standard mimetic operators, originally presented in [5][6] and detailed in Appendix B. From the implementation point of view, this work undertakes the parallelization of both new schemes on multicore (CPU) and manycore (GPU) architectures, and evaluate their performance to measure limiting speedup. It is worthy of mention that the literature on parallel ADI FD methods for wave propagation in acoustic or more general media is rather scarce, and we have not found schemes based on CFD or MDF, rather than the pioneer implementations by Moya F. in [32]. This deficiency highly motivates this work.
The remainder of this paper is organized in the following way. Section 2 briefly presents the formulation of the CFD and mimetic numerical methods for acoustic wave motion, along with the CPU complexity analysis of associated pseudocodes. In Section 3, we describe the progressive code optimization and parallelization of our methods to improve computational performance in available CPU and GPU scenarios. Section 4 states the numerical tests used to assess accuracy and convergence of these schemes, and to compare the speedup of execution times achieved by all their parallel implementations in Section 5. Our main conclusions are finally given in Section 6.
2 Numerical methods
In a medium with density and adiabatic compression , the propagation of acoustic waves obyes the following system of nonhomogeneous differential equations
Above, the dependent variables are the scalar pressure
and the particle velocity vector
, while may correspond to an additional forcing term. Appropriate initial and boundary conditions must be considered in order to find a particular solution to (2), and this velocitypressure formulation is well suited to state boundary conditions of relevance in seismic applications. For instance, a free surface can be modeled by a homogeneous Dirichlet condition on , and most absorbing conditions require the computation of along the domain boundaries. However, we present the following formulation of our numerical methods on the square domain , where a Dirichlet condition is imposed at , and velocities at those edges must be computed during the simulation. Brief comments on the easy reformulation that also computes boundary values of are appropriately given nonetheless. Initial distributions of all three dependent variables are given at .2.1 The nodal CFD method
To simplify notation, we use a square lattice to discretize for a given integer , and then cell sides have a common length . Grid values of the continuous fields ,, and are hold by matrices , , , respectively. The result of FD differentiation of discrete pressures are collocated on matrices and , according to the direction of differentiation, and similar structures are used in the case of particle velocities. Using the CFD operators and presented in Appendix A, pressure differentiation at all grid nodes is computed by the matrix operations
However, pressure is known along boundary edges because of Dirichlet conditions, thus this formulation also makes use of the reduced matrix that only holds interior grid values. Time updating of these inner discrete pressures is based on the discretization of momentum conservation in (2), and requires approximations to and . In space, this discretization in carried out by reduced operators and , also introduced in Appendix A, applied to discrete velocities
(1) 
Reduced matrices and are obtained from whole matrices after removing first and last rows of , and first and last columns of , respectively. By doing so, we avoid unnecessary velocity differentiation along boundaries.
The application of the Alternate Direction Implicit (ADI) formulation to the acoustic model (2) starts by defining the discrete vector operators
(2) 
Next, the CrankNicolson discretization of time derivatives leads to the fully implicit method
(3) 
where is the time step, represents the discrete solution at a given discrete time, and collects the source contribution at grid interior. Coupled equations of this sort are efficiently solved after the application of an ADI splitting, that only requires the solution of banded SEL for CFD differentiation along a single coordinate direction. The PeacemanRachford ADI decomposition of (3), as used in [9], states a calculation in two stages
(4) 
(5) 
each of which alternates the application of operators and . The intermediate approximation is common on these ADI formulations, but only approximations at time levels and are consistent solutions to (2). According to the definition of , the first stage allows an explicit calculation of vertical velocity in terms of its previous values at time , and the differentiation of pressure along the direction. However, this stage also couples the calculation (row by row) of intermediate and in the following two sets of linear systems
We denote the matrices on the right hand side of above equations by and , respectively, which can be computed from known solutions at . That is, and . Next, rows of intermediate pressures and horizontal velocities can be computed by forward and backward substitutions from the LU decomposition of tridiagonal and , on the following uncoupled approximation.
This fixed point iteration stops when and , for a given tolerance .
The second stage of the ADI decomposition (5) allows computing all wavefields at time through a similar procedure. The vertical velocity is computed explicitly using its intermediate value and the approximation , according to the definition of . The columns of pressure and velocity are successively computed by the fixed point ADI iteration
(6) 
where matrices and are obtained from intermediate variables, as one can see. In above formulation, matrix inversion is only used for notation convenience in auxiliary matrices and , but LU is actually employed to resolve the linear systems involved in their calculations. As mentioned earlier, the additional linear systems embedded into each fixedpoint iteration in (2.1) and (6), are also solved by forward/backward substitutions, using the previous LU factorization of and .
The above CDF formulation is amenable for including absorbing damping layers that require the computation of discrete wave fields along boundaries. By simply replacing reduced operators and , and reduced matrices , , and , by their full matrix versions in previous formulation steps as convenient, the boundary wave fields can be also obtained. These values can be further damped by the absorbing technique proposed in [7]. Even more, the versatile strategy developed in [44] for the damped wave equation can be also adopted after a straighforward reformulation of our CFD method.
A highlevel implementation of above nodal CFD method is sketched in algorithm 1. This algorithm presents two sequential inner loops in lines and , corresponding to the two ADI stages that finally calculate from discrete solutions at . For clarity, each of these stages is broken down into a separate coding procedure given in Appendix C. At each of these procedures, there is no data dependency among the linear systems solved, either row wise or column wise. Thus, each ADI set of tridiagonal systems can be solved simultaneously by our parallel methods.
The algorithm 1 employs the following parameters and matrices:

: Initial conditions for pressure and velocities on the square lattice that discretizes the region .

: Matrices of grid values of material properties and , respectively.

: CFL stability parameter and the maximum value of wave speed .

: Physical time of the global simulation.
Moreover, we use some special index notations in this pseudocode. Matrix represents the reduced form of matrix according to the formulations given in sections 2.1 and 2.2. Matrix indexing by refers to every element along the corresponding dimension, while only denotes the interior elements . Finally, represents the element wise vector/matrix multiplication, as used in Octave and MATLAB.
2.1.1 Complexity analysis of the nodal CFD method
As expected, the computational cost of this method is driven by the solution of the ADI embedded SEL. LU has been traditionally employed for solving tridiagonal linear systems due to its computational speed, in addition to a high numerical stability for diagonally dominant matrices (i.e., the Gaussian elimination can proceed with no pivoting). The algorithm only performs steps to solve a given system, that results in an algorithmic complexity of . In this work, we also omit any pivoting strategy, even though CFD matrices and , are not strictly diagonally dominant. Numerical results do not reflect any instability signs arising from this restraint.
The operation complexity depends on the grid density , that sets the dimension of vectors and matrices used by the nodal CFD method. In case of reduced matrices , , , and others, , for . The total number of time iterations simply results , where is the physical simulation time. In this analysis, let us consider the matrix operations in Table 1.
Operation  Data type  Computational  Order 

cost  
Assignment  Dense matrix  
Addition/Substraction  Dense matrix  
Product  Scalar x Dense matrix  
Sparse x Sparse matrices  
Dense x Sparse matrices  
SEL solutions by LU  tridiagonal systems  
Frobenius norm  Dense matrix 
Table 2 lists the amount of matrix operations performed by the nodal CFD method during time iterations. From there, one can easily see that the time complexity of this scheme is of polynomial order, and highly dominated by the row wise and column wise ADI cycles (see statements 11 and 15, in the algorithm 1). In particular, the complexity dominant term is , where is the maximum number of iterations spent by each ADI stage to converge. In all numerical experiments carried out on this work, varies from 6 to 8, and thus we expect being of in similar wave propagation applications.
Section  Maximum  

iterations  
Initialization  1  5  
computation  1  2  1  
1  1  1  1  1  
ADI initialization  4  1  1  2  
ADIrows  4  4  2  2  2  2  
computation  1  2  1  
computation  1  1  1  1  1  
ADI initialization  4  1  1  2  
ADIcolumns  4  4  2  2  2  2 
2.1.2 Additional comments on the CFD implementation
Along our optimization and parallelization process, we implement and test slightly different versions of the ADI embedded iterations (2.1) and (6), in search for a higher level of parallelism. In particular, matrix might be computed from pressure in the second step of (2.1), so this calculation can proceeds simultaneously with the first step that updates . A similar strategy can be applied in (6) for a simultaneous computation of and , making this ADI formulation more suitable for parallel platforms. This simple idea is similar to the one behind the iterative Jacobi linear solver, fully parallelizable in comparison to its Gauss Seidel counterpart. However, this Jacobilike ADI actually requires few more iterations for convergence, than the above implementation that delays computation until is available. In addition, this ADI choice also leads to a reduction of the CFL stability limit of this nodal method around , and this deficiency is even higher in the case of the mimetic SG method. Thus, we discard this implementation choice in current developments.
2.2 The staggered MFD method
In this formulation, the grid distribution of discrete wavefields takes place on a rectangular staggered grid. In analogy with the nodal mesh used by the previous CFD method, let us consider the 1D grid of cells of length along the direction, with nodes , and cell centers for . These grid nodes are collected in vector , while is another vector comprising cell centers in addition to grid edges, that is . After defining similar grid vectors along the direction, the staggered distribution of discrete pressures correspond to the mesh locations given by , where represents the standard cartesian product operator used in set theory. Alternatively, grid sites of discrete horizontal and vertical velocities are given by and , respectively. According to these staggered distributions, matrix dimensions of discrete wavefields are , and , in contrast to the previous nodal formulation where all these matrices are square. Due to the boundary conditions, we again make use of reduced matrices and , obtained from original and after removing rows from the former and columns from the latter, as before.
The staggered FD differentiation employs the mimetic fourthorder operators and given in Appendix B, for an approximation of partial derivatives along both directions of all three wavefields
(7)  
(8) 
Next, we replace the nodal CFD definition of discrete vector operators and by the following staggered approach
(9) 
With above definitions at hand, the formulation of this staggered MFD method proceeds by replicating the time discretization and PeacemanRachford twostage ADI decomposition, applied in the previous section. A pseudocode of this method is given by the algorithm 2. For completeness, the two ADI stages are broken down into separate procedures, and given by the algorithm 4 in Appendix C. The algorithm 2 employs the same parameters used by the nodal algorithm 1, as well as similar matrix and indexing notation.
2.2.1 Complexity analysis of the staggered MFD method
The complexity analysis of this method considers the same matrix operations defined in Table 1. The new Table 3 lists the total amount of these operations performed by the staggered method during time iterations. Similar to the nodal scheme, the operational cost is a dependent polynomial, with a leading term given by the solution of the ADI stages (referred in statements 11 and 15 of the algorithm 2), that we denote as . Specifically, . Thus, the total amount of operations is less than those employed by the nodal method. However, the tridiagonal structure of embedded linear systems on the former method, and our LU based solution strategy, make equivalent the operational complexity of both methods.
Section  Maximum  

iterations  
Initialization  1  5  2  
computation  1  1  
1  1  1  1  
ADI initialization  4  1  1  
ADIrows  4  4  2  2  2  
computation  1  1  
computation  1  1  1  1  
ADI initialization  1  1  1  
ADIcolumns  4  4  2  2  2 
3 Computational optimization
The implementation of the methods presented in Section 2 takes place on an Intel(R) Core(TM) i74770 CPU with cores running at GHz clock speed, and MB of cache memory. This architecture counts with a NVIDIA GTX 960 Maxwell GPU card and the Ubuntu Linux LTS operating system.
3.1 CPU implementations
3.1.1 Sequential versions
Original codes of the proposed methods were developed in Octave and used the LU decomposition of CFD matrices and to solve ADI embedded SEL [9]. To better use the capabilities of our CPU architecture and apply some code optimization techniques, we migrate these original codes to C++, and use the version of the gcc compiler in this process.
The first optimization step is based on using different compilation flags, among which the flag significantly improved our C++ codes, leading to an immediate performance improvement. This flag automatically introduces the following software optimizations: loop unrolling, function inlining and automatic vectorization. A second set of optimizing strategies are related to matrix orientation, and its application highly affects to the new CFD code. The original code in [9] solved the inherent SEL’s in a vector oriented fashion, i.e. matrix updates and system solutions are performed a row/column at a time. To take better performance advantage of the algebra libraries, a matrix oriented approach is taken for the C++ implementation. That is, updates and system solutions proceed simultaneously on the full matrix. It can be seen that most of vector operations in both algorithms CFD and MDF decouple, so they can be performed as a single matrix operation. In particular, the CFD SEL’s with multiple right hand sides and same matrix, can be solved simultaneously.
3.1.2 Parallel versions
For multiprocessing, we use the Octave parallel package to distribute computation along rows/columns among several processes. At first, one process is used for each row/column, but this solution performs extremely bad. Little work is assigned to each process, resulting in most of computing power being wasted in managing these processes and their communication. A second attempt consists of using a small number of processes (up to in an eight core CPU), and then lists of columns/rows are assigned to each of these processes. Performance results are slightly better, however the singlethreaded implementation is still faster for all tested numerical meshes, with the grid of biggest size (, as detailed in Section 4), as the only exception. In general, results of using this Octave parallel package are poor, because of the need of data replication as required by the multiple launched processes. Up to date, no mechanism is available for fast memory sharing/synchronization among processes in this parallel package.
Due to results discussed above, we return to the original vector orientation of this problem, as it parallelizes naturally thanks to the decoupled operations at the vector level. The partition proceeds at the row/column level, distributing each vector in a thread pool that performs the required operations. Our C++ implementations follow and exploit this data orientation. The next optimization step applied to these vectororiented versions is loop parallelization using OpenMP. OpenMP allows us to take full advantage of Intel multicore CPU architectures. In our OpenMP loop parallelization, we have taken into account two intrinsic aspects of our numerical discretization: (i) data dependencies, and (ii) stencil dependencies. Each parallel loop collects a common set of updating instructions applied on a particular grid zone (boundary, near boundary, or interior nodes). Scheduling of OpenMP threads is established by the flag ’auto’, which invokes as many threads as the number of available CPU cores. We observe in our experiments that this option provides the best performance for a wide range of scenarios. Increasing the number of threads over the number of cores usually entails an overhead due to communication conflicts, whereas using less threads than available cores might lead to an underuse of a multicore platform.
3.2 GPU implementation
Once CPU parallelization has been appropriately exploited, we address a new GPU implementation of our numerical schemes. On the available architecture, the programming language is CUDA C based on the nvcc compiler. In this migration process of our original sequential codes, we also use the Toolkit and functions of the cuBLAS library, and only develop specific kernels for particular functionalities not supported by cuBLAS.
We next comment on crucial implementation details. Due to the relatively small sizes of the matrices involved in our test cases (up to ), the vector oriented workload does not yield enough parallelism to fill the GPU. Thus, we return to the matrix approach in our GPU implementations, and instead of partitioning at the row/column level, we focus on optimizing the matrix operations themselves at the element level. This allows achieving a much higher level of parallelism, given that parallelization is performed throughout the full extent of the algorithm, and not only at the row/column solvers.^{1}^{1}1In order to preserve generality of our implementations, we do not unroll the computation chain. This could yield much higher speedups at the expense of problem specificity.
As already mentioned, we implement some missing banded matrix functionalities in cuBLAS. For instance, the Level 3 (L3) product of a banded matrix times a dense matrix is not provided, while the Level 2 (L2) multiplication of a banded matrix times a vector is indeed available. We build the L3 operation over the L2 one by decomposing a matrixmatrix multiplication into independent matrixvector products and accumulation operations. Our first approach, using streams to overlap the matrixvector products, achieves a very low occupancy due to the cost of kernel launching, relative to the workload of each thread. We found that best results are obtained when the whole matrix is computed in a single launch, as the work per thread increases. Here, a higher number of threads are effectively launched, increasing the opportunities to obtain better occupancy. Same results are achieved by using streambased distances and matrix norm computations.
On the other hand, our GPU implementations need to apply reduction operations, like for instance, those involved in matrix norm computations. In this case, parallelization of the reduction process is performed in a hierarchical fashion. Each thread adds a pair of elements, and then half of the threads add the partial results computed. This process is repeated for iterations, until the whole list is reduced to a single value. Several practical considerations must be considered to achieve optimal performance: minimal divergence, array indexing to optimize data locality, unrolling of some operations, etc. Bigger reductions are computed in two steps. First, one reduces in parallel all the columns of the matrix to a vector in shared memory. Then, this vector is reduced again to a single value. Here, we have used this reduction technique to calculate the matrix Frobenius norm in our new kernels, as required to test ADI convergence according to algorithms in Appendix C.
Finally, we want to notice that the stopping criterion of ADI iterations in algorithms 3 and 4 of Appendix C, imposes an important constraint for GPU performance. The number of steps required until convergence is actually limited by the constant parameter , that varies from 6 to 8, in our experimental set. However, convergence on the Frobenius norm of matrices , , and , updated by GPU processing, require that a floating point value must be transferred form device to host, for testing against tolerance. This is wasteful as the PCIe bus is slow and has a high latency. To circumvent this problem, we study the average number of steps until convergence. In our cases, this number is roughly 6, so we thus perform this minimum number of iterations until convergence start being tested, and saving most of the unnecessary device to host transfers.
Concerning our new CUDA schemes, we want to emphasize that checking for inner ADI convergence forces the synchronization of kernel threads, slowing down the global computation, and leading to a bottleneck that hampers performance. As it was said, convergence is almost always achieved in a constant number of iterations for each test case, so we opt for executing a minimum number of iterations before any checking take place.
4 Numerical tests
We next present the family of numerical tests that allow assessing accuracy and convergence properties of both the CFD and MDF schemes. Same tests are used to evaluate their computation speedup gained through the optimization stages described previously, but this discussion is given in next section. Under appropriate initial and Dirichlet boundary conditions (BCs), the propagation model (2) has the following exact solution
(10) 
This solution is time harmonic with period and presents a spatial behaviour controlled by the polynomial coefficient , being a possitive integer, and the spatial period of the sinusoidal component. These four parameters allow adjusting the difficulty of the particular problem numerically solved. In particular, the test solution can switch from a purely sinusoidal function for , to one with significant gradients at the boundary neighbourhood in the case of and with large magnitudes.
We first solve the case of where BCs reduce to homogeneous, consider a unitary wave speed (i.e., ), and take and . We perform ten numerical simulations of each scheme for an increasing number of grid cells , so grid resolution improves with the reduction of . In each case, the global simulation time corresponds five time periods, namely , and the time step is taken as the maximum permitting stable calculations, where . The term is the well known CFL bound that varies between our two FD ADI schemes, and were found in [9]. Especifically, we employ the values of and in our CFD and MDF simulations, respectively. Next, a second set of harder tests with non homogeneous BCs are also numerically solved, where . In this case, we only inspect the representative cases of and , and also assume to simplify this current parametrization.
Figure 3 shows absolute errors in Frobenius norm found on numerical tests performed under homogeneous and heterogeneous BCs. In general, these errors steadily decay with validating our new parallel implementations. On both tests with mild difficulty, i.e., and , the MFD method is slightly more accurate on most of the explored grids with , because staggering halves the grid spacing during numerical differentiation, as compared to the nodal CFD scheme. However, when solving the hardest test with , MDF errors on simulation with suffer a clear stagnation, making useless any further grid refining. Thus, we limit this experimental study, and subsequent speedup analysis, to meshes where .
16  

24  5.41  3.17  2.70 
32  4.78  2.36  2.26 
48  5.92  2.34  2.02 
64  5.06  1.70  1.85 
96  3.64  1.87  1.92 
128  0.22  1.96  1.98 
256  4.02  1.98  1.98 
512  2.52  2.00  2.00 
1024  2.70  2.00  1.99 
Average  4.02  2.07  2.02 
In addition, we estimate the convergence rates of the CFD and MDF solutions as reduces, and list them in tables 4 and 5
, respectively. These estimations are based on the standard error weighting from two consecutive simulations. For each test, we also average the resulting rates after omitting the highest and lowest values, to quantify the overall convergence behaviour avoiding main outliers. For both schemes, convergence degrades as the problem becomes harder, being almost quartic on the homogeneous test (limited by the spatial discretization), and reducing to quadratic on the hardest heterogeneous test (more constrained by the order of CN time integration). We would like to remark that accuracy and convergence results reported in
[9], clearly show a MDF superiority on similar tests under homogeneous BCs. This disadvantage of the former CFD method arises from a reduction of nominal accuracy from fourth to third order at grid boundaries. Here, we opt for a fully fourth order set of CFD stencils at all grid points when implementing our current method.16  

24  4.53  5.54  4.04 
32  4.94  2.17  1.93 
48  5.29  1.92  1.92 
64  7.37  1.94  1.94 
96  0.80  1.98  1.97 
128  1.94  1.99  2.01 
256  5.70  2.00  1.96 
512  0.15  1.91  1.40 
1024  3.84  0.40  0.17 
Average  3.87  1.99  1.88 
5 Comparison of performance results
Figures (a)a and (b)b show speedups in log2 scale, relatives to the original sequential Octave codes, of each implementation delivered by the optimization process discussed in Section 3. These results were obtained after solving the numerical test under heterogeneous BCs for . Speedup values observed on the alternative heterogeneous or homogeneous tests, are very similar to the ones shown in these figures, confirming that evaluation of the forcing function , and related operations at boundaries, minorly affect computation times.
The new CFD and MDF implementations using parallel Octave report a poor performance in most of our tests, being the former better that finally improves to a 1.19x ratio at =1024. Conversely, sequential C++ implementations yield a global improvement that reaches 5.39x (CFD) and 2.6x (MFD) at the finest grid. This improvement is mainly due to software optimizations performed by the the compiler. Now, C++ parallel codes that use OpenMP achieve a higher improvement at all grids, and limiting speedups reach 11.1x (CFD) and 6.74x (MFD), at = 1024. In this case, the main reason that favors speedup is using optimized LAPACK solvers, combined to the parallelization of some operations, and to the reduction of L1L2 cache access failures. The OpenMP implementation was almost trivial to develop from the vector oriented approach, as vectors decouple and no synchronization is needed among threads. Finally, we observe the best performance results on the GPU implementations, where the CFD GPU speedup grows with until 38.85x at the best resolved grid. This is a clear indication of the high level of parallelism achieved by our GPU implementation, on the range of grid sizes tackle in this work. On the other hand, speedup of the CUDA MFD method seems to converge steadily on grids with to a limit of , with a minor improvement at the limiting mesh, and this behavior reveals a soon high occupancy of GPUs.
As a complementary set of speedup results, we list in Table 6 only the ratios achieved by the parallel CFD and MFD methods, developed either in C++ using 8 threads or in CUDA, relative to their sequential C++ implementations. In most of our grids, these speedups improve with the mesh size. In particular, the improvements accomplished by the GPU implementations grow steadily, given the high device occupancy by our bigger matrices. It is worth reminding that grid size is limited in our tests by the numerical precision achieved by our methods, and higher refinements would not lead to more accurate results, at least for the MDF scheme. Thus, we do not test our parallel implementations with bigger matrices. Again, using the sequential C++ computations as reference, the OpenMP implementations only reach limiting improvements of 2.06x (CFD) and 2.59x (MFD) at , due to matrix sizes are not big enough to balance expenses on either, the parallel LAPACK LU solutions of the nodal scheme, or the several matrixvector products of the staggered method. On the contrary, performances gained by the GPU implementations achieve maximum speedups of 7.21x (CFD) and 15.80x (MFD). Keeping matrices in GPU memory and minimizing hostdevice transfers were essential to obtain such performances. In these cases, the usage of streams was not effective giving that launching times were too high with respect to the kernel working times.
The testing process of ADI convergence (inner loops of algorithms 3 and 4 in Appendix C) may result a critical restricting factor to performance, as it demands several small devicehost DMA transfers. As the number of loop iterations until convergence is roughly constant, we minimize this bottleneck by delaying checking the stopping criterion until a fixed number of iterations is performed.
Nodal CFD  16  24  32  48  64  96  128  256  512  1024 

C++(8 threads)  2.75  2.75  2.76  2.75  2.64  2.55  2.52  2.27  2.07  2.06 
CUDA  0.13  0.18  0.23  0.41  0.79  1.22  1.56  2.54  4.82  7.21 
Staggered MFD  
C++(8 threads)  2.29  2.12  1.96  1.78  1.98  2.04  2.03  2.15  2.31  2.59 
CUDA  0.13  0.29  0.45  0.85  1.17  2.73  3.08  5.94  8.08  15.81 
6 Conclusions
This paper first focuses on the computer parallelization of two fourthorder FD methods to model 2D acoustic waves on available multicore (CPU) and manycore (GPU) architectures. The first scheme uses CFD operators on nodal grids and requires solving a tridiagonal system of linear equations (SEL) to compute coupled derivative values along each coordinate grid line. Alternatively, the second scheme employs MFD stencils on SG for local differentiation free of SEL solution. Both methods share a CrankNicolson (CN) time integration combined to a PeacemanRachford splitting that lead to an efficient ADI updating scheme. These FD ADI methods have shown better stability and convergence properties than an explicit Leapfrog integration in a previous work [9], but the computing expenses of ADIembedded iterations demand parallelization to be competitive on practical propagation problems. Along the implementation process of each method, we present a sequential C++ code with optimal compilation flags, and develop three different parallel versions. Precisely, an Octave and a C+ multithreading code, both fully exploiting arrayoriented storing for matrix and vector operations, and a CUDA implementation on a NVIDIA GTX 960 Maxwell.
Performance speedups of parallel implementations were assessed relative to the computing times of the sequential C++ code, but we additionally report results with respect to the original Octave version. These results can be insightful for someone replicating our optimization procedures for similar numerical methods. Results of GPU methods show a significant speedup gain over their parallel counterparts. Main reasons are a high occupancy of GPUs on explored grid densities, better realized by the CFD method, and a minimization of CPUGPU transfers by delaying the convergence testing of ADIembedded iterations, until an expected number of cycles is actually performed. Relative to Octave, the GPU CFD method achieves a speedup of 38.85x at the finest grid, while the GPU MDF reaches a higher 41x. Same performances, but measured with respect to optimal sequential C++ versions, are 7.21x for the former and 15.81x for the latter. Keeping matrices in GPU memory and minimizing hosttodevice transfers result essential in obtaining such performances.
Secondly, numerical experiments carried out in this study also enlighten convergence properties of these methods on a test suite of increasing difficulty. On a problem with an exact harmonic solution, the MDF accuracy was slightly better, and both methods exhibit convergence rates close to , as the differentiation order used on the spatial discretization. However, both convergences decay to second order on smooth problems with significant gradients at boundaries, that may reveal a stronger dependence on the time integration errors. In any case, these second order experimental rates are consistent to the theoretical global convergence of , and limited by the time discretization.
Acknowledgments
We are thankful to CAF editors and reviewers for insightful comments and revisions to this manuscript. First author was partially supported by the Generalitat de Catalunya under agreement 2017SGR962 and the RIS3CAT DRAC project (001P0 01723). The research leading to these results has received funding from the European Union’s Horizon 2020 Programme, grant agreement No. 828947, and from the Mexican Department of Energy, CONACYTSENER Hidrocarburos grant agreement No. BS69926. This project has also received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie SklodowskaCurie grant agreement No. 777778 (MATHROCKS). O. Rojas also thank the European Union’s Horizon 2020 Programme under the ChEESE Project, grant agreement No. 823844.
References
 [1] (2015) Comparison of finite difference schemes for the wave equation based on dispersion. Journal of Applied Mathematics and Physics 3, pp. 1544–1562. External Links: Document Cited by: §1.
 [2] (2012) A new family of finitedifference schemes to solve the heterogeneous acoustic wave equation. Geophysics 77 (5), pp. T187–T199. Cited by: §1.
 [3] (1997) A modified laxwendroff correction for wave propagation in media described by zener elements. Geophysical Journal International 131 (2), pp. 381–386. External Links: Document Cited by: §1.
 [4] (2016) Threedimensional viscoelastic timedomain finitedifference seismic modelling using the staggered adams–bashforth time integrator. Geophysical Journal International 204 (3), pp. 1781–1788. External Links: Document Cited by: §1.
 [5] (2003) A matrix analysis approach to higherorder approximations for divergence and gradients satisfying a global conservation law. SIAM Journal Matrix Analysis Applications 25 (1), pp. 128–142. External Links: Link Cited by: Appendix B: Staggered finite differences, §1.
 [6] (2001) Fourth and sixthorder conservative finite difference approximations of the divergence and gradient. Applied Numerical Mathematics: Transactions of IMACS 37 (1–2), pp. 171–187. Cited by: Appendix B: Staggered finite differences, §1.
 [7] (1985) A nonreflecting boundary condition for discrete acoustic and elastic wave equations. Geophysics 50 (4), pp. 705–708. Cited by: §2.1.
 [8] (1975) Higher order compact implicit schemes for the wave equation. Mathematics of Computation 29 (132), pp. 985–994. Cited by: §1.
 [9] (2016) Compact finite difference modeling of 2d acoustic wave propagation. Journal of Computational and Applied Mathematics 295, pp. 83–91. External Links: Link Cited by: Appendix B: Staggered finite differences, Appendix B: Staggered finite differences, §1, §2.1, §3.1.1, §3.1.1, §4, §4, §6.
 [10] (201710) Diferencias finitas compactas nodales y centro distribuidas aplicadas a la simulación de ondas acústicas. Ph.D. Thesis, Universidad Central de Venezuela. Cited by: Appendix B: Staggered finite differences.

[11]
(2010)
Graphics processing unit pricing of exotic crosscurrency interest rate derivatives with a foreign exchange volatility skew model
. Concurrency and Computation Practice and Experience 26 (9). Cited by: §1.  [12] (2009) A parallel implementation on gpus of adi finite difference methods for parabolic pdes with applications in finance. Canadian Applied Mathematics Quarterly 17 (4), pp. 627–660. Cited by: §1.
 [13] (2013) Application of a fourthorder compact adi method to solve a twodimensional linear hyperbolic equation. International Journal of Computer Mathematics 90 (2), pp. 273–291. Cited by: §1.
 [14] (2016) An unconditionally stable spatial sixthorder ccdadi method for the twodimensional linear telegraph equation. Numerical Algorithms 72 (4), pp. 1103–1117. External Links: Document Cited by: §1.
 [15] (2011) GPU in financial computing part iii: adi solvers on gpus with application to stochastic volatility. WILMOTT Magazine, pp. 51–53. Cited by: §1.
 [16] (2013) An efficient gpu implementation of cyclic reduction solver for highorder compressible viscous flow simulations. 92. Cited by: §1.
 [17] (2007) Computational methods for largescale 3d acoustic finitedifference modeling tutorial. Geophysics 72 (5). Cited by: §1.
 [18] (2014) GPU implementation of finite difference solvers. In Proceedings of the 7th Workshop on High Performance Computational Finance, WHPCF ’14, Piscataway, NJ, USA, pp. 1–8. External Links: Document Cited by: §1.
 [19] (1996) Simulating seismic wave propagation in 3d elastic media using staggeredgrid finite differences. Bulletin of the Seismological Society of America 86 (4), pp. 1091–1106. Cited by: §1.
 [20] W. M. W. Hwu (Ed.) (2011) GPU computing gems jade edition. 1st edition, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. External Links: ISBN 9780123859631, 9780123859648 Cited by: §1.
 [21] (1978) High order difference schemes for the wave equation. International Journal for Numerical Methods in Engineering 12 (10), pp. 1623–1628. Cited by: §1.
 [22] (2009) The new alternating directiion implicit difference methods for the wave equations. Journal of Computational and Applied Mathematics 230, pp. 213–223. Cited by: §1.
 [23] (2011Sept) A scalable tridiagonal solver for gpus. In 2011 International Conference on Parallel Processing, pp. 444–453. External Links: Document Cited by: §1.
 [24] (2007) Highorder schemes for acoustic waveform simulation. Applied Numerical Mathematics 57 (4), pp. 402–414. External Links: Document, Link Cited by: §1.
 [25] (1992) Compact finite difference schemes with spectrallike resolution. Journal of Computational Physics 103, pp. 16–42. Cited by: Appendix A: Compact finite difference matrices, Appendix A: Compact finite difference matrices, §1, §1.
 [26] (1988) Fourthorder finitedifference psv seismograms. Geophysics 53, pp. 1425–1436. Cited by: §1.
 [27] (2014) A guide for implementing tridiagonal solvers on gpus. pp. 29–44. External Links: Document Cited by: §1.
 [28] (2014) On the dispersion, stability and accuracy of a compact higherorder finite difference scheme for 3d acoustic wave equation. Journal of Computational and Applied Mathematics 270, pp. 571–583. Cited by: §1.
 [29] (2017) A parallel multigrid solver for incompressible flows on computing architectures with accelerators. The Journal of Supercomputing 73 (11), pp. 4931–4956. External Links: Document Cited by: §1.
 [30] (1973) High accuracy adi methods for hyperbolic equations with variable coefficients. IMA Journal of Applied Mathematics 11 (1), pp. 105–109. Cited by: §1.
 [31] (2010) Accelerating a threedimensional finitedifference wave propagation code using gpu graphics cards. Geophysical Journal International 182 (1), pp. 389–402. Cited by: §1.
 [32] (2016) Parallelization of finite difference methods: nodal and mimetic solutions of the wave equation. Master’s Thesis. External Links: Link Cited by: §1.
 [33] (2017) A performance analysis of a mimetic finite difference scheme for acoustic wave propagation on gpu platforms. Concurrency and Computation: Practice and Experience 29 (4), pp. e3880. External Links: Document Cited by: §1.
 [34] (1955) The numerical solution of parabolic and elliptic differential equations. Journal of the Society for Industrial and Applied Mathematics 3 (1), pp. 28–41. External Links: Document Cited by: §1.
 [35] (201301) Accuracy of the staggeredgrid finitedifference method of the acoustic wave equation for marine seismic reflection modeling. Chinese Journal of Oceanology and Limnology 31 (1), pp. 169–177. External Links: Document, ISSN 19935005, Link Cited by: §1.
 [36] (2008) Modelling of rupture propagation using highorder mimetic finite differences. Geophysical Journal International 172 (2), pp. 631–650. External Links: Document, ISSN 1365246X, Link Cited by: Appendix B: Staggered finite differences, Appendix B: Staggered finite differences.
 [37] (2014) Low dispersive modeling of rayleigh waves on partly staggered grids. Computational Geosciences 18 (1), pp. 29–43. Cited by: Appendix B: Staggered finite differences, Appendix B: Staggered finite differences, §1.
 [38] (2009) Modeling of rupture propagation under different friction laws using highorder mimetic operators. Ph.D. Thesis, Claremont Graduate University joint to San Diego State University, California, USA. Cited by: Appendix B: Staggered finite differences.
 [39] (2014) Finitedifference staggered grids in gpus for anisotropic elastic wave propagation simulation. Comput. Geosci. 70 (C), pp. 181–189. Cited by: §1.
 [40] (2011) A novel higher order finite difference time domain method based on the castillogrone mimetic curl operator with applications concerning the timedependent maxwell equations. Ph.D. Thesis, San Diego State University, San Diego, USA. Cited by: Appendix B: Staggered finite differences.
 [41] (2004) Finitedifference modeling of viscoelastic and anisotropic wave propagation using the rotated staggered grid. Geophysics 69, pp. 583–591. Cited by: §1.
 [42] (2000) Modeling the propagation of elastic waves using a modified finitedifference grid. Wave Motion 31, pp. 77–92. Cited by: §1.
 [43] (199501) Dispersion analysis of numerical wave propagation and its computational consequences. Journal of Scientific Computing 10, pp. 1–27. Cited by: §1.
 [44] (1987) Absorbing boundary conditions and surface waves. Geophysics 52 (1), pp. 60–71. Cited by: §2.1.
 [45] (2015) A novel approach to evaluating compact finite differences and similar tridiagonal schemes on gpuaccelerated clusters. Master’s Thesis, Clemson University. Cited by: §1.
 [46] (2009) Acceleration of the 3d adifdtd method using graphics processor units. 2009 IEEE MTTS International Microwave Symposium Digest. Cited by: §1.
 [47] (2016) Numerical modeling of 2d seismic wave propagation in fluid saturated porous media using graphics processing unit (gpu): study case of realistic simple structural hydrocarbon trap. AIP Conference Proceedings 1755 (1), pp. 100001. External Links: Document Cited by: §1.
 [48] (2012) A gpu application for highorder compact finite difference scheme. Computers & Fluids 55, pp. 29 – 35. External Links: Document Cited by: §1.
 [49] (2011) GPU accelerated 2d staggeredgrid finite difference seismic modelling. Journal of Software 6 (8), pp. 1554–1561. Cited by: §1.
 [50] (2013) Parallelizing alternating direction implicit solver on gpus. Procedia Computer Science 18, pp. 389 – 398. External Links: Document Cited by: §1.
 [51] (2002) Timesplitting methods for elastic models using forward time schemes. Monthly Weather Review 130 (8), pp. 2088–2097. Cited by: §1.
 [52] (2012) Three‐dimensional elastic wave numerical modelling in the presence of surface topography by a collocated‐grid finite ‐ difference method on curvilinear grids. Geophysical Journal International 190 (1), pp. 358–378. Cited by: §1.
 [53] (2013) Parallelization of implicit cche2d model using cuda programming techniques. In World Environmental and Water Resources Congress, Cincinnati, Ohio. Cited by: §1.
 [54] (2013) Multigpu implementation of a 3d finite difference time domain earthquake code on heterogeneous supercomputers. Procedia Computer Science 18, pp. 1255 – 1264. External Links: Document Cited by: §1.
Appendix A: Compact finite difference matrices
Lele in [25] presents a Taylorbased construction of compact finite difference (CFD) formulas of second, fourth and higher order accuracy on either nodal or staggered grids. In this work, we use the fourth order accurate CFD matrix operators and that allow differentiation of a smooth function on , whose evaluations at nodes , for and , are the components of vector . Nodal approximations to result from solving the banded linear system , for the following matrices and
(11)  
(12) 
Above, matrix subindexes reflect the accuracy choice among the alternative CFD operators given in [25]. This formulation is quite general yielding a threeparametric family of CFD operators with fourth order accuracy, at least, and those given above result from reducing matrix bandwidth to assure tridiagonally.
An alternative pair of CFD operators can be derived from each and set, by omitting the approximations to at grid edges. In this case, the reduced vector only collects values of at interior grid nodes, and the new CFD approximation to at point result from the linear combination of original formulas at nodes and . The same formula is now used at , after switching signs of CFD coefficients to account for backward differentiation. This simple process leads to the reduced matrix operators and of dimensions and , respectively, that take the following form in the case of 11 given above
(13)  
(14) 
Note that is also tridiagonal which permits using same algorithm on the calculation of both and .
Appendix B: Staggered finite differences
For fourth order finite differentiation on staggered grids (SG), we employ the 1D mimetic operators developed in [5] and [6], that have been applied to wave propagation by [9], [38],[36], [37] and [40]. On the interval , we retake the SG used in section 2.2 with cells of length , where nodes and cell centers correspond to the evaluation points of two given scalar functions and . Specifically, nodal evaluations are collected in vector , while evaluations of at cell centers and grid edges conform vector . Staggered differentiation is carried out by two separate FD matrix operators and , where components of approximate at every cell center, while yields approximations to at all nodes. For clarity, the following figure illustrates the spatial distributions of components of , , and on a 1D SG
By construction, each or operator with fourth order accuracy, comprises an individual set of SG FD stencils dependent on three free parameters , and . An optimal parametrization has been recently proposed in [10] to preserve the self adjointness between mimetic or , an important property satisfied by their continuous differential counterparts divergence and gradient operators. However, we here adopt same parameter values , , and used in [9], [36], and [37] that lead to and with minimun bandwidth, to favor computational efficiency.
(15) 
Comments
There are no comments yet.