Minimod: A Finite Difference solver for Seismic Modeling

by   Jie Meng, et al.

This article introduces a benchmark application for seismic modeling using finite difference method, which is namedMiniMod, a mini application for seismic modeling. The purpose is to provide a benchmark suite that is, on one hand easy to build and adapt to the state of the art in programming models and changing high performance hardware landscape. On the other hand, the intention is to have a proxy application to actual production geophysical exploration workloads for Oil Gas exploration, and other geosciences applications based on the wave equation. From top to bottom, we describe the design concepts, algorithms, code structure of the application, and present the benchmark results on different current computer architectures.


page 2

page 3

page 16

page 17

page 18


Devito: an embedded domain-specific language for finite differences and geophysical exploration

We introduce Devito, a new domain-specific language for implementing hig...

MD-Bench: A generic proxy-app toolbox for state-of-the-art molecular dynamics algorithms

Proxy-apps, or mini-apps, are simple self-contained benchmark codes with...

Modeling pre-Exascale AMR Parallel I/O Workloads via Proxy Applications

The present work investigates the modeling of pre-exascale input/output ...

Task inefficiency patterns for a wave equation solver

The orchestration of complex algorithms demands high levels of automatio...

A Massively Parallel Time-Domain Coupled Electrodynamics-Micromagnetics Solver

We present a new, high-performance coupled electrodynamics-micromagnetic...

1 Introduction

Minimod is a Finite Difference-based proxy application which implements seismic modeling (see Chapter 2) with different approximations of the wave equation (see Chapter 3). Minimod is self–contained and designed to be portable across multiple High Performance Computing (HPC) platforms. The application suite provides both non–optimized and optimized versions of computational kernels for targeted platforms (see Chapter 5). The target specific kernels are provided in order to conduct benchmarking and comparisons for emerging new hardware and programming technologies.

Minimod is designed to:

  • Be portable across multiple software stacks.

  • Be self–contained.

  • Provide non–optimized version of the computational kernels.

  • Provide optimized version of computational kernels for targeted platforms.

  • Evaluate node–level parallel performance.

  • Evaluate distributed–level parallel performance.

The first four items are covered in Section 5 and the remainder items are covered in Section 6.

New HPC technologies evaluation is a constant task that plays a key role when decisions are taken in terms of computing capacity acquisitions. Evaluations in the form of benchmarking provide information to compare competing technologies wrt relevant workloads. Minimig is also use for this purpose, and insight collected with it has been part the last major acquisitions by Total (see Figure 1).

Figure 1: Evolution of computing capacity wrt geophysical algorithms.

It can be observed in Figure 1, how more complex geophysical algorithms drive larger capacity installations in Total. The main driver of performance demand by the geophysical algorithms presented in figure are: the accuracy of the wave equation approximation and the addition of optimization or inverse problem schemes. The former is captured in Minimig, where the later is out of scope of this article. Performance trends obtained by conducting experiments with Minimig (or similar tools) influenced the decisions for the last ten years, this mainly motivated by the transition of the main workloads from Ray-based to wave-based methods.

2 Seismic Modeling

Seismic depth imaging is the main tool used to extract information describing the geological structures of the subsurface from recorded seismic data, it is effective till certain depth after which it becomes inaccurate. At its core it is an inverse problem which consists in finding the best model minimizing the distance between the observed data (recorded seismic data) and the predicted data (produced by computational means). The process to estimate the predicted data is known as

forward modeling. It is based on the resolution of the wave equation for artificial perturbations of the subsurface given initial and boundary conditions. This simulation is repeated as many times as perturbations were introduced during seismic data acquisition. In Figure 2 on of such experiments is represented, in this case for a marine setup. The perturbation (namely source) is introduce by an airgun dragged behind a ship, then the waves propagate through the medium. At each interface between layers of materials with different characteristics part of the energy is reflected. These reflections are recorded at sea level (at surface for a onshore setup) by a network of sensors (in the figure depicted in red) also pulled by the ship.

Figure 2: The mechanical medium represented by the subsurface model is perturbed and the wave propagation simulated. In this case, waves are depicted in discretized form as rays, for simplicity.

Solving the forward modeling efficiently is crucial for geophysical imaging as one needs to get solutions for many sources and many iterations as we progressively the subsurface model improves. Constant progresses in data acquisition and in rocks physics labs, more powerful computers and integrated team including physicists, applied mathematicians and computer scientists have greatly contributed to the development of advanced numerical algorithms integrating more and more complex physics. For the last 20 years, the field has been very active in the definition and introduction of different wave equation approximations and corresponding numerical methods for solving forward problem. But the real change came with the implementation of the full wave equation, thanks to the petascale era in HPC, giving access to a complete representation of the wavefield. It allowed geo-scientist to re-design imaging algorithm both in time dynamic and time harmonic domain. The most popular numerical scheme used, nowadays by the industry, is based on finite difference methods (FDM) on regular grids [9], [4]. We refer to [17] for examples of FDM in the geophysics frameworks and to [8] for 3D applications.

3 Finite Differences

Various numerical methods have been explored for the modeling of seismic wave propagation including the finite difference, finite element, finite volume and hybrid methods. Among those methods, the finite difference method is the most popular one for its simplicity and easy and straightforward implementation.

The first step of implementing the governing equations is to called discretizations, basically consist on write the equations on forms that allow the direct implementation of differential operators. The discretizations of the governing equations are impose on three different kind of grids, depending on the symmetry of the problem. We use the standard collocated grid, and two versions of staggered grid, namely Yee [21], [18], [19] and Lebedev [11].

The first equation to be described is the second order acoustic wave equation with constant density, solving for the pressure wavefield ,


where is the velocity of the pressure wavefield, expanded to 3D domain is , likewise for the source .

The second equation is the first order acoustic wave equation with variable density ,


where is the pressure wavefield, and

is a vector wavefield for the particle velocities (time derivatives of displacement) along the different coordinate axis.

The third equation is the acoustic transversely isotropic first order system, see [3] for details.

Finally, we have the elastic equations with variable density ,


where is a vector wavefield for the stresses using Voigt notation and is a vector wavefield for the particle velocities. The derivative operator is


and is the transpose of

without transpose of the derivatives. This is a subtle difference since a derivative is anti-symmetric. We have two different symmetry classes, isotropic and transversely isotropic, which only differs in the sparsity pattern of the stiffness tensor


The above described discretizations are implemented with the following names as kernels:

  • Acoustic_iso_cd: Standard second order acoustic wave-propagation in isotropic media with constant density.

  • Acoustic_iso: first order acoustic wave-propagation in isotropic media on a staggered Yee-grid variable density.

  • Acoustic_tti: first order acoustic wave-propagation in transversely isotropic media on a staggered Lebedev-grid.

  • Elastic_iso: first order elastic wave-propagation in isotropic media on a staggered Yee-grid.

  • Elastic_tti: first order elastic wave-propagation in transversely isotropic media on a staggered Lebedev-grid.

  • Elastic_tti_approx: Non-standard first order elastic wave-propagation in transversely isotropic media on a staggered Yee-grid

All discretizations use CPML [10] at the boundary of the computational domain, with the option of using free surface boundary conditions at the surface. Full unroll of the discretization is given for acoustic_iso_cd, as example, this is the simplest kernel in Minimod for simulating acoustic wave-propagation in isotropic media with a constant density domain , i.e. equation (1). The equation is discretized in time using a second-order centered stencil, resulting in the semi-discretized equation:



The equation is discretized in space using a 25-point stencil in space, with nine points in each direction of three dimensions:

where and are discretization parameters that approximates second derivatives in the different spatial directions. The parameters can be derived from the Taylor expansion of the derivatives in the x, y and z-direction respectively, where the approximation would be of order 8. The derivatives can also use optimized stencils, that reduce the dispersion error at the expense of formal order.

4 Computing costs

Being the core algorithm of Finite Difference, stencil-based computation algorithms represent the kernels of many well–known scientific applications, such as geophysics and weather forecasting.

However, the peak performance of stencil-based algorithms are limited because of the imbalance between computing capacity of processors units and data transfer throughput of memory architectures. In Figure 3 the memory access problem is shown. The computing part of the problem is basically the low re-use of the memory accessed elements.

Figure 3: Memory layout for an simple stencil example, to access the required values multiple useless cache lines (bottom) need to be accessed with incurred penalties. Figure extracted from [6].

In order to deal with the above described limitation, a great amount of research have been devoted to optimize stencil computations to achieve higher performance. For example, de la Cruz and Araya–Polo proposed the semi–stencil algorithm [6] which improves memory access pattern and efficiently reuses accessed data by dividing the computation into several updates. Rivera et al. [16]

showed that tiling 3D stencil combined with array padding could significantly reduce miss rates and achieve performance improvements for key scientific kernels. Recently, Nguyen et al. 

[14] introduced higher dimension cache optimizations.

Advanced programming models have been explored to improve stencil performance and productivity. In 2012, Ghosh et al. [7] analyzed the performance and programmability of three high-level directive-based GPU programming models (PGI, CAPS, and OpenACC) on an NVIDIA GPU for kernels of the same type as described in previous sections and for Reverse Time Migration (RTM, [1]), widely used method in geophysics. In 2017, Qawasmeh et al. [15] implemented an MPI plus OpenACC approach for seismic modeling and RTM. Domain–specific languages (DSLs) for stencil algorithms have also been proposed. For example, Louboutin et al. introduced Devito [12], which a new domain-specific language for implementing differential equation solvers. Also, de la Cruz and Araya–Polo proposed an accurate performance model for a wide range of stencil sizes which captures the behavior of such 3D stencil computation pattern using platform parameters  [5].

5 Minimod Description

5.1 Source Code Structure

In this section, we introduce the basic structure of the source code in Minimod. As we described in Section 3, the simulation in Minimod consists of solving the wave equation, the temporal requires the spatial part of the equation to be solve at each timestep for some number of timesteps. The pseudo-code of the algorithm is shown in algorithm 1, for the second order isotropic constant density equation. We apply a Perfectly Matched Layer (PML) [2] boundary condition to the boundary regions. The resulting domain consists of an “inner” region where Equation 5 is applied, and the outer “boundary” region where a PML calculation is applied.

Data: : source
Result: : wavefield at timestep , for to
1 ;
2 for  to  do
3       for each point in wavefield  do
4             Solve Eq. 5 (left hand side) for wavefield ;
6       end for
7       (Eq. 5 right hand side);
9 end for
Algorithm 1 Minimod high-level description

As described in algorithm 1, the most computationally expensive component of minimod is the computation of the wavefield for each point. We list the code structure of the wavefield calculation in algorithm 2.

Data: : wavefields at previous two timsteps
Result: : wavefield at current timestep
1 for  to  do
2       if  and  then
3             for  to  do
4                   if  and  then
5                         // Bottom Damping (i, j, z1...z2)
6                         // Inner Computation (i, j, z3...z4)
7                         // Top Damping (i, j, z5...z6)
9                  else
10                         // Back and Front Damping (i, j, zmin...zmax)
12                   end if
14             end for
16      else
17             // Left and Right Damping (i, ymin...ymax, zmin...zmax)
19       end if
21 end for
Algorithm 2 Wavefield solution step

The general structure listed above is the backbone for all the propagators included in Minimod. To keep the code simple and flexible, each propagator is compiled separately. This can be selected by setting the propagator variable in the version file before compiling. Figure 4 presents a tree structure of Minimod code suite.

Figure 4: Code tree structure of Minimod package.

5.2 Targets

Each propagator has also its own implementation depending the hardware targeted. The target directory located in each propagator is the way of setting targets. In the source code structure, the following implementations of the target are provided:

  • seq_opt–none : implement kernels without any optimization (so as the short form of sequential none optimization mode). The purpose of the sequential implementation is to be used as a baseline to compare the optimized results. Also, this is useful to analyze not parallel-related characteristics.

  • omp_opt–none : implement kernels using OpenMP directives for multi-threading execution on CPU. The goal of this target implementation is to explore the advanced CPU features on multi-core or many-core systems.

  • acc_opt–gpu : implement kernels using OpenACC directives for offloading computing to GPU (so as the short form of accelerator optimization using GPU). The goal of this target implementation is to explore the accelerating on GPU using OpenACC programming model standard.

  • omp_offload_opt–gpu : implement kernels using OpenMP directives for offloading to GPU (so as the short form of OpenMP offloading optimization using GPU). The goal of implementing this target is to explore the accelerating on GPU using OpenMP programming standard.

In addition to change the propagator that is to be used in tests, one may also change the “version” file to use a different target by setting the “target” variable to the desired multi-threaded or accelerator implementations.

5.3 Compilation and usage

After the target propagators, compilers, and accelerating implementation settings are selected, the source code is ready for compilation, as follows:

     To compile the sequential mode of Minimod package:
     $> source
     $> make

     To compile the multi-threading mode with OpenMP directives:
     $> source
     $> make _USE_OMP=1

     To compile the offloading to GPU mode with OpenMP directives:
     $> source
     $> make _USE_OMP=tesla

     To compile the multi-threading mode with OpenACC directives:
     $> source
     $> make _USE_ACC=multicore

     To compile the offloading to GPU mode with OpenACC directives:
     $> source
     $> make _USE_ACC=tesla


The parameters of Minimod are shown in the following verbatim section. Those are the basic parameters for seismic modeling and they are set as command-line options. The main parameters include: grid sizes, grid spacing on each dimension, the number of time steps and the maximum frequency.

[]$ ./modeling_acoustic_iso_cd_seq_opt-none --help

 --ngrid                100,100,100   # Grid size
 --dgrid                20,20,20      # Dmesh: grid spacing
 --nsteps               1000          # Nb of time steps for modeling
 --fmax                 25            # Max Frequency
 --verbose              .false.       # Print informations

In terms of expected results, the following verbatim section presents an example to show how to run the application and the run-time results of single-thread Minimod acoustic-iso-cd kernel. As we can see, the results report all the parameters that are used in the modeling and at the end the kernel time and modeling time of running the application.

[]]$ ./modeling_acoustic_iso_cd_seq_opt-none --ngrid 240,240,240 --nsteps 300

 nthreads           =            1

 ngrid              =          240         240         240
 dgrid              =    20.0000000       20.0000000       20.0000000
 nsteps             =          300
 fmax               =    25.0000000
 vmin               =    1500.00000
 vmax               =    4500.00000
 cfl                =   0.800000012

 stencil            =            4           4           4
 source_loc         =          120         120         120
 ndamping           =           27          27          27
 ntaper             =            3           3           3

 nshots             =            1
 time_rec           =    0.00000000
 nreceivers         =        57600
 receiver_increment =            1           1
 source_increment   =            1           1           0

 time step         100 /         300
 time step         200 /         300
 time step         300 /         300
Time Kernel       30.47
Time Modeling     31.01

6 Benchmarks

In this section examples of Minimod experimental results are presented. The purpose is illustrate performance and scalability of the propagators with regard to current HPC platforms.

6.1 Experimental set-up

The different propagators of Minimod are evaluated on Fujitsu A64FX architecture, AMD EYPC system, Intel Skylake and IBM Power8 system, as well as Nvidia’s V100 GPUs. The specifications of hardware and software configurations of the experimental platforms are shown in Table 1.

Hardware Software
CPUs A64FX Armv8-A SVE architecture Fujitsu Compiler 1.1.13 (frt)
CPU cores 48 computing cores OpenMP (-Kopenmp)
Memory 32 GB HBM2 auto-parallelisation
L2 8 MB (–Kparallel)
L1 64 KB
Device Fabrication 7nm
TDP 160W
CPUs AMD EYPC 7702 GCC 8.2 (gfortran)
CPU cores 64 computing cores OpenMP
Memory 256 GB
L3 256 MB (per socket)
L2 32 MB
L1 2+2 MB
Device Fabrication 14nm
TDP 200W
CPUs 2x Intel Xeon Gold 5118 intel compiler 17.0.2 (ifort)
CPU cores 24 (12 per CPU)
Memory 768 GB
L3 16 MB (per socket)
L2 1024 KB
L1 32+32 KB
Device Fabrication 14nm
TDP 2 x 105W
CPUs 2 x IBM Power8 (ppc64le) PGI 19.7 (pgfortran)
CPU cores 20 computing cores (10 per CPU) OpenMP (-mp)
Memory 256 GB
L3 8 MB (per two cores)
L2 512 KB (per two cores)
L1 64+32 KB
Device Fabrication 22nm
TDP 2 x 190W
GPU 1 x Nvidia V100 PGI 19.7 (pgfortran)
cores 2560 Nvidia CUDA cores OpenACC (-ta=tesla)
Memory 16 GB HBM2
L2 6144 KB
Device fabrication 12nm FFN
Power consumption 290W
Table 1: Hardware and software configuration of the experimental platforms. From top to bottom, the first section is Fujitsu A64FX Arm8-A architecture. The second section is AMD EYPC Rome architecture. The third section is Intel Skylake architecture. The fourth section is IBM Power8 architecture. And the bottom section is the specification of Nvidia’s V100 GPU.

6.2 Performance characterization

In our experiments, we use roofline model proposed by Williams et al. [20] to understand the hardware limitations as well as evaluating kernel optimization. In the roofline model, the performance of various numerical methods are upper bounded by the peak floating point operations (flop) rate and the memory bandwidth while running on single-core, multi-core or accelerator processor architecture.

Figure 5: Roofline model analyses on AMD EYPC System.

Figure 5 shows the peak performance in term of GFlops per seconds and memory bandwidth of cache and main dynamic random-access memory (DRAM) of the AMD EYPC system listed in Table 1 where we conducted experiments on. The arithmetic intensity in the roofline plot is calculated by the number of floating point operations that are performed in the stencil calculation divided by the number of words that we need to read from and write to memory [6].

6.3 Single compute node-level parallelism

We use Minimod to experiment the single compute node-level parallelism on different computer systems. As shown in Figure 7. The system-level performance tests are conducted on IBM power, Fujitsu A64FX systems, and compared with using NVIDIA’s V100 GPUs as accelerators. The different propagators in Minimod (acoustic_iso_cd, acoustic_iso, acoustic_tti, elastic_iso, and elastic_tti) are tested, and results are shown in Figure 6.

As we observe in Figure 6, the Fujitsu A64FX processor (as shown in the orange bars) provides better performance for all the propagators in comparison to both IBM power system (as shown in the dark blue bars), Intel skylake system (as shown in the light blue bars), as well as AMD EYPC Rome systems (as shown in the yellow bars). In fact, the performance of Fujitsu A64FX is closer to the performance of the system with Nvidia’s V100 GPU accelerator (as shown in the green bars).

The single node-level scalability tests are conducted on IBM power, AMD EYPC, and Fujitsu A64FX systems. The problem size for the strong scalability tests are set to be 240 x 240 x 240. As presented in Figure 7, the results are compared between the three modern computer systems and also compares against the ideal case. Across the three systems, Fujitsu A64FX system again wins IBM power and AMD EYPC Rome systems in the single-node scalability tests.

Figure 6: System-level performance comparison results of Minimod with different propagators running on IBM power system (dark blue bars), on Intel Skylake system (light blue bars), on AMD EPYC system (yellow bars), on Fujitsu A64FX system (orange bars), and on NVIDIA’s V100 GPU (green bars).
Figure 7: Single compute node-level scalability comparison results of Minimod running on IBM power system (blue curve), on AMD EYPC Rome system (yellow curve), and on Fujitsu A64FX system (red curve), and both are compared against the ideal scale-up (green curve).

6.4 Distributed Memory Approach

Figure 8: MPI weak scalability of Minimod running on IBM power system for both the ideal problem-sizes (baseline with linear increment on X dimension) and practical problem-sizes (the problem-size increments are balanced on each dimension) running on 1 to 8 MPI ranks respectively.

The distributed version of Minimod is implemented using Message Passing Interface (MPI). The domain decomposition is defined using regular Cartesian topology, and the domain decomposition parameters need to match the total number of MPI ranks: for example, for the three-dimensional domain decomposition in equals , the rank number needs to be . As for the time being, only acoustic_iso_cd propagator is available within the distributed version of Minimod.

Figure 9: MPI strong scalability comparison results of Minimod running on IBM power system (blue curve) and on Fujitsu A64FX system (red curve), and both are compared against the ideal scale-up (green curve).

The implementation of MPI communication between sub-domains uses non-blocking send and receives. The communication operates in “expected message” mode that has no overlap of communication with computation. Each subdomain performs the following steps: first, to pack the messages to be transmitted in buffers; second, to perform communication by posting all sends and receives, and finally wait till the communication is complete and unpacks the arrived data.

We evaluated both weak scalability and strong scalability of the distributed version of Minimod for acoustic_iso_cd propagator. The results of weak scalability running Minimod on IBM power system is shown in Figure 8, which presents the evaluation results using both the ideal problem sizes and the practical problem sizes running on 1 to 8 MPI ranks, respectively.

For the weak scalability test running ideal problem sizes, we used a baseline of with linear increment on X dimension (for example, for the test running on 6 MPI ranks we used a problem-size of ). And for practical problem-sizes, we used the same baseline while the problem-size increments are balanced on each dimension (for example, for the test running on 6 MPI ranks we used a problem-size of ). The green curves in Figure 8 present the efficiencies in comparison to the baseline result.

We can see from Figure 8 that the weak scalability holds well for running from 1 rank scale to up to 8 ranks for the ideal problem sizes. And for the practical problem sizes which is more close to the real seismic acquisition situation, the weak scalability efficiencies for 2 ranks and 4 ranks are higher than 100% because of the slightly smaller problem sizes compared to the baseline case ( for 2 ranks and for 4 ranks), while it starts diminishing when it reaches 8 ranks mainly because of the increase of problem sizes.

The results of strong scalability are shown in Figure 9. The strong scalability tests are conducted on both IBM power and Fujitsu A64FX systems. The problem size for the strong scalability tests is set to , on the rank numbers of 8, 16, 32, 64, 128, and 256 respectively.

As presented in Figure 9, the results of the kernel execution on the IBM power and the Fujitsu A64FX systems are compared with the ideal scaling trend. The strong scalability results on both systems are very close when the MPI rank number is smaller than 64, while the kernel shows slightly better scalability results on the IBM system than on the Fujitsu system when running with 128 and 256 MPI ranks. In comparison to the ideal case, scalability on the IBM power system reached 63% while on the Fujitsu system reached 60% of the ideal scalability.

6.5 Profiling

Figure 10: Profiling results of Minimod using HPCToolkit.

Profiling and analyses was conducted on Minimod, for example using the HPCToolkit [13] from Rice University. Figure 10 shows a screenshot of the trace view in HPCToolkit profiling Minimod acoustic iso kernel implemented in multi-threading mode using OpenMP. The biggest panel on the top left presents sequences of samples of each trace line rendered. The different colors represent the time spends on different subroutines which are listed on the right panel. The bottom panel in Figure 10 is the depth view of the target Minimod application which presents the call path at every time step.

As an illustrative example for profiling Minimod, Figure 11 shows the profiling results from HPCToolkit trace view for the sequential implementation of the simplest kernel acoustic_iso_cd (acoustic wave-propagation in isotropic media with constant density) in Minimod without any optimization. To better understand the behavior of the kernel, what is shown in the picture is a case with one thread with the stencil computation on a grid size of . As shown in Figure 11, the majority of the time is spent on running the “target_pml_3d” which is the implementation of perfectly-matched region, as shown in the dark color areas in the top left panel. And the green vertical line is for the “target_inner_3d”, where the thread performs computation for the inner region of stencil.

Figure 11: Example of profiling sequential mode of Minimod acoustic-iso-cd kernel using HPCToolkit.

An advantage of HPCToolkit it that can profile the results of Minimod GPU mode for each time sampling traces. Figure 12 shows the the profiling results of the entire execution of Minimod acoustic-iso-cd kernel in OpenACC offloading to GPU mode. Different than the CPU profiling trace views, the GPU profiling trace view on HPCToolkit top-left panel window is composed of two rows. The top row shows the CPU (host) thread traces and the bottom row is for the GPU (device) traces.

Figure 12: GPU profiling results of Minimod acoustic-iso-cd kernel using HPCToolkit.

A zoomed-in view of this GPU profiling results is presented in Figure 13. We selected time step shows the GPU that is running the “target_pml_3d” kernel where the blank white spaces in the GPU row shows the idleness. The same as in the profiling results for CPU, different colors here represent the time spends on different GPU calls.

Figure 13: A detailed view of GPU profiling Minimod acoustic-iso-cd kernel using HPCToolkit.

7 Conclusion

This article introduces a proxy application suite for seismic modeling using finite difference method named Minimod. The design concepts, underline algorithms, and code structures of Minimod are described. The benchmark results of Minimod are shown on different computer architectures for both single compute node-level parallelism and distributed memory approaches.

8 Acknowledgements

We would like to thank Total and subsidiaries for allowing us to share this material. We would also like to express our appreciation to Diego Klahr for his continuous support, and our colleague Elies Bergounioux in France for discussions on the adaptability of proxy applications in production. We also thank Ryuichi Sai from Rice University for his contribution on the profiling results using HPCToolkits. We would like acknowledge Pierre Lagier from Fujitsu for his help with the experiments conducted with latest Fujitsu technology. Last but not least, many thanks to our former colleague Maxime Hugues for his initial implementation of the presented software.


  • [1] M. Araya-Polo, J. Cabezas, M. Hanzich, M. Pericas, F. Rubio, I. Gelado, M. Shafiq, E. Morancho, N. Navarro, E. Ayguade, J. M. Cela, and M. Valero (2011) Assessing accelerator-based hpc reverse time migration. IEEE Transactions on Parallel and Distributed Systems 22 (1), pp. 147–162. Cited by: §4.
  • [2] J. Berenger (1994) A perfectly matched layer for the absorption of electromagnetic waves. Journal of Computational Physics 114 (2), pp. 185 – 200. External Links: ISSN 0021-9991, Document Cited by: §5.1.
  • [3] K. Bube, T. Nemeth, P. Stefani, R. Ergas, W. Lui, T. Nihei, and L. Zhang (2012) On the instability in second-order systems for acoustic vti and tti media. Geophysics 77, pp. 171–186. Cited by: §3.
  • [4] M. Dablain (1986) The application of high-order differencing to the scalar wave equation. Geophysics 51, pp. 54–66. Cited by: §2.
  • [5] R. de la Cruz and M. Araya-Polo (2011) Towards a multi-level cache performance model for 3d stencil computation. Procedia Computer Science 4, pp. 2146 – 2155. Note: Proceedings of the International Conference on Computational Science, ICCS 2011 External Links: ISSN 1877-0509, Document Cited by: §4.
  • [6] R. de la Cruz and M. Araya-Polo (2014-04) Algorithm 942: semi-stencil. ACM Trans. Math. Softw. 40 (3). External Links: ISSN 0098-3500, Document Cited by: Figure 3, §4, §6.2.
  • [7] S. Ghosh, T. Liao, H. Calandra, and B. M. Chapman (2012-11) Experiences with openmp, pgi, hmpp and openacc directives on iso/tti kernels. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, Vol. , pp. 691–700. External Links: Document, ISSN Cited by: §4.
  • [8] R. W. Graves (1996) Simulating seismic wave propagation in 3d elastic media using staggered-grid finite differences.. Geophysics 86, pp. 1091–1106. Cited by: §2.
  • [9] K. R. Kelly, R. W. Ward, S. Treitel, and R. M. Alford (1976) Synthetic seismograms: a finite-difference approach. Geophysics 41, pp. 2–27. Cited by: §2.
  • [10] D. Komatitsch and R. Martin (2007) An unsplit convolutional perfectly matched layer improved at grazing incidence for the seismic wave equation. Geophysics 72, pp. 155–167. Cited by: §3.
  • [11] V.I. Lebedev (1964) Difference analogues of orthogonal decompositions, basic differential operators and some boundary problems of mathematical physics. ii. USSR Computational Mathematics and Mathematical Physics 4, pp. 36–50. Cited by: §3.
  • [12] M. Louboutin, M. Lange, F. Luporini, N. Kukreja, P. A. Witte, F. J. Herrmann, P. Velesko, and G. J. Gorman (2019) Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration. Geoscientific Model Development 12 (3), pp. 1165–1187. External Links: Document Cited by: §4.
  • [13] J. Mellor-Crummey, R. Fowler, and D. Whalley (2001) Tools for application-oriented performance tuning. In Proceedings of the 15th International Conference on Supercomputing, ICS ’01, New York, NY, USA, pp. 154–165. External Links: ISBN 158113410X, Document Cited by: §6.5.
  • [14] A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey (2010) 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In SC ’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Vol. , pp. 1–13. Cited by: §4.
  • [15] A. Qawasmeh, M. R. Hugues, H. Calandra, and B. M. Chapman (2017) Performance portability in reverse time migration and seismic modelling via openacc. The International Journal of High Performance Computing Applications 31 (5), pp. 422–440. External Links: Document Cited by: §4.
  • [16] G. Rivera and Chau-Wen Tseng (2000) Tiling optimizations for 3d scientific computations. In SC ’00: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, Vol. , pp. 32–32. Cited by: §4.
  • [17] J. Virieux and S. Operto (2009) An overview of full-waveform inversion in exploration geophysics. Geophysics 74. Cited by: §2.
  • [18] J. Virieux (1984) SH-wave propagation in heterogeneous media: velocity-stress finite-difference method. Geophysics 49, pp. 1933–1957. Cited by: §3.
  • [19] J. Virieux (1986) P-sv wave propagation in ‘heierogeneous media: velocity-stress finite-difference method. Geophysics 51, pp. 889–901. Cited by: §3.
  • [20] S. Williams, A. Waterman, and D. Patterson (2009-04) Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, pp. 65–76. External Links: Document Cited by: §6.2.
  • [21] K. Yee (1966) ”Numerical solution of initial boundary value problems involving maxwell’s equations in isotropic media.”. IEEE Transactions on Antennas and Propagation 14, pp. 302–307. Cited by: §3.