1 Introduction
Minimod is a Finite Differencebased proxy application which implements seismic modeling (see Chapter 2) with different approximations of the wave equation (see Chapter 3). Minimod is self–contained and designed to be portable across multiple High Performance Computing (HPC) platforms. The application suite provides both non–optimized and optimized versions of computational kernels for targeted platforms (see Chapter 5). The target specific kernels are provided in order to conduct benchmarking and comparisons for emerging new hardware and programming technologies.
Minimod is designed to:

Be portable across multiple software stacks.

Be self–contained.

Provide non–optimized version of the computational kernels.

Provide optimized version of computational kernels for targeted platforms.

Evaluate node–level parallel performance.

Evaluate distributed–level parallel performance.
The first four items are covered in Section 5 and the remainder items are covered in Section 6.
New HPC technologies evaluation is a constant task that plays a key role when decisions are taken in terms of computing capacity acquisitions. Evaluations in the form of benchmarking provide information to compare competing technologies wrt relevant workloads. Minimig is also use for this purpose, and insight collected with it has been part the last major acquisitions by Total (see Figure 1).
It can be observed in Figure 1, how more complex geophysical algorithms drive larger capacity installations in Total. The main driver of performance demand by the geophysical algorithms presented in figure are: the accuracy of the wave equation approximation and the addition of optimization or inverse problem schemes. The former is captured in Minimig, where the later is out of scope of this article. Performance trends obtained by conducting experiments with Minimig (or similar tools) influenced the decisions for the last ten years, this mainly motivated by the transition of the main workloads from Raybased to wavebased methods.
2 Seismic Modeling
Seismic depth imaging is the main tool used to extract information describing the geological structures of the subsurface from recorded seismic data, it is effective till certain depth after which it becomes inaccurate. At its core it is an inverse problem which consists in finding the best model minimizing the distance between the observed data (recorded seismic data) and the predicted data (produced by computational means). The process to estimate the predicted data is known as
forward modeling. It is based on the resolution of the wave equation for artificial perturbations of the subsurface given initial and boundary conditions. This simulation is repeated as many times as perturbations were introduced during seismic data acquisition. In Figure 2 on of such experiments is represented, in this case for a marine setup. The perturbation (namely source) is introduce by an airgun dragged behind a ship, then the waves propagate through the medium. At each interface between layers of materials with different characteristics part of the energy is reflected. These reflections are recorded at sea level (at surface for a onshore setup) by a network of sensors (in the figure depicted in red) also pulled by the ship.Solving the forward modeling efficiently is crucial for geophysical imaging as one needs to get solutions for many sources and many iterations as we progressively the subsurface model improves. Constant progresses in data acquisition and in rocks physics labs, more powerful computers and integrated team including physicists, applied mathematicians and computer scientists have greatly contributed to the development of advanced numerical algorithms integrating more and more complex physics. For the last 20 years, the field has been very active in the definition and introduction of different wave equation approximations and corresponding numerical methods for solving forward problem. But the real change came with the implementation of the full wave equation, thanks to the petascale era in HPC, giving access to a complete representation of the wavefield. It allowed geoscientist to redesign imaging algorithm both in time dynamic and time harmonic domain. The most popular numerical scheme used, nowadays by the industry, is based on finite difference methods (FDM) on regular grids [9], [4]. We refer to [17] for examples of FDM in the geophysics frameworks and to [8] for 3D applications.
3 Finite Differences
Various numerical methods have been explored for the modeling of seismic wave propagation including the finite difference, finite element, finite volume and hybrid methods. Among those methods, the finite difference method is the most popular one for its simplicity and easy and straightforward implementation.
The first step of implementing the governing equations is to called discretizations, basically consist on write the equations on forms that allow the direct implementation of differential operators.
The discretizations of the governing equations are impose on three different kind of grids, depending on the symmetry of the problem. We use the standard collocated grid, and two versions of staggered grid, namely Yee [21], [18], [19] and Lebedev [11].
The first equation to be described is the second order acoustic wave equation with constant density, solving for the pressure wavefield ,
(1) 
where is the velocity of the pressure wavefield, expanded to 3D domain is , likewise for the source .
The second equation is the first order acoustic wave equation with variable density ,
(2) 
where is the pressure wavefield, and
is a vector wavefield for the particle velocities (time derivatives of displacement) along the different coordinate axis.
The third equation is the acoustic transversely isotropic first order system, see [3] for details.
Finally, we have the elastic equations with variable density ,
(3) 
where is a vector wavefield for the stresses using Voigt notation and is a vector wavefield for the particle velocities. The derivative operator is
(4) 
and is the transpose of
without transpose of the derivatives. This is a subtle difference since a derivative is antisymmetric. We have two different symmetry classes, isotropic and transversely isotropic, which only differs in the sparsity pattern of the stiffness tensor
.The above described discretizations are implemented with the following names as kernels:

Acoustic_iso_cd: Standard second order acoustic wavepropagation in isotropic media with constant density.

Acoustic_iso: first order acoustic wavepropagation in isotropic media on a staggered Yeegrid variable density.

Acoustic_tti: first order acoustic wavepropagation in transversely isotropic media on a staggered Lebedevgrid.

Elastic_iso: first order elastic wavepropagation in isotropic media on a staggered Yeegrid.

Elastic_tti: first order elastic wavepropagation in transversely isotropic media on a staggered Lebedevgrid.

Elastic_tti_approx: Nonstandard first order elastic wavepropagation in transversely isotropic media on a staggered Yeegrid
All discretizations use CPML [10] at the boundary of the computational domain, with the option of using free surface boundary conditions at the surface. Full unroll of the discretization is given for acoustic_iso_cd, as example, this is the simplest kernel in Minimod for simulating acoustic wavepropagation in isotropic media with a constant density domain , i.e. equation (1). The equation is discretized in time using a secondorder centered stencil, resulting in the semidiscretized equation:
(5) 
where
The equation is discretized in space using a 25point stencil in space, with nine points in each direction of three dimensions:
where and are discretization parameters that approximates second derivatives in the different spatial directions. The parameters can be derived from the Taylor expansion of the derivatives in the x, y and zdirection respectively, where the approximation would be of order 8. The derivatives can also use optimized stencils, that reduce the dispersion error at the expense of formal order.
4 Computing costs
Being the core algorithm of Finite Difference, stencilbased computation algorithms represent the kernels of many well–known scientific applications, such as geophysics and weather forecasting.
However, the peak performance of stencilbased algorithms are limited because of the imbalance between computing capacity of processors units and data transfer throughput of memory architectures. In Figure 3 the memory access problem is shown. The computing part of the problem is basically the low reuse of the memory accessed elements.
In order to deal with the above described limitation, a great amount of research have been devoted to optimize stencil computations to achieve higher performance. For example, de la Cruz and Araya–Polo proposed the semi–stencil algorithm [6] which improves memory access pattern and efficiently reuses accessed data by dividing the computation into several updates. Rivera et al. [16]
showed that tiling 3D stencil combined with array padding could significantly reduce miss rates and achieve performance improvements for key scientific kernels. Recently, Nguyen et al.
[14] introduced higher dimension cache optimizations.Advanced programming models have been explored to improve stencil performance and productivity. In 2012, Ghosh et al. [7] analyzed the performance and programmability of three highlevel directivebased GPU programming models (PGI, CAPS, and OpenACC) on an NVIDIA GPU for kernels of the same type as described in previous sections and for Reverse Time Migration (RTM, [1]), widely used method in geophysics. In 2017, Qawasmeh et al. [15] implemented an MPI plus OpenACC approach for seismic modeling and RTM. Domain–specific languages (DSLs) for stencil algorithms have also been proposed. For example, Louboutin et al. introduced Devito [12], which a new domainspecific language for implementing differential equation solvers. Also, de la Cruz and Araya–Polo proposed an accurate performance model for a wide range of stencil sizes which captures the behavior of such 3D stencil computation pattern using platform parameters [5].
5 Minimod Description
5.1 Source Code Structure
In this section, we introduce the basic structure of the source code in Minimod. As we described in Section 3, the simulation in Minimod consists of solving the wave equation, the temporal requires the spatial part of the equation to be solve at each timestep for some number of timesteps. The pseudocode of the algorithm is shown in algorithm 1, for the second order isotropic constant density equation. We apply a Perfectly Matched Layer (PML) [2] boundary condition to the boundary regions. The resulting domain consists of an “inner” region where Equation 5 is applied, and the outer “boundary” region where a PML calculation is applied.
As described in algorithm 1, the most computationally expensive component of minimod is the computation of the wavefield for each point. We list the code structure of the wavefield calculation in algorithm 2.
The general structure listed above is the backbone for all the propagators included in Minimod. To keep the code simple and flexible, each propagator is compiled separately. This can be selected by setting the propagator variable in the version file before compiling. Figure 4 presents a tree structure of Minimod code suite.
5.2 Targets
Each propagator has also its own implementation depending the hardware targeted. The target directory located in each propagator is the way of setting targets. In the source code structure, the following implementations of the target are provided:

seq_opt–none : implement kernels without any optimization (so as the short form of sequential none optimization mode). The purpose of the sequential implementation is to be used as a baseline to compare the optimized results. Also, this is useful to analyze not parallelrelated characteristics.

omp_opt–none : implement kernels using OpenMP directives for multithreading execution on CPU. The goal of this target implementation is to explore the advanced CPU features on multicore or manycore systems.

acc_opt–gpu : implement kernels using OpenACC directives for offloading computing to GPU (so as the short form of accelerator optimization using GPU). The goal of this target implementation is to explore the accelerating on GPU using OpenACC programming model standard.

omp_offload_opt–gpu : implement kernels using OpenMP directives for offloading to GPU (so as the short form of OpenMP offloading optimization using GPU). The goal of implementing this target is to explore the accelerating on GPU using OpenMP programming standard.
In addition to change the propagator that is to be used in tests, one may also change the “version” file to use a different target by setting the “target” variable to the desired multithreaded or accelerator implementations.
5.3 Compilation and usage
After the target propagators, compilers, and accelerating implementation settings are selected, the source code is ready for compilation, as follows:
To compile the sequential mode of Minimod package: $> source set_env.sh $> make To compile the multithreading mode with OpenMP directives: $> source set_env.sh $> make _USE_OMP=1 To compile the offloading to GPU mode with OpenMP directives: $> source set_env.sh $> make _USE_OMP=tesla To compile the multithreading mode with OpenACC directives: $> source set_env.sh $> make _USE_ACC=multicore To compile the offloading to GPU mode with OpenACC directives: $> source set_env.sh $> make _USE_ACC=tesla
The parameters of Minimod are shown in the following verbatim section. Those are the basic parameters for seismic modeling and they are set as commandline options. The main parameters include: grid sizes, grid spacing on each dimension, the number of time steps and the maximum frequency.
[]$ ./modeling_acoustic_iso_cd_seq_optnone help ngrid 100,100,100 # Grid size dgrid 20,20,20 # Dmesh: grid spacing nsteps 1000 # Nb of time steps for modeling fmax 25 # Max Frequency verbose .false. # Print informations
In terms of expected results, the following verbatim section presents an example to show how to run the application and the runtime results of singlethread Minimod acousticisocd kernel. As we can see, the results report all the parameters that are used in the modeling and at the end the kernel time and modeling time of running the application.
[]]$ ./modeling_acoustic_iso_cd_seq_optnone ngrid 240,240,240 nsteps 300 nthreads = 1 ngrid = 240 240 240 dgrid = 20.0000000 20.0000000 20.0000000 nsteps = 300 fmax = 25.0000000 vmin = 1500.00000 vmax = 4500.00000 cfl = 0.800000012 stencil = 4 4 4 source_loc = 120 120 120 ndamping = 27 27 27 ntaper = 3 3 3 nshots = 1 time_rec = 0.00000000 nreceivers = 57600 receiver_increment = 1 1 source_increment = 1 1 0 time step 100 / 300 time step 200 / 300 time step 300 / 300 Time Kernel 30.47 Time Modeling 31.01
6 Benchmarks
In this section examples of Minimod experimental results are presented. The purpose is illustrate performance and scalability of the propagators with regard to current HPC platforms.
6.1 Experimental setup
The different propagators of Minimod are evaluated on Fujitsu A64FX architecture, AMD EYPC system, Intel Skylake and IBM Power8 system, as well as Nvidia’s V100 GPUs. The specifications of hardware and software configurations of the experimental platforms are shown in Table 1.
Hardware  Software  
CPUs  A64FX Armv8A SVE architecture  Fujitsu Compiler 1.1.13 (frt) 
CPU cores  48 computing cores  OpenMP (Kopenmp) 
Memory  32 GB HBM2  autoparallelisation 
L2  8 MB  (–Kparallel) 
L1  64 KB  
Device Fabrication  7nm  
TDP  160W  
CPUs  AMD EYPC 7702  GCC 8.2 (gfortran) 
CPU cores  64 computing cores  OpenMP 
Memory  256 GB  
L3  256 MB (per socket)  
L2  32 MB  
L1  2+2 MB  
Device Fabrication  14nm  
TDP  200W  
CPUs  2x Intel Xeon Gold 5118  intel compiler 17.0.2 (ifort) 
CPU cores  24 (12 per CPU)  
Memory  768 GB  
L3  16 MB (per socket)  
L2  1024 KB  
L1  32+32 KB  
Device Fabrication  14nm  
TDP  2 x 105W  
CPUs  2 x IBM Power8 (ppc64le)  PGI 19.7 (pgfortran) 
CPU cores  20 computing cores (10 per CPU)  OpenMP (mp) 
Memory  256 GB  
L3  8 MB (per two cores)  
L2  512 KB (per two cores)  
L1  64+32 KB  
Device Fabrication  22nm  
TDP  2 x 190W  
GPU  1 x Nvidia V100  PGI 19.7 (pgfortran) 
cores  2560 Nvidia CUDA cores  OpenACC (ta=tesla) 
Memory  16 GB HBM2  
L2  6144 KB  
Device fabrication  12nm FFN  
Power consumption  290W 
6.2 Performance characterization
In our experiments, we use roofline model proposed by Williams et al. [20] to understand the hardware limitations as well as evaluating kernel optimization. In the roofline model, the performance of various numerical methods are upper bounded by the peak floating point operations (flop) rate and the memory bandwidth while running on singlecore, multicore or accelerator processor architecture.
Figure 5 shows the peak performance in term of GFlops per seconds and memory bandwidth of cache and main dynamic randomaccess memory (DRAM) of the AMD EYPC system listed in Table 1 where we conducted experiments on. The arithmetic intensity in the roofline plot is calculated by the number of floating point operations that are performed in the stencil calculation divided by the number of words that we need to read from and write to memory [6].
6.3 Single compute nodelevel parallelism
We use Minimod to experiment the single compute nodelevel parallelism on different computer systems. As shown in Figure 7. The systemlevel performance tests are conducted on IBM power, Fujitsu A64FX systems, and compared with using NVIDIA’s V100 GPUs as accelerators. The different propagators in Minimod (acoustic_iso_cd, acoustic_iso, acoustic_tti, elastic_iso, and elastic_tti) are tested, and results are shown in Figure 6.
As we observe in Figure 6, the Fujitsu A64FX processor (as shown in the orange bars) provides better performance for all the propagators in comparison to both IBM power system (as shown in the dark blue bars), Intel skylake system (as shown in the light blue bars), as well as AMD EYPC Rome systems (as shown in the yellow bars). In fact, the performance of Fujitsu A64FX is closer to the performance of the system with Nvidia’s V100 GPU accelerator (as shown in the green bars).
The single nodelevel scalability tests are conducted on IBM power, AMD EYPC, and Fujitsu A64FX systems. The problem size for the strong scalability tests are set to be 240 x 240 x 240. As presented in Figure 7, the results are compared between the three modern computer systems and also compares against the ideal case. Across the three systems, Fujitsu A64FX system again wins IBM power and AMD EYPC Rome systems in the singlenode scalability tests.
6.4 Distributed Memory Approach
The distributed version of Minimod is implemented using Message Passing Interface (MPI). The domain decomposition is defined using regular Cartesian topology, and the domain decomposition parameters need to match the total number of MPI ranks: for example, for the threedimensional domain decomposition in equals , the rank number needs to be . As for the time being, only acoustic_iso_cd propagator is available within the distributed version of Minimod.
The implementation of MPI communication between subdomains uses nonblocking send and receives. The communication operates in “expected message” mode that has no overlap of communication with computation. Each subdomain performs the following steps: first, to pack the messages to be transmitted in buffers; second, to perform communication by posting all sends and receives, and finally wait till the communication is complete and unpacks the arrived data.
We evaluated both weak scalability and strong scalability of the distributed version of Minimod for acoustic_iso_cd propagator. The results of weak scalability running Minimod on IBM power system is shown in Figure 8, which presents the evaluation results using both the ideal problem sizes and the practical problem sizes running on 1 to 8 MPI ranks, respectively.
For the weak scalability test running ideal problem sizes, we used a baseline of with linear increment on X dimension (for example, for the test running on 6 MPI ranks we used a problemsize of ). And for practical problemsizes, we used the same baseline while the problemsize increments are balanced on each dimension (for example, for the test running on 6 MPI ranks we used a problemsize of ). The green curves in Figure 8 present the efficiencies in comparison to the baseline result.
We can see from Figure 8 that the weak scalability holds well for running from 1 rank scale to up to 8 ranks for the ideal problem sizes. And for the practical problem sizes which is more close to the real seismic acquisition situation, the weak scalability efficiencies for 2 ranks and 4 ranks are higher than 100% because of the slightly smaller problem sizes compared to the baseline case ( for 2 ranks and for 4 ranks), while it starts diminishing when it reaches 8 ranks mainly because of the increase of problem sizes.
The results of strong scalability are shown in Figure 9. The strong scalability tests are conducted on both IBM power and Fujitsu A64FX systems. The problem size for the strong scalability tests is set to , on the rank numbers of 8, 16, 32, 64, 128, and 256 respectively.
As presented in Figure 9, the results of the kernel execution on the IBM power and the Fujitsu A64FX systems are compared with the ideal scaling trend. The strong scalability results on both systems are very close when the MPI rank number is smaller than 64, while the kernel shows slightly better scalability results on the IBM system than on the Fujitsu system when running with 128 and 256 MPI ranks. In comparison to the ideal case, scalability on the IBM power system reached 63% while on the Fujitsu system reached 60% of the ideal scalability.
6.5 Profiling
Profiling and analyses was conducted on Minimod, for example using the HPCToolkit [13] from Rice University. Figure 10 shows a screenshot of the trace view in HPCToolkit profiling Minimod acoustic iso kernel implemented in multithreading mode using OpenMP. The biggest panel on the top left presents sequences of samples of each trace line rendered. The different colors represent the time spends on different subroutines which are listed on the right panel. The bottom panel in Figure 10 is the depth view of the target Minimod application which presents the call path at every time step.
As an illustrative example for profiling Minimod, Figure 11 shows the profiling results from HPCToolkit trace view for the sequential implementation of the simplest kernel acoustic_iso_cd (acoustic wavepropagation in isotropic media with constant density) in Minimod without any optimization. To better understand the behavior of the kernel, what is shown in the picture is a case with one thread with the stencil computation on a grid size of . As shown in Figure 11, the majority of the time is spent on running the “target_pml_3d” which is the implementation of perfectlymatched region, as shown in the dark color areas in the top left panel. And the green vertical line is for the “target_inner_3d”, where the thread performs computation for the inner region of stencil.
An advantage of HPCToolkit it that can profile the results of Minimod GPU mode for each time sampling traces. Figure 12 shows the the profiling results of the entire execution of Minimod acousticisocd kernel in OpenACC offloading to GPU mode. Different than the CPU profiling trace views, the GPU profiling trace view on HPCToolkit topleft panel window is composed of two rows. The top row shows the CPU (host) thread traces and the bottom row is for the GPU (device) traces.
A zoomedin view of this GPU profiling results is presented in Figure 13. We selected time step shows the GPU that is running the “target_pml_3d” kernel where the blank white spaces in the GPU row shows the idleness. The same as in the profiling results for CPU, different colors here represent the time spends on different GPU calls.
7 Conclusion
This article introduces a proxy application suite for seismic modeling using finite difference method named Minimod. The design concepts, underline algorithms, and code structures of Minimod are described. The benchmark results of Minimod are shown on different computer architectures for both single compute nodelevel parallelism and distributed memory approaches.
8 Acknowledgements
We would like to thank Total and subsidiaries for allowing us to share this material. We would also like to express our appreciation to Diego Klahr for his continuous support, and our colleague Elies Bergounioux in France for discussions on the adaptability of proxy applications in production. We also thank Ryuichi Sai from Rice University for his contribution on the profiling results using HPCToolkits. We would like acknowledge Pierre Lagier from Fujitsu for his help with the experiments conducted with latest Fujitsu technology. Last but not least, many thanks to our former colleague Maxime Hugues for his initial implementation of the presented software.
References
 [1] (2011) Assessing acceleratorbased hpc reverse time migration. IEEE Transactions on Parallel and Distributed Systems 22 (1), pp. 147–162. Cited by: §4.
 [2] (1994) A perfectly matched layer for the absorption of electromagnetic waves. Journal of Computational Physics 114 (2), pp. 185 – 200. External Links: ISSN 00219991, Document Cited by: §5.1.
 [3] (2012) On the instability in secondorder systems for acoustic vti and tti media. Geophysics 77, pp. 171–186. Cited by: §3.
 [4] (1986) The application of highorder differencing to the scalar wave equation. Geophysics 51, pp. 54–66. Cited by: §2.
 [5] (2011) Towards a multilevel cache performance model for 3d stencil computation. Procedia Computer Science 4, pp. 2146 – 2155. Note: Proceedings of the International Conference on Computational Science, ICCS 2011 External Links: ISSN 18770509, Document Cited by: §4.
 [6] (201404) Algorithm 942: semistencil. ACM Trans. Math. Softw. 40 (3). External Links: ISSN 00983500, Document Cited by: Figure 3, §4, §6.2.
 [7] (201211) Experiences with openmp, pgi, hmpp and openacc directives on iso/tti kernels. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, Vol. , pp. 691–700. External Links: Document, ISSN Cited by: §4.
 [8] (1996) Simulating seismic wave propagation in 3d elastic media using staggeredgrid finite differences.. Geophysics 86, pp. 1091–1106. Cited by: §2.
 [9] (1976) Synthetic seismograms: a finitedifference approach. Geophysics 41, pp. 2–27. Cited by: §2.
 [10] (2007) An unsplit convolutional perfectly matched layer improved at grazing incidence for the seismic wave equation. Geophysics 72, pp. 155–167. Cited by: §3.
 [11] (1964) Difference analogues of orthogonal decompositions, basic differential operators and some boundary problems of mathematical physics. ii. USSR Computational Mathematics and Mathematical Physics 4, pp. 36–50. Cited by: §3.
 [12] (2019) Devito (v3.1.0): an embedded domainspecific language for finite differences and geophysical exploration. Geoscientific Model Development 12 (3), pp. 1165–1187. External Links: Document Cited by: §4.
 [13] (2001) Tools for applicationoriented performance tuning. In Proceedings of the 15th International Conference on Supercomputing, ICS ’01, New York, NY, USA, pp. 154–165. External Links: ISBN 158113410X, Document Cited by: §6.5.
 [14] (2010) 3.5d blocking optimization for stencil computations on modern cpus and gpus. In SC ’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Vol. , pp. 1–13. Cited by: §4.
 [15] (2017) Performance portability in reverse time migration and seismic modelling via openacc. The International Journal of High Performance Computing Applications 31 (5), pp. 422–440. External Links: Document Cited by: §4.
 [16] (2000) Tiling optimizations for 3d scientific computations. In SC ’00: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, Vol. , pp. 32–32. Cited by: §4.
 [17] (2009) An overview of fullwaveform inversion in exploration geophysics. Geophysics 74. Cited by: §2.
 [18] (1984) SHwave propagation in heterogeneous media: velocitystress finitedifference method. Geophysics 49, pp. 1933–1957. Cited by: §3.
 [19] (1986) Psv wave propagation in ‘heierogeneous media: velocitystress finitedifference method. Geophysics 51, pp. 889–901. Cited by: §3.
 [20] (200904) Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, pp. 65–76. External Links: Document Cited by: §6.2.
 [21] (1966) ”Numerical solution of initial boundary value problems involving maxwell’s equations in isotropic media.”. IEEE Transactions on Antennas and Propagation 14, pp. 302–307. Cited by: §3.