1 Introduction
A wide variety of physical phenomena can be formalized in terms of partial differential equations (PDE) such as sound, heat, diffusion, electrostatics, electrodynamics, fluid dynamics, elasticity, and quantum mechanics. The development of computationally efficient methods for obtaining numerical solutions of PDEs through stencil kernels has been mentioned as a key computational science and engineering challenge to be addressed as one of the "seven dwarfs of computation" for at least the next decade, in 2009
[13]. In fact, largescale PDE inversion algorithms that can be solved by finitedifference (FD) schemes used in exploration seismology such as full waveform inversion (FWI) and reverse time migration (RTM) constitute some of the current most computationally demanding problems in industrial and academic research.In general, a stencil on structured grids is defined as a function that updates a point based on the values of its neighbors. The stencil structure remains constant as it moves from one point in space to the next. In the context of a waveequation solver, the stencil is described by the support (gridlocations) and the coefficients of FD schemes. Using parallel designs such as graphics processing units (GPU) has relatively recently become the preferred choice to improve existing code for the current commercial and scientific community that performs stencil computations.
However, a significant barrier that has become increasingly more notable is the difficulty in programming these systems. As the hardware architectures grow in complexity, exploiting the potential of these devices requires higher knowhow on parallel programming. The issue has further been compounded by a rapidly changing hardware design space, with a wide range of parallel architectures. For example, some designs offer many simple processors vs. fewer complex processors, some depend on multithreading, and some even replace caches with explicitly addressed local stores. As no conventional wisdom has yet emerged, it is unsustainable for domain scientists to rewrite their applications for each new type of architecture regarded that developing and validating a PDE solver usually takes decades of effort.
To address the problem of algorithm sustainability, taking into account the uncertainty in future architectures, one solution involves decoupling the work of a domain scientist and a computer scientist. In this approach, Domain Specific Languages (DSL) are developed by highperformance computing (HPC) specialists, and the specifics of the problem and the numerical solution method are specified in the DSL by the domain scientist. Using sourcetosource translation, the numerical solver can be targeted towards different hardware backends. This ensures that only the backend that interfaces with the new architectures need to be written and supported by the translator. The underlying implementation of the solver remains the same, thereby introducing a separation of concerns that results in a direct payoff in productivity.
Interest in building generic DSLs for solving PDEs is not new with early attempts dating back as far as 1970 [1, 2, 3]. More recently, two prominent finite element software packages, FEniCS [6] and Firedrake [11], have demonstrated the power of symbolic computation using the DSL paradigm. The optimization of regular grid and stencil computations has also produced a vast range of libraries and DSLs that aim to ease the efficient automated creation of highperformance codes [4, 5, 9, 16].
In this work, we present an implementation for automatic GPU codegeneration to Devito. This objective can be translated into extending Devito’s backend in such a way that the generated stencils are compatible with this target architecture. Currently, two backends exist in Devito: the default backend to run it on standard CPU architectures; and an alternative backend using the YASK stencil compiler to generate optimized C++ code for Intel^{®} Xeon^{®} and Intel^{®} Xeon Phi^{™} architectures [8]. Our strategy is to utilize one of the Oxford Parallel Domain Specific Languages (OPDSL), called OPS, to build a third backend for Devito. OPS is a programming abstraction embedded in C/C++ for writing multiblock structured mesh algorithms, and it is composed by the corresponding software library (an Application Programming Interface – API) and code translation tools (compilers) to enable automatic parallelization of the intermediarylevel code produced (here, by Devito) using different parallel programming approaches.
As a result, it is expected that executable artifacts wrote in CUDA, OpenACC, OpenCL, OpenMP, and MPI get automatically and transparently composed for a diverse range of hardware from highlevel symbolic descriptions of PDEs. It has been shown that OPS generated code is capable of matching or outperforming handcoded and tuned implementations [12], which implies considerable confidence in such an approach being capable of delivering high performance, code maintainability and future proofing.
It is possible to speculate that it would take much longer not only to compose complex FD problems but also to produce their various handcoded parallel implementations, each of which would have to be then debugged and validated. The authors claim that the time savings on combining code generation with automatic parallel implementation for stateoftheart hardware will have a significant impact on the efforts for modeling seismic inversion algorithms.
The remaining of this paper is organized as follows. In Section 2 we present both the Devito and the OPS compiler, altogether with the model for isotropic wave propagation considered in our study. Section 2.3 describes how the code generated by Devito should be modified in order to match the syntax of the OPS compiler, and also the roofline model for evaluating the performance of the generated kernels on the GPU devices considered in this work. In Section 4 we show and comment on the the results. Section 5 encloses this paper with concluding remarks.
2 Background
2.1 Devito
Devito is a tool to solve partial differential equations (PDEs) which is a mathematical tool to describe numerous problems that are heavily constrained by physical laws. Some areas in which it has uses are: geophysics, earth and climate science, material science, chemical and mechanical engineering, medical imaging and physics, even in economics. It uses a domain specific language (DSL) as method to simplify development process for the user, and also solve it using finite difference method that is a numerical method.
Devito automatically generates C/C++ code with different levels of optimization for finitedifference schemes from a symbolic Python representation of partial differential equations, with a performance that is competitive with, and often better than, handoptimized implementations. To illustrate this, consider the Equation 1 that is a wave propagation with a source injection and its initial conditions.
(1) 
where:

, represents the square slowness model as a function of the three space coordinates ;

, is the spatially varying acoustic wave field in each time step;

, is the source term representing the source injection;
As Devito uses Sympy library for an easier symbolic representation, writing this equation is as simple as shown in Algorithm 1, which represents a small part of the solution.
Devito performs justintime compilation and execution, so the domain expert can focus on the mathematical formulations, instead of writing lowlevel code. Following the example, the C code automatically generated from Devito using Python can be seen in Algorithm 2.
The user doesn’t even need to see this code, it will all be handled by Devito’s compiler and the result from its execution will be available for the developer. Programming the Algorithm 1 is much simpler and maintainable than Algorithm 2 and it enables the code execution in different architectures using the same python code.
In this work, we leveraged Devito to support the OPS library (described in Subsection 2.2) for computing stencil kernels in a GPU environment using the CUDA parallel computing platform.
2.2 Ops
OPS provides highlevel code abstraction aimed at multiblock suctured grid computations. It can be embedded in C/C++ and its API provides a basic structure for grid computations such as: blocks, datasets defined on these blocks representing constants and state variables, and parallel loops across a block, accessing data defined on the grid points. Which are used to deliver code for different parallel architectures: MPI, OpenMP, OpenACC, CUDA and OpenCL.
The diagram in Figure 1 shows the traditional work flow of OPS programs: starting from the desired structured mesh application then programming the C/C++ algorithm using OPS API, compiling and linking it with OPS libraries and executing the desired platform.
OPS and Devito integration enables automatic code generation for GPU architectures from a high level representation.
2.3 DevitoOPS Integration
To accomplish Devito and OPS integration we need to understand the process Devito uses to generate C/C++ code. Devito generates an intermediate representation to perform a sequence of operations to the expressions and iterations, this includes:

Equations lowering;

Local analysis;

Clustering;

Symbolic optimization;

Iteration/expression tree (IET) construction;

Synthesis;

Operator specialization through backends;
the last step is where Devito will specialize data types aiming an interested API, which is OPS in this research. Devito with OPS backend share all the compilation pipeline until the specialization.
In this section, we stress seven (ivii) essential building blocks required to accomplish our prototype solution.
The integration starts with generating OPS Expression’s (i), which are expressions translated into OPS syntax. An expression that initially is represented in C/C++ language as
u[t+1][x][y] = u[t][x][y] + 1
has an OPS representation syntax given by:
ut10[OPS_ACC0(0,0)] = ut00[OPS_ACC1(0,0)] + 1
The array access u in the first representation will be replaced for ut10 when indicating a one position forward in the time dimension, and replaced for ut00 when accessing the current time dimension. The term OPS_ACC#(0,0) is a macro that OPS syntax uses when translating the index to the desired architecture.
Producing this transformation in Devito requires that the parts of a given expression are separated into nodes. For example, an Indexed object containing the indices that corresponds to displacements over dimensions at Devito level, corresponds to Carrays. The Algorithm 3 illustrates the recursive method used to transform Devito expressions.
We are interested in transforming expressions from offloadable loops. These expressions can be parallelized into a device code that will efficiently get executed by GPU architectures. Parallelizable expressions of the same nest can be grouped inside an outlined function that we call OPS User Kernel (ii), called by ops_par_loop (iii) in the OPS API syntax. The iteration range defines the range in which a OPS User Kernel will operate over the mesh. It is described as an integer array that defines the boundaries in each spatial dimension. The mesh that will be written into or read from throughout the kernel operation is the dataset that is represented by ops_dat (iv) in the OPS API syntax.
Others API calls needed to generate a compilable OPS code ultimately are:

ops_init and ops_end are calls that will mark the beginning and ending of OPS syntax usage. All OPS declarations must be located between these two calls.

ops_block is used to group datasets together.

ops_partition triggers a multiblock partitioning across a distributed memory set of processes.
The diagram in Figure 2 represents an overview of the Devito and OPS integration.
The main contribution of this work is a prototype solution that will automatically generate kernel code for a GPU environment. This code can be coupled in a manually generated C code, that is capable of calling this generated kernels. In Section 5 we discuss in future works how to fully generate the host code.
3 Experiment
3.1 Acoustic Wave Propagation
The current investigation involved generating isotropic 3D wave propagation stencil kernels in an automatic fashion for two NVIDIA architectures and analyzing the performance of the generated algorithms by the roofline model [13]. Kepler and Volta were selected as target GPU architectures, with specifications summarized in Table 1. They use CUDA cores, which is compatible with the syntax supported by the OPS programming interface. Executable artifacts were produced by NVCC compiler with flags Xcompiler="std=c99" O3, altogether with specific micro architectural flag depending on the specific architecture.
Titan Z  Tesla V100  

Memory Bandwidth (GB/s)  336 x2  900 
Single Precision Peak Performance (GFLOPS)  4746  14000 
Double Precision Peak Performance (GFLOPS)  1582  7000 
Memory (GB)  6 x2  16 
GTX Titan Z is a graphics card launched in 2014. It combines two graphics processors for increased performance, although here we only consider one of those cores. This card uses Kepler microarchitecture and specific compilation flags gencode arch=compute_35, code=sm_35.
Tesla V100 is a PCIe 16GB launched in 2017. The micro architectural flag specific for Tesla is gencode arch=compute_70,code=sm_70.
The performance of the produced solutions was analyzed in terms of their floatingpoint performance, operational intensity and memory performance through the roofline model. This model reveals the rate between the extent of performance usage and the theoretical peak performance of the evaluated devices.
The maximum performance of each architecture was calculated using Equation 2 considering the hardware specifications described in Table 1. Any algorithm running in the same architecture will be bound to this very same roof.
(2) 
The Operational Intensity (OI) measures the Dynamic Random Access Memory (DRAM) bandwidth needed by a kernel in a particular architecture. In the devices considered in this paper, each read or write transaction between the DRAM and the caches have a 32 bytes size. Using this definition, the Equation 3 is used to determine the OI.
(3) 
A kernel performance measures the number of floatingpoint operations per second. Performance can be directly calculated using Equation 4.
(4) 
4 Results
Data obtained in previous studies indicated that Devito is able to efficiently utilise Intel architectures^{1}^{1}1Intel^{®} Xeon^{®} E52690v2 with 10 physical cores, and Intel^{®} Xeon^{®} Phi^{TM} accelerator card. with a high degree of efficiency, while maintaining the ability to increase accuracy by switching to higher order stencil discretization dynamically [8]. Luporini et al. show that remarkable speedups from 3x up to 4x can be attainable for those architectures on scenarios with what they call "aggressive" optimizations to avoid redundant computation over 3D grids with space order discretization levels varying from 4 to 16. In our study, we measure the performance of a new backend for Devito on the NVIDIA^{®} architectures GTX Titan Z^{TM} and Tesla V100^{TM} considering scenarios with no symbolic optimizations (basic DSE), and with an aggressive symbolic optimization implemented by Devito (aggressive DSE). An isotropic acoustic wave propagation model with absorbing boundaries as described by Equation 1 is utilized.
In this study, we measured the rate between attainable performance and the peak machine performance according to specifications, for both the considered devices. We take into account the roofline model described in Section 3 to evaluate how efficiently the generated algorithms utilize the GPU for varying space order levels of the generated propagation stencil kernels. For each of the considered space orders we profiled the propagation kernel using nvprof ^{2}^{2}2The nvprof profiling tool enables you to collect and view profiling data from the commandline, and is present in the NVIDIA^{®} CUDA^{®} Toolkit. in order to obtain: (a) the number of single precision floatingpoint operations, (b) the number of memory transactions, and (c) the kernel execution time.
For each space order, the produced stencil kernel ran five times for 30.000 time steps. Table 2 shows the values collected for GTX Titan Z and Table 3 shows the values collected for V100, for basic and aggressive symbolic optimization levels, and space orders levels of 8, 12, 16 and 24. The values for OI are obtained according to Equation 3 whereas the values for performance are obtained according to Equation 4.
Figures 3 and 4 display the OI (FLOP/Byte) versus performance (GFLOP/s) from the values found in Tables 2 and 3, respectively. Each of the points in those plots are characterized by two values: (i) the space order, and (ii) the percentage from the device peak performance. The performance bounds were obtained from vendor peak performance specifications in Table 1.








Basic Optimization  
8  1,450,112,268  22,722,746  553.92  1.99  78.54  
12  2,013,392,118  28,068,109  854.39  2.24  70.70  
16  2,375,372,938  29,871,728  907.72  2.48  78.51  
24  2,898,342,158  33,348,001  1,150.01  2.71  75.61  
Aggressive Optimization  
8  641,887,345  22,637,047  135,73  0.89  141.88  
12  760,134,906  27,737,029  179.15  0.86  127.29  
16  842,931,505  29,704,549  180,55  0.89  140.06  
24  929,761,776  32,926,331  219,76  0.88  126.92 








Basic Optimization  
8  1,450,996,129  9,245,436  553.92  4.90  693.77  
12  2,013,446,796  9,112,947  854.39  6.90  740.48  
16  2,375,384,531  7,722,032  907.72  9.61  816.86  
24  2,898,311,328  11,862,338  1,150.01  7.64  719.60  
Aggressive Optimization  
8  641,882,304  9,256,098  15.31  2.18  1258.16  
12  760,133,342  9,289,727  20.37  2.56  1119.42  
16  842,930,745  8,026,245  20.21  3.28  1251.51  
24  929,760,267  11,670,483  18.48  2.49  1509.60 
Considering the results for GTX Titan Z in Figure 3, we can see that the operation intensity increases with higher space order levels for basic optimization, whereas the operation intensity are nearly the same for an aggressive optimization.
Considering the results for GTX Titan Z in Figure 3, we can see that for the basic optimization, the operation intensity increase with higher the space orders, while using the aggressive optimization they almost did not differ. One can also see that aggressive optimization produces code with better performance than with basic optimization in all scenarios, enabling approximately 24% of peak performance to be achieved versus 6% for the basic scenario.
Executing the experiment in the V100 graphic card, we achieve better performance, as illustrated in Figure 4. Performance gains using aggressive optimization goes from approximately 16% to 63%. It is worth noting that there is a decrease in OI for so 24, this result was not expected as there are more operations in higher so. Looking at the data from Table 3 we can verify that the amount of data transferred in so 24 is 45% higher than the so 16, while the difference in data transfer in the other scenarios was at most 15%. We can thus conclude that the amount of data needed for so 24 is much larger than expected, which indicates that memory accesses in GPU are not coalesced for this case.
In both GTX Titan Z and V100 tests, the aggressive optimization led to three times higher peak performance than the basic optimization. The results from the aggressive optimization corroborate results presented in a related experiment, Luporini et al. [8] that enabled Devito to generate code for the YASK framework and obtained peak performances going from 53% to 63% for Intel^{®} Xeon^{®} and Intel^{®} Xeon^{®} Phi^{™} architectures.
Another analysis possible due to the Roofline model is for future optimizations. All the points are located before the ridge point at both the roofline plots in Figures 3 and 4, which indicates that all the tested cases are memory bounded instead of compute bounded. This means that the produced codes should get greater benefits from optimizations targeted to perform memory exchanges more efficiently than from optimizations focused on increasing throughput. Therefore, enabling FLOPs reduction and data locality such as common subexpression elimination, factorization, and code motion should be considered as a priority for future works.
5 Conclusion
The opensource project Devito^{®} [7, 8] has been attracting the attention of academic [10, 14] and industrial [15]
community. As a DSL for seismic inversion applications, it already provides a set of automated performance optimizations during code generation that allow user applications to fully utilize the target hardware without changing the model specification, such as vectorization, sharedmemory parallelism, loop blocking, autotuning, common subexpression elimination (CSE), crossiteration redundancy elimination (CIRE), expression hoisting and factorization. Devito also supports distributedmemory parallelism via MPI, and several haloexchange schemes are available. Classic optimizations such as computationcommunication overlap (relying on asynchronous progress engine) are implemented. It can be integrated with a wide variety of methods (e.g. LBFGSB
^{3}^{3}3Largescale Boundconstrained Optimization ) for solving minimization problems, such as in FWI. It can perform FWI on distributed memory parallel computers with Dask. It also implements support for standard CPU architectures, and for Intel^{®} Xeon^{®} and Intel^{®} Xeon Phi^{™} architectures. However, the support to code specialization for GPU architectures is yet a work in progress.In this study, we created an extension of Devito to enable code generation for the OPS syntax. We also evaluated the new backend in terms of processor performance concerning offchip memory traffic for varying space order discretization levels on the NVIDIA^{®} devices GTX Titan Z^{TM} and Tesla V100^{TM}. We found that the implemented backend achieves up to 62.82% of the peak performance on V100, which is consistent with results from work using Devito to generate YASK framework code [8]. We also observed that isotropic 3D wave propagation stencil kernels generated with aggressive symbolic optimizations have three times higher peak performance than with no symbolic optimizations. This study, therefore, indicates that it is possible to use the available power of GPU architectures in Devito for solving seismic inversion algorithms.
This work is the first study to our knowledge that investigates a seamless coupling between Devito and OPS compilers. However, some limitations are worth noting as the capability of the implemented solution still only covers source injection and forward propagation. The forward model is the basis for further implementations of inversion processes using Devito operators. Yet, in order to enable a seamless sourcetosource translation of FWI algorithms, future work should provide support for receiver interpolation and backward propagation as well. Moreover, the automatic generation of host code, responsible for calling the device code that will execute in GPU, is currently in implementation. Finally, to complete Devito integration, it is necessary to automatically translate, compile, and execute the GPU code through the Devito pipeline and return the result from the execution to the Devito workflow.
References
 [1] Cárdenas, A.F., Karplus, W.J.: PDEL—a language for partial differential equations. Communications of the ACM 13(3), 184–191 (mar 1970). https://doi.org/10.1145/362052.362059
 [2] Cook, G O Jr. (Lawrence Livermore National Lab., C.U.: ALPAL: A tool for the development of largescale simulation codes (1988)
 [3] van Engelen, R., Wolters, L., Cats, G.: CTADEL: A Generator of MultiPlatform High Performance Codes for PDEbased Scientific Applications. In: Proceedings of the 10th international conference on Supercomputing  ICS ’96. pp. 86–93. ACM Press, New York, New York, USA (2003). https://doi.org/10.1145/237578.237589
 [4] Hawick, K., Playne, D.P.: Simulation Software Generation using a DomainSpecific Language for Partial Differential Field Equations. In: Proceedings of the International Conference on Software Engineering Research and Practice (SERP). p. 7. The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp) (2013). https://doi.org/ProQuest document ID: 1491419516
 [5] Henretty, T., Veras, R., Franchetti, F., Pouchet, L.N., Ramanujam, J., Sadayappan, P.: A stencil compiler for shortvector SIMD architectures. In: Proceedings of the 27th international ACM conference on International conference on supercomputing  ICS ’13. p. 13. ACM Press, New York, New York, USA (2013). https://doi.org/10.1145/2464996.2467268
 [6] Logg, A., Olgaard, K.B., Rognes, M.E., Wells, G.N.: FFC: the FEniCS Form Compiler. In: Logg, A., Mardal, K.A., Wells, G.N. (eds.) Automated Solution of Differential Equations by the Finite Element Method, Volume 84 of Lecture Notes in Computational Science and Engineering, chap. 11. Springer (2012)
 [7] Louboutin, M., Lange, M., Luporini, F., Kukreja, N., Witte, P.A., Herrmann, F.J., Velesko, P., Gorman, G.J.: Devito (v3.1.0): An embedded domainspecific language for finite differences and geophysical exploration. Geoscientific Model Development 12(3), 1165–1187 (2019). https://doi.org/10.5194/gmd1211652019
 [8] Luporini, F., Lange, M., Louboutin, M., Kukreja, N., Hückelheim, J., Yount, C., Witte, P., Kelly, P.H.J., Gorman, G.J., Herrmann, F.J.: Architecture and performance of Devito, a system for automated stencil computation (jul 2018)
 [9] Membarth, R., Hannig, F., Teich, J., Kostler, H.: Towards domainspecific computing for stencil codes in HPC. In: Proceedings  2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012. pp. 1133–1138. IEEE (nov 2012). https://doi.org/10.1109/SC.Companion.2012.136
 [10] Mojica, O.F., Kukreja, N.: Towards automatically building starting models for fullwaveform inversion using global optimization methods: A PSO approach via DEAP + Devito (may 2019)
 [11] Rathgeber, F., Ham, D.A., Mitchell, L., Lange, M., Luporini, F., McRae, A.T.T., Bercea, G.T., Markall, G.R., Kelly, P.H.J.: Firedrake: automating the finite element method by composing abstractions (jan 2015). https://doi.org/10.1145/2998441
 [12] Reguly, I.Z., Mudalige, G.R., Giles, M.B., Curran, D., McIntoshSmith, S.: The OPS domain specific abstraction for multiblock structured grid computations. In: Proceedings of WOLFHPC 2014: 4th International Workshop on DomainSpecific Languages and HighLevel Frameworks for High Performance Computing  Held in Conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Stor. pp. 58–67 (2014). https://doi.org/10.1109/WOLFHPC.2014.7
 [13] Williams, S., Waterman, A., Patterson, D.: Roofline. Communications of the ACM 52(4), 65 (apr 2009). https://doi.org/10.1145/1498765.1498785
 [14] Witte, P.A., Louboutin, M., Kukreja, N., Luporini, F., Lange, M., Gorman, G.J., Herrmann, F.J.: A largescale framework for symbolic implementations of seismic inversion algorithms in Julia. GEOPHYSICS 84(3), F57–F71 (may 2019). https://doi.org/10.1190/geo20180174.1
 [15] Yount, C., Tobin, J., Breuer, A., Duran, A.: YASK  Yet another stencil kernel: A framework for HPC stencil codegeneration and tuning. In: 2016 Sixth International Workshop on DomainSpecific Languages and HighLevel Frameworks for High Performance Computing (WOLFHPC). pp. 30–39. IEEE (nov 2017). https://doi.org/10.1109/WOLFHPC.2016.08
 [16] Zhang, Y., Mueller, F.: Autogeneration and autotuning of 3D stencil codes on GPU clusters. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization  CHO ’12. p. 155. ACM Press, New York, New York, USA (2012). https://doi.org/10.1145/2259016.2259037