Investigating the OPS intermediate representation to target GPUs in the Devito DSL

06/26/2019
by   Vincenzo Pandolfo, et al.
0

The Devito DSL is a code generation tool for the solution of partial differential equations using the finite difference method specifically aimed at seismic inversion problems. In this work we investigate the integration of OPS, an API to generate highly optimized code for applications running on structured meshes targeting various platforms, within Devito as a mean of bringing it to the GPU realm by providing an implementation of a OPS backend in Devito, obtaining considerable speed ups compared to the core Devito backend.

READ FULL TEXT VIEW PDF

Authors

page 9

06/02/2020

Finite Difference Neural Networks: Fast Prediction of Partial Differential Equations

Discovering the underlying behavior of complex systems is an important t...
07/12/2017

Optimised finite difference computation from symbolic equations

Domain-specific high-productivity environments are playing an increasing...
01/16/2021

GPU Methodologies for Numerical Partial Differential Equations

In this thesis we develop techniques to efficiently solve numerical Part...
09/09/2021

Geometrical discretisations for unfitted finite elements on explicit boundary representations

Unfitted (also known as embedded or immersed) finite element approximati...
08/06/2018

Devito: an embedded domain-specific language for finite differences and geophysical exploration

We introduce Devito, a new domain-specific language for implementing hig...
08/20/2014

Code Generation for High-Level Synthesis of Multiresolution Applications on FPGAs

Multiresolution Analysis (MRA) is a mathematical method that is based on...
06/21/2018

Optimising finite-difference methods for PDEs through parameterised time-tiling in Devito

Finite-difference methods are widely used in solving partial differentia...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

2.1 Devito

Devito is a DSL and code generation tool for the solution of PDEs using the finite difference method. In particular it is focused on the generation of highly efficient kernels for the solution of seismic inversion problems. It uses SymPy [7] to define high level operators from symbolic equations that are then converted into C code that can run efficiently on a specific target architecture.

The use of symbolic equations allows Devito to be extremely user-friendly to domain specialists, enabling them to define complex operators in little code, hiding from them the complexity of code optimisation. This makes it extremely quick for users to develop new solvers (or rewrite existing ones) that can target different architectures.

2.1.1 Top level overview

The main building blocks of Devito programs are Operators that contain the problem specification: the data they operate on and the expressions that will be evaluated over this data. Once a user has defined their Operator, Devito will use it to generate optimised stencil code that is run in (optimised) loops to apply the equations over the input data.

The Devito CFD tutorial [8] provides us with a simple example of an Operator for a 2D linear convection:

1from devito import Eq, Grid, Operator, TimeFunction
2
3grid = Grid(shape=(nx, ny), extent=(2., 2.))
4u = TimeFunction(name=’u’, grid=grid)
5
6# Initialize the input data
7init_hat(field=u.data[0], dx=dx, dy=dy, value=2.)
8
9# Specify the ‘interior‘ flag so that the stencil is only
10# applied to the interior of the domain.
11eq = Eq(u.dt + c*u.dxl + c*u.dyl, subdomain=grid.interior)
12
13stencil = solve(eq, u.forward)
14
15# Create an operator that updates the forward stencil point
16op = Operator(Eq(u.forward, stencil, subdomain=grid.interior))
17
18# Apply the operator for a number of timesteps
19op(time=nt, dt=dt)

Functions are identified by a name (in the example, u) and operate on a computational Grid. Functions are abstractions describing spatially varying functions. An extension of these are TimeFunctions: these describe spatially varying time dependent discrete functions.

SparseFunction and SparseTimeFunction are sparse functions that operate only on a subset of the grid.

u.forward, u.dx1 and u.dy1 are examples of useful finite difference shortcuts provided by Devito in order to aid with the formulation of FD stencils.

Figure 2.1: The input data (left) and the result (right) of the execution of the example Operator in section 2.1.1 [8]

2.1.2 Optimizations

Once Operators are defined by the user, they undergo a process of optimizations in a sequence of steps, most notably executed by the Devito Symbolic Engine (DSE) and the Devito Loop Engine (DLE), which will be covered in more detail shortly [9].

The goal of this project is to offload part of the optimization process to OPS (described in more detail later), in order to allow Devito to target GPUs without having to write the required optimizations directly. The nature of the Devito optimisation pipeline makes it possible to swap out parts of it with different components, so that external tooling can be used at the right stage when appropriate.

Devito Symbolic Engine

The DSE is tasked with reducing the operational count of groups of equations sharing the same domain (called Clusters) via means of symbolic manipulation. This stage does not directly impact the nature of the loop nests generated by Devito, but it is focused on the optimization of the expressions themselves with methods such as common sub-expressions elimination, FD weights factorization and alias detection.

Once this step is performed, this intermediate representation is transformed into an Iteration/Expression Tree (IET), an abstract syntax tree that represents the structure of the loop nests.

IET and Devito Loop Engine

When the IET is built, equations are embedded in Expressions, nested inside Iteration

s representing the loop nests. The IET is then used for further optimizations performed by the DLE. These include SIMD vectorization, loop blocking and parallelism via OpenMP.

Different backends can provide alternative loop optimization engines. One example of such a back end is the YASK Loop Engine (YLE) that transforms suitable elements of the IET into a format that YASK can process, therefore offloading most of the optimization work to it.

The OPS backend discussed in this report is similar in nature to the YASK one, as IETs are transformed in such a way that the generated code uses the OPS API, leaving the final optimization tasks to the OPS compiler.

2.1.3 Compilation and execution

Once all these steps are performed, variable declarations, headers and profiling instrumentation is inserted in the IET. Once this is done, a IET visitor generates a CGen[10] AST that is then transformed into a source code file, JIT-compiled and loaded into the Python environment. The code can then be executed directly from an Operator via a function used as a default entry point.

Figure 2.2: The Devito pipeline [9]

2.1.4 IET nodes

Most of the focus of this report is on manipulation of IETs and therefore of IET nodes. We will now give a brief overview of some IET nodes that will be used in this report.

  • Expression: encapsulates a ClusterizedEq, a Sympy equation with associated iteration and data space. It is rendered in code generation as an assignment

  • Call: represents a function call

  • Callable: represents a function definition, including its signature and body

  • Iteration: a for loop

  • IterationTree: represents a series of nested iterations

  • Element: general purpose node that can contain different cgen objects

2.2 Ops

The Oxford Parallel library for Structured mesh solvers (OPS)[6] is a Domain Specific Active Library for the development of applications operating on multi-block structured meshes. It provides an abstraction and API to enable automatic parallelization and optimization of computations on multi-block grid computations. The OPS abstraction is composed of four main components:

  • Blocks: high-level definition of the grid, with dimensionality but no size

  • Dataset: the actual data contained in blocks

  • Halos: interface between datasets on different blocks

  • Computations: operations to apply to grid points

The API is straightforward, as we will see:

  • [listparindent=]

  • ops_block ops_decl_block(num_dims, ...) defines a block with num_dims dimensions

  • ops_dat ops_decl_dat(block, size[], ...) defines a dataset on block with size size2. Ownerwship of the data will be passed to OPS and it will then only be accessible via ops_dat handles

  • void ops_par_loop( void (*kernel)(...), block, ndim, range[], arg1, ..., argN) defines a parallel loop over block with the given dimensions and range that applies kernel with the given arguments

  • ops_arg ops_arg_dat(...) defines an argument that can be used to pass a dataset to ops_par_loop

  • ops_arg ops_arg_gbl(...) defines an argument that can be used to pass a global scalar or small array to ops_par_loop

Once users have written their code using OPS, this will go through the OPS source-to-source translator, a Python library that will generate optimized code for the target architecture. To do so, it will parse calls to ops_par_loop in order to extract the required information and will transform it into very specific and optimized loop code. This code will then be linked against OPS libraries (the OPS back end) in order to generate high performance executables.

This back-end gives important features to OPS, some examples being:

Distributed memory parallelism: using MPI, OPS can automatically partition blocks and datasets over multiple processes

Checkpointing and recovery: thanks to OPS taking ownership of the data, it can keep track of the transformations applied to the data in order to periodically store it on disk (checkpointing) and automatically recover in case of failures.

Tiling: Using runtime execution data OPS could be able to analyze data dependencies and generate effective tiling strategies.

These different optimizations can be applied by OPS on multiple architectures, with no work required by the user as they will only need to write their code using the OPS abstraction once. This characteristic makes it a great candidate as a Devito back-end as it sits on a much lower level. Source code generation in Devito can target OPS instead of a specific architecture, making Devito easier to reason about and freeing Devito developers from the burden of thinking of low level optimizations and allowing them to concentrate on more domain specific issues.

2.3 Related work

2.3.1 Yask

YASK (Yet Another Stencil Kernel) is a framework that allows the generation of high performance code targeted at Intel Xeon and Xeon Phi processors. It provides multiple optimizations, including:

  • Vector-folding

    • Traditionally, multi-dimensional vectors are stored as a sequence along one dimension. This is inefficient for multi-dimensional stencils. Vector-folding is a data storage technique that minimizes memory accesses by storing multi-dimensional data in a vector [11].

  • Multi-level parallelism via OpenMP

    • Outer loops are parallelized across cache blocks size that are then ulteriorly split in inner loops. This allows an increase in temporal cache locality, avoiding eviction of data being used by concurrent threads.

  • Multi-socket and nodes processing via MPI

  • Space tiling

    • Also known as loop blocking, space tiling optimizes memory access by applying loop transformations to split the iteration space in chunks.

  • Time tiling

    • Time tiling applies the idea of space tiling across time steps. YASK accomplishes this using a technique called temporal wave-front tiling [12].

It provides an API in both C++ and Python.

Devito YASK backend

An ongoing project within Devito is integrating YASK as a backend, in a similar fashion to what this project has done with OPS. Currently, around 70% of the Devito API is supported by the YASK backend. [9]

2.3.2 Loo.py

Loo.py [13] is a Python library that provides a data model for the definition of array computation and a set of transformations that can be applied on these definitions to generate optimized CUDA/OpenCL code. The transformations to be applied are decided by the user and include loop tiling, vectorization, instruction-level parallelism among others.

Computations are defined using the isl library syntax [14] and then transformed by the user using the provided (or custom) transformations. This way, the user has complete control of what happens to their code, leaving the Loo.py only the task of applying the transformations and generating the code without trying to do anything clever to further optimize the generated code.

The resulting code can then be executed using pyOpenCL, or printed to be compiled and executed via other means if a different platform than OpenCL is being targeted (for example, CUDA).

2.3.3 Mint

Mint [15] is a source-to-source C translator that generates optimized CUDA code based on annotations (pragmas). Mint specifically targets stencil computation: by restricting the application space in which it operates it can perform more effective optimizations for that problem class.

Optimizations are performed depending on the results of analysis of the stencil done by a component called stencil analyzer. The results are then passed to the On-chip Memory Optimizer that uses this information to effectively optimize optimize memory accesses using registers and shared memory in order to avoid neighbouring points being loaded multiple times by independent.

Another optimization performed by Mint is Loop Aggregation (or loop blocking) in which chunks of the iteration space are assigned to a single thread in order to minimize shared memory accesses.

Mint has achieved good performance, with around 80% of performance compared to hand-made optimized CUDA code.

2.3.4 Lift

Lift [16] is a Scala library whose goal is achieving performance portability for DSL and library developers by providing a set of parallel primitives that can be composed to generate high performance stencil code for GPUs.

Lift applications are expressed as a sequence of functional primitives and are optimized through a set of 1D rewrite rules. Users can compose these rewrite rules to express an optimize multidimensional stencil code. The technicalities of the optimizations that can be applied by Lift are hidden from the user who, however, has complete control over how high-level optimizations are applied.

The resulting stencil program is then converted in a sequence of low-level OpenCL-specific primitives that are then transformed in optmized OpenCL code.

Lift has achieved performance comparable to hand-optimized OpenCL code [17].

3.1 Overview

Devito allows the creation of different backends that can modify the default behaviour to various degrees. We utilise this feature to introduce an OPS backend whose goals are to generate, compiles and run code that uses the OPS API.

In order to do this a new OperatorOPS that subclasses Devito’s default Operator has been created. This will be loaded as the Operator to be used automatically by Devito when the ops backend is selected (this is done by setting the DEVITO_BACKEND environment variable).

All necessary operations (before execution) are performed as a sequence of method invocations within the Operator constructor. This allows subclasses to override only parts of the pipeline that are relevant to what needs to be achieved.

In the case of the OperatorOPS the main method to override is _finalize_iet: this method is called after all symbolic manipulation and IET construction (including loop generation) is completed and is tasked with specializing the IET depending on the backend being used. For the core backend (the default Devito backend) this means running the Devito Loop Engine (DLE). For our purposes this will mean transforming offloadable loop nests into OPS kernel, invoked through calls to ops_par_loop as described in the next sections.

Figure 3.1: Code generation pipeline for the OPS backend
Finding offloadable loops

In order for a loop nest to be offloaded to OPS it has to be affine, meaning that all of its array accesses are either constant or affine functions of the iteration variables. This is because non affine loops can lead to data races if parallelised and therefore can’t be further optimized easily.

To find offloadable trees the find_affine_trees utility function (originally from the YASK backend that uses it for the very same purpose) is used. This returns a list of all offloadable IterationTrees. We iterate on this list and for each such tree we generate:

  • An OPS kernel

  • Dataset declarations

  • Parallel loop information declaration (stencils, iteration range)

  • Parallel loop invocation

The parallel loop invocation will substitute the Sections containing the offloaded IterationTrees with OPS loop invocations (therefore removing the actual loop from the IET) using a Transformer visitor: this visitor takes an IET and a map of substitutions and returns a new IET with these substitutions applied.

We then return a List node containing, in this order:

  • OPS initialization (a call to ops_init and creation of the ops_block)

  • Dataset declarations

  • Parallel loop information declaration

  • The newly created IET

  • A call to ops_timing_output to print profiling information from OPS

  • A call to ops_exit to terminate OPS

We also include the headers for OPS and our generated kernels by adding them to the _includes field in the operator.

3.2 OPS Kernel creation

When offloading loops to OPS the first step is to extract the Expressions representing the actual computation in a Kernel function that can be passed to OPS.

To do this the expressions have to be translated to use OPS accessor macros instead of loop indexes to index data within arrays, then placed in the body of a Callable (a Devito IET node representing a function definition) having as parameters the functions and constants used in the expressions. These are found using a FindSymbols visitor, already present in Devito.

3.2.1 Expression translation

An implementation of Devito expression translation to OPS expressions was already developed [18] and it remained mostly unchanged in logic, with some alterations to enable a clearer naming of indexed functions, facilitating the insertion of the correct OPS_ACC accessor during a second pass, mark arrays as either read or write and accessing constants by dereferencing their pointers (since everything must be passed into an OPS Kernel as a pointer).

In order to translate them, expressions are recursively visited and rebuilt. Changes are applied to Indexed objects (array accesses) and Constants. It is important to note that when recursing in an equality, when visiting the LHS we set that sub-tree to be a write (so that if there is an Indexed it is changed accordingly) while leaving the RHS as a read (the default state).

Indexed translation

Indexed objects are recreated to use OPS accessor macros instead of regular array accesses. The first step is determining the name of the new Indexed object: this is the function name in the case of regular Functions while, if the Indexed object is accessing a TimeFunction the name is composed of the function name and the relevant time accessor (e.g. u[t0] becomes ut0).

We then check if an Array of the same name has already been created in the context and if not, we create it and memorise it to be reused when encountered again. These Array objects are specific to OPS and extend Devito Arrays by storing whether they are read-only arrays or not and, if they are, prepend the const modifier to their type definition as it is a requirement when defining the parameters of an OPS kernel.

The indexes used to access the array are then extracted and inserted as a simple text Macro as the index for this new array. This text will contain comma separated integers detailing the shift from the loop value (for example, for ut0[x - 1][y + 1] we would have "-1,1" as the new index). These are not surrounded by the OPS_ACCx macro call yet as this is indexed depending on the order of the parameters that is not yet known.

Once the expressions are completely translated the parameters are found using the FindSymbols visitor and sorted by type (arrays or scalars) and alphabetical order. This order is then used to insert the correct accessors in the Indexed objects in a similar recursive way to what has been described before. For example, if our kernel signature is void Kernel0(const float * ut0, float * ut1) we will use OPS_ACC0 to access elements of ut0 and OPS_ACC1 to access elements of ut1.

Constants

Constants are simply renamed to have the dereference operator at the start of their name. This accomplishes two things: when generating the parameters for the OPS kernel, constants are pointers and when using them in the expressions, they’re dereferenced.

3.2.2 Header file creation

Once the expressions have been translated and the parameters identified Callables are created and inserted in a list within the Operator. OPS kernels are numbered to avoid name clashes (multiple loops within an operator could potentially be offloaded to OPS).

At compile time the kernels are inserted in a List, an IET node that, as the name suggests, contains a list of nodes. This is then passed to a CGen visitor that generates the code string that is then written to a .h file that is included in the .c file containing the Devito kernel code. Both these files will then be passed to the OPS translator at compile time.

1void Kernel0(const float * ut0, float * ut1, const float *dt, const float *h_x, const float *h_y)
2{
3  ut1[OPS_ACC1(0,0)] = -1.0F*(*dt*ut0[OPS_ACC0(0,0)]/((*h_y**h_y)) + *dt*ut0[OPS_ACC0(0,0)]/((*h_x**h_x))) + 5.0e-1F*(*dt*ut0[OPS_ACC0(0,-1)]/((*h_y**h_y)) + *dt*ut0[OPS_ACC0(-1,0)]/((*h_x**h_x)) + *dt*ut0[OPS_ACC0(0,1)]/((*h_y**h_y)) + *dt*ut0[OPS_ACC0(1,0)]/((*h_x**h_x))) + ut0[OPS_ACC0(0,0)];
4}
Listing 1: Devito-generated OPS kernel

3.3 Invoking OPS kernels

Once OPS Kernels have been generated, the next step is to invoke these kernels as part of a parallel loop. In OPS, this means invoking ops_par_loop and passing in the kernel to be used, the iteration space, the starting indices in all dimensions and the arguments required by the Kernel.

In order to achieve this, various code generation features were added to Devito. Among these are the ability to nest function calls, the introduction of new data types and the possibility of initialising symbols with the result of function calls. These will be covered in more detail in the next sections.

3.3.1 Data declaration

The first thing to do is transform the data in ops_args that can be passed to OPS kernels.

There are two types of ops_args: datasets and globals. Datasets represent the main data that OPS operates upon (in the context of Devito, these roughly correspond to Functions) while globals can be small constant arrays or scalars.

Globals can be passed directly to kernels, while datsets need to be transformed into ops_dats first.

3.3.2 ops_dat creation

TimeFunction

When talking about how Devito functions map to OPS datasets it is important to make a distinction between Functions and TimeFunctions. This is due to the fact that the outer time loop is not offloaded to OPS and therefore TimeFunctions at different time steps need to be treated as separate datasets.

Calculations of the value of some time function at time rely on the values at previous time steps. In order to access the needed states the options are to either store all the computed values at different time steps or to only store the required states (for example, if the only needed state is the one at the previous time step, we would store and ) and rotate through them as needed when computing future states, discarding the ones that are no longer needed.

If we are storing and , this means that when calculating (using the values at we can overwrite the data for as this is no longer needed. Practically, this means that to store in memory an array of shape (assuming a 2D grid) is used and that the time indexes used are rotated at every iteration. The indexes are calculated as , , …, where is the amount of time steps stored and is the actual time step value.

Going back to the 2D example, at every time iteration u[t1] = foo(u[t0]) is calculated using these time accessors with the actual values of t1 and t0 rotating between and (and therefore not requiring any other time step besides the ones used in the expression).

As seen previously the translated expression in the OPS kernel would take these as two separate array arguments, ut0 and ut1 (i.e. they are treated as separate datasets), and therefore multiple ops_dat need to be created for each time step used in the kernel.

The creation of the actual ops_dats is then done similarly to how it’s done for normal Functions with the difference that the information relative to the time dimension is removed.

In order to pass the correct pointer to the ops_decl_dat call a new Devito type, FunctionTimeAccess has been created. This represents a pointer to the array representing a TimeFunction at a specific time index (e.g. &u[0]).

Extracting relevant information about Functions

In order to create an ops_dat the following information is required:

  • Array size

  • Base indexes

  • Padding in the negative direction

  • Padding in the positive direction

  • C type

All this information can be extracted from data held by the Devito representation of the function, with the only manipulation required being having to change the sign for the padding in the negative direction, since Devito represents this with positive integers while OPS uses negative integers.

Most of this information needs to be passed to OPS as arrays of integers. These can be created by adding Expressions to the IET containing Sympy equations (Eq) with a SymbolicArray (a new type representing a locally initialized array) on the left-hand side and a ListInitializer on the right-hand side. A ListInitializer is an already existing construct in Devito representing inline list initializations of arrays (e.g. {1, 2, 3}) that has been upgraded to allow integers (previously it would only allow strings or Sympy expressions).

Once all the required array are created the next step is to actually create the ops_dat. This is done using an Element, an IET node that can contain an arbitrary cgen[10] element. In this specific case, we use an Initializer, a cgen node representing an initialization statement of the form type name = value;.

On the left hand side is an OPSDat symbol and, on the right hand side, the Call node representing the call to ops_decl_dat.

In the case of TimeFunctions we will have to create multiple ops_dats. To do this, we create a 1D SymbolicArray of size equal to the amount of time indexes used. We then initialize all its elements as previously described.

It is important to note that Devito finds the symbols that need to be passed as parameters to the Operator by visiting the IET. Therefore, since the symbols representing these Functions are no longer part of an Expression containing a Sympy equation functionality to find these symbols from the arguments of calls within an Element has been added.

The resulting code, once generated, will look as follows:

1int u_dim[2] = {3337, 3337};
2int u_base[2] = {0, 0};
3int u_d_p[2] = {2, 2};
4int u_d_m[2] = {-2, -2};
5ops_dat u_dat[2];
6u_dat[0] = ops_decl_dat(block_0,1,(int *)u_dim,(int *)u_base,
7    (int *)u_d_m,(int *)u_d_p,&u[0],"float","ut0");
8u_dat[1] = ops_decl_dat(block_0,1,(int *)u_dim,(int *)u_base,
9    (int *)u_d_m,(int *)u_d_p,&u[1],"float","ut1");

Note that in this snippet is a TimeFunction, therefore we have two ops_dats to describe the data at each relevant timestep.

IET visitor extension: visiting Elements

After the IET is ready Devito will derive the parameters for the Operator by visiting the IET using a FindSymbols Visitor. This visitor has three modes of operation: symbolics (collects AbstractSymbol objects), free symbols and defines (collects bound objects - variables initialized within the IET).

The parameters will then be the union of the results from running FindSybols in symbolics and free symbols excluding the symbols found in defines mode.

However, since Devito did not have the capability to find these in Elements (due to them being able to contain different cgen nodes and not having any need for this before) when Expressions offloaded to OPS are substituted with OPS API calls any symbol used in these is not found. This results in undeclared identifier errors at compile time, as these symbols are used but never passed to the Operator C function.

To overcome this Element nodes need to be upgraded to be able to find the symbols contained inside it. This has been done for cgen.Initializers by simply defining the relevant properties to either return the appropriate values taken from the Node on the RHS or the LHS in the case of defines.

3.3.3 Parallel loop invocation

Once all the datasets are initialized the next step will be to invoke the ops_par_loop within the time loop. This replaces the space loop and it’s what tells OPS to perform the computation.

ops_par_loop takes as arguments the OPS kernel, the number of dimensions, the iteration space, and ops_args representing the arguments to the kernel. These will be created from the ops_dats initialized outside the loop and from the already known constants.

Kernel pointer

The first argument to ops_par_loop is a function pointer to the OPS kernel function to be used in the loop. To pass this in a FunctionPointer type has been created that simply prints out the function name.

Iteration space

The iteration space can be extracted from the IterationTree containing the expressions being moved in an OPS kernel. These are then used to initialize a SymbolicArray in a similar way to what is described in 3.3.2, with lower and upper bounds for each dimension being the elements of the array.

Kernel arguments

Kernel arguments are passed to the ops_par_loop call by nesting in it calls to ops_arg_dat in the case of datasets and ops_arg_gbl in the case of small global scalars or arrays.

In the case of globals, we pass to ops_arg_gbl a pointer to our data, the size (in our case, since the only globals we have are scalars, we are going to default to 1), a string representing the C type of the data and whether the data is read and/or write (again, we are dealing with constant scalars, therefore we will default to OPS_READ).

For datasets, we will have to pass the relevant ops_dat, an ops_stencil (defining what neighbouring data points are used during execution of the kernel over a point) and, similarly to the globals case, the C type (in string format) and the R/W setting.

In the case of TimeFunctions arguments, as seen before, we have multiple ops_dats in an array. To index the array the time indexing variables are used so that we rotate through the datasets correctly at every time step.

Stencil creation

OPS stencils define what data points are accessed in a dataset when the kernel is run for a single data point.

Taking a Devito expression before OPS translation as an example, we have that for

u[t1][x][y] = u[t0][x][y] + u[t0][x - 1][y]

the OPS stencils would contain the following offsets:

To create these stencils we loop over all the Indexed objects in an expression and analyse their accessors. To find them retrieve_indexed is used, a function already present in Devito that simply visits the equation graphs and returns a collection containing all the found Indexed.

For each function that is indexed a set is created. This will contain tuples of tuples containing the dimension being accessed and its offset. Looking at our previous example, we would have accesses[u] = {((t0, 0), (x, 0), (y, 0)), ((t1, 1), (x, 0), (y, 0)), ((t1, 0), (x, -1), (y, 0))}.

This is done for every equation that will go in the kernel so that for every function we will have a complete set of all accesses across equations. The next step is then to generate ops_stencil objects. This is done through a call to ops_decl_stencil that takes the dimensions, the number of points, an array containing a flattened list of these points and the stencil name as a string. The first two arguments are easily derived from the accesses set, while the name is selected using a simple convention: s{dimensions}d_{function_name}_{points}pt.

The array is created by iterating over the set of points for a function and extracting the offset values from the space dimensions and appending them at the end of a list that is then passed to a ListInitializer used to initialize an SymbolicArray of integers in a Devito Expression. The OPSStencil symbol is then initialized with a cgen.Initializer with the Call to ops_decl_stencil on the RHS as previously described and then stored for later use in the ops_arg_dat call.

In the case of TimeFunction we first need to split grouping the element by the accessor used in the time dimension and then proceed as previously detailed. Looking at the earlier example, the sets to be used would be ut0 = {((x, 0), (y, 0))} and ut1 = {((x, 0), (y, 0)), ((x, -1), (y, 0))}.

For a diffusion operator with space order 2 the following stencils will be generated:

1int s2d_ut0_5pt[10] = {0, 1, 1, 0, 0, -1, -1, 0, 0, 0};
2ops_stencil S2D_UT0_5PT =
3    ops_decl_stencil(2,5,(int *)s2d_ut0_5pt,"S2D_UT0_5PT");
4int s2d_ut1_1pt[2] = {0, 0};
5ops_stencil S2D_UT1_1PT =
6    ops_decl_stencil(2,1,(int *)s2d_ut1_1pt,"S2D_UT1_1PT");
IET visitor extension: nested calls

Devito doesn’t natively support nested calls (e.g. foo(bar())) therefore functionality to do so has been added during the course of this project.

The first step is enable the nested calls to be visited by IET visitors: this is done by defining the children property of the Call class to return a list of all arguments that are instances of Call.

Then, a visit_Call needs to be created or modified for relevant visitors: these define the logic when a Call node is visited.

The CGen visitor, used to generate the actual C code has been updated so that when the argument string for the call is generated in the _args_check method we recurse down when a Call argument is encountered and the resulting code string for that subtree is appended to the argument list.

Similarly, the FindSymbols visit_Call method has been updated so that instead of simply applying the search rule to the current Node, the Call arguments are visited and the results found are added to the set of symbols.

By default, Devito assumes that a function call is not nested and that its result is not used, therefore it always uses a cgen.Statement

(that adds a semicolon at the end of the generated text). However, when nesting calls we do not want the

Call text to be rendered as a C statement, but as simple text: to define when this is the case a semicolon field has been added to the Call class to be used in the CGen visitor. This is also useful when initializing data using a cgen.Initializer to avoid double semicolons on the initialization line.

3.4 Summary

This chapter has shown how to generate:

  • OPS kernels living in a header file that use OPS accessor macros

  • Declarations of OPS datasets, stencils and iteration ranges

  • Invocations of OPS kernels with the appropriate arguments

This is what is required to use the OPS source-to-source translator, compile the resulting code and link it with the OPS libraries.

4.1 Devito compilation and execution pipeline

When an user tries to execute an Operator by invoking the apply method the compilation process will be triggered (if it’s the first execution of the operator). This will involve compiling the Devito generated code into a shared library, loading it into Python and executing the entry point as defined by Devito.

Code generation

In order to compile our code the first step is transforming the IET into a C code string. This is done by retrieving the ccode property of the Operator. This property uses the CGen visitor to generate C code from an IET node such as Operator. This will result in a string containing a complete and valid C file containing headers, type declarations and a function representing the entry point of the computation.

Compilation and execution

This C string is then passed to the jit_compile method that performs just-in-time (JIT) compilation by calling the compile_from_string function from the codepy [19] library. In order to compile a Compiler object (that extends codepy’s GCCToolchain class) is passed to compile_from_string to define what compiler to use and what compilation flags are needed. This will generate a .so that can then be loaded in Python using Numpy’s C-Types Foreign Function Interface [20]. Once loaded, the function can be simply executed as a normal Python function.

4.2 OPS compilation pipeline

OPS programs have a two step compilation process. The first step is running the OPS translator, a source to source compiler that, given some C++ source code that uses the OPS API, generates code that can target multiple platforms including CUDA, MPI, OpenMP, OpenACC and others. The second step compiles the generated code and links it to the OPS libraries to create executables for the different platforms targeted.

The standard OPS workflow to execute these two steps is by using Makefiles. For any new program, an user would simply create its own Makefile that includes OPS provides makefiles and sets a few variables describing where the code lives as shown in the example Makefile in listing 2. Running make will then start the compilation process and generate all the executables needed.

The assumptions that both the translator and the Makefiles make is that executables for all target platforms need to be generated. This renders the Makefiles quite complex to unravel if we only want to compile for a single platform manually (or, in our case, automatically within Devito) but, from a user perspective, are quite straight forward to use.

1include $(OPS_INSTALL_PATH)/../makefiles/Makefile.common
2include $(OPS_INSTALL_PATH)/../makefiles/Makefile.mpi
3include $(OPS_INSTALL_PATH)/../makefiles/Makefile.cuda
4include $(OPS_INSTALL_PATH)/../makefiles/Makefile.hdf5
5
6HEADERS=diffusion_so$(SPACE_ORDER).h
7
8OPS_FILES=diffusion_so$(SPACE_ORDER).c
9
10OPS_GENERATED=diffusion_so$(SPACE_ORDER)_ops.cpp
11
12APP=diffusion_so$(SPACE_ORDER)
13MAIN_SRC=diffusion_so$(SPACE_ORDER)
14
15include $(OPS_INSTALL_PATH)/../makefiles/Makefile.c_ap
Listing 2: An example Makefile for an OPS application

4.3 OPS compilation in Devito

An integration that allows to run OPS code directly from Devito has been successfully implemented as part of this project. However, for the CUDA case, it relies a minor change to how the OPS libraries are built.

Changes to OPS

The OPS build process has been altered to allow the OPS CUDA libraries to allow their use in the creation of shared libraries that can be loaded in Python. The required change was simply to add the -fPIC flag to the nvcc Xcompiler list of flags used to build the OPS libraries.

The fPIC flag tells the compiler to generate position-independent code, so that the generated machine code does not depends on being at certain locations (for example, for jumps) and therefore it can be used for shared libraries and dynamic linking.

Compiling in Devito

When compiling in Devito the generated OPS code is simply written to files and, using the subprocess Python module the OPS translator is invoked on them to generate all the required source files for different target platforms.

Then, depending on the DEVITO_OPS_TARGET environment variable that defines what platform to target, the generated files are used to create a shared library that can be loaded and executed by Devito.

For most target platforms the process is then very similar to what is described in section 4.1, with the difference that we are compiling multiple files and the code strings passed to the compile_from_string have to be read from disk, since what we are now compiling are the OPS translator generated files and not the Devito generated code. For each OPS target platform a new Compiler class has been created that replicates what the OPS Makefile does for that platform.

CUDA compilation

Compiling for CUDA differs slightly in the fact that device code (the actual CUDA kernel) needs to be compiled with nvcc first and the resulting object linked when compiling the host code. The compilation pipeline therefore looks a bit different in this case.

Two Compiler classes have been created, one for the host (OPSCUDAHostCompiler) and one for the device (OPSCUDADeviceCompiler). These two compilers are then used to generate object files for the CUDA kernel and the C++ code respectively by setting the objcet parameter to True when calling compile_from_string. The two objects are then linked using the link_extension method on the host compiler, passing in the two .o files.

Figure 4.1: CUDA compilation pipeline in Devito
OPS translator concerns

The biggest obstacle to a clean integration is the OPS translator. Currently this component has two big flaws: it uses Python 2.7 (that will stop being maintained in 2020 [21]) and it is organized as a monolithic script for standalone execution, making it cumbersome to use within Devito.

Upgrading the translator to Python 3 would be a good first step in simplifying its integration within Devito, but a better solution would be a complete rework of the OPS translator as a proper Python module. It would need to expose methods to translate OPS code to user defined target platforms without having to touch the file system (currently, it generates code for all platforms supported by OPS and writes all the resulting files to disk) to grant the user more refined control.

This would allow easier use of the translator within Devito to target specific platforms using OPS, facilititating the use of Devito Compilers that currently requires Devito to read the generated files from disk.

5.1 Performance evaluation

5.1.1 Hardware used

CUDA benchmarks

To run CUDA benchmarks, a dedicated machine has been used. The specifications are as follows:

CPU: Intel Core i7-4770k
GPU: NVIDIA GTX 1080
CUDA driver version: 9.1

Specifically, the GPU specifications are as follows [22]:
CUDA Cores: 2560
Graphics clock: 1607 MHz
Processor clock: 1733 MHz
Memory size: 8 GB
Memory interface width: 256 bit
Memory bandwith: 320 GB/s
Single Precision performance (theoretical): 8228 GFLOPS/s

OpenMP benchmarks

To run OpenMP benchmarks an Azure H8 instance has been used. The relevant specifications for this machine are as follow [23]:

CPU: Intel Xeon E5 2667 v3 (8 vCPU)
Memory: 56 GB
Memory bandwith: 40 GB/s

5.1.2 Examined problem

The problem we used to benchmark the OPS backend is the diffusion equation, a simple differential equation that makes use of all the implemented features in the Devito OPS backend and is complex enough to give a good indication of what the performance would be with more common and complex seismic imaging problems.

The diffusion equation is expressed as follows:

and can be expressed in Devito as such:

1grid = Grid(shape=(nx, ny))
2u = TimeFunction(name=’u’, grid=grid, time_order=1, space_order=so)
3eqn = Eq(u.dt, v * (u.dx2 + u.dy2))
4stencil = solve(eqn, u.forward)
5op = Operator(Eq(u.forward, stencil))

For the purposes of our benchmarks, the main parameter will be the space order, as changing it leads to generating more operational intensive and complex code to have a range of useful results.

Other parameters that will be varying are the grid size and the Devito Symbolic Engine (DSE) optimization level. All benchmarks will run for 1000 timesteps.

5.1.3 Benchmarking methodology

opescibench [24] has been used to run our benchmarks and plot our results. It is a tool specifically designed to benchmark Devito operators, providing objects to run parametrized operators, collect performance results and plot them.

In order to enable profiling of informations such as operational intensity and GFLOPS/s advanced profiling has been enabled by setting the DEVITO_PROFILING environment variable, as the default profiling setting only collects timing information.

To benchmark the Devito core backend the latest official version of Devito (at the time of writing) has been used. This can be found on GitHub at https://github.com/opesci/devito. The OPS backend has been developed independently on a fork of the Devito repository, as the explorative nature and time frame of the project did not allow it to be merged in the master Devito repository. The Devito+OPS backend repository can be found at https://github.com/vincepandolfo/devito.

5.1.4 Roofline model

To present our results the roofline performance model [25] has been used. It provides an easy to understand visual reference for memory and compute bounds on the observed machine in relation to the operational intensity (FLOP/Byte ratio) of a computation.

Operational Intensity and GFLOPs for every run are calculated automatically by Devito and are then plotted on the roofline and annotated with the total runtime and the space order (SO) of the run.

For the memory bounds the vendor provided memory bandwidth are used. For the compute bounds, theoretical maximum performance (marked on the plots as ideal peak) and measured performance have been used.

To obtain measured performance bounds the LINPACK benchmark [26] and a gravitational n-body simulation [27] (bundled with CUDA samples) have been used on the Xeon CPU and GTX 1080 CPU respectively.

5.1.5 CUDA results

At first, benchmarks were ran only using Devito’s default DSE mode, advanced. This produced the results shown in figure 5.1, with maximum peak utilization of 20% in the grid case and minimum of 7% for the lowest and highest space order respectively. The results for the grid are similar.

(a) grid
(b) grid
Figure 5.1: OPS backend, CUDA, advanced DSE

After discussion with Devito and OPS developers regarding the possible causes of this it was suggested to optimize the kernels by reducing the number of divisions as it is an extremely expensive operation to perform on the GPU.

This reduction can be performed by the Devito Symbolic Engine (DSE) in aggressive mode: among other optimizations, it substitutes common divisions with multiplications as shown in listing 3. However, for more complex kernels the DSE in aggressive mode would apply further transformations that would not be necessarily beneficial to performance on a GPU.

This is not done by default in Devito as it is not necessary on CPUs, but on GPUs it makes a great difference. This suggests that some exploration on what optimizations are most beneficial is necessary in order to create a GPU specific DSE mode.

(a) grid
(b) grid
Figure 5.2: OPS backend, CUDA, aggressive DSE

Running our experiments again with the outlined changes to the kernels we have much better results as shown in figure 5.4. We now have maximum peak utilization of and minimum of for the case, with similar results on the other grid size. This is between 3 and 4 times more utilization compared to the advanced case, which is a very significant increase in performance.

These results are promising and could warrant further future exploration into OPS by the Devito team. It is worth noting that the problem examined is simple in nature and does not represent the complexity found in most seismic imaging problems, however this results are still indicative of what performance can be expected in the main loop of these problems.

1void Kernel0(const float * ut0, float * ut1, const float *dt
2    const float *h_x, const float *h_y)
3{
4  float r0 = 1.0F / (*h_y**h_y);
5  float r1 = 1.0F / (*h_x**h_x);
6  ut1[OPS_ACC1(0,0)] = -1.0F*(*dt*ut0[OPS_ACC0(0,0)]*r1 +
7    *dt*ut0[OPS_ACC0(0,0)]*r0) + 5.0e-1F*(*dt*ut0[OPS_ACC0(-1,0)]*r1 +
8    *dt*ut0[OPS_ACC0(1,0)]*r1 + *dt*ut0[OPS_ACC0(0,-1)]*r0 +
9    *dt*ut0[OPS_ACC0(0,1)]*r0) + ut0[OPS_ACC0(0,0)];
10}
Listing 3: 2D diffusion kernel with space order 2 (aggressive DSE)

5.1.6 OpenMP results

Since OPS targets multiple platforms, it is worth investigating the performance on CPUs compared to what is already offered by Devito. This is because, if good results are obtained, OPS could potentially substitute the core Devito backend. This benchmark will focus on performance using OpenMP.

OpenMP is targeted by both the Devito Core and OPS backends, therefore it is easy to draw a performance comparison between the two utilising the same code. In order to enable OpenMP in the Core backend the DEVITO_OPENMP environment variable must be set to .

Benchmarks have only been run for the grid size as a smaller grid size would be completely contained within the cache, therefore not making use of any memory access optimization that might be performed by the different backends.

(a) Advanced DSE
(b) Aggressive DSE
Figure 5.3: grid, core backend, OpenMP

Figure 5.3 shows the results for this run, with maximum and minimum peak percentage of (space order 4) and (space order 12) when setting the DSE in advanced mode. As suggested in the previous section, using the aggressive DSE did not provide any benefit in the CPU case, while actually getting worse performance for the highest space order ( as opposed to ).

The same benchmarks have then been run using the OPS backend targeting OpenMP (this is done by setting the DEVITO_OPS_TARGET environment variable to OpenMP, as described in 4.3).

The results were quite disappointing, with minimum and maximum peak performance percentage of and across both DSE modes and runtimes from 10 to 25 times slower.

These results, if confirmed by further exploration and not found to be due to problems in the Devito OPS backend, suggest that it might not be worth substituting the core OpenMP backend with OPS as no benefit would be gained by doing so.

(a) Advanced DSE
(b) Aggressive DSE
Figure 5.4: grid, OPS backend, OpenMP

5.1.7 Summary

While the observed performance of the OPS backend on CPU is less than promising, very good results were obtained with CUDA, with good percentages of peak performance and from 4 to 7 times faster execution times compared to the core backend on the hardware used.

5.2 Software evaluation

One of the considerations one has to make when adopting a new library is its soundness as a piece of software. In the case of this project this is even more important, as the investigated library would constitute a critical part of the GPU backend if adopted by Devito.

The questions we will try to answer in this section are:

  • Is OPS being actively developed?

  • Is it easy to use?

  • Is it well engineered?

  • Does it deliver the required features?

Answering these questions will give an important insight on the value of OPS not only as a performance library, but as a usable piece of software.

Is OPS being actively developed?

1em Yes, as evidenced by the GitHub insights in figure 5.5.

Figure 5.5: GitHub insights for the OPS repository (accessed 15th Jun 2019)
Is it easy to use?

1em As mentioned in section 4.3, there are some concerns to the usability of the OPS translator: the main one being the fact that it is in Python 2.7 and therefore not directly usable by Devito (which is written in Python 3). Other usability concerns are the fact that it acts as a black box with no control given to the user who is therefore forced to manually read and manipulate files instead of being able to select what code they need to have generated and get it in a convenient form (such as a string).

These problems could be caused from the fact that OPS has probably been developed as a DSL for complete applications and has not been planned to be used within other code generation libraries.

However, it would not be too difficult for an experienced Python developer to create new interfaces that can provide this kind of functionality and therefore increase usability of the OPS translator within other code generation libraries such as Devito.

Is it well engineered?

1em The internal structure of the translator reflects its nature as black box as it is structured as a script as opposed to a library, this is possibly due to the nature of OPS as a DSL for complete applications as opposed to an intermediate representation for other libraries. It is separated in multiple files, each containing a function that generates source code for a target platform. These functions are hundreds of lines long and mostly reflect the final content of the generated files (minus the parts that are changed depending on the user input).

The code is therefore quite simple to understand as it’s a sequence of string concatenations, but the sheer size of it makes it difficult to maintain. This design could be improved by introducing OOP concepts to create a higher level definition (such as an abstract syntax tree) of what the generated code looks like. This could be more easily achieved by exploiting a code generation library such as cgen [10].

Does it deliver the required features?

1em As shown in the previous section, while the results with OpenMP weren’t promising, OPS has shown good performance results on GPUs using CUDA which is what we are most interested in. Therefore we would say that it does deliver the features required by Devito.

While the software quality of some components of OPS might not be excellent, it is definitely possible to improve it significantly without excessive effort. Even at the current state, as shown in section 4.3, integration within Devito is not excessively complicated. OPS does provide the features needed by Devito to target GPUs and therefore can be considered a valid candidate to be integrated within Devito.

6.1 Conclusions

This work has investigated the OPS intermediate representation as a way of allowing Devito operators to be executed on GPU architectures. In particular the contributions made are:

  • Implemented an OPS backend in Devito

    • This provides the core features required to use OPS in Devito, including code generation, compilation and execution. This also gives Devito developers with a starting point to bring the OPS backend in production and conduct further experiments on GPUs.

  • Performance evaluation of the OPS backend

    • This has shown significant speedups on GPUs compared to the same operators executed on CPUs with the core Devito backend. The performance observed on GPUs has been satisfactory and therefore encourages further work in this space.

  • Software evaluation of OPS

    • The main flaws in OPS as a piece of software have been outlined, along with possible ways of mitigating them and improving the software quality of the OPS library, in particular of the OPS translator. While these flaws are a reason for concern, they should not stop the adoption of the OPS API in Devito.

Given what has been presented in this report, it is the opinion of the writer that the OPS intermediate representation is a valid path for Devito to target GPUs and therefore recommend pursuing further exploration in this space to fully integrate OPS in future Devito releases.

6.2 Future work

In this section we will detail some limitations to what is presented in this thesis and how these could be solved with further evaluations.

The necessary steps to bring the OPS backend in production are then detailed.

6.2.1 Further evaluation

While the results presented in this report are fairly indicative of what performance could be expected in applications targeted by Devito, benchmarks for common seismic operators have not been performed due to time constraints that did not allow the required features to be developed.

Currently, the OPS backend is not able to run these operators correctly. This is due to them having parts of the computation within the time loop not offloaded to OPS. In some cases (for example, computation for sources and receivers in the acoustic wave equation problem) this is not actually possible and therefore would require additional features to manipulate OPS owned data as needed.

Implementing these feature would allow more experimental work to be performed and have a clearer picture of what the performance on GPUs would be for the problem class targeted by Devito.

6.2.2 Adoption

The code developed as part of this project is experimental and has not gone through the Devito development cycle to be merged in the master branch for use in production. However, it provides a solid base from where to start integration of OPS in the next release of Devito.

This would include:

  • Rewriting some parts of the code to better align with the Devito design

  • Splitting the already developed backend in smaller individual features to allow easier code reviews

  • Write a comprehensive testing suite for the new features provided by the OPS backend

  • Develop missing features as outlined in the previous section

This should bring Devito to a stage where it can target GPUs effectively with the added bonus of not having to directly worry about performance optimizations (apart from, possibly, very specific examples) for GPUs as these will be offloaded to OPS,

References

  • [1] F. Rathgeber, D. A. Ham, L. Mitchell, M. Lange, F. Luporini, A. T. T. Mcrae, G.-T. Bercea, G. R. Markall, and P. H. J. Kelly, “Firedrake: Automating the finite element method by composing abstractions,” ACM Trans. Math. Softw., vol. 43, no. 3, pp. 24:1–24:27, 2016.
  • [2] M. S. Alnæs, J. Blechta, J. Hake, A. Johansson, B. Kehlet, A. Logg, C. Richardson, J. Ring, M. E. Rognes, and G. N. Wells, “The fenics project version 1.5,” Archive of Numerical Software, vol. 3, no. 100, 2015.
  • [3] M. S. Alnæs, A. Logg, K. B. Ølgaard, M. E. Rognes, and G. N. Wells, “Unified form language: A domain-specific language for weak formulations of partial differential equations,” ACM Transactions on Mathematical Software, vol. 40, no. 2, 2014.
  • [4] M. Louboutin, M. Lange, F. Luporini, N. Kukreja, P. A. Witte, F. J. Herrmann, P. Velesko, and G. J. Gorman, “Devito: an embedded domain-specific language for finite differences and geophysical exploration,” CoRR, vol. abs/1808.01995, Aug 2018.
  • [5] C. Yount, J. Tobin, A. Breuer, and A. Duran, “Yask—yet another stencil kernel: A framework for hpc stencil code-generation and tuning,” 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC), pp. 30–39, 2016.
  • [6] I. Z. Reguly, G. R. Mudalige, M. B. Giles, D. Curran, and S. McIntosh-Smith, “The ops domain specific abstraction for multi-block structured grid computations,” in Proceedings of the Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, WOLFHPC ’14, (Piscataway, NJ, USA), pp. 58–67, IEEE Press, 2014.
  • [7] A. Meurer, C. P. Smith, M. Paprocki, O. Čertík, S. B. Kirpichev, M. Rocklin, A. Kumar, S. Ivanov, J. K. Moore, S. Singh, T. Rathnayake, S. Vig, B. E. Granger, R. P. Muller, F. Bonazzi, H. Gupta, S. Vats, F. Johansson, F. Pedregosa, M. J. Curry, A. R. Terrel, v. Roučka, A. Saboo, I. Fernando, S. Kulal, R. Cimrman, and A. Scopatz, “Sympy: symbolic computing in python,” PeerJ Computer Science, vol. 3, p. e103, Jan. 2017.
  • [8] “Devito cfd tutorial series.” https://nbviewer.jupyter.org/github/opesci/devito/blob/master/examples/cfd/01_convection.ipynb. Accessed: 24th Jan 2019.
  • [9] F. Luporini, M. Lange, M. Louboutin, N. Kukreja, J. Hückelheim, C. Yount, P. A. Witte, P. H. J. Kelly, G. J. Gorman, and F. J. Herrmann, “Architecture and performance of devito, a system for automated stencil computation,” CoRR, vol. abs/1807.03032, 2018.
  • [10] “Cgen - c/c++ source generation from an ast.” https://github.com/inducer/cgen. Accessed: 25th Jan 2019.
  • [11] C. Yount, “Vector folding: Improving stencil performance via multi-dimensional simd-vector representation,” in 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, pp. 865–870, Aug 2015.
  • [12] C. Yount, A. Duran, and J. Tobin, “Multi-level spatial and temporal tiling for efficient hpc stencil computation on many-core processors with large shared caches,” Future Generation Computer Systems, vol. 92, pp. 903 – 919, 2019.
  • [13] A. Klöckner, “Loo.py: transformation-based code generation for gpus and cpus,” CoRR, vol. abs/1405.7470, 2014.
  • [14] S. Verdoolaege, “isl: An integer set library for the polyhedral model,” in Mathematical Software – ICMS 2010 (K. Fukuda, J. v. d. Hoeven, M. Joswig, and N. Takayama, eds.), (Berlin, Heidelberg), pp. 299–302, Springer Berlin Heidelberg, 2010.
  • [15] D. Unat, X. Cai, and S. B. Baden, “Mint: Realizing cuda performance in 3d stencil methods with annotated c,” pp. 214–224, 01 2011.
  • [16] B. Hagedorn, L. Stoltzfus, M. Steuwer, S. Gorlatch, and C. Dubach, “High performance stencil code generation with lift,” in Proceedings of the 2018 International Symposium on Code Generation and Optimization, CGO 2018, (New York, NY, USA), pp. 100–112, ACM, 2018.
  • [17] M. Steuwer, T. Remmelg, and C. Dubach, “Lift: A functional data-parallel ir for high-performance gpu code generation,” in 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 74–85, Feb 2017.
  • [18] V. Mickus and V. Pandolfo, “Ops expressions translation #760.” https://github.com/opesci/devito/pull/760. Accessed: 2nd Jun 2019.
  • [19] A. Kloeckner, “codepy.” Accessed: 7th June 2019.
  • [20] “C-types foreign function interface (numpy.ctypeslib).” https://docs.scipy.org/doc/numpy/reference/routines.ctypeslib.html. Accessed: 10th June 2019.
  • [21] “Pep 373 python 2.7 release schedule.” https://legacy.python.org/dev/peps/pep-0373/. Accessed: 7th June 2019.
  • [22] NVIDIA, “Geforce gtx 1080 | specifications.” https://www.geforce.co.uk/hardware/desktop-gpus/geforce-gtx-1080/specifications. Accessed: 6th June 2019.
  • [23] “Azure linux vm sizes - hpc | microsoft docs.” https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes-hpc. Accessed: 13th June 2019.
  • [24] “opescibench.” https://github.com/opesci/opescibench. Accessed: 6th June 2019.
  • [25] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for floating-point programs and multicore architectures,” tech. rep., Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States), 2009.
  • [26] J. J. Dongarra, “Performance of various computers using standard linear equations software,” SIGARCH Comput. Archit. News, vol. 20, pp. 22–44, June 1992.
  • [27] L. Nyland, M. Harris, and J. Prins, “Fast n-body simulation with cuda,” GPU Gem, Vol. 3, pp. 677–695, 01 2009.