1 Introduction
High Performance Computing (HPC) systems and architectures are evolving rapidly. Traditional single processorbased CPU clusters are moving towards multicore/multithreaded CPUs. At the same time new architectures based on manycore processors such as graphics processing units (GPUs) and Intel’s Xeon Phi are emerging as important systems and further developments are expected with energyefficient designs from ARM and IBM. According to the IT industry, such advances are expected to deliver compute hardware capable of exascaleperformance (i.e. floatingpoint operations per second) by 2018 (Thibodeau_2009, ). Yet many frameworks aimed at computational/numerical modelling are currently not ready to exploit such new and potentially disruptive technologies.
Traditional approaches to numerical model development involve the production of static, handwritten code to perform the numerical discretisation and solution of the governing equations. Normally this is written in a language such as C or Fortran that is considerably less abstract when compared to a nearmathematical domain specific language. Explicitly inserting the necessary calls to MPI or OpenMP libraries enables the execution of the code on multicore or multithread hardware. However, should a user wish to run the code on alternative platforms such as GPUs, they would likely need to rewrite large sections of the code, including calls to new libraries such as CUDA or OpenCL, and optimise it for that particular hardware backend (Rathgeber_etal_2012, ). As HPC hardware evolves, an increasing burdon faced by computational scientists becomes apparent; in order to keep up with trends in HPC, not only must a model developer be a domain specialist in their area of study, but also an expert in numerical algorithms, software engineering, and parallel computing paradigms (JacobsPiggott_2015, ; Rathgeber_etal_Submitted, ).
One way to address this issue is to introduce a separation of concerns using high level abstractions, such as domain specific languages (DSLs) and active libraries (Rathgeber_etal_Submitted, ; LoggWells_2010, ; Logg_etal_2012, ; Alnaes_etal_2014, ; Luporini_etal_2015, ). This paradigm shift allows a domain specialist to describe their problem as a highlevel, nearmathematical specification. The task of taking this specification and transforming it into executable computer code can then be handled in the subsequent abstraction layer; unlike the traditional approach of handwriting the C/Fortran code that discretises the governing equations, this layer generates the code automatically from the problem specification. Finally, the generated code can be readily targetted towards a specific hardware platform through sourcetosource translation. Hence, domain specialists focus on the equations they wish to solve and the setup of their problem, whilst the parallel computing experts can introduce support for new backends as they become available. At no point does the code have to undergo a fundamental rewrite if the desired backend changes. Use of such strategies can have significant benefits for the productivity of both the user and developer, by removing the need to spend time rewriting code and/or the problem specification (LoggWells_2010, ).
Given the motivation for the use of automated solution techniques, in this paper we present a new framework, OpenSBLI, for the automated derivation and parallel execution of finite differencebased models. This is an opensource release of the recent developments in the SBLI codebase developed at the University of Southampton, involving the replacement of SBLI’s Fortranbased core with flexible Pythonbased code generation capabilities, and the coupling of SBLI to the OPS active library (Giles_etal_2015, ; Reguly_etal_2014, ; Mudalige_etal_2014, ; Jammy_etal_2015, ) which targets the generated code towards a particular backend using sourcetosource translation. Currently, OpenSBLI can generate OPScompliant C code to discretise and solve the governing equations, using arbitraryorder central finite difference schemes and a choice of either the forward Euler scheme or a thirdorder RungeKutta timestepping scheme. OpenSBLI then uses OPS to produce code targetted towards different backends. It is worth noting that backend APIs such as OpenMP (version 4.0 and above) are also capable of running on CPU, GPU and Intel Xeon Phi architectures, for example. However, currently OPS has no support for OpenMP version 4.0 and above. Moreover, codes that are written by hand in OpenMP would still potentially need to be rewritten if different algorithms or equations were to be considered. Thus, the benefits of code generation still play a crucial role here, regardless of which backend is chosen.
The application of SBLI has sofar concentrated on problems in aeronautics and aeroacoustics, in particular looking at shockboundary layer interactions (see e.g. TouberSandham_2009 ; DeTullioSandham_2010 ; Redford_etal_2012 ; Wang_etal_2015 and the references therein for more details). While such applications entail solving the 3D compressible NavierStokes equations, in principle other equations expressible in Einstein notation and solved using finite differences are also supported by the new code generation functionality, highlighting another advantage of such a flexible approach to numerical model development. Note also that while OpenSBLI does not yet feature shockcapturing schemes and Large Eddy Simulation models (unlike the legacy SBLI code), these will be implemented in the future as part of the project’s roadmap. The main purposes of this initial release is the algorithmic changes to legacy SBLI’s core.
2 Design
Legacy versions of SBLI comprise static handwritten Fortran code, parallelised with MPI, that implements a fourthorder central differencing scheme and a lowstorage, third or fourthorder RungeKutta timestepping routine. It is capable of solving the compressible NavierStokes equations coupled with various turbulence parameterisations (e.g. Large Eddy Simulation models) and diagnostic routines. In contrast, OpenSBLI is written in Python, and by replacing the legacy core with modern code generation techniques, the existing functionality of SBLI is enriched with new flexibility; the compressible NavierStokes equations can still be solved in OpenSBLI for the sake of continuity, but the set of equations that can be readily solved essentially becomes a superset of that of the legacy code. Furthermore, the use of the OPS library allows the generated code to easily be targetted towards sequential, MPI, or an MPI+OpenMP hybrid backend (for CPU parallel execution), CUDA and OpenCL (for GPU parallel exection), and OpenACC (for parallel execution on accelerators), without the need to rewrite the model code. OPS is readily extensible in terms of new backends, making the code generation technique an attractive way of futureproofing the codebase and preparing the framework for exascalecapable hardware when it arrives. The main achievement of OpenSBLI is the ability to express model equations at a highlevel with the help of the SymPy library (SymPy_2016, ), expanding the equations based on the index notation, and coupling this functionality with the generation of OPSCbased model code and also with the OPS library which performs code targetting. OpenSBLI’s focus on the generation of computational kernels essentially forms a bridge between the highlevel equations and the computational parallel loops (‘parloops’) that iterate over the grid points to solve the governing equations.
For any given simulation that is to be performed with OpenSBLI, the problem (comprising the equations to be solved, the grid to solve them on, their associated bondary and initial conditions, etc) must be defined in a setup file, which is nothing but a Python file which instantiates the various relevant components of the OpenSBLI framework. All components follow the principle of objectoriented design, and each class is explained in detail throughout the subsections that follow. An overview of the class relationships is also provided in Figure 1.
2.1 Equation specification
In a similar fashion to other problem solving environments such as OpenFOAM (OpenFOAM_2014, ), Firedrake (Rathgeber_etal_Submitted, ), FEniCS (LoggWells_2010, ; Logg_etal_2012, ), OPESCIFD (Sun_2016, ), Devito (Kukreja_etal_2016, ; Lange_etal_2016, ), deal.II (Bangerth_etal_2007, ) and FreeFEM++ (Hecht_2012, ), OpenSBLI comprises a highlevel interface for specifying the differential equations that are to be solved. These equations (and any accompanying formulas for temperaturedependent viscosity, for example) can be expressed in Einstein notation, also known as index notation. The adoption of such an abstraction is advantageous since it removes the need for the user to expand the equations by hand which can be an errorprone task. Furthermore, much like the Devito domain specific language (DSL) (Kukreja_etal_2016, ; Lange_etal_2016, ) for finite difference stencil compilation, OpenSBLI makes use of the SymPy symbolic algebra library that supplies the basic components required for the modelling functionality that has been implemented in the present work. This functionality includes the automatic expansion of indices based on their contraction structure, such that repeated indices are expanded into a sum about that index, and the implementation of various types of differential operator.
2.1.1 Expressing
Consider the conservation of mass equation
(1) 
where is the
th component of the velocity vector
, is the density field, and is the coordinate field in the th dimension. In an OpenSBLI problem setup file, the user would specify this as a string, giving the lefthand side and righthand side of the equation in the following format:mass = "Eq(Der(rho, t), Conservative(rho*u_j, x_j))"
The functions Der and Conservative here are OpenSBLIspecific derivative operators, each defined in their own class derived from SymPy’s Function class. Other highlevel interfaces such as OpenFOAM offer similar differential operators such as div and grad, for example (OpenFOAM_2014, ). General derivatives are represented using the Der operator, whereas the Conservative
operator ensures that the derivative will not be expanded using the product rule. A skewsymmetric form of the derivative is also available using the
Skew function, discussed later in Section 3.3. All of these are essentially ‘handler’/placeholder objects that OpenSBLI uses for spatial/temporal discretisation after parsing and expanding the equations about the Einstein indices. Special functions such as the Kronecker delta function and the LeviCivita symbol are also available, derived from SymPy’s LeviCivita and KroneckerDelta classes in order to handle Einstein expansion; these too are expanded later by OpenSBLI.2.1.2 Parsing
Once all of the governing equations have been expressed by the user in string format, they are collected together in OpenSBLI’s Problem class (see A). This class also accepts substitutions, formulas, and constants. For long equations, such optional substitutions
(such as the definition of the stress tensor) can be written as a separate string (in the same way as the governing equations) to allow better equation readability, and then automatically substituted into the equations (such as the conservation of momentum and energy equations) at expansiontime instead of performing such errorprone manipulations by hand. The constitutive equations which define a relationship between the prognostic and nonprognostic variables are given as
formulas, for example temperaturedependent viscosity relations, and an equation of state for pressure. The constants are the spatially and temporally independent variables which are represented as strings. Upon instantiation of the Problem class, the process is invoked to transform the equations into their final expanded form.For each equation in string form, a new OpenSBLI Equation object is created. During its initialisation, SymPy’s parse_expr function converts the equation string into a SymPy Eq data type. Any of the OpenSBLI derivative operators such as Der and Conservative (currently in string format) are replaced by actual instances of the Der and Conservative classes. Similarly, any substitutions given in the Problem are parsed and substituted directly into the expression using SymPy’s xreplace function. All other terms in the parsed expression are represented by OpenSBLI’s EinsteinTerm class, derived from SymPy’s Symbol class, which contains its own methods and attributes for determining/expanding Einstein indices. For example, the class’s initialisation method __init__ splits up the term u_j where there are underscore markers, and stores the Einstein index j in a list as a SymPy Idx object. The get_expanded method later replaces the alphabetical Einstein indices with actual numerical indices, replacing _j with 0 and 1, in the 2D case. Finally, any constants in the Problem object are also represented as an EinsteinTerm object, but are flagged as constant terms in OpenSBLI, so that they are not spatially or temporallydependent. The coordinate vector components (and the time term ) are a special case of an EinsteinTerm; these are marked with a is_coordinate flag so that, during the expansion phase, the EinsteinTerms are made dependent on the coordinate field (and time, if appropriate) to ensure that differentiation is performed correctly.
2.1.3 Expanding
After the parsing and substitution stage, the equations are expanded about repeated indices. Note that this process is performed by OpenSBLI, although various SymPy classes underpin the functionality. Following the example, (1) would be expanded as
(2) 
OpenSBLI loops over each EinsteinTerm stored in the parsed Equation object, and maps it to a SymPy Indexed object. For example, the term u_k would first be mapped to u[k]. The index k in the term is then expanded over 0, , (where is the dimension of the problem) by replacing it with each integer dimension, yielding a SymPy MutableDenseNDimArray array of size (for a vector function, or for a tensor of rank 2) of expanded variables which is stored as a class attribute. For example, expanding the vector u[k] yields the expansion array [u0, u1] in 2D. Upon expansion, the terms are also made spatiallydependent (i.e. indexed by , , coordinates, depending on the dimension) and, if applicable, temporallydependent (i.e. indexed also by ). The only exceptions to this are constants such as the Reynolds number . The expansion array from the previous example then becomes [u0[x0, x1, t], u1[x0, x1, t]] (and [x0, x1] for the constant coordinate field).
Each equation is expanded by locating any repeated indices and then summing over them as appropriate. For example, after mapping each EinsteinTerm (e.g. u_k) to an Indexed object (e.g. u[k]), the mass equation is represented internally as
Eq(Der(rho, t), Conservative(rho*u[k], x[k]))
Since the index k is repeated, the expansion arrays are used to expand this expression to
Eq(Der(rho[x0, x1, t], t), Conservative(rho[x0, x1, t]*u0[x0, x1, t], x0[x0, x1, t])  Conservative(rho[x0, x1, t]*u1[x0, x1, t]), x1[x0, x1, t]))
Finally, the Der and Conservative functions are applied, with the expression becoming
Eq(Derivative(rho[x0, x1, t], t), Derivative(rho[x0, x1, t]*u0[x0, x1, t], x0)  Derivative(rho[x0, x1, t]*u1[x0, x1, t], x1))
which is equivalent to (2). Similar expansion can also be applied for any other equations involving e.g. diagnostic fields. Note how the calls to Der and Conservative have been replaced by calls to SymPy’s Derivative class (which in turn uses SymPy’s diff function); while it is SymPy that handles the differentiation, it is OpenSBLI that handles the exact formulation of the derivative (i.e. OpenSBLI has ensured that the derivative has not been expanded using the product rule here).
Any nested derivatives are also handled here. It is not currently possible to specify, for example, diff(diff(u_j, x_i), x_j) using SymPy’s diff function directly because the fact that u_j is dependent on x_i and x_j is not taken into account. In contrast, the use of Der and EinsteinTerms like u_j in OpenSBLI allows the derivative to be computed correctly since the terms are made dependent through the use of Indexed objects as previously described. OpenSBLI users must instead use the Der function Der(Der(u_j, x_i), x_j). For each nested derivative (or nested function in general), the inner function is evaluated first along with all other nonnested functions. Only then is the outer function applied.
For the purposes of debugging, OpenSBLI includes a LatexWriter class that takes the expanded equations as input and writes them out in LaTeX format so developers can more easily spot errors, for example where indices have been expanded incorrectly.
2.2 Grid
The governing equations are discretised on a regular grid of solution points that span the domain of interest; an example is provided in Figure 2. All gridrelated functionality is handled by the Grid class, which must be instantiated by the user in the problem setup file. The dimensionality of the problem , the number of points in each dimension, and the grid spacing must all be supplied. A problem of dimension would generate a grid of solution points in total, where represents the userdefined number of grid points in direction .
For the sake of looping over each solution point and computing the necessary derivatives via the finite difference method, each (nonconstant) term is processed further by OpenSBLI; the index of each spatial coordinate (e.g. x0) is mapped onto an index over the grid points in that spatial direction (e.g. i0) which will iterate from 0 to (for a given direction ) when the computational kernel is eventually generated.
In addition to the solution points within the physical domain, a set of halo points (or ‘ghost’ points), which border the outermost grid points, are also created automatically depending on the boundary conditions and the spatial order of accuracy. These halo points are necessary to ensure that the derivatives near the boundary can be computed with the same stencil as the ‘inner’ points. The exact number of halo points required therefore depends on the number of stencil points; for example, in Figure 2 the stencil for a secondorder central difference (using 3 points in each direction) would require one halo point at each end of the domain. The values that these halo points hold depend on the type of boundary condition applied, and this is discussed in more detail in Section 2.6.
Every field/term in the governing equations that is represented by the grid indices holds a socalled ‘work array’ which essentially contains the field’s numerical value at each of the grid points, including the halos. The implementation of initial and boundary conditions is done by accessing and modifying this work array, as will be described in Sections 2.5 and 2.6.
2.3 Computational kernels
The Kernel
class defines a sequence of computational steps that should be performed to solve the governing equations. For instance, one kernel may be created to compute the spatial derivative of a field, while another kernel handles the initialisation of the field values based on a given initial condition, and another handles the enforcement of boundary conditions that involve computations. During the instantiation of a kernel, the relevant variables and fields are classified as inputs, outputs and input/outputs (i.e. both an input and an output), and the kernel’s range of evaluation (i.e. the range of grid indices over which the kernel is applied). This helps to minimise data transfer, since only those variables/fields required to perform the computation are passed to the generated kernel code.
2.4 Discretisation schemes
Once a grid is created, the equations are discretised upon that grid. For spatial discretisation purposes OpenSBLI offers a central differencing scheme for first and secondorder derivatives; all the stencil coefficients are computed using SymPy, which allows stencils of an arbitrary order of accuracy to be created. For temporal discretisation purposes, OpenSBLI features the (firstorder) forward Euler scheme as well as the same lowstorage, thirdorder RungeKutta timestepping scheme (CarpenterKennedy_1994, ) present in the legacy SBLI code.
To use a particular scheme, one should instantiate a discretisation scheme derived from the generic base class called Scheme, which essentially stores the finite difference stencil coefficients or the weights used in a particular timestepping scheme. Spatial and temporal schemes should be instantiated separately.
For the purpose of spatial discretisation, handled by the OpenSBLI SpatialDiscretisation class, an Evaluations object is created for each of the formulas, and the derivatives in the equations. Each Evaluations object automatically finds and stores the dependencies of a given term (e.g. requires the dependencies and ). Once all the Evaluations have been created, they are sorted with respect to their dependencies being evaluated (e.g. if depends on , then should be evaluated first). The next step involves defining the range of grid point indices over which each evaluation should be performed, and also assigning a temporary work array for each evaluation. All of the evaluations are then described by a Kernel object (see Section 2.3). It is here, while creating the kernels, that the (continuous) spatial derivatives are automatically replaced by their discrete counterparts. It should be noted that, for the evaluation of formulas, these kernels are fused together if they have no interdependencies to avoid race conditions when running on threaded architectures. Finally, to evaluate the residual for the purposes of temporal discretisation, the derivatives in the expanded equations (represented by an Evaluations object) are substituted by their temporary work arrays, and a Kernel is created for evaluating the residual of each equation.
The temporal discretisation, handled by the TemporalDiscretisation class, involves applying the various stages of the timestepping scheme supplied using the residuals computed by the spatial discretisation process. Similarly, a Kernel object is created for the evaluations in the timestepping scheme.
2.5 Initial conditions
In order for the prognostic fields to be advanced forward in time, initial conditions can be applied using the GridBasedInitialisation class. This is accomplished in much the same way as specifying equations, but involves assignment of grid variables and work arrays of grid point values. For example, in the simulation setup file the coordinate can be defined using the grid point index and :
x0 = "Eq(grid.grid_variable(x0), grid.Idx[0]*grid.deltas[0])",
which in turn defines the initial value for each prognostic variable, by assigning this to the array of values at each grid point (also known as the variable’s work array), e.g.:
rho = "Eq(grid.work_array(rho), 2.0*sin(x0))".
2.6 Boundary conditions
OpenSBLI currently comprises two types of boundary condition, implemented in the classes PeriodicBoundaryCondition and SymmetryBoundaryCondition. Users may apply different boundary conditions in different directions if they so wish. Periodic boundaries are defined such that, for each prognostic field , where is the number of points in the domain. This condition is achieved via the exchange of halo point data at each end of the domain. Symmetry boundary conditions enforce the condition that for scalar fields and for vector fields (in the direction ), which is achieved using a computational kernel.
2.7 Input and output
The state of the prognostic fields can be written to disk every iterations as defined by the user, or only at the end of the simulation. This functionality is handled by the FileIO class. OpenSBLI adopts the HDF5 format (FolkPourmal_2010, ; Collette_2013, ) as it features parallel read/write capabilities and therefore has the potential to overcome the serial input/output bottleneck currently plaguing many largescale parallel applications (Padua_2011, ; Fu_etal_2011, ). Future releases of OpenSBLI will come with the ability to read in mesh files and the state fields from an HDF5 file, enabling the restarting of simulations from ‘checkpoints’ as well as the assignment of initial conditions that cannot be simply defined by a formula.
2.8 Code generation
OpenSBLI currently generates code in the OPSC language which performs the simulation; this is essentially standard C++ code that includes calls to the OPS library. Such functionality is accomplished using the OpenSBLI OPSCCodePrinter class (derived from SymPy’s CCodePrinter class, used to perform the generation of OPSC code statements) and the OPSC class (which agglomerates the literal strings of OPSC statements and kernel functions and writes them to file). The generated code’s structure follows a generic template that maps out the order in which the simulation steps/computations are to be called. The template is represented as a multiline Python string template, with each line containing a placeholder for the code that performs a particular step. Examples include $header which is replaced by any generic boilerplate header code (e.g. #include <stdlib.h> and kernel function prototypes), $initialisation which is replaced by the grid and field setup (e.g. by declaring an OPS block using the ops_decl_block function), and $bc_calls which is replaced by calls to the boundary condition kernel(s). This template can be readily changed to incorporate additional functionality, such as the inclusion of turbulence models. Once all component placeholders have been replaced by OPSC code, the code is written out to disk. For the case of the OPSC language, two files are written; one is a C++ header file containing the computational kernels, and the other is the C++ source file containing various constant definitions (e.g. the timestep size delta_t, and the constants of the Butcher tableau for the timestepping scheme), OPS data structures, and calls to the kernels specified in the header file.
OpenSBLI’s local Python objects (most pertinently, the kernel objects that describe the computations to be performed on the grid) are essentially translated to OPSC data structures and function calls during the preparation of the code. For instance, when declaring computational stencils that define a particular central differencing scheme, the local grid indices stored in the Central scheme object are used to write out an ops_stencil definition during code generation. Similarly, ops_halo structures and calls to ops_halo_transfer are produced to facilitate the implementation of the periodic boundary conditions. All fields are declared as ops_dat datasets; for an example of where these are used, see the function ops_argument_call in the file opsc.py which generates/accumulates calls to the OPS function ops_arg_dat through the use of ‘printf’style string formatting, filling in the ‘placeholder’ arguments (e.g. %s in Python) with values from the local OpenSBLI objects. Finally, calls to OpenSBLI Kernel objects are represented in OPSC as regular C++ functions (see Figure 3) which are passed to the ops_par_loop function (see Figure 4), which executes the function efficiently over the range of grid points within the desired block; OpenSBLI is currently a singleblock code so only one block, containing all the grid points, is used. Further details on the OPS data structures and functionality can be found in the work by Reguly_etal_2014 .
Some optimisations are performed during the code generation stage by OpenSBLI to avoid unnecessary and expensive division operations in the kernels; rational numbers (e.g. finite difference stencil weights that are rational) and constant EinsteinTerms raised to negative powers (e.g. ) are evaluated and stored (e.g. by overriding the _print_Rational method in the OPSCCodePrinter class).
Once the code generation process is complete, the OPS library is called to target the code towards various backends. These include the sequential code, MPI and hybrid MPI+OpenMP parallellised versions of the code for CPUs, CUDA and OpenCL versions of the code for GPUs, and an OpenACC version for accelerators. The test cases presented in this paper (see Section 3) consider the sequential, MPI, and CUDA backends. Targetting ‘handwritten’/manuallygenerated model code towards a particular architecutre is something that is wellknown as a timeconsuming, errorprone and often unsustainable activity; often numerical models have to be completely rewritten, involving many ifelse statements and #ifdefstyle pragmas to ensure that the correct branch of the code is followed for a given backend. As the number of backends grows, the code becomes unsustainable. In contrast, with the abstraction introduced here through code generation, support for a new backend only needs to be added to the OPS library; the toplevel, abstract definition of the equations and their implementation need not be modified due to the separation of concerns, thereby highlighting one of the key advantages of automated model development.
When comparing the number of lines and the complexity of the code that gets generated by OpenSBLI, another advantage of automated model development becomes clear; in the case of the 3D TaylorGreen vortex test case, the problem specification file containing 100 lines generates OPSC code that is approximately 1,500 lines long (excluding blank lines and comments). As more parameterisations (e.g. Large Eddy Simulation turbulence models) and diagnostic field computations are added, it is expected that this number would grow even further relative to the number of lines required in the setup file.
3 Verification and Validation
In order to verify the correctness of OpenSBLI and be confident in the ability of the solution algorithms to accurately represent the underlying physics, three representative test cases covering 1, 2 and 3 dimensions were created and are presented here.
3.1 Propagation of a wave
This 1D test case considers the firstorder wave equation, given by
(3) 
where is the quantity that is transported at constant speed . The expected behaviour is that an arbitrary initial profile at time is displaced by a distance , such that for some finish time . The constant was set to 0.5 ms in this case, and the equation was solved on the line 0 1 m. Eighthorder central differencing was used to discretise the domain in space in conjunction with a thirdorder RungeKutta scheme for temporal discretisation. The grid spacing was set to 0.001 m, and the timestep size was set to 4 10 s, yielding a Courant number of 0.2. A smooth, periodic initial condition was used, and periodic boundary conditions were enforced at both ends of the domain.
The simulation was run in serial (on an Intel® Core™ i74790 CPU) until a finish time of = 1 s. The initial and final states of the solution field are shown in Figure 5. As desired, the error in the solution is very small at , and provides some confidence in the implementation of the solution method and the periodic boundary conditions.
3.2 Method of manufactured solutions
The method of manufactured solutions (MMS) is a rigorous way to check the correctness of a numerical method’s implementation (SalariKnupp_2000, ; Roache_2002, ; Roy_2005, ). The overall algorithm involves constructing a manufactured solution for the prognostic variable(s) and substituting this into the governing equation. Since the manufactured solution will not, in general, be the exact solution to the equation, a nonzero residual term will be present. This residual term is then subtracted from the RHS such that the manufactured solution essentially becomes the exact/analytical solution of the modified equation (i.e. the one with the source term). A suite of simulations can then be performed using increasingly fine grids to check that the numerical solution converges to the manufactured solution at the expected rate determined by the discretisation scheme.
For this test, the 2D advectiondiffusion equation (with a source term ) given by
(4) 
is considered.
The constant is the diffusivity coefficient which is set to 0.75 ms here. The prescribed field is the th velocity component, with = 1.0 ms and = 0.5 ms. The prognostic field is to be determined and has an initial condition of . In a similar fashion to the works of SalariKnupp_2000 ; Roache_2002 ; Roy_2005 , the manufactured/‘analytical’ solution employs a mixture of sine and cosine functions since these are continuous and infinitely differentiable. The SAGE framework (SteinJoyner_2005, ) was used to symbolically determine the residual/source term .
The domain is a 2D square with dimensions m and m such that the manufactured solution is periodic. Furthermore, periodic boundary conditions are applied on all sides of the domain. Six central differencing schemes of order 2, 4, 6, 8, 10 and 12 are considered for the spatial discretisation, and a thirdorder RungeKutta scheme is used throughout to advance the equation in time. To perform the convergence analysis, the grid spacing was halved for each successive case such that = = , , , and . The timestep size was also halved for each case to maintain a maximum bound of 0.025 on the Courant number; this was purposefully kept small and nearconstant to minimise the influence of temporal discretisation error (Vedovoto_etal_2011, ). All simulations were run in serial (on an Intel® Core™ i74790 CPU) until a finish time of = 100 s to ensure that a steadystate solution was attained.
Figure 6 demonstrates how converges towards the manufactured solution as the grid is refined. The convergence rate for each order of the central difference scheme is illustrated in Figure 7. The anomaly in the twelfthorder convergence plot was likely caused by reaching the limit of machine precision. Overall, these results provide confidence in the correctness of the automaticallygenerated code/model.
3.3 3D TaylorGreen vortex
The TaylorGreen vortex is a wellknown hydrodynamic problem (Brachet_etal_1983, ; DeBonis_2013, ; BullJameson_2014, ) characterised by transition to turbulence, decay of turbulence, and the energy dissipation during its evolution. It is frequently used to evaluate the ability of a numerical method to capture the underlying physical processes. During the initial stages of evolution, the dynamics display structural changes (rolling up, streching and interaction of the vortices). This process is inviscid in nature. Later the vortices break down and transition into fullyturbulent dynamics. As there are no external forces or turbulencegenerating mechanisms, the smallscale structures dissipate all the energy, and the fluid eventually comes to rest (Brachet_etal_1983, ). The numerical method employed should be able to capture each of these stages accurately.
The 3D compressible NavierStokes equations were solved in nondimensional form, written in Einstein notation as
(5) 
(6) 
and
(7) 
for the conservation of mass, momentum and energy, respectively. The (dimensionless) quantity is the fluid density, is the th (scalar) component of the velocity vector , is the pressure field, is the total energy. The components of the stress tensor are given by
(8) 
where is the Kronecker Delta function and is the Reynolds number. The components of the heat flux term are given by
(9) 
where is the temperature field, is the ratio of specific heats, is the Mach number, and is the Prandtl number. The various quantities are nondimensionalised using the reference velocity , the reference length , the reference density , and the reference temperature .
The equation of state linking , and , is defined by
(10) 
and the total energy is given by
(11) 
The pressure is nondimensionalised by .
Central finite difference schemes are nondissipative and are therefore suitable for accurately capturing turbulent dynamics. However, the lack of dissipation can make the scheme unstable. To improve the stability, a skewsymmetric formulation (BlaisdellMansourReynolds_1991, ; BlaisdellMansourReynolds_1993, ; BlaisdellSpyropoulosQin_1996, ; Pirozzoli_2011, ) was applied to the convective terms in (5), (6) and (7); the convective term then becomes
(12) 
where should be set to , and for the continuity, momentum and energy equations, respectively. It should also be noted that the both the convective and viscous terms are discretised using the same spatial order. In all of the simulations performed, the Laplacian in the viscous term is expanded using a finite difference representation of the second derivative (i.e. not treated by successive first derivatives).
As per the work of DeBonis_2013 and BullJameson_2014 , the equations were solved in a 3D cube, with , , and . Periodic boundary conditions were applied on all surfaces. The following initial conditions were imposed at time = 0:
(13) 
(14) 
(15) 
(16) 
In all the simulations, = 1,600, = 0.71, = 0.1, and = 1.4. The reference quantities , and were set to 1.0, and the reference temperature was evaluated using the equation of state (10).
A fourthorder accurate central differencing scheme was used to spatially discretise the domain, and a thirdorder RungeKutta timestepping scheme was used to march the equations forward in time. A set of simulations was performed over a range of resolutions, namely , , and uniformlyspaced grid points. For the 64 case, a nondimensional timestep size of 3.385 10 DeBonis_2013 was used. Each time the number of grid points was doubled, the timestep size was halved to maintain a constant upper bound on the Courant number. The generated code was targetted towards the CUDA backend using OPS and executed on an NVIDIA Tesla K40 GPU until a nondimensional time of = 20, except for the 512 case; this was targetted towards the MPI backend and run in parallel over 1,440 processes on the UK National Supercomputing Service (ARCHER) due to lack of available memory on the GPU, and provided a good example of how the backend can be readily changed.
The component of the vorticity field at various times can be found in Figure 8. At nondimensional time = 2.5 vortex evolution and stretching are clearly visible, progressing onto highly turbulent dynamics where the relatively smooth structures rollup and eventually breakdown at around = 9. This point is characterised by peak enstrophy in the system. The final stage of the simulation features the decay of the turbulent structures such that the enstrophy tends towards its initial value.
Following the definitions of DeBonis_2013 , the integrals of the kinetic energy
(17) 
and enstrophy
(18) 
were computed throughout the simulations. Note that is the whole domain and is the LeviCivita function. These quantities are shown in Figures 9 and 10 for the various grid resolutions, and are plotted against the reference data from a spectral element simulation by Wang_etal_2013 using a 512 grid for comparison. Figure 10 highlights the inviscid nature of the TaylorGreen vortex problem for 34. The transition to turbulence occurs from 39 (which is associated with the peak in enstrophy in Figure 9). Finally, dissipation occurs at 9. The results show a clear agreement with the reference data, and represents a solid first step towards the validation of OpenSBLI.
4 Conclusion
Advances in compute hardware are driving a need to change the current state of numerical model development. By developing a new modelling framework based on automated solution techniques, we have effectively futureproofed the core of the SBLI codebase; no longer does a computational scientist need to rewrite significant portions of code in order to get it up and running on a new piece of hardware. Instead, the model is derived from a highlevel specification independent of the architecture that it will run on, and the underlying code is automatically generated and tailored to a particular backend, the responsibility for which would rest with computer scientists who are experts in parallel programming paradigms. Furthermore, the ease at which the governing equations can be changed is a fundamental advantage of using such abstract specifications. This was highlighted here by considering three test cases, each of which comprised a different set of equations. The discretisation, code generation and code targetting is performed automatically, thereby reducing development costs and potentially avoiding errors, bugs, and nonperformant/nonoptimal operations. In addition, code that solves the different variants of the same governing equations can be easily generated. For example, in the compressible NavierStokes equations, viscosity can be treated either as a constant or as a spatiallyvarying term. In static, handwritten codes this flexibility comes at the cost of writing different routines for the various formulations, unlike with automated code generation techniques. This is particularly useful when wanting to switch between Cartesian and generalised coordinates. This particular framework also facilitates the fast and efficient switching between different spatial orders of accuracy, and reduces the development time and effort when wishing to try out new numerical formulations of the equations (or a new spatial/temporal scheme) on a wide variety of test cases.
4.1 Future work
Explicit schemes such as the one implemented here can be readily extendible to a range of application areas such as computational aeroacoustics, aerothermodynamics, problems involving shocks, and hypersonic flow. Incompressible flows may also be handled with the explicit, compressible solver in OpenSBLI so long as the Mach number is sufficiently small. However, this puts tight restrictions on the timestep size thereby limiting the efficiency of the solver, and thus limits the range of applications that OpenSBLI can handle within the context of CFD in general.
Extending to implicit timestepping schemes requires backend support from OPS for matrix inversion, for example. Once this support is implemented, extending OpenSBLI to handle implicit timestepping schemes is straightforward.
The treatment of incompressible flows can be accomplished using schemes such as pressure projection methods Chorin_1967 ; Chorin_1968 . Such a method requires (1) the solution of a tentative velocity field, (2) the solution to a pressure Poisson equation (using either direct or iterative solvers), and (3) the update/correction of the velocity field. The equations defining each step would need to be given by the user in OpenSBLI, in a similar fashion to implementing a projection method in the Unified Form Language (UFL) in FEniCS FEniCS_2011 ; Alnaes_etal_2014
. OpenSBLI would also need to recognise that a projection method has been chosen, possibly via a flag set in the problem definition file. For step (2) of the method, direct solvers can be implemented directly once support for matrix inversion and fast Fourier transforms (for example) are included in OPS (which in turn would need to link to various linear algebra packages such as PETSc
Balay_etal_2014 ). This is similar to how an implicit timestepping scheme would be implemented in OpenSBLI. On the other hand, explicit solution schemes require an iterative solution to the pressure Poisson equation; this is possible by writing the relevant kernel support with an exit criterion (which exits the kernel once a desired tolerance for the solution residual has been attained) and modifying how the code is generated with this in mind. Other methods for incompressible flows such as artificial compressibility methods Chorin_1967 ; Chorin_1967_2 ; OhwadaAsinari_2010 can be implemented with the current functionality by modifying the input equations in the problem setup file accordingly.The work considered here only focussed on the MPI and CUDA backends for CPU and GPU execution, respectively. Future work will consider the CPU and GPU performance on other backends, such as OpenMP. For problems such as Mandelbrot Set computation and matrix multiplication, OpenMP has been demonstrated to perform well against other APIs such as CUDA and OpenACC Ledur_etal_2013 . Future work will also look at comparing the performance of the legacy Fortranbased SBLI code against the OpenSBLIgenerated code in order to demonstrate the potential speedups that can be obtained.
5 Code Availability
OpenSBLI is an opensource release of the original SBLI code developed at the University of Southampton, and is available under the GNU General Public Licence (version 3). Prospective users can download the source code from the project’s Git repository: https://github.com/opensbli/opensbli
6 Acknowledgments
CTJ was supported by a European Union Horizon 2020 project grant entitled “ExaFLOW: Enabling Exascale Fluid Dynamics Simulations” (grant reference 671571). SPJ was supported by an EPSRC grant entitled “Futureproof massivelyparallel execution of multiblock applications” (EP/K038567/1). The data behind the results presented in this paper will be available from the University of Southampton’s institutional repository. The authors acknowledge the use of the UK National Supercomputing Service (ARCHER), with computing time provided by the UK Turbulence Consortium (EPSRC grant EP/L000261/1). The authors would also like to thank the NVIDIA Corporation for donating the Tesla K40 GPU used throughout this research.
Appendix A Example of a simulation setup file
The code in Figure 11 contains the key components of a simulation setup file. Specifically, this is taken from the TaylorGreen vortex simulation. Other examples of setup files can be found in the apps directory of OpenSBLI.
References
References

(1)
P. Thibodeau,
Scientists,
IT Community Await Exascale Computers (2009).
URL http://www.computerworld.com/article/2550451/computerhardware/scientistsitcommunityawaitexascalecomputers.html  (2) F. Rathgeber, G. R. Markall, L. Mitchell, N. Loriant, D. A. Ham, C. Bertolli, P. H. Kelly, PyOP2: A HighLevel Framework for PerformancePortable Simulations on Unstructured Meshes, in: High Performance Computing, Networking Storage and Analysis, SC Companion, IEEE Computer Society, 2012, pp. 1116–1123.
 (3) C. T. Jacobs, M. D. Piggott, FiredrakeFluids v0.1: numerical modelling of shallow water flows using an automated solution framework, Geoscientific Model Development 8 (3) (2015) 533–547. doi:10.5194/gmd85332015.

(4)
F. Rathgeber, D. A. Ham, L. Mitchell, M. Lange, F. Luporini, A. T. T. McRae,
G.T. Bercea, G. R. Markall, P. H. J. Kelly,
Firedrake: automating the finite
element method by composing abstractions, ACM Transactions on Mathematical
Software.
URL http://arxiv.org/abs/1501.01809  (5) A. Logg, G. N. Wells, DOLFIN: Automated finite element computing, ACM Transactions on Mathematical Software 37 (2). doi:10.1145/1731022.1731030.
 (6) A. Logg, K.A. Mardal, G. N. Wells, et al., Automated Solution of Differential Equations by the Finite Element Method, Springer, 2012. doi:10.1007/9783642230998.

(7)
M. S. Alnæs, A. Logg, K. B. Ølgaard, M. E. Rognes, G. N. Wells, Unified Form Language: A domainspecific language for weak formulations of partial differential equations, ACM Transactions on Mathematical Software 40 (2).
 (8) F. Luporini, A. L. Varbanescu, F. Rathgeber, G.T. Bercea, J. Ramanujam, D. A. Ham, P. H. J. Kelly, CrossLoop Optimization of Arithmetic Intensity for Finite Element Local Assembly, ACM Transactions on Architecture and Code Optimization 11 (4). doi:10.1145/2687415.

(9)
M. Giles, I. Reguly, G. Mudalige,
OPS C++ User’s Manual,
University of Oxford (2015).
URL http://www.oerc.ox.ac.uk/projects/ops  (10) I. Z. Reguly, G. R. Mudalige, M. B. Giles, D. Curran, S. McIntoshSmith, The OPS Domain Specific Abstraction for MultiBlock Structured Grid Computations, in: Proceedings of the 2014 Fourth International Workshop on DomainSpecific Languages and HighLevel Frameworks for High Performance Computing, IEEE Computer Society, 2014, pp. 58–67. doi:10.1109/WOLFHPC.2014.7.
 (11) G. R. Mudalige, I. Z. Reguly, M. B. Giles, A. C. Mallinson, W. P. Gaudin, J. A. Herdman, Performance Analysis of a Highlevel Abstractionsbased Hydrocode on Future Computing Systems, in: Proceedings of the 5th international workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS ’14). Held in conjunction with IEEE/ACM Supercomputing 2014 (SC’14), 2014.
 (12) S. P. Jammy, G. R. Mudalige, I. Z. Reguly, N. D. Sandham, M. Giles, Block structured compressible navier stokes solution using the ops highlevel abstraction, in: Proceedings of the 27th International Conference on Parallel Computational Fluid Dynamics (Parallel CFD 2015), 2015.
 (13) E. Touber, N. D. Sandham, Largeeddy simulation of lowfrequency unsteadiness in a turbulent shockinduced separation bubble, Theoretical and Computational Fluid Dynamics 23 (2) (2009) 79–107. doi:10.1007/s001620090103z.
 (14) N. De Tullio, N. D. Sandham, Direct numerical simulation of breakdown to turbulence in a Mach 6 boundary layer over a porous surface, Physics of Fluids 22 (9). doi:10.1063/1.3481147.
 (15) J. Redford, N. D. Sandham, G. T. Roberts, Numerical simulations of turbulent spots in supersonic boundary layers: Effects of Mach number and wall temperature, Progress in Aerospace Sciences 52 (2012) 67–79. doi:10.1016/j.paerosci.2011.08.002.
 (16) B. Wang, N. D. Sandham, W. Hu, W. Liu, Numerical study of oblique shockwave/boundarylayer interaction considering sidewall effects, Journal of Fluid Mechanics 767 (2015) 526–561. doi:10.1017/jfm.2015.58.

(17)
SymPy Development Team, SymPy: Python library for
symbolic mathematics (2016).
URL http://www.sympy.org  (18) OpenFOAM, OpenFOAM User Guide, Version 2.3.1 (2014).

(19)
T. Sun, OPESCIFD: Automatic Code
Generation Package for Finite Difference Models, Master’s thesis, Imperial
College London (2016).
URL https://arxiv.org/abs/1605.06381 
(20)
N. Kukreja, M. Louboutin, F. Vieira, F. Luporini, M. Lange, G. Gorman,
Devito: automated fast finite
difference computation, in: In Proceedings of the Sixth International
Workshop on DomainSpecific Languages and HighLevel Frameworks for High
Performance Computing, 2016.
URL https://arxiv.org/abs/1608.08658 
(21)
M. Lange, N. Kukreja, M. Louboutin, F. Luporini, F. Vieira, V. Pandolfo,
P. Velesko, P. Kazakas, G. Gorman,
Devito: Towards a generic Finite
Difference DSL using Symbolic Python, in: In Proceedings of the PyHPC 2016
Conference, 2016.
URL https://arxiv.org/abs/1609.03361  (22) W. Bangerth, R. Hartmann, G. Kanschat, deal.II–A generalpurpose objectoriented finite element library, ACM Transactions on Mathematical Software 33 (4). doi:10.1145/1268776.1268779.
 (23) F. Hecht, New development in FreeFem++, Journal of Numerical Mathematics 20 (34) (2012) 251–265. doi:10.1515/jnum20120013.
 (24) M. H. Carpenter, C. A. Kennedy, FourthOrder 2NStorage RungeKutta Schemes, NASA Technical Memorandum 109112, National Aeronautics and Space Administration, Langley Research Center (1994).
 (25) M. Folk, E. Pourmal, Balancing Performance and Preservation Lessons Learned with HDF5, in: Proceedings of the 2010 Roadmap for Digital Preservation Interoperability Framework Workshop, USDPIF ’10, 2010, pp. 11:1–11:8. doi:10.1145/2039274.2039285.
 (26) A. Collette, Python and HDF5: Unlocking Scientific Data, O’Reilly Media, 2013. doi:10.1007/9783642230998.
 (27) D. Padua, Encyclopedia of Parallel Computing, Vol. 4, Springer US, 2011. doi:10.1007/9783642230998.
 (28) J. Fu, M. Misun, R. Latham, C. D. Carothers, Parallel I/O Performance for ApplicationLevel Checkpointing on the Blue Gene/P System, in: Proceedings of the 2011 IEEE International Conference on Cluster Computing, 2011, pp. 465–473.
 (29) K. Salari, P. Knupp, Code verification by the method of manufactured solutions, Tech. Rep. SAND20001444, Sandia National Laboratories (2000).
 (30) P. J. Roache, Code Verification by the Method of Manufactured Solutions, Journal of Fluids Engineering 124 (1) (2002) 4–10. doi:10.1115/1.1436090.
 (31) C. J. Roy, Review of code and solution verification procedures for computational simulation, Journal of Computational Physics 205 (1) (2005) 131–156. doi:10.1016/j.jcp.2004.10.036.
 (32) W. Stein, D. Joyner, SAGE: System for Algebra and Geometry Experimentation, ACM SIGSAM Bulletin 39 (2).
 (33) J. M. Vedovoto, A. da Silveira Neto, A. Mura, L. F. F. da Silva, Application of the method of manufactured solutions to the verification of a pressurebased finitevolume numerical scheme, Computers & Fluids 51 (1) (2011) 85 – 99. doi:10.1016/j.compfluid.2011.07.014.
 (34) M. E. Brachet, D. I. Meiron, S. A. Orszag, B. G. Nickel, R. H. Morf, U. Frisch, Smallscale structure of the TaylorGreen vortex, Journal of Fluid Mechanics 130 (1983) 411–452. doi:10.1017/S0022112083001159.
 (35) J. DeBonis, Solutions of the TaylorGreen Vortex Problem Using HighResolution Explicit Finite Difference Methods, in: 51st AIAA Aerospace Sciences Meeting including the New Horizons Forum and Aerospace Exposition, Aerospace Sciences Meetings, 2013. doi:10.2514/6.2013382.
 (36) J. R. Bull, A. Jameson, Simulation of the Compressible Taylor Green Vortex using HighOrder Flux Reconstruction Schemes, in: 7th AIAA Theoretical Fluid Mechanics Conference, AIAA Aviation, 2014. doi:10.2514/6.20143210.
 (37) G. A. Blaisdell, N. N. Mansour, W. C. Reynolds, Numerical simulations of homogeneous compressible turbulence, Report TF50, Department of Mechanical Engineering, Stanford University (1991).
 (38) G. A. Blaisdell, N. N. Mansour, W. C. Reynolds, Compressibility effects on the growth and structure of homogeneous turbulent shear flow, Journal of Fluid Mechanics 256 (1993) 443–485. doi:10.1017/S0022112093002848.
 (39) G. A. Blaisdell, E. T. Spyropoulos, J. H. Qin, The effect of the formulation of nonlinear terms on aliasing errors in spectral methods, Applied Numerical Mathematics 21 (3) (1996) 207 – 219. doi:10.1016/01689274(96)000050.
 (40) S. Pirozzoli, Numerical Methods for HighSpeed Flows, Annual Review of Fluid Mechanics 43 (1) (2011) 163–194. doi:10.1146/annurevfluid122109160718.
 (41) Z. Wang, K. Fidkowski, R. Abgrall, F. Bassi, D. Caraeni, A. Cary, H. Deconinck, R. Hartmann, K. Hillewaert, H. Huynh, N. Kroll, G. May, P.O. Persson, B. van Leer, M. Visbal, Highorder cfd methods: current status and perspective, International Journal for Numerical Methods in Fluids 72 (8) (2013) 811–845. doi:10.1002/fld.3767.
 (42) A. Chorin, The numerical solution of the NavierStokes equations for an incompressible fluid, Bulletin of the American Mathematical Society 73 (6) (1967) 928–931.
 (43) A. Chorin, Numerical Solution of the NavierStokes Equations, Mathematics of Computation 22 (104) (1968) 745–762.

(44)
The FEniCS Project,
Incompressible
NavierStokes equations, Online.
URL https://fenicsproject.org/documentation/dolfin/1.0.1/python/demo/pde/navierstokes/python/documentation.html  (45) S. Balay, S. Abhyankar, M. Adams, J. Brown, P. Brune, K. Buschelman, V. Eijkhout, W. Gropp, D. Kaushik, M. Knepley, L. Curfman McInnes, K. Rupp, B. Smith, H. Zhang, PETSc Users Manual, Tech. Rep. Revision 3.5, Argonne National Laboratory (2014).
 (46) A. Chorin, A numerical method for solving incompressible viscous flow problems, Journal of Computational Physics 2 (1) (1967) 12–26. doi:10.1016/00219991(67)90037X.
 (47) T. Ohwada, P. Asinari, Artificial compressibility method revisited: Asymptotic numerical method for incompressible navierstokes equations, Journal of Computational Physics 229 (5) (2010) 1698–1723. doi:10.1016/j.jcp.2009.11.003.
 (48) C. L. Ledur, C. M. D. Zeve, J. C. S. dos Anjos, Comparative Analysis of OpenACC, OpenMP and CUDA using Sequential and Parallel Algorithms, in: Proceedings of the 11th Workshop on Parallel and Distributed Processing (WSPPD), 2013, 2013.
Comments
There are no comments yet.