1 Introduction
Partial differential equations that involve vectorial or tensorial quantities are very common in science. For example, the vacuum Maxwell’s equations,
(1)  
(2)  
(3)  
(4) 
involve manipulations of the vector fields
and . Numerically, these fields are represented by arrays of three numbers per point on a discretized spatial grid. Manipulations of the fields using a language such as C or FORTRAN will involve loops over each component and over the gridsize.When solving Einstein’s equations of general relativity baumgarteShapiroBook, the necessary tensorial equations can become quite involved. As a moderate example consider the evolution equation of the spatial metric in certain formulations of Einstein’s equations,
(5) 
Here, and are the spatial metric and the extrinsic curvature; both these are represented by spatially varying, symmetric x matrices. The scalar quantity denotes the lapsefunction and the shiftvector, both of which are also spatially varying. And finally, denotes the covariant derivative operator compatible with . Equations (1)–(5) depend on one or two indices, respectively. Intermediate expressions in general relativity can easily depend on more indices, for instance the Christoffelsymbols are defined as
(6) 
where denotes the partial derivative, and the 3x3 symmetric matrix is the inverse of the matrix , both of which are spatially varying. Because of the symmetry in the indexpair , Eq. (6) represents 18 independent equations, each one with nine terms on the righthand side.
Henceforth, we adopt the Einstein sumconvention that repeating indices are being summed over (i.e. we will no longer write in equations like Eq. (6)). Furthermore, Latin lowercase letters from the middle of the alphabet () will range over the three spatial dimensions.
Upon spatial discretization, each spatially dependent tensor is represented on a spatial grid or, for multidomain methods, on multiple spatial grids. On each such grid, assumed to have points, Eq. (5) would then, schematically, be represented by code such as that in Listing 1.
The schematic listing 1 indexes tensorial objects with parentheses for the tensorindices; the gridpoints of the underlying grid are indexed with square brackets. We furthermore assume in Listing 1 that the covariant derivative of was already precomputed^{1}^{1}1The covariant derivative is given by
db
.
Our focus in this paper is the numerical relativity code SpEC, SpE (e0cm), a mature code in active use for the computation of gravitational waveforms for groundbased detectors. Expressions such as that of Listing 1 are ubiquitous in SpEC and present a major challenge to development, adaptation, and maintenance. The library presented in this paper, TLoops, removes from SpEC the need for explicit sourcecode loops over tensorindices. Equation (5) can then be written as a single line, as illustrated in Listing 2.
The variables i_, j_
, etc, are predefined by TLoops.
Overloaded indexingoperators and
assignmentoperators are defined such that the single line in
Listing 2 expands to all relevant loops, both
over tensorindices and over gridpoints.
TLoops also handles sums.
For instance Eq. (6) can be coded as the
single expression in listing 3.
There already exist several packages implementing similar functionality Landry (2012); Arndt et al. (2017); jer (1998); Reynders and Cummings (1998); Tisdale (1999); Veldhuizen (1996). Consistent with our observations, benchmarks of them show impaired performance relative to explicitly coded loops FTe (2012), presumably due to compiler optimizations being oriented towards the latter.
The true (and to our knowledge unique) advantage of our package is its ability to automatically generate equivalent source code to templated expressions. When a certain compiler flag is defined, TLoops stores each unique tensor expression it encounters within the linker code of each compiled library. A packaged executable, CodeWriter, thus has access to the full list of possible tensor expressions, from which it generates legal nontemplated code performing equivalent operations. We present here two examples, a (loop based) Cimplementation, and a GPU (CUDA) implementation. In either case, the original C++ templatecode does not need any sourcecode modifications – the new C or CUDAcode is incorporated at linktime.
Because of the latter functionality, TLoops can be used to immediately port large numbers of tensor operations to the GPU without the need to explicitly write kernels. These tensor operations are normally substantially faster than CPU code, and allow data to be kept on the GPU between calls to other GPU kernels, allowing segments of code to be handported without extraneous CPUGPU synchronizations.
The remainder of this paper is divided into three parts: First, we introduce SpEC and outline the C++ template techniques that enable the compact code in listings 2 and 3. Second, we present our techniques to allow replacement of the templategenerated code with automatically generated nontemplated code. Finally, we show detailed benchmarks of the new results.
2 Direct evaluation of tensor loops using expression templates
2.1 Spectral Einstein Code
The code presented in this paper is based on the Spectral Einstein
Code (SpEC) SpE (e0cm) written in C++. In
SpEC, arrays over gridpoints are represented by the
class DataMesh
, and tensorial objects are represented by
template<class T> class Tensor
. SpEC’s
class DataMesh
already contains expressiontemplates that
handle loops over gridpoints. Therefore, in SpEC,
Listing 1 is coded as displayed in Listing 4.
Indexing a Tensor<T>
, e.g. K(i,j)
, returns a (const or
nonconst) reference to T
.
SpEC’s class Tensor
is aware of symmetric indices. For
instance, if K
is initialized as symmetric, K(i,j)
and
K(j,i)
both return a reference to the same element.
SpEC’s class DataMesh
implements automatic resizing when assigned
to. Furthermore, when used on the righthandside of assignments (as
in Listing 4), DataMesh
checks consistency
of the sizes of all DataMesh
’es involved. These consistency
checks, combined with the absence of the explicit loop over
gridpoints and indexing of gridpoints already significantly reduces
the possibility of coding errors. However, two major shortcomings remain:

Loops over tensor indices must be coded manually, which is tedious and errorprone. Specifically, the loops over indices must be consistent with the symmetries of the respective
Tensor
(cf. the loop overj
in Listing 4). 
The existing SpEC expression templates operate on
class DataMesh
’es. For each combination(i,j)
, the inner loop thus represents an independent expression onDataMesh
, triggering a full traversal of all gridpoints. In Listing 4 this requires six traversals of the associated memory, whereas Eq. (6) would require 18 traversals.
TLoops corrects both these shortcomings by removing the need to write explicit loops entirely. This is done using a hierarchy of expression templates to represent tensorial manipulations. In the immediately following sections we detail the specifics of those templates.
2.2 Tensors and SpEC’s Tensor class
For our purposes, tensors are objects with indices, each taking distinct values. The integer is called the rank of the tensor, and its dimension. For instance, indicates a rank tensor of dimension . If a tensor is symmetric on a pair of indices, then the ordering of the two indices in the pair is irrelevant. For instance, defined in Eq. (6) is symmetric on its two lower indices, i.e. for any values of . The significance of the indexplacement (up/down) is irrelevant for the purposes of this paper. Many tensors in general relativity are symmetric on some or all indices, including and in Eq. (5)^{2}^{2}2Tensors can also be antisymmetric, a property not implemented in SpEC and therefore of no relevance here.. In differential geometry tensors must satisfy additional conditions related to coordinate transformations, which are not satisfied by Christoffel symbols . While are not tensors in the mathematical sense, they are nevertheless represented in SpEC with class Tensor.
In general relativity one commonly encounters both spacetime and spatial tensors. Indices of spacetime tensors range over the three spatial dimensions and time. An example is the spacetime metric,
(7) 
where we use letters from the start of the alphabet () to indicate spacetime indices. The zeroth indexvalue (e.g. ) indicates the timedimension, while indicate the space dimensions. The spatial metric is a subset of the spacetime metric,
(8) 
Because SpEC always indexes starting with , Eq. (8) must add to the spatial indices to obtain the relevant components of .
SpEC’s Tensor class represents multiindex objects whose indices each take on the values . The represented objects may be symmetric on some of their indices, like or . Internally, a Tensor<X> holds an array of elements, each an object X, of appropriate size given the symmetries (e.g. the symmetric tensor has six elements). A Tensor<X> furthermore holds a lookup table to translate indices (i,j) into the actual storage location inside the array. Symmetries are implemented by the lookup table for (i,j) and (j,i) pointing to the same element. Tensor is indexed with parentheses, i.e. Gamma(0,1,2) represents . Listing 5 demonstrates some indexingoperations performed on Tensor, while Listing 4 already demonstrated actual computations performed with Tensor.
The listings 4 and 5 use class DataMesh, another SpECspecific class. DataMesh represents a multidimensional rectangular array, holding one double per gridpoint, with dimension , extents and size . SpEC_ implements expression templates: arithmetic operators between DataMeshobjects and/or double’s are overloaded to return recursively defined types encoding the operation and the data type of the operands (DataMesh or double). The instantiations of the expression templates furthermore collect references to the memory locations of all involved data. The assignment operator then recurses through the template to evaluate the expression.
Certain design choices of SpEC present challenges for the development of TLoops. Because SpEC is a wellestablished and intensely used code, these choices cannot be changed and we have to work within them:

Dimension, rank and symmetry of a Tensor<X> are assigned dynamically at runtime, and not statically through templatearguments at compiletime. This gives flexibility when using instances of Tensor, because dimension/rank/symmetry can be changed as needed. Unfortunately, this also implies that dimension/rank/symmetry are not available to C++’s typesystem at compiletime. Part of this paper therefore deals with injecting compiletime information into the tensorexpressions (e.g. Listings 2–5) so that the information needed to construct loops over tensorindices can be deduced at compiletime.

Tensor<T> is a template class which stores an array of T’s. Because each DataMesh allocates its own storage independently, this implies that Tensor<DataMesh> has independent double* arrays of size for each tensorcomponent, rather than one contiguous array of size . While SpEC’s designchoice makes it convenient to use Tensors, it is not necessarily computationally optimal. Specifically, for GPUimplementations of tensorloops, the increased number of memory locations degrades performance, cf. Section 5.

Finally, because of SpEC’s age, and the need to run on various supercomputers with varying degree of uptodate compilers, SpEC restricts itself to C++03 with only a small set of newer C++11 features, and no C++17specific features.
2.3 Capturing a TLoopsexpression as a type
2.3.1 Classes for indexing a Tensor
All expression templates begin with capturing the structure of the expression as a type. Types are available to the compiler and thus enable metaprogramming at compiletime. In this section we detail the classes we have developed to accomplish this.
Two classes represent indexing with tensorindices, i.e. the variables appearing in tensorial equations like Eqs. (5) and (8): class TIndex represents an index (e.g. ) across the entire expression, whereas class TIndexSlot is specific to each occurrence of .
Listing 6 defines two classes and a set of variables. class TIndex<dim, label> serves as a marker to enable typecapture. As such, TIndex is templated on the dimension of the index. The additional labelargument distinguishes indices of the same dimension (e.g. ). TIndex also contains a counter “mValue” and functionality to iterate this counter over the allowed indexvalues of the given TIndex. This functionality will become relevant when we discuss evaluation of a tensorexpression in Section 2.4.
class TIndexSlot<dim, label, offset> tags each individual occurance of an index in an expression. This is required because different occurrances can have different offsets (i.e. “” and “”), which therefore require different encodings. TIndexSlot inherits from TIndex, in order to allow easy downcasting, which is convenient for compiletime consistency checks of the index structure.
Finally, Listing 6 defines variables i_, j_, etc, that map to specific TIndexSlots. These enable the user to inject typeinformation about indexing into source code, conveniently resembling mathematical expressions in tensor calculus.
2.3.2 Types representing an indexed Tensor
The classes and variables defined in the previous subsection (TIndex, TIndexSlot, i_, ...) are used to index a tensor, e.g. g(i_, 1). This is accomplished with a suitable Tensor::operator(), which will return a type that carries all information about the indexing. This returntype is built from helper classes introduced schmatically in Listing 7. The first two helper classes (TInequality and TSymmetry) handle symmetries of tensors.
The markerclass TInequality<pos1,pos2> indicates a symmetry between one pair of indices in a tensor, namely the indices at pos1’th and pos2’th position (where the counting starts from zero). For example, consider the metric and its timederivative both of which are symmetric in their only two indices. When iterating over all components of these tensors, this implies a condition , cf. Listing 4. TInequality<0,1> precisely represents this inequality, in that it indicates that in loops, the value of the 0th index should be equal or larger than the value of the 1th index.
A generic tensor with any structure of symmetric indices^{3}^{3}3Recall that we do not consider tensors with antisymmetric pairs of indices, or with cyclic symmetries like those of the Riemann tensor. can then be represented by a set of TInequality’s. For a tensor without symmetries, this set is empty. For a tensor with one symmetric pair of indices, the set contains one TInequality, like for or for (TInequality<1,2>). More generic symmetries are represented by multiple TInequality’s. For instance, a rankfour tensor separately symmetric on the first two and last two indices is represented by TSymmetry<TInequality<0,1>, TInequality<2,3>>, and a completely symmetric rankthree tensor by TSymmetry<TInequality<0,1>, TInequality<1,2>>.
Such sets of inequalities are represented by class TSymmetry<...>, which is only defined in specializations for TInequality<.,.>, and which utilizes compiletime asserts to enforce a monotonically increasing ordering of the TInequality’s.
Creation of the TSymmetry<...> markerclasses is handled via free template functions Sym<i,j>(), Sym<i,j,k>(), Sym<i,j,k,l>() that return the TSymmetry class representing full symmetrization on the indicated slots. For the case of several distinct symmetries, e.g. , operator && is suitably overloaded to allow Sym<0,1>() && Sym<2,3>().
Knowledge of the symmetries of a tensor is only important on the lefthand side of an assignment, as only the indices on the lefthand side are looped over. Therefore, it is optional to specify symmetries for tensors on the righthand side, as in Listing 2. Doing so does not throw an error, but also has no effect.
With symmetries of a tensor handled by TSymmetry, we now turn to the indexing of a tensor. This is handled by class TIndexStructure<Tsymmetry<...symm>, ...indices>, where the templatepack indices has a length equal to the number of indices. Each type in this templatepack is either a TIndexSlot or an int. The former indicates indexing with an implicit index (as used in Listing 2), whereas the latter case indicates the usual indexing by an integer (as used in Listing 4). Indexing with implicit indices and integers can be mixed. Consider, for example, a rank 3 tensor symmetric on the last two indices, which is indexed on the first slow with the integer 1. In mathematical notation, this is represented by , which is mapped by TLoops to Gamma(1,i_, j_) with indexing structure TIndexStructure<TSymmetry<TInequality<1,2>>, 1, TIndexSlot<3,0,0>, TIndexSlot<3,1,0>>.
The classes struct TSymmetry<...TIneq> and TIndexStructure<TStructure_t>, ...indices> are recursively defined in the number of TInequality’s and tensorindices. The specializations for the empty case are trivial. One then adds additional inequalities/indices at the front via variadic template arguments. Each specialization defines several member types and membervariables that will be useful subsequently, and which are shown in Listing 8.
The member types and variables of TSymmetry and TIndexStructure are used in subsequent steps to implement functionality. For instance, when adding two indexed tensors, the suitable operator+ will static_assert that both indexed tensors have the same set of free indices. This mirrors mathematical meaning, where is correct, whereas is erroneous.
2.3.3 Expressiontree for implicit tensor loop expressions
TIndexStructure represents the full indexed structure of an indexed tensor, i.e. the information needed to prepare at compiletime (via metaprogramming) the necessary loops. In order to execute the operation, in addition, the memory locations of all Tensor<DataMesh> instances are needed. The memory locations will be stored in a recursive expressiontree assembled with a template class iBinaryOp<L, Op, R> taking three templateparameters: The first templateparameters L and R represent the left and right operands. These operands could be of type iBinaryOp themselves, thus enabling the recursion to represent nested expressions. The middle templateparameter Op represents the mathematical operation to be performed.
The ’i’ in iBinaryOp indicates that the relevant expression is indexed by at least one symbolic tensorindex (i_, j_, ...). This is important, because for tensorial expressions, only a small subset of mathematical operations are permissible: (i) Addition and subtraction, for which the tensorindices in both operands must agree; (ii) Multiplication; and (iii) negation. In contrast, SpEC expressiontemplates operating on DataMesh utilize a much larger set of mathematical operators, including division, and a greatly enhanced set of unary operators (like squareroot and trigonometric functions).
There are several groups of partial specializations of iBinaryOp, to enable its full functionality. These specializations are schematically indicated in Listing 9.

contains the recursive operators that combine two indexed expressions together. As explained above, mathematically there are only four allowed operators wich are represented by the markerclasses AddOp, SubOp, MultOp and negateOp. For instance the ‘+’ operators in Listing 2 are represented by iBinaryOp’s of set (1).
iBinaryOp<EmptyType,negateOp,iBinaryOp> illustrates our convention to indicate unary operators with the type class EmptyType { } in lieu of the first template argument L to iBinaryOp.

recursively combine an indexed expression with a scalar expression (either a double, or a DataMesh, or a (scalarvalued) expression template of DataMesh, represented by class BinaryOp. Only multiplication and division is mathematically permissible, leaving only three cases each. The multiplication 0.5*... in listing 3 is represented by a specialization of set (2a), and the term 2*N*K(i_, j_) is represented by a specialization of set (2c), combining the DataMeshexpression 2*N with the indexed expression K(i_, j_).
The construction of recursive iBinaryOps is handled by the relevant overloaded operators. The recursive types of set (1) and (2) are returned by suitably defined operator+, operator, and operator*. Leafnodes of set (3) are returned by suitably templated Tensor<DataMesh>::operator(), which are provided in two versions: with and without a first argument of type TSymmetry. Such a TSymmetry argument must be provided on the lefthandside of assignments, like dg(Sym<0,1>(), i_, j_) in Listing 2, because the symmetry is required to determine the loopbounds. The instance of TSymmetry passed into Tensor<DataMesh>::operator() is encoded in the templated returntype of this operator within the TIndexStructureparameter inside the set (3) type in listing 9. On the righthandside, compiletime information about the symmetry of the tensors is not needed and therefore, presently, it is optional to specify the symmetry through an extra first argument to Tensor<DataMesh>::operator()^{4}^{4}4All examples in this paper omit the symmetry specifiers on the righthandside..
All iBinaryOp specializations have certain membertypes and membervariables which are useful when assembling the types recursively, and when evaluating the implicit tensorloop expression. These members are indicated in Listing 10.
TIndexSet_t is a templatetype that represents the set of all free indices; this set is used in operator+ and operator to verify that both operands have the same free indices. ExpandIndices_t is the DataMeshexpression type that results when all implicit tensorindices are replaced by concrete values, i.e. when each Tensor<DataMesh> is replaced by the DataMesh of one of its components. This type will be used when evaluating the tensorloop expression, cf. Sec. 2.4. Finally, the references lhs and rhs are also needed during evaluation of the tensorloop expression, as they contain the concrete memory locations of all relevant data.
2.4 Evaluation of TLoopstemplate
The preceding sections describe the individual elements that make up an implicittensor loop assignment, as in Listing 2: The lefthandside of this expression expands to a iBinaryOp of Set(3), whereas the righthandside expands to a iBinaryOp of arbitrary complexity. These two elements are combined via the memberassignment operator operator=() of the lefthandside’s type.
Listing 11 indicates the structure of these assignment operators. There are several such assignment operators depending on the type of operation (=, +=, =, *=, /=) and depending on the righthandside type (iBinaryOp, DataMesh, double). Only some combination of these are mathematically permissible, and only those are defined. As appropriate, these operators check that free tensorindices on the lefthandside and the righthandside match, and they resize the data on the lefthandside. Then all these operators call a templated free function TLoopApply for the actual computations. This allows us to handle the different types of assignment (=, +=, =, ...) without codeduplication.
Listing 12 executes the actual calculations, and as such this listing requires detailed explanations:

TLoopApply() starts with safety checks: CheckUniqueIndices is a compiletime check that there are no repeated tensorindices on the lefthandside. This test catches, for instance, the typo “dg(Sym<0,1>(), i_, i_)” in the lefthandside of Listing 2, which is mathematically forbidden. CheckSymmetries() verifies that the stated symmetries in the assignment —e.g. Sym<0,1>() in Listing 2— agree with the runtime symmetrystate of the respective tensor. Because of SpEC’s design decision that symmetries of Tensor<X> are set at runtime, this test necessarily can only trigger runtime errors.

The loop for(lhs.Reset(); lhs; ++lhs) forwards directly to the corresponding memberfunctions of TIndexStructure shown in Listing 8. TIndexStructure uses recursive templatepack expansion to recurse through all tensorindices. The loop will modify the static membervariables int TIndex<dim, label>::mValue of the TIndextypes occuring on the lefthandside, cf. Listing 6. During the ++lhs increment, these variables will be reset whenever they reach this upper bound. In this case, the TInequality parameters indicate the position of a potential other index with which an inequality (arising from a tensorial symmetry) must be satisfied. If so, int TIndexStructure::GetNthIndex<int>() retrieves the current value of this other index, which is used in resetting the index under consideration. Overall, the assignment dg(Sym<0,1>(), i_, j_)=... in Listing 2, results in loops
for(j=0; j<3; ++j) { for(i=j; i<3; ++i) { ... } }
where ‘i’ represents TIndex<3,0>::mValue and ‘j’ represents TIndex<3,1>::mValue. 
Inside the loop in Listing 12, we must now index each Tensor<DataMesh> on the righthandside ‘rhs’ with the current set of indexvalues as stored inside the respective TIndex<dim,label>::mValue. Upon such indexing, each Tensor<DataMesh> in the righthandside expression becomes a standard SpEC_ DataMesh, and the expressiontree becomes a regular DataMeshexpression template tree of type RHS::ExpandIndices_t (cf. Listing 10). The helperclass BinaryOpHolder recursively descends through ‘rhs’s structure, and builds an instance of the DataMeshexpression with all datareferences pointing to the appropriate elements of each Tensor<DataMesh>.

Finally, the actual assignment happens in ApplyOp::modify(). This memberfunction of the markerclasses SetEqualOp, PlusEqualOp, MinusEqualOp takes its second argument (i.e. the DataMesh expression template representation), and assigns/adds/subtracts it from its first argument (the DataMesh returned by indexing the lefthandside with the current set of tensorindices), thus triggering execution of SpEC_ DataMesh expression template code.
2.5 Sumoperations
Let us now turn to an exposition of contractions as in Listing 3.
The goal is to transform the expression Sum(k_, op[k_]) (schematically) into the expression op[0]+op[1]+op[2]. Here ‘op’ indicates a tensorloops expression which may have an arbitrary number of free indices. These should remain intact in the output expression. In our code, this is implemented with a templateclass PartialSum<curr_dim, TIndex_t, iBinaryOp>. An instance of this class is responsible for handling the indexvalue curr_dim of the tensorindex TIndex_t. This class recursively decrements curr_dim via inheritance of PartialSum<curr_dim1, TIndex_t, iBinaryOp>, and during recursion assembles the full sum. The corresponding code is shown schematically in Listing 13.
In principle, the framework presented here could detect implicit sums even without the explicit Sum(...), by watching via metaprogramming for duplicate TIndex<.,.> in operator*. Walter Landry’s FTensor behaves in this way and presents the choice as a design feature Landry (2012). We choose not to implement such implicit loop functionality for two reasons: First it would leave evaluation order according to C++ operator precedence, and it is not guaranteed that C++ precedence rules will result in optimal evaluation. The requirement to explicitly place Sum(...) in the code will force the user to make an explicit choice of how terms will be grouped, thus exhibiting the FLOPS implications more clearly. Second, sums exponentially increase the amount of FLOPs in an expression. The explicit occurrence of Sum, especially when repeated multiple times in the same expression, acts as signal for potentially very expensive operations.
3 Automatic code generation
In this section we describe TLoops’ automatic code generation functionality, which represents TLoops expressions with equivalent C or CUDA code. First, SpEC is compiled with certain options what encode each TLoops expression it uses. Next, an executable called CodeWriter iterates through the encoded expressions and outputs new code for each. SpEC is then recompiled with this new code, which replaces the expressiontemplates described in Section 4.2 at linktime. This gives in total four different SpEC compilation variants: NonAccel, as always; CodeWriter, to output the new equivalent code; AccelCPU, to link in the automaticallygenerated C code; and AccelCUDA, to link in the automaticallygenerated CUDA code.
Let us first give a highlevel overview of the tools which generate this code. As described in Section 4.2, TLoops expressions are represented as trees. Each node in the tree represents either an operator, in which case it has either one or two subnodes, or actual data (of type double, DataMesh, or indexed Tensor<DataMesh>), in which case it has no subnodes and we call it a “leaf”. The root node represents the type of assignment (, , , or ). In Section 4.2 this expression tree is built at compiletime with recursive templates, with the one goal of executing the encoded calculation.
In Figure 1 we illustrate the tree structure appropriate to the operation . The top panel shows the TLoops sourcecode expression. The second and third illustrate the expression tree. In the second panel that tree is illustrated by a shorthand representation of the expression template. BOp, in particular, stands in for the iBinaryOp introduced in Listing 9 and the surrounding text.
The third panel illustrates the tree recursion performing automatic C code generation. First, we generate an appropriate set of for loops from the index structure of the LHS tensor. We next iterate through the tree nodes representing operators, outputting variables that represent concrete data (such as d0 for double) from the child leafs of each node.
We now turn to equivalent code output in multiple languages (C and CUDA) illustrated in Figure 1. Performing this output using compiletime templates proved cumbersome, since such templatebased code is difficult to write and debug, and is too inflexible for our diverse goals. We therefore have developed a secondary runtime representation of the expression tree as concretelyinstantiated C++ classes, which works as follows:

The abstract class TExpressionLeafBase represents a leaf in the expression tree, with concrete derived classes for each type of leaf, such as double, DataMesh, or indexed Tensor<DataMesh>.

The abstract class TExpressionOperatorBase represents operators, with one concrete derived class for each (, , sqrt, etc).

class TExpressionNode represents a node in the expression, which may be a leaf or an operator. This is the class which forms the actual tree structure, and which handles recursion. It holds pointers to any child TExpressionNodes, as well as a pointer to the
TExpressionLeafBase*, if a leaf, or TExpressionOperatorBase*, if an operator.
With the above classreprentation of an expression in hand, we are now ready to output actual code. Conceptually, each class will trigger output of whatever codefragment it represents:

A TExpressionOperatorBase will have member functions to output the string representation (‘+’, ‘sqrt’, etc.). These member functions will place the operants in the right places, e.g. on either side of binary operators like ‘+’, or within the parentheses of unary operators like ‘sqrt()’.

A TExpressionLeafBase has member functions to output any code fragments directly involving the operand. For example, the member function std::string VarDeclaration();, which outputs variable declarations for the Cstyle code, outputs const double d0;, in the case of the first double on the right hand side, or const double* TDm1[3];, in the case of the second Tensor on the right hand side, with a rank of 1 and dimension of 3.

A TExpressionNode will have member functions PrintExpression and PrintCUDAExpression. In the case of a leaf, these call the appropriate code output frunction from TExpressionLeafBase. In the case of an operator, they call the output functions of the associated TExpressionOperatorBase along with those of the next child TExpressionNode, with parentheses formatted appropriately depending on whether the operator is unary or binary. class TExpressionLeafBase or
So far, we have described the structure and operation of a fully initialized TExpressionTree, and how such a tree yields the desired output code. We now turn to the construction of these TExpressionTrees. First, we must interface the templated representation of an expression (iBinaryOp<L, Op, R>) with this new C++ classbased representation. To do so, we proceed as follows:

We define a set of C++ functions that are templated on the expressiontemplate representation. These functions call one another recursively, and in this way they recurse through the expressiontemplate representation in the appropriate order, returning at each stage the relevant part of the classbased representation, i.e. the correct TExpressionNodes.

We further define a templated wrapper class ConcreteTExpressionTreeHouse which constructs the classrepresentation and which provides convenient functions to interact with it. This class is derived from an abstract baseclass TExpressionTreeHouse, which hides the type of the concrete expressiontemplate. Thus, having a pointer to TExpressionTreeHouse, surrounding code can interact with the expression in a typeagnostic way, making it possible to loop over different expressions and output code.
At this stage, we are faced with the task of constructing one ConcreteTExpressionTreeHouse for each distinct expressiontype in SpEC, and collecting pointers to the TExpressionTreeHouseBase in one large list, so that we can iterate over the tree. For this, we utilize the existing code in SpEC named Factory, which implements the factory design pattern Gamma et al. (1994).
Conceptually, for each abstract baseclass in SpEC, there is a database called Factory within which each concrete derived class registers itself, providing an IDstring along with a pointer to a function that creates the concrete derived class. After registration, Factory can then be passed IDstrings, causing it to call the relevant createfunction and to return a pointer to the newly created instance of the polymorphic class.
We use the SpECFactory as follows. When SpEC is compiled with the flag , each TLoopApplytemplate function  the function which triggers evaluation of the expression template, and which is thus itself uniquely templated on the expression  activates extra code that defines a static variable
In the above, ... represents a unique string constructed from the complete template type by recursive calls to a function TNameHelper mapping expressiontemplate types to string fragments.
During standard execution, these extra variables have no effect. They instead become important when linked into the CodeWriter executable. Upon initialization of the static variables at the start of CodeWriter execution, each registered variable causes the relevant call to Factory::Register, thus building a database containing all expressiontemplates occuring within the object files. Each entry in this database will now be represented by a concrete derived class of the CodeWriter baseclass, making the list of expression templates available at runtime to the executable. At that point CodeWriter iterates through all these concrete derived classes, creates one instance of the expression tree classrepresentation from each, and calls the relevant member functions to output C and CUDA code.
Listing 14 illustrates the iterationthroughtemplates procedure performed by CodeWriter. In that Listing, the line marked //* retrieves from the Factory a list of all possible IDstrings which represent derived classes from TExpressionTreeHouseBase, and thus which represent expression templates. CodeWriter then iterates through that list and, in the line marked //**, constructs a concrete instance of ConcreteTExpressionTreeHouse for each expression.
CodeWriter outputs into three files. The first contains functions whose arguments are templated on the appropriate TLoops expression template. These functions route to either the CUDA or the CPU code depending on which of AccelCPU or AccelCUDA are defined. The second extracts the actual arrays of pointers from the iBinaryOp’s passed to the function and passes these on. In the CUDA case this array of pointers must be copied to the GPU via an API call, which can be a significant extra expense. To avoid this we make the copy only once, repeating only if the structure of the Tensor changes. The third file contains the actual functions and, in the CUDA case, tuning arguments for the kernels, which are chosen based on the tensor structure of the lefthand side (see Section 5 for details).
4 Design Considerations of Automatically Generated CUDA Code
Section 3 detailed the tools we have developed to output automatically generated C and CUDA code to perform TLoops operations. In this section we describe the structure of that code with an eye to its performance.
First, let us give some examples of Cstyle code generation. Consider, for example, the TLoops expression
which symmetrically contracts the rank2 tensors A and B over their shared index . CodeWriter generates the following Cstyle code from Listing 15:
Recall N is the spatial gridsize.
In Listing 16, a in the for loop marked
is initialized to b rather than to 0, due
to the Sym<0,1>() flag in Listing 15.
The for loops run up to 4 due to the use of a_, b_ rather
than i_, j_ indices, which would generate loops running to 3.
Indexing may be further controlled by ‘fixing’ indices (e.g. by specifying
an integer value, such as 1, instead of an index such as i_),
or by specifying index “offsets” such as i_+1. Thus, the expression
generates the following Cstyle code
note that in this case the Sym<0,1>() flag has no effect.
Let us now demonstrate our automated CUDA code, starting with a simplified sample generated from the expression
This differs from Listing 15 only in that the symmetry flag has been removed. In CUDA, instructions to the GPU are collected into functionlike entities called kernels. The kernel generated from Listing 19 closely resembles the following:
In the real code we use __restrict__ flags on the pointer arguments for performance reasons. On the GPU, it is advantageous to store the tensor indices in a single, flattened array, since pointer indirections are relatively expensive. Similarly, unrolling expressions which on the CPU would have appeared as for loops prevents unnecessary serialization.
The variables threadIdx.y, blockIdx.y, etc, are used by CUDA to manage parallel data access. In CUDA, computations are abstracted as a threedimensional grid of blocks, in turn composed of threads. Each thread represents a discrete computational process that will execute the instructions in the kernel. While differing threads issue the same instructions, they will normally do so upon differing data, since they may address memory using their unique block and thread indices (threadIdx.y, etc.). for loops are generally replaced with if statements such as that in Listing 20, which ensures that no thread make an outofbounds array access.
Physically, GPU resources are divided into streaming multiprocessors (SMs) composed of tightly coupled processing cores. Cores in a given SM execute instructions in lockstep, and share certain memory resources with one another besides the global memory accessible to the entire GPU. Dividing threads into blocks, which are always local to a particular SM, enables such resources to be safely utilized. Although we make no use of such resources, the blocksize is nevertheless important for us, since a single SM may operate upon only a certain number of blocks at one time. The SM hides latency by switching between those blocks when one stalls (for example because of a data dependency). A poor blocksize choice can result in low “occupancy”, which impairs this ability, since there are insufficient blocks to switch to.
For this and other reasons, it is important to appropriately “tune” the kernel launch, via choice of the number of blocks in the grid (nblocks), and the number of threads in each block (blocksize). Those arguments are in the case of Listing 19 fixed by the corresponding “wrapper” code:
We expose the parallelism of the dataindependent spatial gridpoints by devoting the entire logical dimension of the CUDA grid to them. We then use the four remaining thread addresses to parallelize over the indices of the lefthand side tensor, since the corresponding arrays are also dataindependent. In principle, we could implement the Sum operator as a parallel reduction as well, since this dramatically complicates automatic code generation and offers no advantage at dimensions or , we instead perform them in serial, as demonstrated in Listing 20.
There are two cases in which we cannot parallelize over all the LHS components. The first is that of a symmetry. For example, Listing 15, which has a symmetry between the a and b indices, generates the following kernel:
Because the number of passes through the for loop in marked by //* in Listing 22 depends on the value of b, parallelization of that loop would require different CUDA threads within a block to execute different instructions. Since all the threads in a particular SM share the same control circuitry, however, this is not possible. CUDA deals with this by having SMs that encounter socalled “divergent execution paths” run each one in serial. We avoid this by serializing explicitly.
The second case we cannot parallize is the unusual one of an LHS tensor of rank greater than four. This exhausts the number of independent CUDA thread addresses, and so we must serialize the extra indices.
Bandwidth  Processing Power  

Device  theoretical, GB/s  measured, GB/s  GFLOP/s 
CPU  42.7    8.0 
M2090  177.6  123  665.5 
K80 (one card)  280  170  932–1456 
P100  720  449  40364670 
We tune our kernels using the following simple rules, designed to
achieve maximum or high occupancy on all the GPUs in Table 1.
We use (compare
Listing 21)
blocksize_x
and nblocks_x
to parallelize across the spatial
grid, which leaves us four independent parameters with which to parallelize
across LHS tensor indices. Each one will be used to parallelize a different
index, and will thus be set to either , , or (either no index,
or the dimension of the relevant index). Therefore,
will be either
or .
We now must set blocksize_x in order to control the total blocksize. This must be a multiple of 32, or else cores will be left idle, since GPU instructions are issued to groups of in lockstep. An optimal total blocksize, allowing each SM to fully utilize its compute resources, will be one of a few values that depend on the particular GPU in question. On the M2090, for example, these are , , , , and . and , in particular, achieve maximum occupancy across all three cards. These values can be achieved exactly when blocksize_y*blocksize_z is , or , in which cases we respectively set blocksize_x to , , or . Otherwise, we set blocksize_x to (for ) or (for ), which are nearoptimal. Note that this algorithm limits the maximum that our code can handle to , which is always in the millions. This limit could be easily removed by for example serializing over extremely large grids, but this has not been necessary for our purposes.
Each SM has a single “register file” of extremely fast RAM used to store variables allocated within a kernel. During kernel launch, those registers are logically allocated to individual threads as necessary. If a kernel’s register demands are such that running all possible threads would exhaust the register file, CUDA will restrict the number of blocks assigned to each SM to compensate, thus lowering occupancy and possibly affecting the above calculations. In the above calculations we assume this does not happen. Our benchmarks (c.f. Table 2) show this assumption is usually, but not always, borne out.
The synchronization of Tensors presents an additional complication for CUDA which is not present on the CPU. Recall that the individual arrays representing components of a Tensor are not contiguous on the CPU, since that class does not assume those components have identical memory footprints, and in fact permits modification of those footprints after construction, for example by reshaping the component DataMeshes. The CPU Tensors instead maintains an array of pointers, each addressing a particular tensor element.
This design choice is sufficiently inextricable from SpEC that we must work around it. But the obvious solution of maintaining an equivalent array of pointers on the GPU has a significant performance impact if handled naiively. The pointer array can change after construction, so we cannot simply create a GPU equivalent once and assume it will be always correct. On the other hand, the high latency of GPU array allocation and synchronization makes copying a fresh array with each kernel launch unacceptable.
Instead, each Tensor is paired with a set of “GPUPointers” that store a copy of the CPU pointer array, the GPU pointer array, and a reference to the relevant Tensor. When the GPU pointer array is retrieved, we first ensure that the copies CPU array is identical with that actually present in the Tensor, synchronizing only if it is not. This keeps the number of necessary synchronizations to their bare minimum. If the GPU array is never retrieved we do not create it at all, so that extra overhead is not incurred if a Tensor never encounters a TLoops kernel.
5 Benchmarks
5.1 Methodology
We now turn to benchmarks of both the automaticallygenerated code and the expressiontemplate implementation. Due to the wide range of potential expressions, hardware, compilers, and compiler options, it is not possible to do this fully comprehensively, but we aim here to give a broad picture of our code’s behaviour.
Our kernels make no explicit attempt to reuse data once loaded, and we therefore expect them to be bandwidthbound or latencybound (i.e. the limiting factor to their performance is either the memory bandwidth of the device or the latency between successive instructions). A useful performance metric in this case is the “effective bandwidth” BW, which is the ratio between the amount of information which must be read and written by an operation with the time taken to execute that operation in practice. We measure BW in GB/s. BW will be maximal for a kernel which simply copies data, and decrease as operations spend significant time on computations or extraneous memory operations. Calling the spatial gridsize, the total number of tensor elements involved in the operation and the number of doubles occuring outside of a DataMesh array, we have
(9) 
For example, the expression in Listing 15 has and ( elements each from A and B, but only from C due to the symmetry), while the one in Listing 17 has and ( from D, from F, and from E).
We begin with basic operations typical in relativity. We benchmark each such operation using threedimensional tensor indices, at three levels of complexity. These operations are assignments
(10)  
(11)  
(12) 
additions,
(13)  
(14)  
(15) 
outer products,
(16)  
(17)  
(18) 
and contractions
(19)  
(20)  
(21) 
We furthermore benchmark two practical expressions that actually occur in numerical relativity. The first mixes scalars, tensors, and outer products,
(22) 
and the second (Eq. (6)) computes the spatial Christoffel symbols of the second kind , thus also including contractions.
Finally, we will present under the label “GH” our TLoops port of the actual SpEC module which solves the generalized harmonic equations in the eponymous formulation of relativity theory. Roughly speaking, this code computes the second timederivative of the spacetime metric , as a function of first and second spatial derivatives. As such, the input data is primarily the , its spatial derivatives and , and its first timederivative . In total there are 22 input arrays. The equations involved consist of 25 separate TLoops
expressions, with as many as four LHS indices, and as many as two contractions on the RHS. Overall, we estimate 381 spatialgridsize arrays in the numerator of Eq. (
9) for the GH operation, so that GH is about 4100 times more bandwidthintensive than the other benchmarked expressions. This reflects in the raw execution times: GH takes about 10 times longer to execute than the other benchmarks, though the overall execution time is normalized away by plotting .TLoops can also handle transcendental functions and many other unary functions. Such functions occur often enough that automatic codegeneration is warranted. But they only use a marginal fraction of overall runtime, and so we do not benchmark such functions at present.
We benchmark each expression on the following combination of hardware and codepath:

Automatically generated CUDA code executing on the three NVIDIA GPUs in Table 1, namely M2090, K80, and P100.

Automatically generated C code executing on the host processor (labeled ‘AccelCPU’).

The TLoops expression templates executing on the host processor (labeled ‘NonAccel’).

The original SpEC code without TLoops codesimplifications, as an overall baseline (labeled SpEC.
The CPU code was compiled using gcc 4.8.1 using the O3, fPIC, and std=c++11 compiler flags. We also took benchmarks using intel 15.0.2 and the compiler flags O3 xHost fPIC fpmodel precise std=c++11. The Intel code usually gives comparable or worse results to gcc (c.f. Figure 4), and so for visual simplicity only the gcc results are displayed in Figures 2, 3, and 5. The CPU timings were performed on a single core of an Intel Xeon E52620 CPU, which has a clock frequency of 2.0 GHz, a theoretical bandwidth of GB/s, and a theoretical doubleprecision processing power of GFLOP/s. We sampled gridsizes at multiples of 32 with decreasing resolution at increasing gridsize.
We have not made a systematic study of CPUs and compiler options, and do not intend for these results to reflect the potential CPU performance of our code. Improved results could almost certainly be achieved using a CPU with a higher clock frequency and e.g. vectorized instructions over multiple cores. This machine and these compiler options are, however, representative of realistic conditions under which SpEC might presently run. In particular, limitations to parallelism imposed by SpEC’s implementation of multidomain pseudospectral methods restrict it to a single CPU core per expression.
The M2090’s host processor is the same as used for the CPU test. We compiled the CPU code in this case using the Intel compiler with the same options as above. The K80 and P100 are hosted by somewhat faster POWER8 processors, and the code in these cases was compiled using xlc 13.1.4 with the flags fPIC, O3, std=c++11. The CPU’s performance should not be relevant to the GPU tests. We compiled the GPU code with CUDA 6.5 on the M2090 using arch=sm_20. On the K80 (arch=sm_37) and P100 (arch=sm_60) we used CUDA 8.0.
5.2 Benchmarks of simple expressions
We are now ready to present benchmarks for the expressions corresponding to Eqs. (10)(18) on each of the various hardware and codepath combinations. We execute each benchmark 21 times, discard the first, and take the median.
The results are summarized in Figure 2. Each panel of that figure correpondes to one particular expression, with the axis indicated gridsize, and the axis indicating performance. We express performances in terms of effective bandwidth BW, c.f. discussion in Section 5.1 above.
Let us now discuss and interpret Figure 2. Focusing first on the GPUs, ‘M2090’, ‘K80’, and ‘P100’ all show the same overall trends. Execution time is roughly constant in the gridsize until occupancy is saturated, after which point it increases linearly. Since the number of memory transactions increases linearly in gridsize throughout, shows linear increase up to the point of saturation, after which point it is constant. Since all the GPUs have essentially the same singlethread performance, they perform essentially identically until their respective points of saturation. However, newer cards (especially the P100) can support more parallel threads, resulting in a later point of saturation with a higher BW.
The saturation gridsize is most importantly determined by the lefthand side tensor rank: higher rank tensors saturate earlier. For example, in Figure 2 the saturation gridsizes are almost identical for compared with , for compared with , and for compared with any of the addition operations Eqs. (13)(15). This reflects our code’s parallelism over tensor indices, which is crucial for achieving good performance at gridsizes on the order of . The pattern would not persist past rank 4 or for symmetric indices, since we serialize in these cases. Postsaturation performance is mostly independent of the operation and usually quite close  slightly beneath a factor of in the worst case of  to the measured bandwidths from bandwidthTest.
In contrast, the three CPU executionpaths do not suffer from high latency, and so the BW curves are generally quite flat with respect to gridsize. Nevertheless, several patterns are visible in the relative exeuction speed of the three CPU execution paths. Except sometimes for assignments (Eqs. (10)(12)) at very large gridsize, the expressiontemplate code (NonAccel) usually gives worse performance than either the automaticallygenerated C code (AccelCPU) or that SpEC without any TLoops simplifications (SpEC) by a factor of between about 310. These results are roughly in line with those obtained from Walter Landry’s FTensor Landry (2012); FTe (2012), which is similar to TLoops running in NonAccel mode, and is presumably due to the compiler being less able to optimize the various templated expression templates.
More unexpectedly, SpEC and AccelCPU do not perform identically. While performance is usually comparable, SpEC is sometimes noticeably superior, especially for less complex operations at small gridsizes. AccelCPU differs from SpEC in two ways. First, the for loops over tensor indices appear directly in source code using SpEC, whereas AccelCPU routes through a few extra classes before reaching them. While we consider it unlikely, the impaired performance for less complex operations may be due to some extra overhead from this routing. Second, SpEC handles the loop over gridpoints using expression templates, while AccelCPU uses an additional for loop. This may result in differing behaviour regarding e.g. the creation of temporaries and the use of cached memory in the machine code.
Now comparing the performance of the GPUs with the CPU executions paths, we note that the CPU generally gives better performance at small gridsize, but is eventually surpassed by the GPU. This is the expected behaviour: the CPU has superior singlethread performance, but the GPU has more capacity for paralellism. Compiler optimizations available to the CPU will likely also result in more efficient reuse of memory than on the GPU. This will make operations on the CPU less complex, but also make the floatingpoint performance of the hardware more relevant. Thus, the CPU has less of an advantage at small gridsize for more computationally intensive operations.
5.3 Benchmarks of more complex expressions
Let us now turn to the more complex operations corresponding to Eqs. (19)(22), Eq. (6), and the GH equations, each described in Section 5.1. We proceed here as in Section 5.2, benchmarking six hardware and codeexecutionpath combinations as a function of gridsize, with results presented in Figure 3.
Figure 3 shows broadly similar features throughout: GPU performance increases linearly up to a saturation point and then is constant, while CPU performance is nearly flat. In particular, the operation (Eq. (22), lower left panel of Figure 3) behaves essentially identically to Eq. (17) (lower left panel of Figure 2), to which it is indeed very similar in form.
The contraction operations (Eqs. (19), (20), and (21)) in the top row of Figure 3 show some new behaviour. On the CPU, we first of all notice that performance, while still independent of gridsize, worsens sharply as we move from left to right between panels. These operations are more strongly computebound than those discussed until now, although the form of our automatically generated code does not expose this. Since the CPU performs memory operations relatively better than floatingpoint computations, its performance degrades for operations involving more of the latter.
We also notice that AccelCPU gives better performance than does SpEC for these operations, which is the opposite behaviour as observed previously. We can only guess at the reason for this. Perhaps there is more opportunity for compiler optimizations for operations involving more floatingpoint operations, but the expressiontemplates over gridsize used by SpEC prevent those optimizations from being made.
Turning attention to the GPU curves, we see that the lowgridsize performance is almost exactly identical between panels. Due to the massive parallelism it must support, the CUDA compiler cannot make nearly so aggressive optimizations as can a modern C++ compiler, and so the automaticallygenerated code presumably behaves in the bandwidthbound manner in which it is written. The saturation gridsize, however, does change, even though the number of LHS indices remains constant. Similarly, the postsaturation performance gets lower as we move from left to right.
This stems from the fact that the SpEC class Tensor is a list of arrays (one array over the spatial grid per tensor index), rather than a single contiguous one. Since the tensors are not contiguous each memory access actually involves two pointer indirections, one each to retrieve the appropriate component array and spatial gridpoint. For example an instruction such as d = A[i][j][x] must first load A[i] from global memory, then A[i][j], then finally A[i][j][x]. Since the first two loads are not in principle necessary, we do not include them in the numerator of . Since that numerator therefore underestimates the true number of memory transactions our computed will be correspondingly lower.
M2090  K80  P100  
Operation  Regs  % Occ  Regs  % Occ  Regs  % Occ 
(10)  10  100  10  93.8  12  93.8 
(11)  10  100  10  93.8  12  93.8 
(12)  10  83.3  10  93.8  12  93.3 
(13)  14  100  14  93.8  14  93.8 
(14)  18  100  16  93.8  15  93.8 
(15)  20  100  21  93.8  18  93.8 
(16)  14  100  14  93.8  14  93.8 
(17)  18  83.3  16  93.8  16  93.8 
(18)  21  83.3  21  93.8  18  93.8 
(19)  28  72.9  29  93.8  24  93.8 
(20)  38  41.7  42  93.8  32  93.8 
(21)  50  41.7  56  93.8  48  62.5 
(22)  21  87.5  42  93.8  32  93.8 
(6)  34  62.5  32  93.8  32  93.8 
The extra indirections also result in additional thread latency, since the thread must stall between the successive loads, and since the large number of loads may exhaust the SMs memory pipeline. This last effect could in principle be alleviated by staggering loads to avoid memory dependency, but this would complicate automatic code generation considerably. Future improvements will instead focus on making tensors contiguous.
The extra pointers finally result in extra threadlocal memory allocations, increasing the kernel’s perthread register count. Each streaming multiprocessor (SM) in a GPU has physically a single register file that is logically allocated to threads as needed. Each SM is also theoretically capable of simultaneously executing a certain number of warps, each consisting of 32 threads, but only if the perthread register count is small enough that these warps do not collectively exhuast the register file.
On the M2090, K80, and P100 respectively, this occurs when the perthread register count exceeds 21, 64, and 32. The SMs on the K80 and P100 have equally sized register files (of 256kb, compared to 128kb on the M2090), but the P100 SMs can also potentially execute more warps, resulting in a lower register threshold for maximum occupancy. If the limit is saturated by a large threshold, occupancy will significantly decrease, resulting in an earlier point of saturation with worse asymptotic performance. We never exceed this threshold on the K80 (Table 2) but it does sometimes become relevant for operations involving contractions.
Finally, let us turn attention to the GH operation, in the lower right panel of Figure 3. On the GPU, the transition from linear to constant performance growth is not nearly so sharp as for the singleexpression operations. This presumably reflects an averaging out between the many saturation points of the differing expressions within GH. GH also displays (in all cases) noticeably worse performance compared to its predecessors in this discussion. The GH operation consists of many succcessive individual kernels, many of which are complex contractions; thus, the above discussion of contractions applies here as well. On top of this, the many individual kernel launches add latency to the GPU execution time.
5.4 Impact of CPU compiler
In Figure 4, we show some benchmarks illustrating the relative performance of gcc vs. Intel compilers operating upon our code. Generically, but not always, gcc gives better performance. The difference is most stark for the C++11 expression templates of NonAccel, which work over an order of magnitude faster using gcc throughout. gcc also gets uniformly better performance out of the AccelCPU code, though the difference is less dramatic. Without TLoops (“SpEC”), the compilers do behave differently, but their relative performance varies between operations.
5.5 Impact of templating over Tensorindices
From the perspective of automatic GPUporting, an alternative approach to TLoops would be to automatically generate code from SpEC’s existing spatialgridpoint expression templates. For example, in Listing 4, one might automatically generate code only for the interior operation, and not for the full expression including the for loops over i and j. This would be much simpler to write, and would require no source code modifications at all.
Chronologically, this approach was the first we tried. TLoops was motivated by its strongly negative performance impact on the SpEC code proper. The performance decrease accounting for this is illustrated in Figure 5. Here, we benchmark various expressions using the P100 GPU and the two TLoops CPU execution pathways. For the lines marked Tensors, TLoops is used to represent the full expression, as in Listing 2. For those marked Arrays, it is used only to represent operations over individual Tensor elements, which therefore are surrounded by explicit for loops in source code, as in Listing 1. While the respective performance of the two strategies is comparable on the CPU, on the GPU TLoops expressions are vastly superior, particularly at realistic gridsizes between about and .
Templating over tensor indices is advantageous on the GPU for three reasons. First, launching a GPU kernel carries an overhead of about 20 s. In the array loop approach this overhead needs to be paid once per every free and contracted index in the operation. Automatically ported operations will usually be small, and in practice launch overhead is very often the dominant expense. A TLoops operation, on the other hand, launches only one kernel. Second, TLoops operations are paralellized over the lefthand side tensor indices as well as the spatial grid, whereas the array loop approach can parallelize only over the spatial grid. In principle the array loop approach could achieve some indexlevel parallelism via concurrent execution of GPU kernels. However, the aforementioned launch overhead synchronizes the device, preventing concurrent execution in practice.
6 Conclusion
We have presented a software package, TLoops, which allows tensoralgebraic expressions to compile and execute in C++ code. TLoops can also automatically generate equivalent C++ or CUDA code to these expressions, which can be linked back to a second compilation. We have shown this automatically generated code to give identical or comparable performance compared to the code SpEC uses by default, and that the CUDA code often outperforms the CPU. Even at only moderate gridsizes of a few 1000, the CUDA code often comes close to the peak (memorybound) performance of the GPU.
Significant opportunity remains for improvement. The code at present is intertwined with the rest of SpEC. We hope to separate it from the latter into an independent opensource library. Opportunity for performance improvements also exists. In particular, we are working on adopting a contiguous Tensor class within SpEC. This will allow for simpler, faster automatic code.
We hope the simplifications to coding effort made possible by TLoops may speed the development of future code, inside and outside of numerical relativity.
7 Conflicts of Interest
The authors have no conflicts of interest to report.
8 Acknowledgments
We thank Nils Deppe and Mark Scheel for helpful discussions. Calculations were performed with the SpECcode SpE (e0cm). We gratefully acknowledge support from NSERC of Canada, form the Canada Research Chairs Program, and from the Canadian Institute for Advanced Research. Some calculations were performed at the SciNet HPC Consortium Loken et al. (2010). SciNet is funded by: the Canada Foundation for Innovation (CF) under the auspices of Compute Canada; the Government of Ontario; Ontario Research Fund (ORF) – Research Excellence; and the University of Toronto.
References
 SpE (e0cm) http://www.blackholes.org/SpEC.html, .
 Landry (2012) W. Landry, Implementing a high performance tensor library, http://www.wlandry.net/Presentations/FTensor.pdf, 2012.
 Arndt et al. (2017) D. Arndt, W. Bangerth, D. Davydov, T. Heister, L. Heltai, M. Kronbichler, M. Maier, J.P. Pelteret, B. Turcksin, D. Wells, The deal.II library, version 8.5, Journal of Numerical Mathematics (2017).
 jer (1998) Tensor objects in finite element programming, International Journal for Numerical Methods in Engineering 41 (1998) 113–126.
 Reynders and Cummings (1998) J. V. Reynders, III, J. C. Cummings, The pooma framework, Comput. Phys. 12 (1998) 453–459.
 Tisdale (1999) R. Tisdale, Svmt, http://www.netwood.net/~edwin/svmt/, 1999.
 Veldhuizen (1996) T. Veldhuizen, Blitz++, http://www.oonumerics.org/blitz, 1996.
 FTe (2012) http://www.wlandry.net/Projects/FTensor#Benchmarks, 2012.
 Gamma et al. (1994) E. Gamma, R. Helm, R. Johnson, J. Vlissides, Design Patterns: Elements of Reusable ObjectOriented Software, AddisonWesley Professional Computing Series, Pearson Education, 1994. URL: https://books.google.ca/books?id=6oHuKQe3TjQC.
 Loken et al. (2010) C. Loken, D. Gruner, L. Groer, R. Peltier, N. Bunn, M. Craig, T. Henriques, J. Dempsey, C.H. Yu, J. Chen, L. J. Dursi, J. Chong, S. Northrup, J. Pinto, N. Knecht, R. V. Zon, SciNet: Lessons Learned from Building a Powerefficient Top20 System and Data Centre, J. Phys.: Conf. Ser. 256 (2010) 012026.
Comments
There are no comments yet.