1. Introduction
Solving partial differential equations (PDEs) is one of the pillars of computational science and engineering, and solving PDEs accurately on a computer is one of the grand challenges in high performance computing. Simulations may have billions of degrees of freedom, hence highly scalable codes that make efficient use of the invested energy are required. But, highly efficient software often requires expert knowledge, the resulting code might not reflect the underlying numerical scheme, and code tends to become complex: While on the one hand a software might want to support several PDEs, several finite element types, or multiphysics simulations, on the other hand developers have to deal with the subtleties of modern hardware architectures, for example vector instruction latency or level 1 cache bandwidth. Moreover, evolution of hardware architectures may require the adaption of all implemented combinations or even a radical change of data structures.
A strategy to deal with the mentioned complexity is to choose a proper abstraction. For example in linear algebra, a lot of operations may be implemented with only a few simple innermost kernels (e.g. (Van Zee and van de Geijn, 2015)). The advantage is twofold, as firstly only a few kernels need to be optimised by an expert, and secondly all projects that build up on these kernels may benefit from improvements. Abstracting algorithms in terms of linear algebra operations, especially the General MatrixMatrix Multiplication (GEMM), is becoming popular for the numerical solution of PDEs, too. Highorder discontinuous Galerkin (DG) or spectral element methods usually consist of small element local operators, such that its implementations may be expressed as sequence of small GEMMs (Vincent et al., 2016; Uphoff et al., 2017; Hutchinson et al., 2016; Breuer et al., 2017). The matrices are usually small enough, such that data movement considerations within a GEMM become irrelevant. At the same time, the overhead of calling BLAS and the choice of GEMM microkernels becomes significant, such that specialised code generators for small GEMMs are employed (Heinecke et al., 2016b), or even specialised code generators for small tensor contractions are developed (Breuer et al., 2017). An alternative ansatz is to express tensor operations in an appropriate intermediate representation, which captures the underlying loop structure, and use compiler techniques and cost models in order to generate efficient code (Stock et al., 2011; Luporini et al., 2015; Kempf et al., 2018).
The abstraction of tensor operations is not new but has been researched for decades in the field of computational chemistry. Here, one may distinguish two different levels of abstraction: The first level of abstraction is the move from an expression involving several tensors to binary tensor operations. On this level of abstraction, algebraic transformations may be employed. In the socalled strength reduction, for example, the optimal sequence of tensor operations is determined (under a memory constraint) such that the total number of floating point operations is minimised, which is then usually magnitudes smaller than a naive implementation (Lam et al., 1997). The most prominent representative is the Tensor Contraction Engine (TCE) (Baumgartner et al., 2005), but also a framework called Barracuda for GPUs exists (Nelson et al., 2015). The second level of abstraction is the implementation of binary tensor operations, especially for tensor contractions. For the latter, a variety of algorithms exist, such as nestedloop code, LoopoverGEMMs (LoG), TransposeTransposeGEMMTranspose (TTGT), and GEMMlike TensorTensor multiplication (GETT) (Springer and Bientinesi, 2018, and references therein).
In computational chemistry tensors may become very large, such that intermediate results may not remain in memory but must be loaded from disk (Baumgartner et al., 2005). So despite the fact that quantum chemistry models and highorder DG methods may be abstracted as tensor operations, the situation is quite different for the latter, as shall be outlined in the following:

[topsep=0pt]

All matrices fit inside memory or even inside lowlevel caches. There is no need to optimise for memory usage or fuse loops to trade off memory usage and operation count (Baumgartner et al., 2005).

Tensor operations do not need to be distributed among nodes.

All tensor dimensions and respective sparsity patterns in elementlocal operations are known a priori. Furthermore, spending a lot of time optimising individual tensor operations can be justified, as those are called millions of times during simulations (Uphoff et al., 2017).
Other publications and opensource packages focus on binary tensor contractions (Solomonik et al., 2013; Springer and Bientinesi, 2018; Shi et al., 2016; Li et al., 2015; Matthews, 2018), GPUs (Nelson et al., 2015), or focus on loop transformations (Stock et al., 2011; Luporini et al., 2015; Kempf et al., 2018), where the latter lack support for sparse matrices in elementlocal operators and are to our understanding not designed for use with code generators for small GEMMs.
We therefore present Yet Another Tensor Toolbox (YATeTo), which we explicitly designed for the use in PDE solvers. In YATeTo, tensor operations are described in a highlevel language that is motivated by the Einstein convention. An abstract syntax tree (AST) is formed on which algebraic transformations and other optimisations are done. Afterwards, code generators for small GEMMs are called and finally C++11code is generated which calls either generated code or calls BLAS libraries. Our design does not prescribe the use of a specific implementation of binary tensor contractions, but allows to choose the implementation based on operation type and hardware architecture. Note that our main focus are highorder DG and spectral element methods, and that our ultimate goal is that the performance of the YATeTogenerated code matches the performance of handcrafted code, but the toolbox itself may be used for arbitrary applications where small tensor operations are required and tensor dimensions and respective sparsity patterns are known a priori.
2. Related work
Many components of YATeTo build on previous algorithms and ideas. In this section we summarise related work and necessary background.
2.1. Highlevel language and representation
The need for highlevel languages for tensor contractions was already clear to A. Einstein, who wrote in the foundations of general relativity:
“Dafür führen wir die Vorschrift ein: Tritt ein Index in einem Term eines Ausdrucks zweimal auf, so ist über ihn stets zu summieren, wenn nicht ausdrücklich das Gegenteil bemerkt ist.” (Einstein, 1916)
The socalled Einstein convention states that one shall sum over every index
that appears twice in a term.
Elegant domainspecific languages may be defined using this convention.
For example, in the Cyclops tensor framework one may write (Solomonik et al., 2013)
[commandchars=
{},codes=]
W[”MnIj”] = V[”MnEf”] * T[”EfIj”];
Z[”AbIj”] = W[”MnIj”] * T[”AbMn”];
which is valid C++ (using operator overloading) and implements
The Barracuda framework defines a similar highlevel language (Nelson et al., 2015).
In the TCE, tensor operations are represented internally as function sequences consisting of two types of formulae (Lam et al., 1997): The first type is a multiplication formula of the form . Depending on the indices inside the square brackets a different operation is represented. E.g. represents a tensor product whereas represents a Hadamard product. (Note that two indices appear but are not summed over, that is, this representation does not adhere to the Einstein convention.) The second type is a summation formula of the form . Here, the dimension of the lefthand side is always one less than the righthand side, and the dimension to be summed over is determined by the indices in the brackets.
2.2. Strength reduction
In many cases there are many mathematically equivalent ways to implement tensor operations. In (Baumgartner et al., 2005) the following motivating example is given, where each dimension has size ,
The first implementation corresponds to the naive implementation with 10 nested loops, which requires operations. The second implementation saves intermediate results and only requires operations. Hence, the second implementation is four orders of magnitudes cheaper than the first implementation.
Naturally, one asks if and how optimal implementations can be found automatically. A formal optimisation problem is set up in (Lam et al., 1997) using the two types of formulae mentioned in Section 2.1. It is shown that this optimisation problem is NPcomplete and an efficient search procedure is developed in (Lam et al., 1997), which exhaustively searches all valid formulae and hence finds the optimal implementation. As the number of dimensions arising in practice is typically small, this search procedure is feasible. The number of dimensions is also small enough in our application examples, see Section 5, hence there is no need to develop new algorithms for strength reduction.
2.3. LoopoverGEMM (LoG)
The authors of (Di Napoli et al., 2014)
give a summary of tensor contraction classes and possible mappings to BLAS. In a binary contraction, the contraction is classified in the number of free indices of each input tensor. (That is, those indices which are not contracted.) The most interesting case is when both tensors have at least one free index, as in this case a mapping to GEMM is always possible, if data is allowed to be copied or nonstrideoneGEMMs are considered.
Consider the following contraction from (Springer and Bientinesi, 2018):
Following (Shi et al., 2016), we have the following options in a LoG implementation: We may batch modes yielding a nonfree index , denoted with , we may flatten modes yielding a single free index, denoted with , and we may transpose modes, denoted with . Note that transposes may only be applied to exactly 2 free modes.
A possible LoG implementation would be
(1) 
That is, we loop over the indices c and k. Inside the loop, GEMM is called for subtensors of C, B, and A. An implementation of Equation 1 is listed in (Springer and Bientinesi, 2018, Listing 8).
3. Tensors in discontinuous Galerkin methods
The discontinuous Galerkin method has become a popular method for solving hyperbolic PDEs. The DG method has several advantages compared to other classic methods, such as Finite Difference methods, Finite Volume methods, and continuous Finite Element methods: It simultaneously allows to handle complex geometries, achieves highorder convergence rates, and has an explicit semidiscrete form (Hesthaven and Warburton, 2008). Additionally, the high number of elementlocal operations and the small stencil make it a promising candidate to exploit the capabilities of modern supercomputing hardware. Before discussing the details of our tensor toolbox, we want to motivate why tensor contractions are a natural abstraction for DG methods.
As an example, we consider the following linear system of PDEs:
(2) 
Multiplying the latter equation with a test function , integrating over a domain , and integrating by parts yields the corresponding weak form
(3) 
where is the outwardpointing normal vector. From here a sequence of steps follows to obtain a semidiscrete form, such as partitioning the domain in finite elements (e.g. tetrahedra or hexahedra), introducing a numerical flux to weakly couple the finite elements, transforming the physical elements to a reference element, and introducing a polynomial approximation of the quantities on the reference element (Atkins and Shu, 1998; Dumbser and Käser, 2006; Hesthaven and Warburton, 2008). After these steps, the semidiscrete form is a system of ODEs, which may, for example, be solved with a RungeKutta scheme or the ADER approach (Dumbser and Käser, 2006).
We skip some steps for simplicity, and assume that the quantities are discretised as at a given point in time on an element . Here, are a set of polynomial basis functions and is a matrix, storing the coefficients of the basis expansion for every quantity, where is the number of quantities. By inserting our discretisation we may write the second line of (3) in the following way
(4) 
The integrals in (4) may be precomputed and stored as (socalled) stiffness matrices , , and of size .^{1}^{1}1Note that these matrices need to be only computed for the reference element in an actual implementation. The implementation in terms of GEMM is then given by the sequence of matrix chain products .
When one uses the unit cube as reference element, then it is possible to use a spectral basis, where the basis functions and test function are given by . The degrees of freedom are then stored in a 4dimensional tensor , such that . In each dimension we have basis functions, where is the polynomial degree, and has entries. Inserting the discretisation and the test functions in the second line of (3), we obtain
where and . We note here that the index order lmnp is chosen arbitrarily. In fact, the index permutations plmn, lpmn, and lmpn are equally viable, and it depends on the application which is best. E.g. if is large and is smaller than the SIMD vector width then it might be beneficial to store index continuous in memory. Otherwise, if is large and is small then either of lmn should be continuous in memory.
As a final example, one may solve multiple problems concurrently. E.g. one may add another dimension to the degrees of freedom in (4) to obtain a 3D tensor , which stores multiple degrees of freedom (Breuer et al., 2017). The matrix chain product then becomes the tensor contraction sequence and may be implemented using the LoopoverGEMM approach, e.g. .
4. Yet Another Tensor Toolbox
The input to our tensor toolbox is a domainspecific language (DSL), which naturally resembles the Einstein notation. An abstract syntax tree (AST) is derived from an expression. Subsequently, the tree is shaped using a sequence of visitors (Gamma et al., 1995). Afterwards, we transform the AST to a control flow graph (CFG) for standard compiler techniques (Seidl et al., 2012). Note that the CFG is very simple as the DSL does not feature loops and branches. Finally, a code generator is called which generates either C++11 code or may make use of existing code generators or BLAS libraries. An overview is shown in Figure 1. YATeTo is opensource and available on www.github.com/SeisSol/yateto.
In the following, we are going to introduce our DSL and elaborate algorithms and design choices of each of the three stages.
4.1. Highlevel description of tensor operation
The DSL we created is embedded into Python 3. An embedded DSL has the advantage that we do not need our own parser and lexer, which decreases the complexity of the software. Furthermore, a user may use any feature of Python to form expressions, for example including loops, branches, classes, and lambda expressions.
The basic data types are the classes Tensor and Scalar.
A tensor definition includes a name and a shape tuple, e.g. as in the following:
[commandchars=
{},codes=]
N = 8
A = Tensor(’A’, (N, N))
B = Tensor(’B’, (N, N, N))
w = Tensor(’w’, (N,))
C = Tensor(’C’, (N, N))
A and C are matrices, B a third order tensor, and w is a vector, where all
dimensions are of size N.
From a Tensor, an object of type IndexedTensor may be derived by supplying an index string. The convention is that the length of the index string must be equal to the number of dimensions of the tensor and that only lower and uppercase letters may be used.^{2}^{2}2Note that this limits the number of distinct indices that may appear in an expression to 52. We think that this is not an issue in practice, hence we favour simplicity instead of generality. Objects of type IndexedTensor may appear in expressions, e.g.
[commandchars=
{},codes=]
kernel = C[’ij’] ¡= 2.0 * C[’ij’] + A[’lj’] * B[’ikl’] * w[’k’]
The design is inspired by (Solomonik et al., 2013) (see Section 2.1), and represents the operation
The variable kernel contains an AST with nodes Assign, Add, ScalarMultiplication, and Einsum, where the latter represents a tensor operation using the Einstein convention.
The AST is built by overloading the binary operators ’*’ and ’+’. One of the following four actions is taken whenever one of the operators is invoked for operands op1 and op2, where nodeType = Einsum for ’*’ and nodeType = Add for ’+’:

type(op1) = nodeType and type(op2) = nodeType: Append op2’s children to op1.

type(op1) = nodeType and type(op2) nodeType: Append op2 to op1.

type(op1) nodeType and type(op2) = nodeType: Prepend op1 in op2.

type(op1) nodeType and type(op2) nodeType: Return nodeType of op1 and op2.
These actions ensure that a node of type nodeType does not have children of type nodeType, but all tensors w.r.t. the same nodeType are merged. This is important, especially for Einsum and the algorithms presented in Section 4.2 and Section 4.3, which require all tensors participating in a tensor operation to be known. Note that we do not apply distributive laws, e.g. in the expression the addition is always evaluated before the Einstein sum.
For scalar multiplications we take similar actions to ensure that ScalarMultiplication is always the parent node of an Einsum.
4.2. Equivalent sparsity patterns
The notion of equivalent sparsity patterns (EQSPPs) and an algorithm to compute them was introduced in (Uphoff and Bader, 2016) for matrix chain products (MCP). It is defined as the minimal sparsity patterns of the involved matrices that leaves the result of an MCP unchanged. In the latter definition, no cancellation is assumed, which means that adding nonzero entries never results in zero (Cohen, 1998). First, the assumption allows to work with boolean matrices alone. Second, we assume that the sparsity pattern is known but the values are unknown (or known only at runtime), which has the following implication: In an inner product, which is the core of matrix multiplication, we can always find a pair of nonorthogonal vectors. That is, if the values of at least one matrix are unknown, we cannot guarantee cancellation.
The concept of EQSPPs can be illustrated with the following example (where the entries may be either scalars or dense block matrices):
Here, the multiplication with the first matrix removes and from the righthand side and the multiplication with the third matrix removes and from the righthand side. Thus, the equivalent sparsity pattern of the second matrix is only nonzero for the topleft block ().
In (Uphoff and Bader, 2016) an algorithm to compute EQSPPs for MCPs is presented, which is based on a graph based representation. The extension of the MCP graph to general tensor operations is not straightforward, thus we derive an algorithm that is not based on a graph in the remainder of this section.
4.2.1. Formalisation of tensor operations
So far we have defined tensor operations via handwaving and examples, but here we need to introduce a formal definition of the “*”expression in our DSL.
First of all, we formalise the labelling of the dimensions of a tensor . It consists of a set of indices and an assignment of indices to dimensions. We denote the set of indices with calligraphic letters, e.g. , which must be a subset of the powerset of an alphabet , where e.g. . We require .^{3}^{3}3Note that this requirement excludes the trace of a matrix or similar expressions. However, one may write the trace as , which conforms with the requirement. The assignment of indices to dimensions is a bijective mapping . The range of an index is always , where .
Example 1 ().
Let . The index set of A[’jik’] is , the map is given by , and .
Second, assume a “*”expression is formed by tensors , and the result tensor . Einstein’s convention requires us to sum over all indices that appear twice, which we formalise in the following: The index space of tensor is , where is the index range of the th dimension. In order to span an iteration space, we introduce the global index set , and the global index space , where . Similar to a tensor, we assign global indices to dimensions of the global index space with the bijective function .^{4}^{4}4The function may for example be obtained by ordering the global index set lexically. In order to “access” entries of a tensor, we introduce projection and permutation functions , where
Entries of are accessed with the notation , where , or shortly with , where the permutation and projection function is understood from context. We define the restriction of the global index set to an index as .
With the set up formalism, we are able to precisely state the meaning of a “*”expression:
(5) 
Example 2 ().
Consider tensors and . In Einstein notation, the mode2 product is written as and in our DSL as C[’ijk’] <= A[’ilk’] * B[’jl’].
In our formal notation, the global index space is , where . The permutation and projection functions are , , and . We may compute the entries of C with
4.2.2. Computation of EQSPPs for tensor operations
As we assume no cancellation in sums, it is sufficient to model a sparsity pattern as a boolean tensor. The EQSPP of a tensor is called , where . The “hat”tensor is the original tensor masked with . In operations involving we identify with and with .
Having set up the necessary formalism, we are ready to formally define EQSPPs:
Definition 4.1 ().
Assuming no cancellation, we call equivalent sparsity patterns w.r.t. (5) if

, where and ,

The number of nonzeros of , , is minimal, that is, we cannot set a nonzero to zero without implying .
As we assume no cancellation, we may disregard the sums over indices, and instead need to only consider a product over all tensors on the righthand side:
Lemma 0 ().
The EQSPPs w.r.t. (5) are equivalent to the EQSPPs w.r.t.
Proof.
Clearly, , hence condition 1 is fulfilled.
In order to check condition 2, assume there is another set of EQSPPs, , which has less nonzeros than . Then (otherwise would not be minimal). As and does not get cancelled in sums, it follows that there exists an index where . ∎
Our main result is that we may cast the computation of EQSPPs as boolean tensor operations on the original sparsity patterns:
Proposition 1 ().
Proof.
We are going to show that (6) computes the EQSPPs for the product of tensors (see Lemma 4.2). In order to satisfy condition 1, we only need to show that the sparsity patterns of and are identical. The sparsity pattern of is given by , where and the sparsity pattern of is given by , where
First, we show that . Note that
(7) 
where the latter equality follows from idempotence and if by the definition of . Using (7) we obtain
(8) 
where the last identity follows from and the absorption law. Equation 8 is true for all , hence .
Second, in order to satisfy condition 2, we show that the number of nonzeros is minimal: Assume that there exist another set of sparsity patterns , with less nonzeros than . Then, . Clearly, and , because otherwise . Hence, there must be at least one such that . As it immediately follows that
which violates condition 1. ∎
4.2.3. Implementation and discussion
We have applied our EQSPP algorithm to an application example from (Uphoff and Bader, 2016), where we extended the degrees of freedom by an additional dimension (cf. Section 5.1). The original SPPs and EQSPPs can be seen in Figure 2. We observe that the nonzero entries of and induce additional zeros in which may be disregarded when computing the tensor operation. For example, in the evaluation of we may restrict the loop ranges of and .
The implementation of EQSPPcomputation with (6) is straightforward, as we require the same kind of tensor operations that we support within YATeTo, the only difference being that the tensors are boolean. E.g. we may apply strength reduction (cf. Section 4.3) in order to reduce the amount of computations. Still, our method to compute EQSPPs is more expensive to compute than to simply evaluate the original tensor operation. Nevertheless, the cost of computing EQSPPs is negligible in comparison to the possibly millions of times a kernel is called within a DG scheme.
4.3. Strength reduction
We already mentioned in Section 4.3 that finding the sequence of tensor operations with minimal operation count is an NPhard problem, but an efficient enumeration procedure exists for tensor dimensions appearing in practice (Lam et al., 1997). The same enumeration procedure may also be used when sparse tensors are involved, but in this case the sparsity patterns of intermediate results are required, or at least an estimate of the sparsity (Lam et al., 1999).
In YATeTo we assume that tensors are small enough, such that it is feasible to explicitly compute all intermediate products during strength reduction. The number of operations is then determined in the following way (Lam et al., 1999): For a multiplication formula the number of operations is equal to the number of nonzeros in . For a summation formula the number of operations is equal to the number of nonzeros in minus the number of nonzeros in .
An example of an intermediate AST after strength reduction is shown in Figure 2(b).
4.4. Memory layouts
Following Figure 1 the next step is to compute memory layouts of tensors. The memory layout influences the possibility to fuse indices in a LoGimplementation. Thus the layout influences the cost function in Section 4.5 and needs to be fixed at this point already. In the following we give an overview over the two classes of memory layouts that are currently supported.
4.4.1. Dense layout
Let . We call the tuple the shape of a tensor. A simple memory layout of a dense tensor is the “columnmajor” storage:
where the socalled stride is given by with . That is, the tensor is stored linearly in a 1D array, such that the first dimension varies fastest and the last dimension varies slowest in memory. The stride gives us the number of floating numbers we have to skip when we increase index by one.
From Section 4.2 we expect that we also have tensors with large zero blocks, where it would be wasteful to store the full tensor. Hence, our next evolution is the boundingbox columnmajor storage:
where and . This memory layout models the case where is nonzero within the index set and zero otherwise.
Finally, it might be beneficial to align the fastest dimension. That is, the number of bytes in a fibre in the first dimension should be divisible by the architecture’s SIMD vector length. In some cases, one has to add artificial zeros to the first dimension, hence we allow that the bounding interval of the first dimension is larger than the size of the first dimension. Given an initial (minimal) bounding interval we align the bounding interval as in (Uphoff and Bader, 2016), that is,
where is the number of floating point values that fit into a SIMD vector (e.g. using AVX512 one sets for double precision or for single precision).
We conclude the presentation of dense layouts with a possible pitfall of using bounding boxes and alignment. It might be beneficial to fuse multiple indices when mapping a tensor contraction to a LoopoverGEMM (see Section 2.3). Fusing indices is always possible when (Shi et al., 2016)
(9) 
Other cases would require more involved compatibility conditions. For example in the GEMM , where artificial zeros are added for dimension in and dimension in , then one may only fuse and whenever has the same number of artificial zeros in dimension and . Otherwise one needs temporary storage and an additional data copy. In order to avoid possible data copies and complications arising in subsequent optimisation steps we only allow fused indices when (9) is fulfilled. Conversely, aligned layouts or bounding box layouts have to be considered carefully as they may prohibit fusing indices, which leads to smaller (and possibly slower) GEMMs.
4.4.2. Compressed sparse column (CSC) layout
Sparse matrices may appear in discontinuous Galerkin methods, e.g. the stiffness matrices are sparse when an orthogonal set of basis functions is used. We have a limited support for the CSC format: CSC matrices may appear in GEMM or LoopoverGEMM calls but only in a sparse x dense or a dense x sparse GEMM.
4.4.3. Other (sparse) formats
It might be surprising that we describe Equivalent Sparsity Patterns in great detail but currently do not offer a genuine sparse tensor memory layout. First of all, we want to point out that if an EQSPP induces a large zero block, then we are able to exploit this block using the bounding box memory layout or we may let a LoG operate only on a subtensor. Second, to the authors’ knowledge, an efficient code generator or efficient routines for general small sparse tensors do currently not exist.
4.5. Optimal index permutations
The strength reduction step converts each Einsum node to a binary tree consisting only of IndexSum and Product nodes. In a first step, we identify contractions in the tree, which are (multiple) IndexSum nodes followed by a single Product node. A subtree that was identified to be a contraction is replaced by a new node of type Contraction. For example, we might end up with the (sub)tree shown in Figure 4.
Each Contraction node shall now be mapped to an implementation as LoopoverGEMM. In general, the mapping to GEMM is not unique, but several implementations are possible (e.g. (Shi et al., 2016, Table II)). A systematic way to list all possible implementations is given in (Springer and Bientinesi, 2018, Listing 7)
. As accurately predicting the performance of each implementation is difficult, heuristics may be employed to select the most promising candidate, e.g. one may choose the variant with the largest GEMM
(Springer and Bientinesi, 2018).The algorithm to list all LoGimplementations in (Springer and Bientinesi, 2018) assumes that the order of all indices are known. In our case we have temporary tensors in which an order of indices is not prescribed. (E.g. , , and in Figure 4.) The index order has to be chosen carefully as it decides possible mappings to GEMM and there are even cases where nonunit stride GEMMs are required (Shi et al., 2016, Table II). Moreover, the index order of a temporary result influences possible mappings to GEMM in nodes that involve the temporary result.
4.5.1. Optimal index permutations
Our approach to select the index orders of temporary tensors is to solve the following discrete optimisation problem: Let be the set of vertices of an AST with root . We denote the set of children of a vertex with , and the set of descendants of a vertex with . For each vertex we have a set of permissible index permutations . We introduce variables , for all , where i.e. variable contains the index permutation for vertex . The Cartesian product of all permissible index permutations is called the configuration space. The optimisation problem is to find a configuration which minimises a cost function over all possible configurations, that is,
(10) 
We define the cost function of a tree with root recursively in the following way:
(11) 
The functions can be thought of as a measure of the cost of the operation represented by a vertex in the AST. We give a detailed definition of in Section 4.5.2.
We split the minimisation in two stages. In the first stage we minimise over the root variable and its children, in the second stage we minimise over all other variables, i.e. the variables associated with the vertices in the set . One can then show that
where we introduced
(12) 
The subproblem introduced in (12) can be interpreted as finding the optimal configuration for a subtree, assuming that the index permutation of the root node is fixed. In fact, . Such problems are said to have an optimal substructure, because the optimal solution can be constructed out of optimal solutions to subproblems (Cormen et al., 2009).
Dynamic programming is a commonly used approach for problems with optimal substructure. We consider a bottomup dynamic programming algorithm, which works in the following way: The AST is traversed in postorder. For each vertex , we solve problem (12) for every permissible index permutation . The minimum cost as well as a minimising configuration is memoized in a dictionary. If vertex is a leaf node, we simply evaluate all index permutations in . For internal nodes, we evaluate all configurations in . In order to evaluate the cost, we evaluate the function and for the subproblems we lookup the memoized costs. (The latter are available due to the postorder traversal.) The runtime of this algorithm can be bound with , where is the number of vertices in an AST, is the maximum number of indices in a vertex (and hence is the maximum size of each permissible index permutation set), and is the maximum number of children.
4.5.2. Cost function
The missing pieces in the last section are the cost functions . Here, we choose a simple heuristic. Our basic assumption is that our ASTs will consist mostly of LoGs and that those will dominate the runtime. Hence, all other operations are assigned cost zero. Further assumptions are listed in the following:

Nonunit stride GEMMs are inferior to unit stride GEMMs.

Transposes of () in the GEMM should be avoided when using columnmajor (rowmajor) layout. Transposes of () should be avoided due to missing support in code generation backends.

Large GEMMs are faster than small GEMMs, i.e. one should fuse as many indices as possible.
From these assumptions we define the cost of a LoG as the 4tuple , where is the number of required slices with nonunit stride, is the number of lefttransposes (assuming columnmajor layout), is the number of righttransposes, and is the number of fused indices. Cost comparison is based on lexicographic comparison of , where the lower number of lefttransposes is deciding when two 3tuples are equal.
The cost function is then
where MinLoG enumerates feasible LoGimplementations and returns a LoGimplementation with minimum cost or infinite cost if no mapping exists. With “violated constraints“ we summarise additional constraints that exist, e.g. for Add or Assign nodes.
4.5.3. Discussion
The cost function we employ is clearly limited, as it is a heuristic and makes sense for LoG nodes only. However, choosing another cost function does not change the dynamic programming scheme, as long as the cost function is structurally equivalent to Equation 11. For example, one could swap our LoGcost by a runtime model based on microbenchmarks, using the methods developed in (Peise and Bientinesi, 2012).
The algorithm needs to visit every node only once, hence the algorithm is feasible also for large ASTs. Problematic are tensors with many dimensions (say large ), as the number of permutations is and we need to check all of them. But as stated in the introduction, we assume that our tensors fit into lowlevel caches, which constrains . E.g. is the maximum number of dimensions we may choose such that a tensor (double precision) still fits into a 32 KiB L1 cache. In other words, YATeTo is not designed for highdimensional tensors.
An example of an AST after mapping to LoG can be seen in Figure 2(c).
4.6. Prefetching
The library LIBXSMM, a generator for small matrix matrix multiplications, allows software prefetching instructions to be included in the generated assembler code (Heinecke et al., 2016b). This may improve performance, particularly on Intel Knights Landing architecture (Heinecke et al., 2016a; Uphoff et al., 2017). A possible prefetching strategy is to insert vprefetch1 instructions for a matrix B after writes to the result matrix C, where the memory offset calculated for the matrix C are also used for the matrix B. The rationale is that B has the same or a similar shape as C.
In YATeTo the users may add a list of tensors they wish to be prefetched to a kernel. The FindPrefetchCapabilities visitor determines the number of bytes that may be prefetched for every node of the kernel. For a LoGnode this is equal to the number of bytes in the result tensor. (Other nodes have no prefetch capability, but this may be added if appropriate code generators are available.) AssignPrefetch then greedily assigns each tobeprefetched tensor P to one of the nodes, such that the number of bytes of P matches the prefetch capability of the node.
4.7. Control flow graph
Following Figure 1, we convert the AST to a control flow graph (CFG) representation (Seidl et al., 2012). The aim here is first to obtain a standardised sequential representation for the later code generation step, and second to manage temporary buffers or to avoid them at all.
The data structure of the CFG are program points (vertices), which save the current state of the kernel, and actions (edges), which transform the kernel’s state. An action may be either an assignment or a plusequals operation. The lefthand side is always a tensor. The righthand side may be either a tensor or one of the operations modelled in the AST (such as LoG or Product). The righthand side may be augmented by multiplication with a scalar.
The individual optimisation steps use standard techniques from compiler optimisation (Seidl et al., 2012), thus we only briefly summarise them in the following:

MergeScalarMultiplications: The actions are replaced by .

LivenessAnalysis: Determine live variables (Seidl et al., 2012).

SubstituteForward: If possible, replace by tmp after action .

SubstituteBackward: If possible, replace lefthand side tmp by if there follows an action .

RemoveEmptyStatements: Remove statements such as .

MergeActions: Try to merge actions, e.g. merge and to , when there is no intermediate action which depends on .

DetermineLocalInitialization: Greedy map of temporary buffers to temporary variables. If buffers are used for multiple variables, set the buffer size to the maximum size required by its assigned variables.
In order to illustrate the individual steps, we present matrix multiplication as an example:
[commandchars=
{},codes=]
C[’ij’] ¡= C[’ij’] + 0.5 * A[’ik’] * B[’kj’]
A simple, but correct, traversal of the corresponding AST yields the sequential CFG shown in Figure 4(a). This inefficient CFG is transformed to a single call to GEMM (with and ), as shown in Figure 4(e).
4.8. Code generation
The final step is to generate machine code. In principle, the aim is to reuse existing code generators, such as LIBXSMM (Heinecke et al., 2016b) or PSpaMM (Brei, 2018; Wauligmann and Brei, 2019), or flavours of BLAS. Besides, YATeTo generates C++gluecode, and it is able to generate generic fallback code, whenever an efficient implementation is not available. The fallback code consists of nested loops and is architecture independent, but we expect the performance to be rather poor and compilerdependent.
Operations are divided into four types: copyscaleadd, indexsum, product, and log. The first type is a combination of the BLAS routines COPY and AXPY, and may be implemented in terms of these two functions. The product and indexsum types correspond to the multiplication formulae and summation formulae presented in Section 2.1. Having a generic implementation for these two types ensures that we cover a large class of tensor operations (Lam et al., 1997). The log type implements LoopoverGEMMs.
Our concept requires that it should be simple to include additional code generators depending on the type of operation and on the architecture. We use the factory method pattern (Gamma et al., 1995) to distinguish between code generators. For each type, the factory method may return a different code generator depending on the operation’s description (e.g. dense or sparse matrices) and the architecture. For the LoG type, YATeTo offers a generic C++loopbased implementation, which internally calls another factory method for GEMM. But our structure would also allow to call specialised LoG generators, such as the ones developed in (Breuer et al., 2017) for contractions involving 3D tensors and matrices.
4.9. Application interface
A class is generated for every kernel. The class has an argumentfree execute function and pointers to the input and output tensors of the kernel are public members, where the name of a member variable is given by the name of the tensor. So in order to invoke a kernel one has to create a kernel object, set the pointers to the input and output tensors, and call the execute function. A kernel class also stores the following information as static members: The minimal number of flops,^{5}^{5}5Strictly speaking, the term “minimal” is incorrect due to Strassen’s algorithm or other bounds on the number of operations. That is, minimal is understood as minimal w.r.t. the cost function defined in Section 4.3. also sometimes called nonzero flops, and the number of flops the implementation requires, also sometimes called hardware flops. The number of nonzero flops is computed by means of strength reduction with the sparse cost function in Section 4.3, and the hardware flops are returned by the code generators. These flop counters are useful to evaluate the performance of the kernels.
In addition to the kernel classes, a unit test is generated for every kernel. The unit test compares the optimised kernel to the naive implementation of the same kernel. If the Frobenius norm of the difference of both implementations matches up to a tolerance, the unit test passes. (We allow a tolerance due to finite arithmetic imprecision caused by the difference in order of addition.)
Moreover, a class is generated for every tensor, where information about the memory layout is stored. The most useful static member, which is declared constexpr, is the number of required floating point numbers. This information may be used to allocate stack or heap memory for a tensor. For convenience, each tensor class contains a function that returns a view object for a memory region. Using a view object the user may access a ddimensional tensor with operator(), i.e. V(i1,...,id). Tensors whose entries are known at compile time (e.g. stiffness matrices) may be stored within the generated code as static array, already formatted according to the selected memory layout.
To integrate YATeTo into the build process of an application, users need to have Python 3 and NumPy installed. The generated code follows C++11 syntax and depends on a small headeronly library, i.e. the generated code itself requires only a recent C++compiler. The kernels may also be called from C or Fortran, by manually wrapping the kernel calls in a C function. We plan to generate a C and Fortran interface automatically in the future.
5. Application to an ADERdG method
We integrated YATeTo in two codes in order to evaluate practicability of our design, as well as to evaluate the performance and optimisation opportunities. Both codes employ a discontinuous Galerkin method with ADER timestepping based on the discrete CauchyKovalewski procedure (Käser et al., 2007). Furthermore, both codes solve a linear PDE in two or three dimensions of the form of Equation 2. The differences lie in the PDE (elastic wave equation, acoustic wave equation) and the underlying finite elements (tetrahedra, rectangles).
In the following, we do not derive the numerical schemes but only indicate changes w.r.t. literature references, in order to stay within the scope of this paper. We validated both codes with a plane wave problem (e.g. (Käser et al., 2007)), and checked that the code converges (for an appropriately chosen timestep) and that the observed convergence order matches the theoretical order of the scheme. Moreover, we tried to keep the amount of indices to a minimum by mostly omitting dependencies on space and time. We use the Einstein convention for subscript indices, but not for superscript indices.
5.1. SeisSol
The earthquake simulation code SeisSol (www.github.com/SeisSol/SeisSol) solves the seismic wave equation with support for elastic (Dumbser and Käser, 2006), viscoelastic (Käser et al., 2007), and viscoplastic rheological models (Wollherr et al., 2018). The code operates on unstructured tetrahedral meshes and the quantities are approximated with Dubiner’s basis functions. In essence, one may write the computational kernels as matrix chain products of small matrices, where small means that the maximum dimension size of each matrix is smaller or equal , where is the maximum polynomial degree of the basis functions and is the theoretical convergence order. SeisSol already makes use of the code generator tailored for matrix chain products, which simplifies the implementation of the various rheological models (Uphoff and Bader, 2016).
For elastic rheological models, the code achieves about 50 % of peak performance on compute clusters based on Intel’s Haswell or Knight’s Landing architecture (Heinecke et al., 2016a; Uphoff et al., 2017). Moreover, the authors of (Breuer et al., 2017) investigate the possibility to fuse multiple simulations in a single run, using the same ADERDG method as SeisSol and an elastic rheological model. They report a speedup of for a fourthorder scheme, when fusing 8 simulations in one run (compared to 8 individual simulations). Their main optimisation ingredient is a specialised code generator for tensor contractions involving 3D tensors and sparse matrices.
5.1.1. AderDg
The numerical scheme of SeisSol is given in detail in (Dumbser and Käser, 2006; Käser et al., 2007; Uphoff et al., 2017), and is stated in a compact form in Equations 16, 15, 14 and 13 for an elastic rheological model. The equations already include multiple simulations, as in (Breuer et al., 2017). Multiple simulations are easily modelled using the Einstein convention as one simply has to add another index to the degrees of freedom, which we have denoted with a blue s.
The first step in the numerical scheme is to compute the timederivatives of the degrees of freedom , using a CauchyKovalewski procedure:
(13) 
The indices and denote the spacetime element and index denotes the order of the timederivative. The matrices are sparse matrices, and and are sparse matrices, which are elementdependent linear combinations of the Jacobians and from Equation 2. For every simulation one needs to store a matrix for the degrees of freedom and the derivatives , yielding tensors, where is the number of simulations.
The evolution in time is predicted with a Taylor expansion, without considering an element’s boundaries. In the DGscheme, the timeintegral over a timestep with size is required, which is computed as
(14) 
The timeintegral is inserted into the DGscheme, yielding the update scheme for the degrees of freedom.
(15)  
(16) 
The matrices are sparse matrices, and are sparse matrices, where , and are sparse matrices. and contain the solution to the 1D Riemann problem which is solved along the four sides of a tetrahedron, and these are dense matrices.
5.1.2. Equivalent sparsity patterns
The matrices contain large zero blocks, especially only the first columns are nonzero. YATeTo automatically determines that the EQSPP of is a subtensor. Hence, the contraction is mapped to a GEMM of size instead of a GEMM of size . Also the contraction is automatically mapped to a GEMM of size instead of a GEMM of size . So the percentage of original flops scales with , e.g. 50 % for a fourthorder scheme or 37.5 % for a sixthorder scheme.
The sparsity pattern of is equivalent to the sparsity pattern of and it has a staircase structure. The effect is, that derivative has only nonzero coefficients per quantity (Breuer et al., 2014b). We may set the sparsity patterns of the derivatives accordingly in order to exploit the vanishing coefficient – or we use existing visitors of YATeTo in order to simultaneously define the derivative kernels and determine the sparsity patterns of the derivatives automatically, as shown in Figure 6.
5.1.3. Strength reduction
By solving the matrix chain multiplication order problem the number of required flops in Equation 16 can be reduced from to (Uphoff et al., 2017). The optimal matrix chain order is reproduced by the implementation of the strength reduction algorithm. Interestingly, for 8–32 fused simulations the order of evaluation is optimal, found automatically by strength reduction.
5.1.4. Optimal index permutations
We take Equation 16 as example, but note that the same discussion can be applied to the other kernels. From YATeTo we obtain the following mappings to LoG for either a single simulation or for multiple simulations (Greek letters denote temporary tensors):
We observe that the optimal index permutation algorithm from Section 4.5 fulfils the goals of the cost function: Nonunit stride GEMMs are not present, transposes are kept to a minimum and indices are fused if possible.
In SeisSol, the implementation is in fact a bit different: The and matrices (as well as the star matrices and the matrices) are stored transposed, such that a transposefree scheme is obtained. For multiple simulations we can also obtain a transposefree scheme in the following manner: The matrices (as well the star matrices and the matrices) are stored transposed as for a single simulation, but for the other matrices we face the opposite situation: The matrices are stored in normal order and we have to store and transposed as well as and for the other kernels.
5.1.5. Findings
YATeTo is able to reproduce the major optimisations from SeisSol: Zeroblocks are exploited and the optimal matrix chain multiplication order is found. For multiple simulations, zero blocks are also exploited automatically, strength reduction revealed the optimal—and different—evaluation order, and a transposefree scheme is found by inspection of mappings to GEMM produced by YATeTo.
5.2. LinA
The code LinA was developed for education purposes, and contains a basic implementation of the ADERDG method. LinA solves the linearised equations of acoustics in two dimensions (LeVeque, 2002), which involve the three quantities pressure and particle velocities in horizontal and vertical direction. A uniform mesh with rectangular elements is used.
Our goal here is to evaluate the application of YATeTo to a DGspectralelementlike method (cf. (Kopriva, 2009, Chapter 5.4) for an introduction to DGSEM). As briefly introduced in Section 3, quantities are represented using a tensor product basis, i.e. . An advantage of DGSEM is that integrals over the reference elements may be split in a tensor product of 1D integrals. E.g. the computation of the gradient in the weak form can be done with instead of computations per element, where is the number of dimensions. On the downside, one needs to store degrees of freedom per quantity instead of per quantity, which is the minimum amount of degrees of freedom needed to represent the same space of polynomials.
We implemented the ADERDGSEM method in LinA using YATeTo (www.github.com/TUMI5/LinA), and extended the code to three dimensions, which implies an additional quantity or 4 quantities in total. For the 1D basis functions we use a nodal basis with GaussLobatto points and Legendre polynomials (Hesthaven and Warburton, 2008). In the following we are going to abuse notation and use the same symbols for tensors as in SeisSol, even though the tensors are of different shape and have different sparsity patterns.
5.2.1. Numerical scheme
The first step in LinA, as in SeisSol, is to predict the elementlocal evolution in time:
(17)  
(18) 
Indices denote the location of an element on the grid and is the time index. The matrices are matrices and and are sparse matrices. Both and