Yet Another Tensor Toolbox for discontinuous Galerkin methods and other applications

03/27/2019 ∙ by Carsten Uphoff, et al. ∙ Technische Universität München 0

The numerical solution of partial differential equations is at the heart of many grand challenges in supercomputing. Solvers based on high-order discontinuous Galerkin (DG) discretisation have been shown to scale on large supercomputers with excellent performance and efficiency, if the implementation exploits all levels of parallelism and is tailored to the specific architecture. However, every year new supercomputers emerge and the list of hardware-specific considerations grows, simultaneously with the list of desired features in a DG code. Thus we believe that a sustainable DG code needs an abstraction layer to implement the numerical scheme in a suitable language. We explore the possibility to abstract the numerical scheme as small tensor operations, describe them in a domain-specific language (DSL) resembling the Einstein notation, and to map them to existing code generators which generate small matrix matrix multiplication routines. The compiler for our DSL implements classic optimisations that are used for large tensor contractions, and we present novel optimisation techniques such as equivalent sparsity patterns and optimal index permutations for temporary tensors. Our application examples, which include the earthquake simulation software SeisSol, show that the generated kernels achieve over 50 considerably simplifies the implementation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Solving partial differential equations (PDEs) is one of the pillars of computational science and engineering, and solving PDEs accurately on a computer is one of the grand challenges in high performance computing. Simulations may have billions of degrees of freedom, hence highly scalable codes that make efficient use of the invested energy are required. But, highly efficient software often requires expert knowledge, the resulting code might not reflect the underlying numerical scheme, and code tends to become complex: While on the one hand a software might want to support several PDEs, several finite element types, or multi-physics simulations, on the other hand developers have to deal with the subtleties of modern hardware architectures, for example vector instruction latency or level 1 cache bandwidth. Moreover, evolution of hardware architectures may require the adaption of all implemented combinations or even a radical change of data structures.

A strategy to deal with the mentioned complexity is to choose a proper abstraction. For example in linear algebra, a lot of operations may be implemented with only a few simple inner-most kernels (e.g. (Van Zee and van de Geijn, 2015)). The advantage is two-fold, as firstly only a few kernels need to be optimised by an expert, and secondly all projects that build up on these kernels may benefit from improvements. Abstracting algorithms in terms of linear algebra operations, especially the General Matrix-Matrix Multiplication (GEMM), is becoming popular for the numerical solution of PDEs, too. High-order discontinuous Galerkin (DG) or spectral element methods usually consist of small element local operators, such that its implementations may be expressed as sequence of small GEMMs (Vincent et al., 2016; Uphoff et al., 2017; Hutchinson et al., 2016; Breuer et al., 2017). The matrices are usually small enough, such that data movement considerations within a GEMM become irrelevant. At the same time, the overhead of calling BLAS and the choice of GEMM micro-kernels becomes significant, such that specialised code generators for small GEMMs are employed (Heinecke et al., 2016b), or even specialised code generators for small tensor contractions are developed (Breuer et al., 2017). An alternative ansatz is to express tensor operations in an appropriate intermediate representation, which captures the underlying loop structure, and use compiler techniques and cost models in order to generate efficient code (Stock et al., 2011; Luporini et al., 2015; Kempf et al., 2018).

The abstraction of tensor operations is not new but has been researched for decades in the field of computational chemistry. Here, one may distinguish two different levels of abstraction: The first level of abstraction is the move from an expression involving several tensors to binary tensor operations. On this level of abstraction, algebraic transformations may be employed. In the so-called strength reduction, for example, the optimal sequence of tensor operations is determined (under a memory constraint) such that the total number of floating point operations is minimised, which is then usually magnitudes smaller than a naive implementation (Lam et al., 1997). The most prominent representative is the Tensor Contraction Engine (TCE) (Baumgartner et al., 2005), but also a framework called Barracuda for GPUs exists (Nelson et al., 2015). The second level of abstraction is the implementation of binary tensor operations, especially for tensor contractions. For the latter, a variety of algorithms exist, such as nested-loop code, Loop-over-GEMMs (LoG), Transpose-Transpose-GEMM-Transpose (TTGT), and GEMM-like Tensor-Tensor multiplication (GETT) (Springer and Bientinesi, 2018, and references therein).

In computational chemistry tensors may become very large, such that intermediate results may not remain in memory but must be loaded from disk (Baumgartner et al., 2005). So despite the fact that quantum chemistry models and high-order DG methods may be abstracted as tensor operations, the situation is quite different for the latter, as shall be outlined in the following:

  • [topsep=0pt]

  • All matrices fit inside memory or even inside low-level caches. There is no need to optimise for memory usage or fuse loops to trade off memory usage and operation count (Baumgartner et al., 2005).

  • Tensor operations do not need to be distributed among nodes.

  • Data copies or index permutations may be amortised for large GEMMs (Goto and van de Geijn, 2008; Springer and Bientinesi, 2018), as these scale with compared to , but this is usually not the case for small GEMMs (Heinecke et al., 2016b). (Which excludes TTGT.)

  • All tensor dimensions and respective sparsity patterns in element-local operations are known a priori. Furthermore, spending a lot of time optimising individual tensor operations can be justified, as those are called millions of times during simulations (Uphoff et al., 2017).

  • Sparsity can be dealt with explicitly, as it is feasible to compute the sparsity pattern of a tensor operation during the invocation of a code generator. There is no need to estimate the sparsity, as proposed in

    (Lam, 1999).

  • Some optimisations are special to small GEMMs or DG methods, such as padding matrices for optimal vectorisation or tailored software prefetching schemes

    (Heinecke et al., 2016a).

Other publications and open-source packages focus on binary tensor contractions (Solomonik et al., 2013; Springer and Bientinesi, 2018; Shi et al., 2016; Li et al., 2015; Matthews, 2018), GPUs (Nelson et al., 2015), or focus on loop transformations (Stock et al., 2011; Luporini et al., 2015; Kempf et al., 2018), where the latter lack support for sparse matrices in element-local operators and are to our understanding not designed for use with code generators for small GEMMs.

We therefore present Yet Another Tensor Toolbox (YATeTo), which we explicitly designed for the use in PDE solvers. In YATeTo, tensor operations are described in a high-level language that is motivated by the Einstein convention. An abstract syntax tree (AST) is formed on which algebraic transformations and other optimisations are done. Afterwards, code generators for small GEMMs are called and finally C++11-code is generated which calls either generated code or calls BLAS libraries. Our design does not prescribe the use of a specific implementation of binary tensor contractions, but allows to choose the implementation based on operation type and hardware architecture. Note that our main focus are high-order DG and spectral element methods, and that our ultimate goal is that the performance of the YATeTo-generated code matches the performance of hand-crafted code, but the toolbox itself may be used for arbitrary applications where small tensor operations are required and tensor dimensions and respective sparsity patterns are known a priori.

2. Related work

Many components of YATeTo build on previous algorithms and ideas. In this section we summarise related work and necessary background.

2.1. High-level language and representation

The need for high-level languages for tensor contractions was already clear to A. Einstein, who wrote in the foundations of general relativity:

“Dafür führen wir die Vorschrift ein: Tritt ein Index in einem Term eines Ausdrucks zweimal auf, so ist über ihn stets zu summieren, wenn nicht ausdrücklich das Gegenteil bemerkt ist.” (Einstein, 1916)

The so-called Einstein convention states that one shall sum over every index that appears twice in a term. Elegant domain-specific languages may be defined using this convention. For example, in the Cyclops tensor framework one may write (Solomonik et al., 2013) [commandchars=
{},codes=] W[”MnIj”] = V[”MnEf”] * T[”EfIj”]; Z[”AbIj”] = W[”MnIj”] * T[”AbMn”]; which is valid C++ (using operator overloading) and implements

The Barracuda framework defines a similar high-level language (Nelson et al., 2015).

In the TCE, tensor operations are represented internally as function sequences consisting of two types of formulae (Lam et al., 1997): The first type is a multiplication formula of the form . Depending on the indices inside the square brackets a different operation is represented. E.g.  represents a tensor product whereas represents a Hadamard product. (Note that two indices appear but are not summed over, that is, this representation does not adhere to the Einstein convention.) The second type is a summation formula of the form . Here, the dimension of the left-hand side is always one less than the right-hand side, and the dimension to be summed over is determined by the indices in the brackets.

2.2. Strength reduction

In many cases there are many mathematically equivalent ways to implement tensor operations. In (Baumgartner et al., 2005) the following motivating example is given, where each dimension has size ,

The first implementation corresponds to the naive implementation with 10 nested loops, which requires operations. The second implementation saves intermediate results and only requires operations. Hence, the second implementation is four orders of magnitudes cheaper than the first implementation.

Naturally, one asks if and how optimal implementations can be found automatically. A formal optimisation problem is set up in (Lam et al., 1997) using the two types of formulae mentioned in Section 2.1. It is shown that this optimisation problem is NP-complete and an efficient search procedure is developed in (Lam et al., 1997), which exhaustively searches all valid formulae and hence finds the optimal implementation. As the number of dimensions arising in practice is typically small, this search procedure is feasible. The number of dimensions is also small enough in our application examples, see Section 5, hence there is no need to develop new algorithms for strength reduction.

2.3. Loop-over-GEMM (LoG)

The authors of (Di Napoli et al., 2014)

give a summary of tensor contraction classes and possible mappings to BLAS. In a binary contraction, the contraction is classified in the number of free indices of each input tensor. (That is, those indices which are not contracted.) The most interesting case is when both tensors have at least one free index, as in this case a mapping to GEMM is always possible, if data is allowed to be copied or non-stride-one-GEMMs are considered.

Consider the following contraction from (Springer and Bientinesi, 2018):

Following (Shi et al., 2016), we have the following options in a LoG implementation: We may batch modes yielding a non-free index , denoted with , we may flatten modes yielding a single free index, denoted with , and we may transpose modes, denoted with . Note that transposes may only be applied to exactly 2 free modes.

A possible LoG implementation would be

(1)

That is, we loop over the indices c and k. Inside the loop, GEMM is called for subtensors of C, B, and A. An implementation of Equation 1 is listed in (Springer and Bientinesi, 2018, Listing 8).

3. Tensors in discontinuous Galerkin methods

The discontinuous Galerkin method has become a popular method for solving hyperbolic PDEs. The DG method has several advantages compared to other classic methods, such as Finite Difference methods, Finite Volume methods, and continuous Finite Element methods: It simultaneously allows to handle complex geometries, achieves high-order convergence rates, and has an explicit semi-discrete form (Hesthaven and Warburton, 2008). Additionally, the high number of element-local operations and the small stencil make it a promising candidate to exploit the capabilities of modern supercomputing hardware. Before discussing the details of our tensor toolbox, we want to motivate why tensor contractions are a natural abstraction for DG methods.

As an example, we consider the following linear system of PDEs:

(2)

Multiplying the latter equation with a test function , integrating over a domain , and integrating by parts yields the corresponding weak form

(3)

where is the outward-pointing normal vector. From here a sequence of steps follows to obtain a semi-discrete form, such as partitioning the domain in finite elements (e.g. tetrahedra or hexahedra), introducing a numerical flux to weakly couple the finite elements, transforming the physical elements to a reference element, and introducing a polynomial approximation of the quantities on the reference element (Atkins and Shu, 1998; Dumbser and Käser, 2006; Hesthaven and Warburton, 2008). After these steps, the semi-discrete form is a system of ODEs, which may, for example, be solved with a Runge-Kutta scheme or the ADER approach (Dumbser and Käser, 2006).

We skip some steps for simplicity, and assume that the quantities are discretised as at a given point in time on an element . Here, are a set of polynomial basis functions and is a matrix, storing the coefficients of the basis expansion for every quantity, where is the number of quantities. By inserting our discretisation we may write the second line of (3) in the following way

(4)

The integrals in (4) may be pre-computed and stored as (so-called) stiffness matrices , , and of size .111Note that these matrices need to be only computed for the reference element in an actual implementation. The implementation in terms of GEMM is then given by the sequence of matrix chain products .

When one uses the unit cube as reference element, then it is possible to use a spectral basis, where the basis functions and test function are given by . The degrees of freedom are then stored in a 4-dimensional tensor , such that . In each dimension we have basis functions, where is the polynomial degree, and has entries. Inserting the discretisation and the test functions in the second line of (3), we obtain

where and . We note here that the index order lmnp is chosen arbitrarily. In fact, the index permutations plmn, lpmn, and lmpn are equally viable, and it depends on the application which is best. E.g. if is large and is smaller than the SIMD vector width then it might be beneficial to store index continuous in memory. Otherwise, if is large and is small then either of lmn should be continuous in memory.

As a final example, one may solve multiple problems concurrently. E.g. one may add another dimension to the degrees of freedom in (4) to obtain a 3D tensor , which stores multiple degrees of freedom (Breuer et al., 2017). The matrix chain product then becomes the tensor contraction sequence and may be implemented using the Loop-over-GEMM approach, e.g. .

4. Yet Another Tensor Toolbox

Figure 1. Overview of YATeTo, showing the path from the high-level language to the generated code. The first stage (left box) operates on an abstract syntax tree, in the second stage (middle box) the representation is changed to a simple control flow graph without branches, and in the last stage (right box) different back-ends are called which output C++ or assembly.

The input to our tensor toolbox is a domain-specific language (DSL), which naturally resembles the Einstein notation. An abstract syntax tree (AST) is derived from an expression. Subsequently, the tree is shaped using a sequence of visitors (Gamma et al., 1995). Afterwards, we transform the AST to a control flow graph (CFG) for standard compiler techniques (Seidl et al., 2012). Note that the CFG is very simple as the DSL does not feature loops and branches. Finally, a code generator is called which generates either C++-11 code or may make use of existing code generators or BLAS libraries. An overview is shown in Figure 1. YATeTo is open-source and available on www.github.com/SeisSol/yateto.

In the following, we are going to introduce our DSL and elaborate algorithms and design choices of each of the three stages.

4.1. High-level description of tensor operation

The DSL we created is embedded into Python 3. An embedded DSL has the advantage that we do not need our own parser and lexer, which decreases the complexity of the software. Furthermore, a user may use any feature of Python to form expressions, for example including loops, branches, classes, and lambda expressions.

The basic data types are the classes Tensor and Scalar. A tensor definition includes a name and a shape tuple, e.g. as in the following: [commandchars=
{},codes=] N = 8 A = Tensor(’A’, (N, N)) B = Tensor(’B’, (N, N, N)) w = Tensor(’w’, (N,)) C = Tensor(’C’, (N, N)) A and C are matrices, B a third order tensor, and w is a vector, where all dimensions are of size N.

From a Tensor, an object of type IndexedTensor may be derived by supplying an index string. The convention is that the length of the index string must be equal to the number of dimensions of the tensor and that only lower- and upper-case letters may be used.222Note that this limits the number of distinct indices that may appear in an expression to 52. We think that this is not an issue in practice, hence we favour simplicity instead of generality. Objects of type IndexedTensor may appear in expressions, e.g.

[commandchars=
{},codes=] kernel = C[’ij’] ¡= 2.0 * C[’ij’] + A[’lj’] * B[’ikl’] * w[’k’]

The design is inspired by (Solomonik et al., 2013) (see Section 2.1), and represents the operation

The variable kernel contains an AST with nodes Assign, Add, ScalarMultiplication, and Einsum, where the latter represents a tensor operation using the Einstein convention.

The AST is built by overloading the binary operators ’*’ and ’+’. One of the following four actions is taken whenever one of the operators is invoked for operands op1 and op2, where nodeType = Einsum for ’*’ and nodeType = Add for ’+’:

  1. type(op1) = nodeType and type(op2) = nodeType: Append op2’s children to op1.

  2. type(op1) = nodeType and type(op2) nodeType: Append op2 to op1.

  3. type(op1) nodeType and type(op2) = nodeType: Prepend op1 in op2.

  4. type(op1) nodeType and type(op2) nodeType: Return nodeType of op1 and op2.

These actions ensure that a node of type nodeType does not have children of type nodeType, but all tensors w.r.t. the same nodeType are merged. This is important, especially for Einsum and the algorithms presented in Section 4.2 and Section 4.3, which require all tensors participating in a tensor operation to be known. Note that we do not apply distributive laws, e.g. in the expression the addition is always evaluated before the Einstein sum.

For scalar multiplications we take similar actions to ensure that ScalarMultiplication is always the parent node of an Einsum.

4.2. Equivalent sparsity patterns

(a) Original sparsity patterns, stored in memory.
(b) Equivalent sparsity patterns, used during contractions.
Figure 2. We show the kernel (depicted with mode-n product, i.e. ), which is similar to an application example in (Uphoff and Bader, 2016) but with an additional dimension added to the degrees of freedom. The degrees of freedom (I, Q) are given as full tensor (Figure 1(a)), but we detect that large blocks in I do not influence the final result, as they are multiplied with zeros in the tensor contractions (Figure 1(b)).

The notion of equivalent sparsity patterns (EQSPPs) and an algorithm to compute them was introduced in (Uphoff and Bader, 2016) for matrix chain products (MCP). It is defined as the minimal sparsity patterns of the involved matrices that leaves the result of an MCP unchanged. In the latter definition, no cancellation is assumed, which means that adding non-zero entries never results in zero (Cohen, 1998). First, the assumption allows to work with boolean matrices alone. Second, we assume that the sparsity pattern is known but the values are unknown (or known only at run-time), which has the following implication: In an inner product, which is the core of matrix multiplication, we can always find a pair of non-orthogonal vectors. That is, if the values of at least one matrix are unknown, we cannot guarantee cancellation.

The concept of EQSPPs can be illustrated with the following example (where the entries may be either scalars or dense block matrices):

Here, the multiplication with the first matrix removes and from the right-hand side and the multiplication with the third matrix removes and from the right-hand side. Thus, the equivalent sparsity pattern of the second matrix is only non-zero for the top-left block ().

In (Uphoff and Bader, 2016) an algorithm to compute EQSPPs for MCPs is presented, which is based on a graph based representation. The extension of the MCP graph to general tensor operations is not straightforward, thus we derive an algorithm that is not based on a graph in the remainder of this section.

4.2.1. Formalisation of tensor operations

So far we have defined tensor operations via handwaving and examples, but here we need to introduce a formal definition of the “*”-expression in our DSL.

First of all, we formalise the labelling of the dimensions of a tensor . It consists of a set of indices and an assignment of indices to dimensions. We denote the set of indices with calligraphic letters, e.g. , which must be a subset of the power-set of an alphabet , where e.g. . We require .333Note that this requirement excludes the trace of a matrix or similar expressions. However, one may write the trace as , which conforms with the requirement. The assignment of indices to dimensions is a bijective mapping . The range of an index is always , where .

Example 1 ().

Let . The index set of A[’jik’] is , the map is given by , and .

Second, assume a “*”-expression is formed by tensors , and the result tensor . Einstein’s convention requires us to sum over all indices that appear twice, which we formalise in the following: The index space of tensor is , where is the index range of the -th dimension. In order to span an iteration space, we introduce the global index set , and the global index space , where . Similar to a tensor, we assign global indices to dimensions of the global index space with the bijective function .444The function may for example be obtained by ordering the global index set lexically. In order to “access” entries of a tensor, we introduce projection and permutation functions , where

Entries of are accessed with the notation , where , or shortly with , where the permutation and projection function is understood from context. We define the restriction of the global index set to an index as .

With the set up formalism, we are able to precisely state the meaning of a “*”-expression:

(5)
Example 2 ().

Consider tensors and . In Einstein notation, the mode-2 product is written as and in our DSL as C[’ijk’] <= A[’ilk’] * B[’jl’].

In our formal notation, the global index space is , where . The permutation and projection functions are , , and . We may compute the entries of C with

4.2.2. Computation of EQSPPs for tensor operations

As we assume no cancellation in sums, it is sufficient to model a sparsity pattern as a boolean tensor. The EQSPP of a tensor is called , where . The “hat”-tensor is the original tensor masked with . In operations involving we identify with and with .

Having set up the necessary formalism, we are ready to formally define EQSPPs:

Definition 4.1 ().

Assuming no cancellation, we call equivalent sparsity patterns w.r.t. (5) if

  1. , where and ,

  2. The number of non-zeros of , , is minimal, that is, we cannot set a non-zero to zero without implying .

As we assume no cancellation, we may disregard the sums over indices, and instead need to only consider a product over all tensors on the right-hand side:

Lemma 0 ().

The EQSPPs w.r.t. (5) are equivalent to the EQSPPs w.r.t.

Proof.

Clearly, , hence condition 1 is fulfilled.

In order to check condition 2, assume there is another set of EQSPPs, , which has less non-zeros than . Then (otherwise would not be minimal). As and does not get cancelled in sums, it follows that there exists an index where . ∎

Our main result is that we may cast the computation of EQSPPs as boolean tensor operations on the original sparsity patterns:

Proposition 1 ().

The EQSPPs w.r.t. (5) are given by

(6)

where is the sparsity pattern of , .

Proof.

We are going to show that (6) computes the EQSPPs for the product of tensors (see Lemma 4.2). In order to satisfy condition 1, we only need to show that the sparsity patterns of and are identical. The sparsity pattern of is given by , where and the sparsity pattern of is given by , where

First, we show that . Note that

(7)

where the latter equality follows from idempotence and if by the definition of . Using (7) we obtain

(8)

where the last identity follows from and the absorption law. Equation 8 is true for all , hence .

Second, in order to satisfy condition 2, we show that the number of non-zeros is minimal: Assume that there exist another set of sparsity patterns , with less non-zeros than . Then, . Clearly, and , because otherwise . Hence, there must be at least one such that . As it immediately follows that

which violates condition 1. ∎

4.2.3. Implementation and discussion

We have applied our EQSPP algorithm to an application example from (Uphoff and Bader, 2016), where we extended the degrees of freedom by an additional dimension (cf. Section 5.1). The original SPPs and EQSPPs can be seen in Figure 2. We observe that the non-zero entries of and induce additional zeros in which may be disregarded when computing the tensor operation. For example, in the evaluation of we may restrict the loop ranges of and .

The implementation of EQSPP-computation with (6) is straightforward, as we require the same kind of tensor operations that we support within YATeTo, the only difference being that the tensors are boolean. E.g. we may apply strength reduction (cf. Section 4.3) in order to reduce the amount of computations. Still, our method to compute EQSPPs is more expensive to compute than to simply evaluate the original tensor operation. Nevertheless, the cost of computing EQSPPs is negligible in comparison to the possibly millions of times a kernel is called within a DG scheme.

4.3. Strength reduction

We already mentioned in Section 4.3 that finding the sequence of tensor operations with minimal operation count is an NP-hard problem, but an efficient enumeration procedure exists for tensor dimensions appearing in practice (Lam et al., 1997). The same enumeration procedure may also be used when sparse tensors are involved, but in this case the sparsity patterns of intermediate results are required, or at least an estimate of the sparsity (Lam et al., 1999).

In YATeTo we assume that tensors are small enough, such that it is feasible to explicitly compute all intermediate products during strength reduction. The number of operations is then determined in the following way (Lam et al., 1999): For a multiplication formula the number of operations is equal to the number of non-zeros in . For a summation formula the number of operations is equal to the number of non-zeros in minus the number of non-zeros in .

An example of an intermediate AST after strength reduction is shown in Figure 2(b).

(a) Initial AST from DSL.

(b) AST after strength reduction.

(c) Final AST.
Figure 3. Overview of major stages during AST transformation of an Einsum node (left). After determining the equivalent sparsity patterns, the Einsum node is transformed during strength reduction, yielding an operation-minimal tree (middle). This tree is binary and consists of Product and IndexSum nodes only. Finally, a mapping to Loop-over-GEMM is found which minimises the cost function described in Section 4.5 (right).

4.4. Memory layouts

Following Figure 1 the next step is to compute memory layouts of tensors. The memory layout influences the possibility to fuse indices in a LoG-implementation. Thus the layout influences the cost function in Section 4.5 and needs to be fixed at this point already. In the following we give an overview over the two classes of memory layouts that are currently supported.

4.4.1. Dense layout

Let . We call the tuple the shape of a tensor. A simple memory layout of a dense tensor is the “column-major” storage:

where the so-called stride is given by with . That is, the tensor is stored linearly in a 1D array, such that the first dimension varies fastest and the last dimension varies slowest in memory. The stride gives us the number of floating numbers we have to skip when we increase index by one.

From Section 4.2 we expect that we also have tensors with large zero blocks, where it would be wasteful to store the full tensor. Hence, our next evolution is the bounding-box column-major storage:

where and . This memory layout models the case where is non-zero within the index set and zero otherwise.

Finally, it might be beneficial to align the fastest dimension. That is, the number of bytes in a fibre in the first dimension should be divisible by the architecture’s SIMD vector length. In some cases, one has to add artificial zeros to the first dimension, hence we allow that the bounding interval of the first dimension is larger than the size of the first dimension. Given an initial (minimal) bounding interval we align the bounding interval as in (Uphoff and Bader, 2016), that is,

where is the number of floating point values that fit into a SIMD vector (e.g. using AVX-512 one sets for double precision or for single precision).

We conclude the presentation of dense layouts with a possible pitfall of using bounding boxes and alignment. It might be beneficial to fuse multiple indices when mapping a tensor contraction to a Loop-over-GEMM (see Section 2.3). Fusing indices is always possible when (Shi et al., 2016)

(9)

Other cases would require more involved compatibility conditions. For example in the GEMM , where artificial zeros are added for dimension in and dimension in , then one may only fuse and whenever has the same number of artificial zeros in dimension and . Otherwise one needs temporary storage and an additional data copy. In order to avoid possible data copies and complications arising in subsequent optimisation steps we only allow fused indices when (9) is fulfilled. Conversely, aligned layouts or bounding box layouts have to be considered carefully as they may prohibit fusing indices, which leads to smaller (and possibly slower) GEMMs.

4.4.2. Compressed sparse column (CSC) layout

Sparse matrices may appear in discontinuous Galerkin methods, e.g. the stiffness matrices are sparse when an orthogonal set of basis functions is used. We have a limited support for the CSC format: CSC matrices may appear in GEMM or Loop-over-GEMM calls but only in a sparse x dense or a dense x sparse GEMM.

4.4.3. Other (sparse) formats

It might be surprising that we describe Equivalent Sparsity Patterns in great detail but currently do not offer a genuine sparse tensor memory layout. First of all, we want to point out that if an EQSPP induces a large zero block, then we are able to exploit this block using the bounding box memory layout or we may let a LoG operate only on a subtensor. Second, to the authors’ knowledge, an efficient code generator or efficient routines for general small sparse tensors do currently not exist.

4.5. Optimal index permutations

Contraction

Contraction

Contraction

Contraction

Figure 4. A possible intermediate AST of . We have the freedom to choose the index permutations of the temporary tensors , , and .

The strength reduction step converts each Einsum node to a binary tree consisting only of IndexSum and Product nodes. In a first step, we identify contractions in the tree, which are (multiple) IndexSum nodes followed by a single Product node. A sub-tree that was identified to be a contraction is replaced by a new node of type Contraction. For example, we might end up with the (sub-)tree shown in Figure 4.

Each Contraction node shall now be mapped to an implementation as Loop-over-GEMM. In general, the mapping to GEMM is not unique, but several implementations are possible (e.g. (Shi et al., 2016, Table II)). A systematic way to list all possible implementations is given in (Springer and Bientinesi, 2018, Listing 7)

. As accurately predicting the performance of each implementation is difficult, heuristics may be employed to select the most promising candidate, e.g. one may choose the variant with the largest GEMM 

(Springer and Bientinesi, 2018).

The algorithm to list all LoG-implementations in (Springer and Bientinesi, 2018) assumes that the order of all indices are known. In our case we have temporary tensors in which an order of indices is not prescribed. (E.g. , , and in Figure 4.) The index order has to be chosen carefully as it decides possible mappings to GEMM and there are even cases where non-unit stride GEMMs are required (Shi et al., 2016, Table II). Moreover, the index order of a temporary result influences possible mappings to GEMM in nodes that involve the temporary result.

4.5.1. Optimal index permutations

Our approach to select the index orders of temporary tensors is to solve the following discrete optimisation problem: Let be the set of vertices of an AST with root . We denote the set of children of a vertex with , and the set of descendants of a vertex with . For each vertex we have a set of permissible index permutations . We introduce variables , for all , where i.e. variable contains the index permutation for vertex . The Cartesian product of all permissible index permutations is called the configuration space. The optimisation problem is to find a configuration which minimises a cost function over all possible configurations, that is,

(10)

We define the cost function of a tree with root recursively in the following way:

(11)

The functions can be thought of as a measure of the cost of the operation represented by a vertex in the AST. We give a detailed definition of in Section 4.5.2.

We split the minimisation in two stages. In the first stage we minimise over the root variable and its children, in the second stage we minimise over all other variables, i.e. the variables associated with the vertices in the set . One can then show that

where we introduced

(12)

The sub-problem introduced in (12) can be interpreted as finding the optimal configuration for a sub-tree, assuming that the index permutation of the root node is fixed. In fact, . Such problems are said to have an optimal substructure, because the optimal solution can be constructed out of optimal solutions to sub-problems (Cormen et al., 2009).

Dynamic programming is a commonly used approach for problems with optimal substructure. We consider a bottom-up dynamic programming algorithm, which works in the following way: The AST is traversed in post-order. For each vertex , we solve problem (12) for every permissible index permutation . The minimum cost as well as a minimising configuration is memoized in a dictionary. If vertex is a leaf node, we simply evaluate all index permutations in . For internal nodes, we evaluate all configurations in . In order to evaluate the cost, we evaluate the function and for the sub-problems we look-up the memoized costs. (The latter are available due to the post-order traversal.) The run-time of this algorithm can be bound with , where is the number of vertices in an AST, is the maximum number of indices in a vertex (and hence is the maximum size of each permissible index permutation set), and is the maximum number of children.

4.5.2. Cost function

The missing pieces in the last section are the cost functions . Here, we choose a simple heuristic. Our basic assumption is that our ASTs will consist mostly of LoGs and that those will dominate the run-time. Hence, all other operations are assigned cost zero. Further assumptions are listed in the following:

  • Non-unit stride GEMMs are inferior to unit stride GEMMs.

  • Transposes of () in the GEMM should be avoided when using column-major (row-major) layout. Transposes of () should be avoided due to missing support in code generation back-ends.

  • Large GEMMs are faster than small GEMMs, i.e. one should fuse as many indices as possible.

From these assumptions we define the cost of a LoG as the 4-tuple , where is the number of required slices with non-unit stride, is the number of left-transposes (assuming column-major layout), is the number of right-transposes, and is the number of fused indices. Cost comparison is based on lexicographic comparison of , where the lower number of left-transposes is deciding when two 3-tuples are equal.

The cost function is then

where MinLoG enumerates feasible LoG-implementations and returns a LoG-implementation with minimum cost or infinite cost if no mapping exists. With “violated constraints“ we summarise additional constraints that exist, e.g. for Add or Assign nodes.

4.5.3. Discussion

The cost function we employ is clearly limited, as it is a heuristic and makes sense for LoG nodes only. However, choosing another cost function does not change the dynamic programming scheme, as long as the cost function is structurally equivalent to Equation 11. For example, one could swap our LoG-cost by a run-time model based on micro-benchmarks, using the methods developed in (Peise and Bientinesi, 2012).

The algorithm needs to visit every node only once, hence the algorithm is feasible also for large ASTs. Problematic are tensors with many dimensions (say large ), as the number of permutations is and we need to check all of them. But as stated in the introduction, we assume that our tensors fit into low-level caches, which constrains . E.g.  is the maximum number of dimensions we may choose such that a tensor (double precision) still fits into a 32 KiB L1 cache. In other words, YATeTo is not designed for high-dimensional tensors.

An example of an AST after mapping to LoG can be seen in Figure 2(c).

4.6. Prefetching

The library LIBXSMM, a generator for small matrix matrix multiplications, allows software prefetching instructions to be included in the generated assembler code (Heinecke et al., 2016b). This may improve performance, particularly on Intel Knights Landing architecture (Heinecke et al., 2016a; Uphoff et al., 2017). A possible prefetching strategy is to insert vprefetch1 instructions for a matrix B after writes to the result matrix C, where the memory offset calculated for the matrix C are also used for the matrix B. The rationale is that B has the same or a similar shape as C.

In YATeTo the users may add a list of tensors they wish to be prefetched to a kernel. The FindPrefetchCapabilities visitor determines the number of bytes that may be prefetched for every node of the kernel. For a LoG-node this is equal to the number of bytes in the result tensor. (Other nodes have no prefetch capability, but this may be added if appropriate code generators are available.) AssignPrefetch then greedily assigns each to-be-prefetched tensor P to one of the nodes, such that the number of bytes of P matches the prefetch capability of the node.

4.7. Control flow graph

Following Figure 1, we convert the AST to a control flow graph (CFG) representation (Seidl et al., 2012). The aim here is first to obtain a standardised sequential representation for the later code generation step, and second to manage temporary buffers or to avoid them at all.

The data structure of the CFG are program points (vertices), which save the current state of the kernel, and actions (edges), which transform the kernel’s state. An action may be either an assignment or a plus-equals operation. The left-hand side is always a tensor. The right-hand side may be either a tensor or one of the operations modelled in the AST (such as LoG or Product). The right-hand side may be augmented by multiplication with a scalar.

The individual optimisation steps use standard techniques from compiler optimisation (Seidl et al., 2012), thus we only briefly summarise them in the following:

  1. MergeScalarMultiplications: The actions are replaced by .

  2. LivenessAnalysis: Determine live variables (Seidl et al., 2012).

  3. SubstituteForward: If possible, replace by tmp after action .

  4. SubstituteBackward: If possible, replace left-hand side tmp by if there follows an action .

  5. RemoveEmptyStatements: Remove statements such as .

  6. MergeActions: Try to merge actions, e.g. merge and to , when there is no intermediate action which depends on .

  7. DetermineLocalInitialization: Greedy map of temporary buffers to temporary variables. If buffers are used for multiple variables, set the buffer size to the maximum size required by its assigned variables.

In order to illustrate the individual steps, we present matrix multiplication as an example: [commandchars=
{},codes=] C[’ij’] ¡= C[’ij’] + 0.5 * A[’ik’] * B[’kj’]

[commandchars=
{},codes=] ˙tmp0 = LoopOverGEMM(A, B) ˙tmp1 = 0.5 * ˙tmp0 ˙tmp2 = C ˙tmp2 += ˙tmp1 C = ˙tmp2

(a) Initial CFG.

[commandchars=
{},codes=] ˙tmp1 = 0.5 * LoopOverGEMM(A, B) ˙tmp2 = C ˙tmp2 += ˙tmp1 C = ˙tmp2

(b) After MergeScalarMultiplications.

[commandchars=
{},codes=] ˙tmp1 = 0.5 * LoopOverGEMM(A, B) C = C C += ˙tmp1 C = C

(c) After SubstituteForward.

[commandchars=
{},codes=] ˙tmp1 = 0.5 * LoopOverGEMM(A, B) C += ˙tmp1

(d) After RemoveEmptyStatements.

[commandchars=
{},codes=] C += 0.5 * LoopOverGEMM(A, B)

(e) After MergeActions.
Figure 5. Illustration of CFG optimisation of a matrix multiplication example.

A simple, but correct, traversal of the corresponding AST yields the sequential CFG shown in Figure 4(a). This inefficient CFG is transformed to a single call to GEMM (with and ), as shown in Figure 4(e).

4.8. Code generation

The final step is to generate machine code. In principle, the aim is to re-use existing code generators, such as LIBXSMM (Heinecke et al., 2016b) or PSpaMM (Brei, 2018; Wauligmann and Brei, 2019), or flavours of BLAS. Besides, YATeTo generates C++-glue-code, and it is able to generate generic fallback code, whenever an efficient implementation is not available. The fallback code consists of nested loops and is architecture independent, but we expect the performance to be rather poor and compiler-dependent.

Operations are divided into four types: copyscaleadd, indexsum, product, and log. The first type is a combination of the BLAS routines COPY and AXPY, and may be implemented in terms of these two functions. The product and indexsum types correspond to the multiplication formulae and summation formulae presented in Section 2.1. Having a generic implementation for these two types ensures that we cover a large class of tensor operations (Lam et al., 1997). The log type implements Loop-over-GEMMs.

Our concept requires that it should be simple to include additional code generators depending on the type of operation and on the architecture. We use the factory method pattern (Gamma et al., 1995) to distinguish between code generators. For each type, the factory method may return a different code generator depending on the operation’s description (e.g. dense or sparse matrices) and the architecture. For the LoG type, YATeTo offers a generic C++-loop-based implementation, which internally calls another factory method for GEMM. But our structure would also allow to call specialised LoG generators, such as the ones developed in (Breuer et al., 2017) for contractions involving 3D tensors and matrices.

4.9. Application interface

A class is generated for every kernel. The class has an argument-free execute function and pointers to the input and output tensors of the kernel are public members, where the name of a member variable is given by the name of the tensor. So in order to invoke a kernel one has to create a kernel object, set the pointers to the input and output tensors, and call the execute function. A kernel class also stores the following information as static members: The minimal number of flops,555Strictly speaking, the term “minimal” is incorrect due to Strassen’s algorithm or other bounds on the number of operations. That is, minimal is understood as minimal w.r.t. the cost function defined in Section 4.3. also sometimes called non-zero flops, and the number of flops the implementation requires, also sometimes called hardware flops. The number of non-zero flops is computed by means of strength reduction with the sparse cost function in Section 4.3, and the hardware flops are returned by the code generators. These flop counters are useful to evaluate the performance of the kernels.

In addition to the kernel classes, a unit test is generated for every kernel. The unit test compares the optimised kernel to the naive implementation of the same kernel. If the Frobenius norm of the difference of both implementations matches up to a tolerance, the unit test passes. (We allow a tolerance due to finite arithmetic imprecision caused by the difference in order of addition.)

Moreover, a class is generated for every tensor, where information about the memory layout is stored. The most useful static member, which is declared constexpr, is the number of required floating point numbers. This information may be used to allocate stack or heap memory for a tensor. For convenience, each tensor class contains a function that returns a view object for a memory region. Using a view object the user may access a d-dimensional tensor with operator(), i.e. V(i1,...,id). Tensors whose entries are known at compile time (e.g. stiffness matrices) may be stored within the generated code as static array, already formatted according to the selected memory layout.

To integrate YATeTo into the build process of an application, users need to have Python 3 and NumPy installed. The generated code follows C++11 syntax and depends on a small header-only library, i.e. the generated code itself requires only a recent C++-compiler. The kernels may also be called from C or Fortran, by manually wrapping the kernel calls in a C function. We plan to generate a C and Fortran interface automatically in the future.

5. Application to an ADER-dG method

We integrated YATeTo in two codes in order to evaluate practicability of our design, as well as to evaluate the performance and optimisation opportunities. Both codes employ a discontinuous Galerkin method with ADER time-stepping based on the discrete Cauchy-Kovalewski procedure (Käser et al., 2007). Furthermore, both codes solve a linear PDE in two or three dimensions of the form of Equation 2. The differences lie in the PDE (elastic wave equation, acoustic wave equation) and the underlying finite elements (tetrahedra, rectangles).

In the following, we do not derive the numerical schemes but only indicate changes w.r.t. literature references, in order to stay within the scope of this paper. We validated both codes with a plane wave problem (e.g. (Käser et al., 2007)), and checked that the code converges (for an appropriately chosen time-step) and that the observed convergence order matches the theoretical order of the scheme. Moreover, we tried to keep the amount of indices to a minimum by mostly omitting dependencies on space and time. We use the Einstein convention for subscript indices, but not for superscript indices.

5.1. SeisSol

The earthquake simulation code SeisSol (www.github.com/SeisSol/SeisSol) solves the seismic wave equation with support for elastic (Dumbser and Käser, 2006), viscoelastic (Käser et al., 2007), and viscoplastic rheological models (Wollherr et al., 2018). The code operates on unstructured tetrahedral meshes and the quantities are approximated with Dubiner’s basis functions. In essence, one may write the computational kernels as matrix chain products of small matrices, where small means that the maximum dimension size of each matrix is smaller or equal , where is the maximum polynomial degree of the basis functions and is the theoretical convergence order. SeisSol already makes use of the code generator tailored for matrix chain products, which simplifies the implementation of the various rheological models (Uphoff and Bader, 2016).

For elastic rheological models, the code achieves about 50 % of peak performance on compute clusters based on Intel’s Haswell or Knight’s Landing architecture (Heinecke et al., 2016a; Uphoff et al., 2017). Moreover, the authors of (Breuer et al., 2017) investigate the possibility to fuse multiple simulations in a single run, using the same ADER-DG method as SeisSol and an elastic rheological model. They report a speed-up of for a fourth-order scheme, when fusing 8 simulations in one run (compared to 8 individual simulations). Their main optimisation ingredient is a specialised code generator for tensor contractions involving 3D tensors and sparse matrices.

5.1.1. Ader-Dg

The numerical scheme of SeisSol is given in detail in (Dumbser and Käser, 2006; Käser et al., 2007; Uphoff et al., 2017), and is stated in a compact form in Equations 16, 15, 14 and 13 for an elastic rheological model. The equations already include multiple simulations, as in (Breuer et al., 2017). Multiple simulations are easily modelled using the Einstein convention as one simply has to add another index to the degrees of freedom, which we have denoted with a blue s.

The first step in the numerical scheme is to compute the time-derivatives of the degrees of freedom , using a Cauchy-Kovalewski procedure:

(13)

The indices and denote the space-time element and index denotes the order of the time-derivative. The matrices are sparse matrices, and and are sparse matrices, which are element-dependent linear combinations of the Jacobians and from Equation 2. For every simulation one needs to store a matrix for the degrees of freedom and the derivatives , yielding tensors, where is the number of simulations.

The evolution in time is predicted with a Taylor expansion, without considering an element’s boundaries. In the DG-scheme, the time-integral over a time-step with size is required, which is computed as

(14)

The time-integral is inserted into the DG-scheme, yielding the update scheme for the degrees of freedom.

(15)
(16)

The matrices are sparse matrices, and are sparse matrices, where , and are sparse matrices. and contain the solution to the 1D Riemann problem which is solved along the four sides of a tetrahedron, and these are dense matrices.

5.1.2. Equivalent sparsity patterns

The matrices contain large zero blocks, especially only the first columns are non-zero. YATeTo automatically determines that the EQSPP of is a subtensor. Hence, the contraction is mapped to a GEMM of size instead of a GEMM of size . Also the contraction is automatically mapped to a GEMM of size instead of a GEMM of size . So the percentage of original flops scales with , e.g. 50 % for a fourth-order scheme or 37.5 % for a sixth-order scheme.

The sparsity pattern of is equivalent to the sparsity pattern of and it has a staircase structure. The effect is, that derivative has only non-zero coefficients per quantity (Breuer et al., 2014b). We may set the sparsity patterns of the derivatives accordingly in order to exploit the vanishing coefficient – or we use existing visitors of YATeTo in order to simultaneously define the derivative kernels and determine the sparsity patterns of the derivatives automatically, as shown in Figure 6.

[commandchars=
{},codes=] D = [Q] for i in range(1,order): derivativeSum = Add() for j in range(3): derivativeSum += db.kDivMT[j][’kl’] * D[-1][’slq’] * db.star[j][’qp’] derivativeSum = DeduceIndices( D[-1][’skp’].indices ).visit(derivativeSum) derivativeSum = EquivalentSparsityPattern().visit(derivativeSum) dQ = Tensor(’dQ(–˝)’.format(i), qShape, spp=derivativeSum.eqspp()) g.add(’derivative(–˝)’.format(i), dQ[’skp’] ¡= derivativeSum) D.append(dQ)

Figure 6. Exemplary code which models vanishing coefficients in the Cauchy-Kovalewski procedure from Equation 13. First a partial AST is built and stored in derivativeSum. Then the first two visitors in the transformation process shown in Figure 1 are used to obtain the sparsity pattern of derivative . Finally, a tensor using the derived sparsity pattern is defined and a complete AST is added to the code generator.

5.1.3. Strength reduction

By solving the matrix chain multiplication order problem the number of required flops in Equation 16 can be reduced from to (Uphoff et al., 2017). The optimal matrix chain order is reproduced by the implementation of the strength reduction algorithm. Interestingly, for 8–32 fused simulations the order of evaluation is optimal, found automatically by strength reduction.

5.1.4. Optimal index permutations

We take Equation 16 as example, but note that the same discussion can be applied to the other kernels. From YATeTo we obtain the following mappings to LoG for either a single simulation or for multiple simulations (Greek letters denote temporary tensors):

We observe that the optimal index permutation algorithm from Section 4.5 fulfils the goals of the cost function: Non-unit stride GEMMs are not present, transposes are kept to a minimum and indices are fused if possible.

In SeisSol, the implementation is in fact a bit different: The and matrices (as well as the star matrices and the matrices) are stored transposed, such that a transpose-free scheme is obtained. For multiple simulations we can also obtain a transpose-free scheme in the following manner: The matrices (as well the star matrices and the matrices) are stored transposed as for a single simulation, but for the other matrices we face the opposite situation: The matrices are stored in normal order and we have to store and transposed as well as and for the other kernels.

5.1.5. Findings

YATeTo is able to reproduce the major optimisations from SeisSol: Zero-blocks are exploited and the optimal matrix chain multiplication order is found. For multiple simulations, zero blocks are also exploited automatically, strength reduction revealed the optimal—and different—evaluation order, and a transpose-free scheme is found by inspection of mappings to GEMM produced by YATeTo.

5.2. LinA

The code LinA was developed for education purposes, and contains a basic implementation of the ADER-DG method. LinA solves the linearised equations of acoustics in two dimensions (LeVeque, 2002), which involve the three quantities pressure and particle velocities in horizontal and vertical direction. A uniform mesh with rectangular elements is used.

Our goal here is to evaluate the application of YATeTo to a DG-spectral-element-like method (cf. (Kopriva, 2009, Chapter 5.4) for an introduction to DG-SEM). As briefly introduced in Section 3, quantities are represented using a tensor product basis, i.e. . An advantage of DG-SEM is that integrals over the reference elements may be split in a tensor product of 1D integrals. E.g. the computation of the gradient in the weak form can be done with instead of computations per element, where is the number of dimensions. On the downside, one needs to store degrees of freedom per quantity instead of  per quantity, which is the minimum amount of degrees of freedom needed to represent the same space of polynomials.

We implemented the ADER-DG-SEM method in LinA using YATeTo (www.github.com/TUM-I5/LinA), and extended the code to three dimensions, which implies an additional quantity or 4 quantities in total. For the 1D basis functions we use a nodal basis with Gauss-Lobatto points and Legendre polynomials (Hesthaven and Warburton, 2008). In the following we are going to abuse notation and use the same symbols for tensors as in SeisSol, even though the tensors are of different shape and have different sparsity patterns.

5.2.1. Numerical scheme

The first step in LinA, as in SeisSol, is to predict the element-local evolution in time:

(17)
(18)

Indices denote the location of an element on the grid and is the time index. The matrices are matrices and