1 Introduction
The need for rapid implementation of high performance, robust, and portable finite element methods has led to approaches based on automated code generation. This has been proven successful in the context of the FEniCS [Logg et al. (2012)] and Firedrake [Rathgeber et al. (2015)] projects. In these frameworks, the weak variational form of a problem is expressed in a high level mathematical syntax by means of the domainspecific language UFL [Alnæs et al. (2014)]. This mathematical specification is used by a domainspecific compiler, known as a form compiler, to generate lowlevel C or C++ code for the integration over a single element of the computational mesh of the variational problem’s left and right hand side operators. The code for assembly operators must be carefully optimized: as the complexity of a variational form increases, in terms of number of derivatives, premultiplying functions, or polynomial order of the chosen function spaces, the operation count increases, with the result that assembly often accounts for a significant fraction of the overall runtime.
As demonstrated by the substantial body of research on the topic, automating the generation of such high performance implementations poses several challenges. This is a result of the complexity inherent in the mathematical expressions involved in the numerical integration, which varies from problem to problem, and the particular structure of the loop nests enclosing the integrals. Generalpurpose compilers, such as those by GNU and Intel, fail to exploit the structure inherent in the expressions, thus producing suboptimal code (i.e., code which performs more floatingpoint operations, or “flops”, than necessary; we show this in Section 7). Research compilers, for instance those based on polyhedral analysis of loop nests, such as PLUTO [Bondhugula et al. (2008)], focus on parallelization and optimization for cache locality, treating issues orthogonal to the question of minimising flops. The lack of suitable thirdparty tools has led to the development of a number of domainspecific code transformation (or synthesizer) systems. quadrature1 show how automated code generation can be leveraged to introduce optimizations that a user should not be expected to write “by hand”. FFCTC and Francis employ mathematical reformulations of finite element integration with the aim of minimizing the operation count. In Luporini, the effects and the interplay of generalized code motion and a set of low level optimizations are analysed. It is also worth mentioning two new new form compilers, UFLACS [Alnæs (2016)] and TSFC [Homolya and Mitchell (2016)], which particularly target the compilation time challenges of the more complex variational forms. The performance evaluation in Section 7 includes most of these systems.
However, in spite of such a considerable research effort, there is still no answer to one fundamental question: can we automatically generate an implementation of a form which is optimal in the number of flops executed? In this paper, we formulate an approach that solves this problem for a particular class of forms and provides very good approximations in all other cases. In particular, we will define “local optimality”, which relates operation count with inner loops. In summary, our contributions are as follows:

We formalize the class of finite element integration loop nests and we build the space of legal transformations impacting their operation count.

We provide an algorithm to select points in the transformation space. The algorithm uses a cost model to: (i) understand whether a transformation reduces or increases the operation count; (ii) choose between different (noncomposable) transformations.

We demonstrate that our approach systematically leads to a local optimum. We also explain under what conditions of the input problem global optimality is achieved.

We integrate our approach with a compiler, COFFEE^{1}^{1}1COFFEE stands for COmpiler For Fast Expression Evaluation. The compiler is opensource and available at https://github.com/coneoproject/COFFEE, which is in use in the Firedrake framework.

We experimentally evaluate using a broader suite of forms, discretizations, and code generation systems than has been used in prior research. This is essential to demonstrate that our optimality model holds in practice.
In addition, in order to place COFFEE on the same level as other code generation systems from the viewpoint of low level optimization (which is essential for a fair performance comparison):

We introduce a transformation based on symbolic execution that allows irrelevant floating point operations to be skipped (for example those involving zerovalued quantities).
After reviewing basic concepts in finite element integration, in Section 3.1 we introduce a set of definitions mapping mathematical properties to the level of loop nests. This step is an essential precursor to the definition of the two algorithms – sharing elimination (Section 3.2) and preevaluation (Section 3.3) – through which we construct the space of legal transformations. The main transformation algorithm in Section 4 delivers the local optimality claim by using a cost model to coordinate the application of sharing elimination and preevaluation. We elaborate on the correctness of the methodology in Section 5. The numerical experiments are showed in Section 7. We conclude discussing the limitations of the algorithms presented and future work.
2 Preliminaries
We review finite element integration using the same notation and examples adopted in quadrature1 and Francis.
Consider the weak formulation of a linear variational problem:
(1) 
where and are, respectively, a bilinear and a linear form. The set of trial functions and the set of test functions are suitable discrete function spaces. For simplicity, we assume . Let be the set of basis functions spanning . The unknown solution can be approximated as a linear combination of the basis functions . From the solution of the following linear system it is possible to determine a set of coefficients to express :
(2) 
in which and discretize and respectively:
(3) 
The matrix and the vector are assembled and subsequently used to solve the linear system through (typically) an iterative method.
We focus on the assembly phase, which is often characterized as a twostep procedure: local and global assembly. Local assembly is the subject of this article. It consists of computing the contributions of a single element in the discretized domain to the equation’s approximated solution. During global assembly, these local contributions are coupled by suitably inserting them into and .
We illustrate local assembly in a concrete example, the evaluation of the local element matrix for a Laplacian operator. Consider the weighted Poisson equation:
(4) 
in which is unknown, while is prescribed. The bilinear form associated with the weak variational form of the equation is:
(5) 
The domain of the equation is partitioned into a set of cells (elements) such that and . By defining as the set of basis functions with support on the element (i.e. those which do not vanish on this element), we can express the local element matrix as
(6) 
The local element vector can be determined in an analogous way.
2.1 Monomials
It has been shown (for example in Kirby:TC) that local element tensors can be expressed as a sum of integrals over
, each integral being the product of derivatives of functions from sets of discrete spaces and, possibly, functions of some spatially varying coefficients. An integral of this form is called monomial.2.2 Quadrature mode
Quadrature schemes are typically used to numerically evaluate . For convenience, a reference element and an affine mapping to any element are introduced. This implies that a change of variables from reference coordinates to real coordinates is necessary any time a new element is evaluated. The basis functions are then replaced with local basis functions such that . The numerical integration of (6) over an element can then be expressed as follows:
(7) 
where is the number of integration points, the quadrature weight at the integration point , the dimension of ,
the number of degrees of freedom associated to the local basis functions, and
the determinant of the Jacobian of the aforementioned change of coordinates.2.3 Tensor contraction mode
By exploiting the linearity, associativity and distributivity of the relevant mathematical operators, we can rewrite (7) as
(8) 
A generalization of this transformation was introduced in [Kirby and Logg (2007)]. Since it only involves reference element terms, the quadrature sum can be preevaluated and reused for each element. The evaluation of the local tensor can then be abstracted as
(9) 
in which the preevaluated reference tensor, , and the celldependent geometry tensor, , are exposed.
2.4 Qualitative comparison
Depending on form and discretization, the relative performance of the two modes, in terms of the operation count, can vary quite dramatically. The presence of derivatives or coefficient functions in the input form increases the rank of the geometry tensor, making the traditional quadrature mode preferable for sufficiently complex forms. On the other hand, speedups from adopting tensor mode can be significant in a wide class of forms in which the geometry tensor remains sufficiently small. The discretization, particularly the polynomial order of trial, test, and coefficient functions, also plays a key role in the resulting operation count.
These two modes are implemented in the FEniCS Form Compiler [Kirby and Logg (2006)]
. In this compiler, a heuristic is used to choose the most suitable mode for a given form. It consists of analysing each monomial in the form, counting the number of derivatives and coefficient functions, and checking if this number is greater than a constant found empirically
[Logg et al. (2012)]. We will return to the efficacy of this approach in section 7. One of the objectives of this paper is to produce a system that goes beyond the dichotomy between quadrature and tensor modes. We will reason in terms of loop nests, code motion, and code preevaluation, searching the entire implementation space for an optimal synthesis.3 Transformation Space
In this section, we characterize global and local optimality for finite element integration as well as the space of legal transformations that needs be explored to achieve them. The method by which exploration is performed is discussed in Section 4.
3.1 Loop nests, expressions and optimality
In order to make the article selfcontained, we start with reviewing basic compiler terminology.
Definition 1 (Perfect and imperfect loop nests).
A perfect loop nest is a loop whose body either 1) comprises only a sequence of nonloop statements or 2) is itself a perfect loop nest. If this condition does not hold, a loop nest is said to be imperfect.
Definition 2 (Independent basic block).
An independent basic block is a sequence of statements such that no data dependencies exist between statements in the block.
We focus on perfect nests whose innermost loop body is an independent basic block. A straightforward property of this class is that hoisting invariant expressions from the innermost to any of the outer loops or the preheader (i.e., the block that precedes the entry point of the nest) is always safe, as long as any dependencies on loop indices are honored. We will make use of this property. The results of this section could also be generalized to larger classes of loop nests, in which basic block independence does not hold, although this would require refinements beyond the scope of this paper.
By mapping mathematical properties to the loop nest level, we introduce the concepts of a linear loop and, more generally, a (perfect) multilinear loop nest.
Definition 3 (Linear loop).
A loop defining the iteration space through the iteration variable , or simply , is linear if in its body

appears only as an array index, and

whenever an array is indexed by (), all expressions in which this appears are affine in .
Definition 4 (Multilinear loop nest).
A multilinear loop nest of arity is a perfect nest composed of loops, in which all of the expressions appearing in the body of the innermost loop are affine in each loop separately.
We will show that multilinear loop nests, which arise naturally when translating bilinear or linear forms into code, are important because they have a structure that we can take advantage of to reach a local optimum.
We define two other classes of loops.
Definition 5 (Reduction loop).
A loop is said to be a reduction loop if in its body

appears only as an array index, and

for each augmented assignment statement (e.g., an increment), arrays indexed by appear only on the right hand side of .
Definition 6 (Orderfree loop).
A loop is said to be an orderfree loop if its iterations can be executed in any arbitrary order.
Consider Equation 7 and the (abstract) loop nest implementing it illustrated in Figure 1. The imperfect nest comprises an orderfree loop (over elements in the mesh), a reduction loop (performing numerical integration), and a multilinear loop nest (over test and trial functions). In the body of , one or more statements evaluate the local tensor for the element . Expressions (the right hand side of a statement) result from the translation of a form in high level matrix notation into code. In particular, is the number of monomials (a form is a sum of monomials), () represents the product of a coefficient function (e.g., the inverse Jacobian matrix for the change of coordinates) with test or trial functions, and is a function of coefficients and geometry. We do not pose any restrictions on function spaces (e.g., scalar or vectorvalued), coefficient expressions (linear or nonlinear), differential and vector operators, so can be arbitrarily complex. We say that such an expression is in normal form, because the algebraic structure of a variational form is intact: products have not yet been expanded, distinct monomials can still be identified, and so on. This brings us to formalize the class of loop nests that we aim to optimize.
Definition 7 (Finite element integration loop nest).
A finite element integration loop nest is a loop nest in which the following appear, in order: an imperfect orderfree loop, an imperfect (perfect only in some special cases), linear or nonlinear reduction loop, and a multilinear loop nest whose body is an independent basic block in which expressions are in normal form.
We then characterize optimality for a finite element integration loop nest as follows.
Definition 8 (Optimality of a loop nest).
Let be a generic loop nest, and let be a transformation function such that is semantically equivalent to (possibly, ). We say that is an optimal synthesis of if the total number of operations (additions, products) performed to evaluate the result is minimal.
The concept of local optimality, which relies on the particular class of flopdecreasing transformations, is also introduced.
Definition 9 (Flopdecreasing transformation).
A transformation which reduces the operation count is called flopdecreasing.
Definition 10 (Local optimality of a loop nest).
Given , and as in Definition 8, we say that is a locally optimal synthesis of if:

the number of operations (additions, products) in the innermost loops performed to evaluate the result is minimal, and

is expressed as composition of flopdecreasing transformations.
The restriction to flopdecreasing transformations aims to exclude those apparent optimizations that, to achieve flopoptimal innermost loops, would rearrange the computation at the level of the outer loops causing, in fact, a global increase in operation count.
We also observe that Definitions 8 and 10 do not take into account memory requirements. If the execution of loop nest were memorybound – the ratio of operations to bytes transferred from memory to the CPU being too low – then optimizing the number of flops would be fruitless. Henceforth we assume we operate in a CPUbound regime, evaluating arithmeticintensive expressions. In the context of finite elements, this is often true for more complex multilinear forms and/or higher order elements.
Achieving optimality in polynomial time is not generally feasible, since the subexpressions can be arbitrarily unstructured. However, multilinearity results in a certain degree of regularity in and . In the following sections, we will elaborate on these observations and formulate an approach that achieves: (i) at least a local optimum in all cases; (ii) global optimality whenever the monomials are “sufficiently structured”. To this purpose, we will construct:

an algorithm to select points in the transformation space (Section 4)
3.2 Sharing elimination
We start with introducing the fundamental notion of sharing.
Definition 11 (Sharing).
A statement within a loop nest presents sharing if at least one of the following conditions hold:
 Spatial sharing

There are at least two symbolically identical subexpressions
 Temporal sharing

There is at least one nontrivial subexpression (e.g., an addition or a product) that is redundantly executed because it is independent of .
To illustrate the definition, we show in Figure 2 how sharing evolves as factorization and code motion are applied to a trivial multilinear loop nest. In the original loop nest (Figure 2(a)), spatial sharing is induced by the symbol . Factorization eliminates spatial sharing and creates temporal sharing (Figure 2(b)). Finally, generalized code motion [Luporini et al. (2015)], which hoists subexpressions that are redundantly executed by at least one loop in the nest^{2}^{2}2Traditional loopinvariant code motion, which is commonly applied by generalpurpose compilers, only checks invariance with respect to the innermost loop., leads to optimality (Figure 2(c)).



In this section, we study sharing elimination, a transformation that aims to reduce the operation count by removing sharing through the application of expansion, factorization, and generalized code motion. If the objective were reaching optimality and the expressions lacked structure, a transformation of this sort would require solving a large combinatorial problem – for instance to evaluate the impact of all possible factorizations. Our sharing elimination strategy, instead, exploits the structure inherent in finite element integration expressions to guarantee, after coordination with other transformations (an aspect which we discuss in the following sections), local optimality. Global optimality is achieved if stronger preconditions hold. Setting local optimality, rather than optimality, as primary goal is essential to produce simple and computationally efficient algorithms – two necessary conditions for integration with a compiler.
3.2.1 Identification and exploitation of structure
Finite element expressions can be seen as composition of operations between tensors. Often, the optimal implementation strategy for these operations is to be determined out of two alternatives. For instance, consider , with being the transposed inverse Jacobian matrix for the change of (twodimensional) coordinates, and a generic twodimensional vector. The tensor operation will reduce to the scalar expression , in which and represent components of that depend on . To minimize the operation count for expressions of this kind, we have two options:
Strategy 1.
Eliminating temporal sharing through generalized code motion.
Strategy 2.
Eliminating spatial sharing first – through product expansion and factorization – and temporal sharing afterwards, again through generalized code motion.
In the current example, we observe that, depending on the size of , applying Strategy 2 could reduce the operation count since the expression would be recast as and some hoistable subexpressions would be exposed. On the other hand, Strategy 1 would have no effect as only depends on a single loop, . In general, the choice between the two strategies depends on multiple factors: the loop sizes, the increase in operation count due to expansion (in Strategy 2), and the gain due to code motion. A second application of Strategy 2 was provided in Figure 2. These examples motivate the introduction of a particular class of expressions, for which the two strategies assume notable importance.
Definition 12 (Structured expression).
We say that an expression is “structured along a loop nest ” if and only if, for every symbol depending on at least one loop in , the spatial sharing of may be eliminated by factorizing all occurrences of in the expression.
Proposition 1.
An expression along a multilinear loop nest is structured.
Proof.
If were an arbitrary loop nest, a given symbol could appear everywhere (e.g., times in a summand and times in another summand with , as argument of a higher level function, in the denominator of a division), thus posing the challenge of finding the factorization that maximizes temporal sharing. If is instead a finite element integration loop nest, thanks to Proposition 1 the space of flopdecreasing transformations is constructed by “composition” of Strategy 1 and Strategy 2, as illustrated in Algorithm 1.
Finally, we observe that the subexpressions can sometimes be considered “weakly structured”. This happens when a relaxed version of Definition 12 applies, in which the factorization of only “minimizes” (rather than “eliminates”) spatial sharing (for instance, in the complex hyperelastic model analyzed in Section 7). Weak structure will be exploited by Algorithm 1 in the attempt to achieve optimality.
3.2.2 Algorithm
Algorithm 1 describes sharing elimination assuming as input a tree representation of the loop nest. It makes use of the following notation and terminology:

multilinear operand: any or in the input expression.

multilinear symbol: a symbol appearing within a multilinear operand depending on or (e.g., test functions, first order derivatives of test functions, etc.).
Examples will be provided in Section 3.2.3.
Algorithm 1 (Sharing elimination).
The input of the algorithm is a tree representation a finite element integration loop nest.

Perform a depthfirst visit of the loop tree to collect and partition multilinear operands into disjoint sets, . is such that all multilinear operands in each share the same set of multilinear symbols , whereas there is no sharing across different partitions. For all multilinear operands in such that , apply Strategy 1.
Note: as a consequence of Proposition 1, and represent the number of products in the innermost loop induced by if Strategy 1 or Strategy 2 were applied 
For each subexpression depending on exactly one linear loop, collect the multilinear symbols and the temporaries produced at step (1). Partition them into disjoint sets, , such that includes all instances of a given symbol in . Apply Strategy 2 factorizing the symbols in each , provided that this leads to a reduction in operation count; otherwise, apply Strategy 1
Note: the last check ensures the flopdecreasing nature of the transformation. In the cases in which expansion outweighs code motion, Strategy 1 is preferred.
Note: the expansion cost is a function of the products wrapping a symbol (how many of them and their arity), so it can be determined through tree visits. 
Build the sharing graph . Each represents a multilinear symbol or a temporary produced by the previous steps. An edge , indicates that a product would appear if the subexpressions including and were expanded.
Note: the following steps will only impact bilinear forms, since otherwise . 
Partition into disjoint sets, , such that includes all instances of a given symbol in the expression. Transform by merging into a unique vertex (taking the union of the edges), provided that factorizing would not cause an increase in operation count.

Map
to an Integer Linear Programming (ILP) model for determining how to optimally apply Strategy
2. The solution is the set of symbols that will be factorized by Strategy 2. Let ; the ILP model then is as follows: 
Perform a depthfirst visit of the loop tree and, for each yet unhandled or hoisted expression, apply the most profitable between Strategy 1 and Strategy 2.
Note: this pass speculatively assumes that expressions are (weakly) structured along the reduction loop. If the assumption does not hold, the operation count will generally be suboptimal because only a subset of factorizations and code motion opportunities may eventually be considered.
Although the primary goal of Algorithm 1 is operation count minimization within the multilinear loop nest, the enforcement of flopdecreasing transformations (steps (2) and (4)) and the rescheduling of subexpressions within outer loops (last step) also attempt to optimize the loop nest globally. We will further elaborate this aspect in Section 5.
3.2.3 Examples
Consider again Figure 2(a). We have , with , , and . For all , we have , although applying Strategy 1 in step (1) has no effect. The sharing graph is , and . The ILP formulation leads to the code in Figure 2(c).
In Figure 3, Algorithm 1 is executed in a very simple realistic scenario, which originates from the bilinear form of a Poisson equation in two dimensions. We observe that , with and . In addition, , so Strategy 1 is applied to both partitions (step (1)). We then have (step (3)) . Since there are no more factorization opportunities, the ILP formulation becomes irrelevant.
for (e = 0; e < E; e++)
= ...
= ...
...
for (i = 0; i < I; i++)
for (j = 0; j < J; j++)
for (k = 0; k < K; k++)
+=

for (e = 0; e < E; e++)
...
for (i = 0; i < I; i++)
for (k = 0; k < K; k++)
=
=
for (j = 0; j < J; j++)
=
=
for (j = 0; j < J; j++)
for (k = 0; k < K; k++)
+= +

For reasons of space, further examples, including the hyperelastic model evaluated in Section 7 and other nontrivial ILP instances, are made available online^{3}^{3}3Sharing elimination examples: https://gist.github.com/FabioLuporini/14e79457d6b15823c1cd.
3.3 Preevaluation of reductions
Sharing elimination uses three operators: expansion, factorization, and code motion. In this section, we discuss the role and legality of a fourth operator: reduction preevaluation. We will see that what makes this operator special is the fact that there exists a single point in the transformation space of a monomial (i.e., a specific factorization of test, trial, and coefficient functions) ensuring its correctness.
We start with an example. Consider again the loop nest and the expression in Figure 1. We pose the following question: are we able to identify subexpressions for which the reduction induced by can be preevaluated, thus obtaining a decrease in operation count proportional to the size of , ? The transformation we look for is exemplified in Figure 4 with a simple loop nest. The reader may verify that a similar transformation is applicable to the example in Figure 3(a).

for (i = 0; i < I; i++)
for (k = 0; k < K; k++)
+= ( + )
for (e = 0; e < E; e++)
for (k = 0; k < K; k++)
=

Preevaluation can be seen as the generalization of tensor contraction (Section 2.3) to a wider class of subexpressions. We know that multilinear forms can be seen as sums of monomials, each monomial being an integral over the equation domain of products (of derivatives) of functions from discrete spaces. A monomial can always be reduced to the product between a “reference” and a “geometry” tensor. In our model, a reference tensor is simply represented by one or more subexpressions independent of , exposed after particular transformations of the expression tree. This leads to the following algorithm.
Algorithm 2 (Preevaluation).
Consider a finite element integration loop nest . We dissect the normal form input expression into distinct subexpressions, each of them representing a monomial. Each subexpression is then factorized so as to split constants from dependent terms. This transformation is feasible^{4}^{4}4For reasons of space, we omit the detailed sequence of steps (e.g., expansion, factorization), which is however available at https://github.com/coneoproject/COFFEE/blob/master/coffee/optimizer.py in fabio_luporini_2016_49279., as a consequence of the results in Kirby:TC. These dependent terms are hoisted outside of and stored into temporaries. As part of this process, the reduction induced by is computed by means of symbolic execution. Finally, is removed from .
The preevaluation of a monomial introduces some critical issues:

Depending on the complexity of a monomial, a certain number, , of temporary variables is required if preevaluation is performed. Such temporary variables are actually dimensional arrays of size , with and being, respectively, the arity and the extent (iteration space size) of the multilinear loop nest (e.g., and in the case of bilinear forms). For certain values of , preevaluation may dramatically increase the working set, which may be counterproductive for actual execution time.

The transformations exposing dependent terms increase the arithmetic complexity of the expression (e.g., expansion tends to increase the operation count). This could outweigh the gain due to preevaluation.

A strategy for coordinating sharing elimination and preevaluation is needed. We observe that sharing elimination inhibits preevaluation, whereas preevaluation could expose further sharing elimination opportunities.
We expand on point (1) in the next section, while we address points (2) and (3) in Section 4.
3.4 Memory constraints
We have just observed that the code motion induced by monomial preevaluation may dramatically increase the working set size. Even more aggressive code motion strategies are theoretically conceivable. Imagine is enclosed in a time stepping loop. One could think of exposing (through some transformations) and hoisting timeinvariant subexpressions for minimizing redundant computation at each time step. The working set size would then increase by a factor , and since
, the gain in operation count would probably be outweighed, from a runtime viewpoint, by a much larger memory pressure.
Since, for certain forms and discretizations, hoisting may cause the working set to exceed the size of some level of local memory (e.g. the last level of private cache on a conventional CPU, the shared memory on a GPU), we introduce the following memory constraints.
Constraint 1.
The size of a temporary due to code motion must not be proportional to the size of .
Constraint 2.
The total amount of memory occupied by the temporaries due to code motion must not exceed a certain threshold, .
4 Selection and composition of transformations
In this section, we build a transformation algorithm that, given a memory bound, systematically reaches a local optimum for finite element integration loop nests.
4.1 Transformation algorithm
We address the two following issues:

Coordination of preevaluation and sharing elimination. Recall from Section 3.3 that preevaluation could either increase or decrease the operation count in comparison with that achieved by sharing elimination.

Optimizing over composite operations. Consider a form comprising two monomials and . Assume that preevaluation is profitable for but not for , and that and share at least one term (for example some basis functions). If preevaluation were applied to , sharing between and would be lost. We then need a mechanism to understand which transformation – preevaluation or sharing elimination – results in the highest operation count reduction when considering the whole set of monomials (i.e., the expression as a whole).
Let be a cost function that, given a monomial , returns the gain/loss achieved by preevaluation over sharing elimination. In particular, we define , where and represent the operation counts resulting from applying sharing elimination and preevaluation, respectively. Thus preevaluation is profitable for if and only if . We return to the issue of deriving and in Section 4.2. Having defined , we can now describe the transformation algorithm (Algorithm 3).
Algorithm 3 (Transformation algorithm).
The algorithm has three main phases: initialization (step 1); determination of the monomials preserving the memory constraints that should be preevaluated (steps 24); application of preevaluation and sharing elimination (step 5).

Perform a depthfirst visit of the expression tree and determine the set of monomials . Let be the subset of monomials such that . The set of monomials that will potentially be preevaluated is .
Note: there are two fundamental reasons for not preevaluating straight away: 1) the potential presence of spatial sharing between and , which impacts the search for the global optimum; 2) the risk of breaking Constraint 2. 
Build the set of all possible bipartitions of . Let be the dictionary that will store the operation counts of different alternatives.

Discard if the memory required after applying preevaluation to the monomials in exceeds (see Constraint 2); otherwise, add .
Note: is in practice very small, since even complex forms usually have only a few monomials. This pass can then be accomplished rapidly as long as the cost of calculating and is negligible. We elaborate on this aspect in Section 4.2. 
Take .

Apply preevaluation to all monomials in . Apply sharing elimination to all resulting expressions.
Note: because of the reuse of basis functions, preevaluation may produce some identical tables, which will be mapped to the same temporary variable. Sharing elimination is therefore transparently applied to all expressions, including those resulting from preevaluation.
The output of the transformation algorithm is provided in Figure 4.1, assuming as input the loop nest in Figure 1.
4.2 The cost function
We tie up the remaining loose end: the construction of the cost function .
We recall that , with and representing the operation counts after applying sharing elimination and preevaluation. Since is deployed in a working compiler, simplicity and efficiency are essential characteristics. In the following, we explain how to derive these two values.
The most trivial way of evaluating and would consist of applying the actual transformations and simply count the number of operations. This would be tolerable for , as Algorithm 1 tends to have negligible cost. However, the overhead would be unacceptable if we applied preevaluation – in particular, symbolic execution – to all bipartitions analyzed by Algorithm 3. We therefore seek an analytic way of determining .
The first step consists of estimating the
increase factor, . This number captures the increase in arithmetic complexity due to the transformations exposing preevaluation opportunities. For context, consider the example in Figure 6. One can think of this as the (simplified) loop nest originating from the integration of the action of a mass matrix. The subexpression *+*+* represents the coefficient over (tabulated) basis functions (array ). In order to apply preevaluation, the expression needs be transformed to separate from all dependent quantities (see Algorithm 2). By product expansion, we observe an increase in the number of dependent terms of a factor .
In general, however, determining is not so straightforward since redundant tabulations may result from common subexpressions. Consider the previous example. One may add one coefficient in the same function space as , repeat the expansion, and observe that multiple subexpressions (e.g., and ) will reduce to identical tables. To evaluate , we then use combinatorics. We calculate the combinations with repetitions of elements, where: (i) is the number of (derivatives of) coefficients appearing in a product; (ii) is the number of unique basis functions involved in the expansion. In the original example, we had (for , , and ) and , which confirms . In the modified example, there are two coefficients, so , which means .
If (the extent of the reduction loop), we already know that preevaluation will not be profitable. Intuitively, this means that we are introducing more operations than we are saving from preevaluating . If , we still need to find the number of terms such that . The mass matrix monomial in Figure 6 is characterized by the dot product of test and trial functions, so trivially . In the example in Figure 3, instead, we have after a suitable factorization of basis functions. In general, therefore, depends on both form and discretization. To determine this parameter, we look at the refactorized expression (as established by Algorithm 2), and simply count the terms amenable to preevaluation.
5 Formalization
We demonstrate that the orchestration of sharing elimination and preevaluation performed by the transformation algorithm guarantees local optimality (Definition 10). The proof reuses concepts and explanations provided throughout the paper, as well as the terminology introduced in Section 3.2.2.
Proposition 2.
Consider a multilinear form comprising a set of monomials , and let be the corresponding finite element integration loop nest. Let be the transformation algorithm. Let be the set of monomials that, according to , need to be preevaluated, and let . Assume that the preevaluation of different monomials does not result in identical tables. Then, is a local optimum in the sense of Definition 10 and satisfies Constraint 2.
Proof.
We first observe that the cost function predicts the exact gain/loss in monomial preevaluation, so and can actually be constructed.
Let denote the operation count for and let be the subset of innermost loops (all loops in Figure 4.1). We need to show that there is no other synthesis satisfying Constraint 2 such that . This holds if and only if

The coordination of preevaluation with sharing elimination is optimal. This boils down to prove that

preevaluating any would result in

not preevaluating any would result in


Sharing elimination leads to a (at least) local optimum.
We discuss these points separately


Let represent the set of tables resulting from applying preevaluation to a monomial . Consider two monomials and the respective sets of preevaluated tables, and . If , at least one table is assignable to the same temporary. , therefore, may not be optimal, since only distinguishes monomials in “isolation”. We neglect this scenario (see assumptions) because of its purely pathological nature and its – with high probability – negligible impact on the operation count.

Let and be two monomials sharing some generic multilinear symbols. If were carelessly preevaluated, there may be a potential gain in sharing elimination that is lost, potentially leading to a nonoptimum. This situation is prevented by construction, because exhaustively searches all possible bipartitions on order to determine an optimum which satisfies Constraint 2^{5}^{5}5Note that the problem can be seen as an instance of the wellknown Knapsack problem. Recall that since the number of monomials is in practice very small, this pass can rapidly be accomplished.


Consider Algorithm 1. Proposition 1 ensures that there are only two ways of scheduling the multilinear operands in : through generalized code motion (Strategy 1) or factorization of multilinear symbols (via Strategy 2). If applied, these two strategies would lead, respectively, to performing and multiplications at every loop iteration. Since Strategy 1 is applied if and only if and does not change the structure of the expression (it requires neither expansion nor factorization), step (1) cannot prune the optimum from the search space.
After structuring the sharing graph in such a way that only flopdecreasing transformations are possible, the ILP model is instantiated. At this point, proving optimality reduces to establishing the correctness of the model, which is relatively straightforward because of its simplicity. The model aims to minimize the operation count by selecting the most promising factorizations. The second set of constraints is to select all edges (i.e., all multiplications), exactly once. The first set of inequalities allows multiplications to be scheduled: once a vertex is selected (i.e., once a symbol is decided to be factorized), all multiplications involving can be grouped.
∎
Throughout the paper we have reiterated the claim that Algorithm 3 achieves a globally optimal flop count if stronger preconditions on the input variational form are satisfied. We state here these preconditions, in increasing order of complexity.

There is a single monomial and only a specific coefficient (e.g., the coordinates field). This is by far the simplest scenario, which requires no particular transformation at the level of the outer loops, so optimality naturally follows.

There is a single monomial, but multiple coefficients are present. Optimality is achieved if and only if all subexpressions depending on coefficients are structured (see Section 3.2.1). This avoids ambiguity in factorization, which in turn guarantees that the output of step (7) in Algorithm 1 is optimal.

There are multiple monomials, but either at most one coefficient (e.g., the coordinates field) or multiple coefficients not inducing sharing across different monomials are present. This reduces, respectively, to cases (1) and (2) above.

There are multiple monomials, and coefficients are shared across monomials. Optimality is reached if and only if the coefficientdependent subexpressions produced by Algorithm 1 – that is, the byproduct of factorizing test/trial functions from distinct monomials – preserve structure.
6 Code Generation
Sharing elimination and preevaluation, as well as the transformation algorithm, have been implemented in COFFEE, the compiler for finite element integration routines adopted in Firedrake. In this section, we briefly discuss the aspects of the compiler that are relevant for this article.
6.1 Expressing transformations through the COFFEE language
COFFEE implements sharing elimination and preevaluation by composing building block transformation operators, which we refer to as rewrite operators. This has several advantages. The first is extensibility. New transformations, such as sum factorization in spectral methods, could be expressed by composing the existing operators, or with small effort building on what is already available. Second, generality: COFFEE can be seen as a lightweight, low level computer algebra system, not necessarily tied to finite element integration. Third, robustness: the same operators are exploited, and therefore tested, by different optimization pipelines. The rewrite operators, whose (Python) implementation is based on manipulation of abstract syntax trees (ASTs), comprise the COFFEE language. A nonexhaustive list of such operators includes expansion, factorization, reassociation, generalized code motion.
6.2 Independence from form compilers
COFFEE aims to be independent of the high level form compiler. It provides an interface to build generic ASTs and only expects expressions to be in normal form (or sufficiently close to it). For example, Firedrake has transitioned from a version of the FEniCS Form Compiler [Kirby and Logg (2006)] modified to produce ASTs rather than strings, to a newly written compiler^{6}^{6}6TSFC, the twostage form compiler https://github.com/firedrakeproject/tsfc, while continuing to emply COFFEE. Thus, COFFEE decouples the mathematical manipulation of a form from code optimization; or, in other words, relieves form compiler developers of the task of fine scale loop optimization of generated code.
6.3 Handling blocksparse tables
For several reasons, basis function tables may be blocksparse (e.g., containing zerovalued columns). For example, the FEniCS Form Compiler implements vectorvalued functions by adding blocks of zerovalued columns to the corresponding tabulations; this extremely simplifies code generation (particularly, the construction of loop nests), but also affects the performance of the generated code due to the execution of “useless” flops (e.g., operations like a + 0). In quadrature1, a technique to avoid iteration over zerovalued columns based on the use of indirection arrays (e.g. A[B[i]], in which A is a tabulated basis function and B a map from loop iterations to nonzero columns in A) was proposed. This technique, however, produces noncontiguous memory loads and stores, which nullify the potential benefits of vectorization. COFFEE, instead, handles blocksparse basis function tables by restructuring loops in such a manner that low level optimization (especially vectorization) is only marginally affected. This is based on symbolic execution of the code, which enables a series of checks on array indices and loop bounds which determine the zerovalued blocks which can be skipped without affecting data alignment.
7 Performance Evaluation
7.1 Experimental setup
Experiments were run on a single core of an Intel I72600 (Sandy Bridge) CPU, running at 3.4GHz, 32KB L1 cache (private), 256KB L2 cache (private) and 8MB L3 cache (shared). The Intel Turbo Boost and Intel Speed Step technologies were disabled. The Intel icc 15.2 compiler was used. The compilation flags used were O3, xHost. The compilation flag xHost tells the Intel compiler to generate efficient code for the underlying platform.
The Zenodo system was used to archive all packages used to perform the experiments: Firedrake [Mitchell et al. (2016)], PETSc [Smith et al. (2016)], petsc4py [Firedrake (2016)], FIAT [Rognes et al. (2016)], UFL [Alnæs et al. (2016)], FFC [Logg et al. (2016)], PyOP2 [Rathgeber et al. (2016b)] and COFFEE [Luporini et al. (2016)]. The experiments can be reproduced using a publicly available benchmark suite [Rathgeber et al. (2016a)].
We analyze the execution time of four realworld bilinear forms of increasing complexity, which comprise the differential operators that are most common in finite element methods. In particular, we study the mass matrix (“Mass”) and the bilinear forms arising in a Helmholtz equation (“Helmholtz”), in an elastic model (“Elasticity”), and in a hyperelastic model (“Hyperelasticity”). The complete specification of these forms is made publicly available^{7}^{7}7https://github.com/firedrakeproject/firedrakebench/blob/experiments/forms/firedrake_forms.py.
We evaluate the speedups achieved by a wide variety of transformation systems over the “original” code produced by the FEniCS Form Compiler (i.e., no optimizations applied). We analyze the following transformation systems:
 quad

Optimized quadrature mode. Work presented in quadrature1, implemented in in the FEniCS Form Compiler.
 tens

Tensor contraction mode. Work presented in FFCTC, implemented in the FEniCS Form Compiler.
 auto

Automatic choice between tens and quad driven by heuristic (detailed in Fenics and summarized in Section 2.4). Implemented in the FEniCS Form Compiler.
 ufls

UFLACS, a novel backend for the FEniCS Form Compiler whose primary goals are improved code generation and execution times.
 cfO1

Generalized loopinvariant code motion. Work presented in Luporini, implemented in COFFEE.
 cfO2

Optimal loop nest synthesis with handling of blocksparse tables. Work presented in this article, implemented in COFFEE.
The values that we report are the average of three runs with “warm cache”; that is, with all kernels retrieved directly from the Firedrake’s cache, so code generation and compilation times are not counted. The timing includes however the cost of both local assembly and matrix insertion, with the latter minimized through the choice of a mesh (details below) small enough to fit the L3 cache of the CPU.
For a fair comparison, small patches were written to the make quad, tens, and ufls compatible with Firedrake. By executing all simulations in Firedrake, we guarantee that both matrix insertion and mesh iteration have a fixed cost, independent of the transformation system employed. The patches adjust the data storage layout to what Firedrake expects (e.g., by generating an array of pointers instead of a pointer to pointers, by replacing flattened arrays with bidimensional ones).
For Constraint 2, discussed in Section 3.4, we set ; that is, the size of the processor L2 cache (the last level of private cache). When the threshold had an impact on the transformation process, the experiments were repeated with . The results are documented later, individually for each problem.
Following the methodology adopted in quadrature1, we vary the following parameters:

the polynomial degree of test, trial, and coefficient (or “premultiplying”) functions,

the number of coefficient functions
While constants of our study are

the space of test, trial, and coefficient functions: Lagrange

the mesh: tetrahedral with a total of 4374 elements

exact numerical quadrature (we employ the same scheme used in quadrature1, based on the GaussLegendreJacobi rule)
7.2 Performance results
We report the results of our experiments in Figures 7, 8, 9, and 10 as threedimensional plots. The axes represent , , and code transformation system. We show one subplot for each problem instance , with the code transformation system varying within each subplot. The best variant for each problem instance is given by the tallest bar, which indicates the maximum speedup over nontransformed code. We note that if a bar or a subplot are missing, then the form compiler failed to generate code because it either exceeded the system memory limit or was otherwise unable to handle the form.
The rest of the section is organized as follows: we first provide insights into the general outcome of the experimentation; we then comment on the impact of a fundamental lowlevel optimization, namely autovectorization; finally, we motivate, for each form, the performance results obtained.
High level view
Our transformation strategy does not always guarantee minimum execution time. In particular, about 5 of the test cases (3 out of 56, without counting marginal differences) show that cfO2 was not optimal in terms of runtime. The most significant of such test cases is the elastic model with . There are two reasons for this. First, low level optimization can have a significant impact on the actual performance. For example, the aggressive loop unrolling in tens eliminates operations on zeros and reduces the working set size by not storing entire temporaries; on the other hand, preserving the loop structure can maximize the chances of autovectorization. Second, the transformation strategy adopted when is exceeded plays a key role, as we will later elaborate.
Autovectorization
We chose the mesh dimension and the function spaces such that the inner loop sizes would always be a multiple of the machine vector length. This ensured autovectorization in the majority of code variants^{8}^{8}8We verified the vectorization of inner loops by looking at both compiler reports and assembly code.. The biggest exception is quad, due to the presence of indirection arrays in the generated code. In tens, loop nests are fully unrolled, so the standard loop vectorization is not feasible; the compiler reports suggest, however, that block vectorization [Larsen and Amarasinghe (2000)] is often triggered. In ufls, cfO1, and cfO2 the iteration spaces have identical structure, with loop vectorization being regularly applied.
Mass matrix
We start with the simplest of the bilinear forms investigated, the mass matrix. Results are in Figure 7. We first notice that the lack of improvements when is due to the fact that matrix insertion outweighs local assembly. For , cfO2 generally shows the highest speedups. It is worth noting why auto does not always select the fastest implementation: auto always opts for tens, while for quad tends to be preferable. On the other hand, cfO2 always makes the optimal decision about whether to apply preevaluation or not. Surprisingly, despite the simplicity of the form, the performance of the various code generation systems can differ significantly.
Helmholtz
As in the case of Mass matrix, when the matrix insertion phase is dominant. For , the general trend is that cfO2 outperforms the competitors. In particular:

preevaluation makes cfO2 notably faster than cfO1, especially for high values of ; auto correctly selects tens, which is comparable to cfO2.

auto picks tens; the choice is however suboptimal when and . This can indirectly be inferred from the large gap between cfO2 and tens/auto: cfO2 applies sharing elimination, but it correctly avoids preevaluation because of the excessive expansion cost.
 and

auto reverts to quad, which would theoretically be the right choice (the flop count is much lower than in tens); however, the generated code suffers from the presence of indirection arrays, which break autovectorization and “traditional” code motion.
The slowdowns (or marginal improvements) seen in a small number of cases exhibited by ufls can be attributed to the presence of sharing in the generated code.
An interesting experiment we additionally performed was relaxing the memory threshold by setting . We found that this makes cfO2 generally slower for , with a maximum slowdown of 2.16 with . This effect could be worse when running in parallel, since the L3 cache is shared and different threads would end up competing for the same resource.
Elasticity
The results for the elastic model are displayed in Figure 9. The main observation is that cfO2 never triggers preevaluation, although in some occasions it should. To clarify this, consider the test case , in which tens/auto show a considerable speedup over cfO2. cfO2 finds preevaluation profitable in terms of operation count, although it is eventually not applied to avoid exceeding . However, running the same experiments with resulted in a dramatic improvement, even higher than that obtained by tens. The reason is that, despite exceeding by roughly 40, the saving in operation count is so large (5 in this specific problem) that preevaluation would in practice be the winning choice. This suggests that our objective function should be improved to handle the cases in which there is a significant gap between potential cache misses and reduction in operation count.
We also note that:

the differences between cfO2 and cfO1 are due to the perfect sharing elimination and the zerovalued blocks avoidance technique presented in Section 6.3.

when , auto prefers tens over quad, which leads to suboptimal operation counts and execution times.

ufls often results in better execution times than quad and tens. This is due to multiple factors, including avoidance of indirection arrays, preservation of loop structure, and a more effective code motion strategy.
Hyperelasticity
In the experiments on the hyperelastic model, shown in Figure 10, cfO2 exhibits the largest gains out of all problem instances considered in this paper. This is a positive result, since it indicates that our transformation algorithm scales well with form complexity. The fact that all code transformation systems (apart from tens) show quite significant speedups suggests two points. First, the baseline is highly inefficient. With forms as complex as in the hyperelastic model, a trivial translation of integration routines into code should always be avoided as even the best generalpurpose compiler available (the Intel compiler on an Intel platform at maximum optimization level) fails to exploit the structure inherent in the expressions. Second, the strategy for removing spatial and temporal sharing has a tremendous impact. Sharing elimination as performed by cfO2 ensures a critical reduction in operation count, which becomes particularly pronounced for higher values of .
8 Conclusions
We have developed a theory for the optimization of finite element integration loop nests. The article details the domain properties which are exploited by our approach (e.g., linearity) and how these translate to transformations at the level of loop nests. All of the algorithms shown in this paper have been implemented in COFFEE, a compiler publicly available fully integrated with the Firedrake framework. The correctness of the transformation algorithm was discussed. The performance results achieved suggest the effectiveness of our methodology.
9 Limitations and future work
We have defined sharing elimination and preevaluation as high level transformations on top of a specific set of rewrite operators, such as code motion and factorization, and we have used them to construct the transformation space. There are three main limitations in this process. First, we do not have a systematic strategy to optimize subexpressions which are independent of linear loops. Although we have a mechanism to determine how much computation should be hoisted to the level of the integration (reduction) loop, it is not clear how to effectively improve the heuristics used at step (6) in Algorithm 1. Second, lower operation counts may be found by exploiting domainspecific properties, such as redundancies in basis functions; this aspect is completely neglected in this article. Third, with Constraint 1 we have limited the applicability of code motion. This constraint was essential given the complexity of the problem tackled.
Another issue raised by the experimentation concerns selecting a proper threshold for Constraint 2. To solve this problem would require a more sophisticated cost model, which is an interesting question deserving further research.
We also identify two additional possible research directions: a complete classification of forms for which a global optimum is achieved; and a generalization of the methodology to other classes of loop nests, for instance those arising in spectral element methods.
References
 [1]
 Alnæs (2016) Martin Sandve Alnæs. 2016. UFLACS  UFL Analyser and Compiler System. https://bitbucket.org/fenicsproject/uflacs. (2016).

Alnæs et al. (2014)
M. S. Alnæs, A. Logg, K. B. Ølgaard, M. E. Rognes, and G. N.
Wells. 2014.
Unified Form Language: A domainspecific language for weak formulations of partial differential equations.
ACM Trans Math Software 40, 2, Article 9 (2014), 9:1–9:37 pages. DOI:http://dx.doi.org/10.1145/2566630  Alnæs et al. (2016) Martin Sandve Alnæs, Anders Logg, Garth Wells, Lawrence Mitchell, Marie E. Rognes, Miklós Homolya, Aslak Bergersen, David A. Ham, Johannes Ring, chrisrichardson, and KentAndre Mardal. 2016. ufl: The Unified Form Language. (April 2016). DOI:http://dx.doi.org/10.5281/zenodo.49282
 Bondhugula et al. (2008) Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’08). ACM, New York, NY, USA, 101–113. DOI:http://dx.doi.org/10.1145/1375581.1375595
 Firedrake (2016) Firedrake. 2016. petsc4py: The Python interface to PETSc. (April 2016). DOI:http://dx.doi.org/10.5281/zenodo.49283
 Homolya and Mitchell (2016) Miklós Homolya and Lawrence Mitchell. 2016. TSFC  Twostage form compiler. https://github.com/firedrakeproject/tsfc. (2016).
 Kirby and Logg (2006) Robert C. Kirby and Anders Logg. 2006. A Compiler for Variational Forms. ACM Trans. Math. Softw. 32, 3 (Sept. 2006), 417–444. DOI:http://dx.doi.org/10.1145/1163641.1163644
 Kirby and Logg (2007) Robert C. Kirby and Anders Logg. 2007. Efficient Compilation of a Class of Variational Forms. ACM Trans. Math. Softw. 33, 3, Article 17 (Aug. 2007). DOI:http://dx.doi.org/10.1145/1268769.1268771
 Larsen and Amarasinghe (2000) Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI ’00). ACM, New York, NY, USA, 145–156. DOI:http://dx.doi.org/10.1145/349299.349320
 Logg et al. (2016) Anders Logg, Martin Sandve Alnæs, Marie E. Rognes, Garth Wells, Johannes Ring, Lawrence Mitchell, Johan Hake, Miklós Homolya, Florian Rathgeber, Fabio Luporini, Graham Markall, Aslak Bergersen, Lizao Li, David A. Ham, KentAndre Mardal, Jan Blechta, GheorgheTeodor Bercea, Tuomas Airaksinen, Nico Schlömer, Hans Petter Langtangen, Colin J Cotter, Ola Skavhaug, Thomas Hisch, mliertzer, Joachim B Haga, and Andrew T. T. McRae. 2016. ffc: The FEniCS Form Compiler. (April 2016). DOI:http://dx.doi.org/10.5281/zenodo.49276
 Logg et al. (2012) Anders Logg, KentAndre Mardal, Garth N. Wells, and others. 2012. Automated Solution of Differential Equations by the Finite Element Method. Springer. DOI:http://dx.doi.org/10.1007/9783642230998
 Luporini et al. (2016) Fabio Luporini, Lawrence Mitchell, Miklós Homolya, Florian Rathgeber, David A. Ham, Michael Lange, Graham Markall, and Francis Russell. 2016. COFFEE: A Compiler for Fast Expression Evaluation. (April 2016). DOI:http://dx.doi.org/10.5281/zenodo.49279
 Luporini et al. (2015) Fabio Luporini, Ana Lucia Varbanescu, Florian Rathgeber, GheorgheTeodor Bercea, J. Ramanujam, David A. Ham, and Paul H. J. Kelly. 2015. CrossLoop Optimization of Arithmetic Intensity for Finite Element Local Assembly. ACM Trans. Archit. Code Optim. 11, 4, Article 57 (Jan. 2015), 25 pages. DOI:http://dx.doi.org/10.1145/2687415
 Mitchell et al. (2016) Lawrence Mitchell, David A. Ham, Florian Rathgeber, Miklós Homolya, Andrew T. T. McRae, GheorgheTeodor Bercea, Michael Lange, Colin J Cotter, Christian T. Jacobs, Fabio Luporini, Simon Wolfgang Funke, Henrik Büsing, Tuomas Kärnä, Anna Kalogirou, Hannah Rittich, Eike Hermann Mueller, Stephan Kramer, Graham Markall, Patrick E. Farrell, Geordie McBain, and Asbjørn Nilsen Riseth. 2016. firedrake: an automated finite element system. (April 2016). DOI:http://dx.doi.org/10.5281/zenodo.49284
 Ølgaard and Wells (2010) Kristian B. Ølgaard and Garth N. Wells. 2010. Optimizations for quadrature representations of finite element tensors through automated code generation. ACM Trans. Math. Softw. 37, 1, Article 8 (Jan. 2010), 23 pages. DOI:http://dx.doi.org/10.1145/1644001.1644009
 Rathgeber et al. (2015) Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew T. T. McRae, GheorgheTeodor Bercea, Graham R. Markall, and Paul H. J. Kelly. 2015. Firedrake: automating the finite element method by composing abstractions. CoRR abs/1501.01809 (2015). http://arxiv.org/abs/1501.01809
 Rathgeber et al. (2016a) Florian Rathgeber, Fabio Luporini, and Lawrence Mitchell. 2016a. firedrakebench: firedrake bench optimality paper release. (April 2016). DOI:http://dx.doi.org/10.5281/zenodo.49290
 Rathgeber et al. (2016b) Florian Rathgeber, Lawrence Mitchell, Fabio Luporini, Graham Markall, David A. Ham, GheorgheTeodor Bercea, Miklós Homolya, Andrew T. T. McRae, Hector Dearman, Christian T. Jacobs, gbts, Simon Wolfgang Funke, Kaho Sato, and Francis Russell. 2016b. PyOP2: Framework for performanceportable parallel computations on unstructured meshes. (April 2016). DOI:http://dx.doi.org/10.5281/zenodo.49281
 Rognes et al. (2016) Marie E. Rognes, Anders Logg, David A. Ham, Miklós Homolya, Nico Schlömer, Jan Blechta, Andrew T. T. McRae, Aslak Bergersen, Colin J Cotter, Johannes Ring, and Lawrence Mitchell. 2016. fiat: The Finite Element Automated Tabulator. (April 2016). DOI:http://dx.doi.org/10.5281/zenodo.49280
 Russell and Kelly (2013) Francis P. Russell and Paul H. J. Kelly. 2013. Optimized Code Generation for Finite Element Local Assembly Using Symbolic Manipulation. ACM Trans. Math. Software 39, 4 (2013).
 Smith et al. (2016) Barry Smith, Satish Balay, Matthew Knepley, Jed Brown, Lois Curfman McInnes, Hong Zhang, Peter Brune, sarich, stefanozampini, Dmitry Karpeyev, Lisandro Dalcin, tisaac, markadams, Victor Minden, VictorEijkhout, vijaysm, Karl Rupp, Fande Kong, and SurtaiHan. 2016. petsc: Portable, Extensible Toolkit for Scientific Computation. (April 2016). DOI:http://dx.doi.org/10.5281/zenodo.49285
5 Formalization
We demonstrate that the orchestration of sharing elimination and preevaluation performed by the transformation algorithm guarantees local optimality (Definition 10). The proof reuses concepts and explanations provided throughout the paper, as well as the terminology introduced in Section 3.2.2.
Proposition 2.
Consider a multilinear form comprising a set of monomials , and let be the corresponding finite element integration loop nest. Let be the transformation algorithm. Let be the set of monomials that, according to , need to be preevaluated, and let . Assume that the preevaluation of different monomials does not result in identical tables. Then, is a local optimum in the sense of Definition 10 and satisfies Constraint 2.
Proof.
We first observe that the cost function predicts the exact gain/loss in monomial preevaluation, so and can actually be constructed.
Let denote the operation count for and let be the subset of innermost loops (all loops in Figure 4.1). We need to show that there is no other synthesis satisfying Constraint 2 such that . This holds if and only if

The coordination of preevaluation with sharing elimination is optimal. This boils down to prove that

preevaluating any would result in

not preevaluating any would result in


Sharing elimination leads to a (at least) local optimum.
We discuss these points separately


Let represent the set of tables resulting from applying preevaluation to a monomial . Consider two monomials and the respective sets of preevaluated tables, and . If , at least one table is assignable to the same temporary. , therefore, may not be optimal, since only distinguishes monomials in “isolation”. We neglect this scenario (see assumptions) because of its purely pathological nature and its – with high probability – negligible impact on the operation count.

Let and be two monomials sharing some generic multilinear symbols. If were carelessly preevaluated, there may be a potential gain in sharing elimination that is lost, potentially leading to a nonoptimum. This situation is prevented by construction, because exhaustively searches all possible bipartitions on order to determine an optimum which satisfies Constraint 2^{5}^{5}5Note that the problem can be seen as an instance of the wellknown Knapsack problem. Recall that since the number of monomials is in practice very small, this pass can rapidly be accomplished.


Consider Algorithm 1. Proposition 1 ensures that there are only two ways of scheduling the multilinear operands in : through generalized code motion (Strategy 1) or factorization of multilinear symbols (via Strategy 2). If applied, these two strategies would lead, respectively, to performing and multiplications at every loop iteration. Since Strategy 1 is applied if and only if and does not change the structure of the expression (it requires neither expansion nor factorization), step (1) cannot prune the optimum from the search space.
After structuring the sharing graph in such a way that only flopdecreasing transformations are possible, the ILP model is instantiated. At this point, proving optimality reduces to establishing the correctness of the model, which is relatively straightforward because of its simplicity. The model aims to minimize the operation count by selecting the most promising factorizations. The second set of constraints is to select all edges (i.e., all multiplications), exactly once. The first set of inequalities allows multiplications to be scheduled: once a vertex is selected (i.e., once a symbol is decided to be factorized), all multiplications involving can be grouped.
∎
Throughout the paper we have reiterated the claim that Algorithm 3 achieves a globally optimal flop count if stronger preconditions on the input variational form are satisfied. We state here these preconditions, in increasing order of complexity.

There is a single monomial and only a specific coefficient (e.g., the coordinates field). This is by far the simplest scenario, which requires no particular transformation at the level of the outer loops, so optimality naturally follows.

There is a single monomial, but multiple coefficients are present. Optimality is achieved if and only if all subexpressions depending on coefficients are structured (see Section 3.2.1). This avoids ambiguity in factorization, which in turn guarantees that the output of step (7) in Algorithm 1 is optimal.

There are multiple monomials, but either at most one coefficient (e.g., the coordinates field) or multiple coefficients not inducing sharing across different monomials are present. This reduces, respectively, to cases (1) and (2) above.

There are multiple monomials, and coefficients are shared across monomials. Optimality is reached if and only if the coefficientdependent subexpressions produced by Algorithm 1 – that is, the byproduct of factorizing test/trial functions from distinct monomials – preserve structure.
6 Code Generation
Sharing elimination and preevaluation, as well as the transformation algorithm, have been implemented in COFFEE, the compiler for finite element integration routines adopted in Firedrake. In this section, we briefly discuss the aspects of the compiler that are relevant for this article.
6.1 Expressing transformations through the COFFEE language
COFFEE implements sharing elimination and preevaluation by composing building block transformation operators, which we refer to as rewrite operators. This has several advantages. The first is extensibility. New transformations, such as sum factorization in spectral methods, could be expressed by composing the existing operators, or with small effort building on what is already available. Second, generality: COFFEE can be seen as a lightweight, low level computer algebra system, not necessarily tied to finite element integration. Third, robustness: the same operators are exploited, and therefore tested, by different optimization pipelines. The rewrite operators, whose (Python) implementation is based on manipulation of abstract syntax trees (ASTs), comprise the COFFEE language. A nonexhaustive list of such operators includes expansion, factorization, reassociation, generalized code motion.
6.2 Independence from form compilers
COFFEE aims to be independent of the high level form compiler. It provides an interface to build generic ASTs and only expects expressions to be in normal form (or sufficiently close to it). For example, Firedrake has transitioned from a version of the FEniCS Form Compiler [Kirby and Logg (2006)] modified to produce ASTs rather than strings, to a newly written compiler^{6}^{6}6TSFC, the twostage form compiler https://github.com/firedrakeproject/tsfc, while continuing to emply COFFEE. Thus, COFFEE decouples the mathematical manipulation of a form from code optimization; or, in other words, relieves form compiler developers of the task of fine scale loop optimization of generated code.
6.3 Handling blocksparse tables
For several reasons, basis function tables may be blocksparse (e.g., containing zerovalued columns). For example, the FEniCS Form Compiler implements vectorvalued functions by adding blocks of zerovalued columns to the corresponding tabulations; this extremely simplifies code generation (particularly, the construction of loop nests), but also affects the performance of the generated code due to the execution of “useless” flops (e.g., operations like a + 0). In quadrature1, a technique to avoid iteration over zerovalued columns based on the use of indirection arrays (e.g. A[B[i]], in which A is a tabulated basis function and B a map from loop iterations to nonzero columns in A) was proposed. This technique, however, produces noncontiguous memory loads and stores, which nullify the potential benefits of vectorization. COFFEE, instead, handles blocksparse basis function tables by restructuring loops in such a manner that low level optimization (especially vectorization) is only marginally affected. This is based on symbolic execution of the code, which enables a series of checks on array indices and loop bounds which determine the zerovalued blocks which can be skipped without affecting data alignment.
7 Performance Evaluation
7.1 Experimental setup
Experiments were run on a single core of an Intel I72600 (Sandy Bridge) CPU, running at 3.4GHz, 32KB L1 cache (private), 256KB L2 cache (private) and 8MB L3 cache (shared). The Intel Turbo Boost and Intel Speed Step technologies were disabled. The Intel icc 15.2 compiler was used. The compilation flags used were O3, xHost. The compilation flag xHost tells the Intel compiler to generate efficient code for the underlying platform.
The Zenodo system was used to archive all packages used to perform the experiments: Firedrake [Mitchell et al. (2016)], PETSc [Smith et al. (2016)], petsc4py [Firedrake (2016)], FIAT [Rognes et al. (2016)], UFL [Alnæs et al. (2016)], FFC [Logg et al. (2016)], PyOP2 [Rathgeber et al. (2016b)] and COFFEE [Luporini et al. (2016)]. The experiments can be reproduced using a publicly available benchmark suite [Rathgeber et al. (2016a)].
We analyze the execution time of four realworld bilinear forms of increasing complexity, which comprise the differential operators that are most common in finite element methods. In particular, we study the mass matrix (“Mass”) and the bilinear forms arising in a Helmholtz equation (“Helmholtz”), in an elastic model (“Elasticity”), and in a hyperelastic model (“Hyperelasticity”). The complete specification of these forms is made publicly available^{7}^{7}7https://github.com/firedrakeproject/firedrakebench/blob/experiments/forms/firedrake_forms.py.
We evaluate the speedups achieved by a wide variety of transformation systems over the “original” code produced by the FEniCS Form Compiler (i.e., no optimizations applied). We analyze the following transformation systems:
 quad

Optimized quadrature mode. Work presented in quadrature1, implemented in in the FEniCS Form Compiler.
 tens

Tensor contraction mode. Work presented in FFCTC, implemented in the FEniCS Form Compiler.
 auto

Automatic choice between tens and quad driven by heuristic (detailed in Fenics and summarized in Section 2.4). Implemented in the FEniCS Form Compiler.
 ufls

UFLACS, a novel backend for the FEniCS Form Compiler whose primary goals are improved code generation and execution times.
 cfO1

Generalized loopinvariant code motion. Work presented in Luporini, implemented in COFFEE.
 cfO2

Optimal loop nest synthesis with handling of blocksparse tables. Work presented in this article, implemented in COFFEE.
The values that we report are the average of three runs with “warm cache”; that is, with all kernels retrieved directly from the Firedrake’s cache, so code generation and compilation times are not counted. The timing includes however the cost of both local assembly and matrix insertion, with the latter minimized through the choice of a mesh (details below) small enough to fit the L3 cache of the CPU.
For a fair comparison, small patches were written to the make quad, tens, and ufls compatible with Firedrake. By executing all simulations in Firedrake, we guarantee that both matrix insertion and mesh iteration have a fixed cost, independent of the transformation system employed. The patches adjust the data storage layout to what Firedrake expects (e.g., by generating an array of pointers instead of a pointer to pointers, by replacing flattened arrays with bidimensional ones).
For Constraint 2, discussed in Section 3.4, we set ; that is, the size of the processor L2 cache (the last level of private cache). When the threshold had an impact on the transformation process, the experiments were repeated with . The results are documented later, individually for each problem.
Following the methodology adopted in quadrature1, we vary the following parameters:

the polynomial degree of test, trial, and coefficient (or “premultiplying”) functions,

the number of coefficient functions
While constants of our study are

the space of test, trial, and coefficient functions: Lagrange

the mesh: tetrahedral with a total of 4374 elements

exact numerical quadrature (we employ the same scheme used in quadrature1, based on the GaussLegendreJacobi rule)
7.2 Performance results
We report the results of our experiments in Figures 7, 8, 9, and 10 as threedimensional plots. The axes represent , , and code transformation system. We show one subplot for each problem instance , with the code transformation system varying within each subplot. The best variant for each problem instance is given by the tallest bar, which indicates the maximum speedup over nontransformed code. We note that if a bar or a subplot are missing, then the form compiler failed to generate code because it either exceeded the system memory limit or was otherwise unable to handle the form.
The rest of the section is organized as follows: we first provide insights into the general outcome of the experimentation; we then comment on the impact of a fundamental lowlevel optimization, namely autovectorization; finally, we motivate, for each form, the performance results obtained.
High level view
Our transformation strategy does not always guarantee minimum execution time. In particular, about 5 of the test cases (3 out of 56, without counting marginal differences) show that cfO2 was not optimal in terms of runtime. The most significant of such test cases is the elastic model with . There are two reasons for this. First, low level optimization can have a significant impact on the actual performance. For example, the aggressive loop unrolling in tens eliminates operations on zeros and reduces the working set size by not storing entire temporaries; on the other hand, preserving the loop structure can maximize the chances of autovectorization. Second, the transformation strategy adopted when is exceeded plays a key role, as we will later elaborate.
Autovectorization
We chose the mesh dimension and the function spaces such that the inner loop sizes would always be a multiple of the machine vector length. This ensured autovectorization in the majority of code variants^{8}^{8}8We verified the vectorization of inner loops by looking at both compiler reports and assembly code.. The biggest exception is quad, due to the presence of indirection arrays in the generated code. In tens, loop nests are fully unrolled, so the standard loop vectorization is not feasible; the compiler reports suggest, however, that block vectorization [Larsen and Amarasinghe (2000)] is often triggered. In ufls, cfO1, and cfO2 the iteration spaces have identical structure, with loop vectorization being regularly applied.
Mass matrix
We start with the simplest of the bilinear forms investigated, the mass matrix. Results are in Figure 7. We first notice that the lack of improvements when is due to the fact that matrix insertion outweighs local assembly. For , cfO2 generally shows the highest speedups. It is worth noting why auto does not always select the fastest implementation: auto always opts for tens, while for quad tends to be preferable. On the other hand, cfO2 always makes the optimal decision about whether to apply preevaluation or not. Surprisingly, despite the simplicity of the form, the performance of the various code generation systems can differ significantly.
Helmholtz
As in the case of Mass matrix, when the matrix insertion phase is dominant. For , the general trend is that cfO2 outperforms the competitors. In particular:

preevaluation makes cfO2 notably faster than cfO1, especially for high values of ; auto correctly selects tens, which is comparable to cfO2.

auto picks tens; the choice is however suboptimal when and . This can indirectly be inferred from the large gap between cfO2 and tens/auto: cfO2 applies sharing elimination, but it correctly avoids preevaluation because of the excessive expansion cost.
 and

auto reverts to quad, which would theoretically be the right choice (the flop count is much lower than in tens); however, the generated code suffers from the presence of indirection arrays, which break autovectorization and “traditional” code motion.
The slowdowns (or marginal improvements) seen in a small number of cases exhibited by ufls can be attributed to the presence of sharing in the generated code.
An interesting experiment we additionally performed was relaxing the memory threshold by setting . We found that this makes cfO2 generally slower for , with a maximum slowdown of 2.16 with . This effect could be worse when running in parallel, since the L3 cache is shared and different threads would end up competing for the same resource.
Elasticity
The results for the elastic model are displayed in Figure 9. The main observation is that cfO2 never triggers preevaluation, although in some occasions it should. To clarify this, consider the test case , in which tens/auto show a considerable speedup over cfO2. cfO2 finds preevaluation profitable in terms of operation count, although it is eventually not applied to avoid exceeding . However, running the same experiments with resulted in a dramatic improvement, even higher than that obtained by tens. The reason is that, despite exceeding by roughly 40, the saving in operation count is so large (5 in this specific problem) that preevaluation would in practice be the winning choice. This suggests that our objective function should be improved to handle the cases in which there is a significant gap between potential cache misses and reduction in operation count.
We also note that:

the differences between cfO2 and cfO1 are due to the perfect sharing elimination and the zerovalued blocks avoidance technique presented in Section 6.3.

when , auto prefers tens over quad, which leads to suboptimal operation counts and execution times.

ufls often results in better execution times than quad and tens. This is due to multiple factors, including avoidance of indirection arrays, preservation of loop structure, and a more effective code motion strategy.
Hyperelasticity
In the experiments on the hyperelastic model, shown in Figure 10, cfO2 exhibits the largest gains out of all problem instances considered in this paper. This is a positive result, since it indicates that our transformation algorithm scales well with form complexity. The fact that all code transformation systems (apart from tens) show quite significant speedups suggests two points. First, the baseline is highly inefficient. With forms as complex as in the hyperelastic model, a trivial translation of integration routines into code should always be avoided as even the best generalpurpose compiler available (the Intel compiler on an Intel platform at maximum optimization level) fails to exploit the structure inherent in the expressions. Second, the strategy for removing spatial and temporal sharing has a tremendous impact. Sharing elimination as performed by cfO2 ensures a critical reduction in operation count, which becomes particularly pronounced for higher values of .
8 Conclusions
We have developed a theory for the optimization of finite element integration loop nests. The article details the domain properties which are exploited by our approach (e.g., linearity) and how these translate to transformations at the level of loop nests. All of the algorithms shown in this paper have been implemented in COFFEE, a compiler publicly available fully integrated with the Firedrake framework. The correctness of the transformation algorithm was discussed. The performance results achieved suggest the effectiveness of our methodology.
9 Limitations and future work
We have defined sharing elimination and preevaluation as high level transformations on top of a specific set of rewrite operators, such as code motion and factorization, and we have used them to construct the transformation space. There are three main limitations in this process. First, we do not have a systematic strategy to optimize subexpressions which are independent of linear loops. Although we have a mechanism to determine how much computation should be hoisted to the level of the integration (reduction) loop, it is not clear how to effectively improve the heuristics used at step (6) in Algorithm 1. Second, lower operation counts may be found by exploiting domainspecific properties, such as redundancies in basis functions; this aspect is completely neglected in this article. Third, with Constraint 1 we have limited the applicability of code motion. This constraint was essential given the complexity of the problem tackled.
Another issue raised by the experimentation concerns selecting a proper threshold for Constraint 2. To solve this problem would require a more sophisticated cost model, which is an interesting question deserving further research.
We also identify two additional possible research directions: a complete classification of forms for which a global optimum is achieved; and a generalization of the methodology to other classes of loop nests, for instance those arising in spectral element methods.
References
 [1]
 Alnæs (2016) Martin Sandve Alnæs. 2016. UFLACS  UFL Analyser and Compiler System. https://bitbucket.org/fenicsproject/uflacs. (2016).

Alnæs et al. (2014)
M. S. Alnæs, A. Logg, K. B. Ølgaard, M. E. Rognes, and G. N.
Wells. 2014.
Unified Form Language: A domainspecific language for weak formulations of partial differential equations.
ACM Trans Math Software 40, 2, Article 9 (2014), 9:1–9:37 pages. DOI:http://dx.doi.org/10.1145/2566630  Alnæs et al. (2016) Martin Sandve Alnæs, Anders Logg, Garth Wells, Lawrence Mitchell, Marie E. Rognes, Miklós Homolya, Aslak Bergersen, David A. Ham, Johannes Ring, chrisrichardson, and KentAndre Mardal. 2016. ufl: The Unified Form Language. (April 2016). DOI:http://dx.doi.org/10.5281/zenodo.49282
 Bondhugula et al. (2008) Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’08). ACM, New York, NY, USA, 101–113. DOI:http://dx.doi.org/10.1145/1375581.1375595
 Firedrake (2016) Firedrake. 2016. petsc4py: The Python interface to PETSc. (April 2016). DOI:http://dx.doi.org/10.5281/zenodo.49283
 Homolya and Mitchell (2016) Miklós Homolya and Lawrence Mitchell. 2016. TSFC  Twostage form compiler. https://github.com/firedrakeproject/tsfc. (2016).
 Kirby and Logg (2006) Robert C. Kirby and Anders Logg. 2006. A Compiler for Variational Forms. ACM Trans. Math. Softw. 32, 3 (Sept. 2006), 417–444. DOI:http://dx.doi.org/10.1145/1163641.1163644
 Kirby and Logg (2007) Robert C. Kirby and Anders Logg. 2007. Efficient Compilation of a Class of Variational Forms. ACM Trans. Math. Softw. 33, 3, Article 17 (Aug. 2007). DOI:http://dx.doi.org/10.1145/1268769.1268771
 Larsen and Amarasinghe (2000) Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI ’00). ACM, New York, NY, USA, 145–156. DOI:http://dx.doi.org/10.1145/349299.349320
 Logg et al. (2016) Anders Logg, Martin Sandve Alnæs, Marie E. Rognes, Garth Wells, Johannes Ring, Lawrence Mitchell, Johan Hake, Miklós Homolya, Florian Rathgeber, Fabio Luporini, Graham Markall, Aslak Bergersen, Lizao Li, David A. Ham, KentAndre Mardal, Jan Blechta, GheorgheTeodor Bercea, Tuomas Airaksinen, Nico Schlömer, Hans Petter Langtangen, Colin J Cotter, Ola Skavhaug, Thomas Hisch, mliertzer, Joachim B Haga, and Andrew T. T. McRae. 2016. ffc: The FEniCS Form Compiler. (April 2016). DOI:http://dx.doi.org/10.5281/zenodo.49276
 Logg et al. (2012) Anders Logg, KentAndre Mardal, Garth N. Wells, and others. 2012. Automated Solution of Differential Equations by the Finite Element Method. Springer. DOI:http://dx.doi.org/10.1007/9783642230998
 Luporini et al. (2016) Fabio Luporini, Lawrence Mitchell, Miklós Homolya, Florian Rathgeber, David A. Ham, Michael Lange, Graham Markall, and Francis Russell. 2016. COFFEE: A Compiler for Fast Expression Evaluation. (April 2016). DOI:http://dx.doi.org/10.5281/zenodo.49279
 Luporini et al. (2015) Fabio Luporini, Ana Lucia Varbanescu, Florian Rathgeber, GheorgheTeodor Bercea, J. Ramanujam, David A. Ham, and Paul H. J. Kelly. 2015. CrossLoop Optimization of Arithmetic Intensity for Finite Element Local Assembly. ACM Trans. Archit. Code Optim. 11, 4, Article 57 (Jan. 2015), 25 pages. DOI:http://dx.doi.org/10.1145/2687415
 Mitchell et al. (2016) Lawrence Mitchell, David A. Ham, Florian Rathgeber, Miklós Homolya, Andrew T. T. McRae, GheorgheTeodor Bercea, Michael Lange, Colin J Cotter, Christian T. Jacobs, Fabio Luporini, Simon Wolfgang Funke, Henrik Büsing, Tuomas Kärnä, Anna Kalogirou, Hannah Rittich, Eike Hermann Mueller, Stephan Kramer, Graham Markall, Patrick E. Farrell, Geordie McBain, and Asbjørn Nilsen Riseth. 2016. firedrake: an automated finite element system. (April 2016). DOI:http://dx.doi.org/10.5281/zenodo.49284
 Ølgaard and Wells (2010) Kristian B. Ølgaard and Garth N. Wells. 2010. Optimizations for quadrature representations of finite element tensors through automated code generation. ACM Trans. Math. Softw. 37, 1, Article 8 (Jan. 2010), 23 pages. DOI:http://dx.doi.org/10.1145/1644001.1644009
 Rathgeber et al. (2015) Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew T. T. McRae, GheorgheTeodor Bercea, Graham R. Markall, and Paul H. J. Kelly. 2015. Firedrake: automating the finite element method by composing abstractions. CoRR abs/1501.01809 (2015). http://arxiv.org/abs/1501.01809
 Rathgeber et al. (2016a) Florian Rathgeber, Fabio Luporini, and Lawrence Mitchell. 2016a. firedrakebench: firedrake bench optimality paper release. (April 2016). DOI:http://dx.doi.org/10.5281/zenodo.49290
 Rathgeber et al. (2016b) Florian Rathgeber, Lawrence Mitchell, Fabio Luporini, Graham Markall, David A. Ham, GheorgheTeodor Bercea, Miklós Homolya, Andrew T. T. McRae, Hector Dearman, Christian T. Jacobs, gbts, Simon Wolfgang Funke, Kaho Sato, and Francis Russell. 2016b. PyOP2: Framework for performanceportable parallel computations on unstructured meshes. (April 2016). DOI:http://dx.doi.org/10.5281/zenodo.49281
 Rognes et al. (2016) Marie E. Rognes, Anders Logg, David A. Ham, Miklós Homolya, Nico Schlömer, Jan Blechta, Andrew T. T. McRae, Aslak Bergersen, Colin J Cotter, Johannes Ring, and Lawrence Mitchell. 2016. fiat: The Finite Element Automated Tabulator. (April 2016). DOI:http://dx.doi.org/10.5281/zenodo.49280
 Russell and Kelly (2013) Francis P. Russell and Paul H. J. Kelly. 2013. Optimized Code Generation for Finite Element Local Assembly Using Symbolic Manipulation. ACM Trans. Math. Software 39, 4 (2013).
 Smith et al. (2016) Barry Smith, Satish Balay, Matthew Knepley, Jed Brown, Lois Curfman McInnes, Hong Zhang, Peter Brune, sarich, stefanozampini, Dmitry Karpeyev, Lisandro Dalcin, tisaac, markadams, Victor Minden, VictorEijkhout, vijaysm, Karl Rupp, Fande Kong, and SurtaiHan. 2016. petsc: Portable, Extensible Toolkit for Scientific Computation. (April 2016). DOI:http://dx.doi.org/10.5281/zenodo.49285
6 Code Generation
Sharing elimination and preevaluation, as well as the transformation algorithm, have been implemented in COFFEE, the compiler for finite element integration routines adopted in Firedrake. In this section, we briefly discuss the aspects of the compiler that are relevant for this article.
6.1 Expressing transformations through the COFFEE language
COFFEE implements sharing elimination and preevaluation by composing building block transformation operators, which we refer to as rewrite operators. This has several advantages. The first is extensibility. New transformations, such as sum factorization in spectral methods, could be expressed by composing the existing operators, or with small effort building on what is already available. Second, generality: COFFEE can be seen as a lightweight, low level computer algebra system, not necessarily tied to finite element integration. Third, robustness: the same operators are exploited, and therefore tested, by different optimization pipelines. The rewrite operators, whose (Python) implementation is based on manipulation of abstract syntax trees (ASTs), comprise the COFFEE language. A nonexhaustive list of such operators includes expansion, factorization, reassociation, generalized code motion.
6.2 Independence from form compilers
COFFEE aims to be independent of the high level form compiler. It provides an interface to build generic ASTs and only expects expressions to be in normal form (or sufficiently close to it). For example, Firedrake has transitioned from a version of the FEniCS Form Compiler [Kirby and Logg (2006)] modified to produce ASTs rather than strings, to a newly written compiler^{6}^{6}6TSFC, the twostage form compiler https://github.com/firedrakeproject/tsfc, while continuing to emply COFFEE. Thus, COFFEE decouples the mathematical manipulation of a form from code optimization; or, in other words, relieves form compiler developers of the task of fine scale loop optimization of generated code.
6.3 Handling blocksparse tables
For several reasons, basis function tables may be blocksparse (e.g., containing zerovalued columns). For example, the FEniCS Form Compiler implements vectorvalued functions by adding blocks of zerovalued columns to the corresponding tabulations; this extremely simplifies code generation (particularly, the construction of loop nests), but also affects the performance of the generated code due to the execution of “useless” flops (e.g., operations like a + 0). In quadrature1, a technique to avoid iteration over zerovalued columns based on the use of indirection arrays (e.g. A[B[i]], in which A is a tabulated basis function and B a map from loop iterations to nonzero columns in A) was proposed. This technique, however, produces noncontiguous memory loads and stores, which nullify the potential benefits of vectorization. COFFEE, instead, handles blocksparse basis function tables by restructuring loops in such a manner that low level optimization (especially vectorization) is only marginally affected. This is based on symbolic execution of the code, which enables a series of checks on array indices and loop bounds which determine the zerovalued blocks which can be skipped without affecting data alignment.
7 Performance Evaluation
7.1 Experimental setup
Experiments were run on a single core of an Intel I72600 (Sandy Bridge) CPU, running at 3.4GHz, 32KB L1 cache (private), 256KB L2 cache (private) and 8MB L3 cache (shared). The Intel Turbo Boost and Intel Speed Step technologies were disabled. The Intel icc 15.2 compiler was used. The compilation flags used were O3, xHost. The compilation flag xHost tells the Intel compiler to generate efficient code for the underlying platform.
The Zenodo system was used to archive all packages used to perform the experiments: Firedrake [Mitchell et al. (2016)], PETSc [Smith et al. (2016)], petsc4py [Firedrake (2016)], FIAT [Rognes et al. (2016)], UFL [Alnæs et al. (2016)], FFC [Logg et al. (2016)], PyOP2 [Rathgeber et al. (2016b)] and COFFEE [Luporini et al. (2016)]. The experiments can be reproduced using a publicly available benchmark suite [Rathgeber et al. (2016a)].
We analyze the execution time of four realworld bilinear forms of increasing complexity, which comprise the differential operators that are most common in finite element methods. In particular, we study the mass matrix (“Mass”) and the bilinear forms arising in a Helmholtz equation (“Helmholtz”), in an elastic model (“Elasticity”), and in a hyperelastic model (“Hyperelasticity”). The complete specification of these forms is made publicly available^{7}^{7}7https://github.com/firedrakeproject/firedrakebench/blob/experiments/forms/firedrake_forms.py.
We evaluate the speedups achieved by a wide variety of transformation systems over the “original” code produced by the FEniCS Form Compiler (i.e., no optimizations applied). We analyze the following transformation systems:
 quad

Optimized quadrature mode. Work presented in quadrature1, implemented in in the FEniCS Form Compiler.
 tens

Tensor contraction mode. Work presented in FFCTC, implemented in the FEniCS Form Compiler.
 auto

Automatic choice between tens and quad driven by heuristic (detailed in Fenics and summarized in Section 2.4). Implemented in the FEniCS Form Compiler.
 ufls

UFLACS, a novel backend for the FEniCS Form Compiler whose primary goals are improved code generation and execution times.
 cfO1

Generalized loopinvariant code motion. Work presented in Luporini, implemented in COFFEE.
 cfO2

Optimal loop nest synthesis with handling of blocksparse tables. Work presented in this article, implemented in COFFEE.
The values that we report are the average of three runs with “warm cache”; that is, with all kernels retrieved directly from the Firedrake’s cache, so code generation and compilation times are not counted. The timing includes however the cost of both local assembly and matrix insertion, with the latter minimized through the choice of a mesh (details below) small enough to fit the L3 cache of the CPU.
For a fair comparison, small patches were written to the make quad, tens, and ufls compatible with Firedrake. By executing all simulations in Firedrake, we guarantee that both matrix insertion and mesh iteration have a fixed cost, independent of the transformation system employed. The patches adjust the data storage layout to what Firedrake expects (e.g., by generating an array of pointers instead of a pointer to pointers, by replacing flattened arrays with bidimensional ones).
For Constraint 2, discussed in Section 3.4, we set ; that is, the size of the processor L2 cache (the last level of private cache). When the threshold had an impact on the transformation process, the experiments were repeated with . The results are documented later, individually for each problem.
Following the methodology adopted in quadrature1, we vary the following parameters:

the polynomial degree of test, trial, and coefficient (or “premultiplying”) functions,

the number of coefficient functions
While constants of our study are

the space of test, trial, and coefficient functions: Lagrange

the mesh: tetrahedral with a total of 4374 elements

exact numerical quadrature (we employ the same scheme used in quadrature1, based on the GaussLegendreJacobi rule)
7.2 Performance results
We report the results of our experiments in Figures 7, 8, 9, and 10 as threedimensional plots. The axes represent , , and code transformation system. We show one subplot for each problem instance , with the code transformation system varying within each subplot. The best variant for each problem instance is given by the tallest bar, which indicates the maximum speedup over nontransformed code. We note that if a bar or a subplot are missing, then the form compiler failed to generate code because it either exceeded the system memory limit or was otherwise unable to handle the form.
The rest of the section is organized as follows: we first provide insights into the general outcome of the experimentation; we then comment on the impact of a fundamental lowlevel optimization, namely autovectorization; finally, we motivate, for each form, the performance results obtained.
High level view
Our transformation strategy does not always guarantee minimum execution time. In particular, about 5 of the test cases (3 out of 56, without counting marginal differences) show that cfO2 was not optimal in terms of runtime. The most significant of such test cases is the elastic model with . There are two reasons for this. First, low level optimization can have a significant impact on the actual performance. For example, the aggressive loop unrolling in tens eliminates operations on zeros and reduces the working set size by not storing entire temporaries; on the other hand, preserving the loop structure can maximize the chances of autovectorization. Second, the transformation strategy adopted when is exceeded plays a key role, as we will later elaborate.
Autovectorization
We chose the mesh dimension and the function spaces such that the inner loop sizes would always be a multiple of the machine vector length. This ensured autovectorization in the majority of code variants^{8}^{8}8We verified the vectorization of inner loops by looking at both compiler reports and assembly code.. The biggest exception is quad, due to the presence of indirection arrays in the generated code. In tens, loop nests are fully unrolled, so the standard loop vectorization is not feasible; the compiler reports suggest, however, that block vectorization [Larsen and Amarasinghe (2000)] is often triggered. In ufls, cfO1, and cfO2 the iteration spaces have identical structure, with loop vectorization being regularly applied.
Mass matrix
We start with the simplest of the bilinear forms investigated, the mass matrix. Results are in Figure 7. We first notice that the lack of improvements when is due to the fact that matrix insertion outweighs local assembly. For , cfO2 generally shows the highest speedups. It is worth noting why auto does not always select the fastest implementation: auto always opts for tens, while for quad tends to be preferable. On the other hand, cfO2 always makes the optimal decision about whether to apply preevaluation or not. Surprisingly, despite the simplicity of the form, the performance of the various code generation systems can differ significantly.
Helmholtz
As in the case of Mass matrix, when the matrix insertion phase is dominant. For , the general trend is that cfO2 outperforms the competitors. In particular:

preevaluation makes cfO2 notably faster than cfO1, especially for high values of ; auto correctly selects tens, which is comparable to cfO2.

auto picks tens; the choice is however suboptimal when and . This can indirectly be inferred from the large gap between cfO2 and tens/auto: cfO2 applies sharing elimination, but it correctly avoids preevaluation because of the excessive expansion cost.
 and

auto reverts to quad, which would theoretically be the right choice (the flop count is much lower than in tens); however, the generated code suffers from the presence of indirection arrays, which break autovectorization and “traditional” code motion.
The slowdowns (or marginal improvements) seen in a small number of cases exhibited by ufls can be attributed to the presence of sharing in the generated code.
An interesting experiment we additionally performed was relaxing the memory threshold by setting . We found that this makes cfO2 generally slower for , with a maximum slowdown of 2.16 with . This effect could be worse when running in parallel, since the L3 cache is shared and different threads would end up competing for the same resource.
Elasticity
The results for the elastic model are displayed in Figure 9. The main observation is that cfO2 never triggers preevaluation, although in some occasions it should. To clarify this, consider the test case , in which tens/auto show a considerable speedup over cfO2. cfO2 finds preevaluation profitable in terms of operation count, although it is eventually not applied to avoid exceeding . However, running the same experiments with resulted in a dramatic improvement, even higher than that obtained by tens. The reason is that, despite exceeding by roughly 40, the saving in operation count is so large (5 in this specific problem) that preevaluation would in practice be the winning choice. This suggests that our objective function should be improved to handle the cases in which there is a significant gap between potential cache misses and reduction in operation count.
We also note that:

the differences between cfO2 and cfO1 are due to the perfect sharing elimination and the zerovalued blocks avoidance technique presented in Section 6.3.

when , auto prefers tens over quad, which leads to suboptimal operation counts and execution times.

ufls often results in better execution times than quad and tens. This is due to multiple factors, including avoidance of indirection arrays, preservation of loop structure, and a more effective code motion strategy.
Hyperelasticity
In the experiments on the hyperelastic model, shown in Figure 10, cfO2 exhibits the largest gains out of all problem instances considered in this paper. This is a positive result, since it indicates that our transformation algorithm scales well with for
Comments
There are no comments yet.