1 Introduction
The realization of efficient solution procedures for partial differential equations (PDEs) using finite element methods on modern computer systems requires the combination of diverse skills across mathematics, programming languages and highperformance computing. Automated code generation is one of the promising approaches to manage this complexity. It has been increasingly adopted in software systems and libraries. Recent successful examples include FEniCS
(Logg et al., 2012), Firedrake (Rathgeber et al., 2016) and FreeFem++ (Hecht, 2012). These software packages provide users with highlevel interfaces for high productivity while relying on optimizations and transformations in the code generation pipeline to generate efficient lowlevel code. The challenge, as in all compilers, is to use appropriate abstraction layers that enable optimizations to be applied that achieve high performance on a broad set of programs and machines.One particular challenge for generating highperformance code on modern hardware is vectorization. Modern CPUs increasingly rely on SIMD instructions to achieve higher throughput and better energy efficiency. Finite element computation requires the assembly of vectors and matrices which represent differential forms on discretized function spaces. This process consists of applying a local function, often called an element kernel
, to each mesh entity, and incrementing the global data structure with the local contribution. Typical local assembly kernels suffer from issues that can preclude effective vectorization. These issues include complicated loop structures, poor data access patterns, and short loop trip counts that are not multiples of the vector width. As we show in this paper, general purpose compilers perform poorly in generating efficient, vectorized code for such kernels. Padding and data layout transformations are required to enable the vectorization of the element kernels
(Luporini et al., 2015), but the effectiveness of such approaches is not consistent across different examples. Since padding may also result in larger overheads for wider vector architectures, new strategies are needed as vector width increases for the new generation of hardware.Matrixfree methods avoid building large sparse matrices in applications of the finite element method and thus trade computation for storage. They have become popular for use on modern hardware due to their higher arithmetic intensity (defined as the number of floatingpoint operations per byte of data transfer). Vectorization is particularly important for computationally intensive high order methods, for which matrixfree methods are often applied. Previous works on improving vectorization of matrixfree operator application, or equivalently, residual evaluation, mostly focus on exposing library interfaces to the users. Kronbichler and Kormann (2017) first perform a change of basis from nodal points to quadrature points, and provide overloaded SIMD types for users to write a quadraturepointwise expression for residual evaluation. However, since the transformation is done manually, new operators require manual reimplementation. Knepley and Terrel (2013) also transpose to quadraturepoint basis but target GPUs instead. Both works vectorize by grouping elements into batches, either to match the SIMD vector length in CPUs or the shared memory capacity on GPUs. In contrast, Müthing et al. (2017) apply an intrakernel vectorization strategy and exploit the fact that in 3D, evaluating both a scalar field and its three derivatives fills the four lanes of an AVX2 vector register. More recently, Kempf et al. (2018) target high order Discontinuous Galerkin (DG) methods on hexahedral meshes using automated code generation to search for vectorization strategies, while taking advantage of the specific memory layout of the data.
In this work, we present a generic and portable solution based on crosselement vectorization. Our vectorization strategy, implemented in Firedrake, is similar to that of Kronbichler and Kormann (2017) but is fully automated through code generation like that of Kempf et al. (2018). We extend the scope of code generation in Firedrake to incorporate the outer iteration over mesh entities and leverage Loopy (Klöckner, 2014), a loop code generator based loosely on the polyhedral model, to systematically apply a sequence of transformations which promote vectorization by grouping mesh entities into batches so that each SIMD lane operates on one entity independently. This automated code generation mechanism enables us to explore the effectiveness of our techniques on operators spanning a wide range of complexity and systematically evaluate our methodology. Compared with an intrakernel vectorization strategy, this approach is conceptually welldefined, more portable, and produces more predictable performance. Our experimental evaluation demonstrates that the approach consistently achieves a high fraction of hardware peak performance while being fully transparent to end users.
The contributions of this work are as follows:

We present the design of a code transformation pipeline that permits the generation of highperformance, vectorized code on a broad class of FEM models.

We provide a thorough evaluation of our code generation strategy and demonstrate that it achieves a substantial fraction of theoretical peak performance across a broad range of test cases.
The rest of this paper is arranged as follows. After reviewing the preliminaries of code generation for the finite element method in Section 2, we describe our implementation of crosselement vectorization in Firedrake in Section 3. In Section 4, we demonstrate the effectiveness of our approach with experimental results. Finally, we review our contributions and identify future research priorities in Section 5.
2 Preliminaries
The computation of multilinear forms using the basis functions spanning the discretized function spaces is called finite element assembly. When applying the matrixfree methods, one only needs to assemble linear forms, or residual forms, because matrixvector products are essentially the assembly of linear forms which represent the actions of bilinear forms. Optimizing linear form assembly is therefore crucial for improving the performance of matrixfree methods. In Firedrake, one can invoke the matrixfree approach without changing the highlevel problem formulation by setting solver options as detailed by Kirby and Mitchell (2018).
The general structure of a linear form is
(1) 
where , are arbitrary coefficient functions, and is the test function. is linear with respect to , but possibly nonlinear with respect to the coefficient functions.
Let be the set of basis functions spanning . Define , then the assembly of constitutes the computation of the vector . In Firedrake, this is treated as a twostep process: local assembly and global assembly.
2.1 Local assembly
Local assembly of linear forms is the evaluation of the integrals as defined by the weak form of the differential equation on each entity (cell or facet) of the mesh. In Firedrake, the users define the problem in Unified Form Language (UFL) (Alnæs et al., 2014) which captures the weak form and the function space discretization. Then the TwoStage Form Compiler (TSFC) (Homolya et al., 2018) takes this highlevel, mathematical description and generates efficient C code. As an example, consider the linear form of the weak form of the positivedefinite Helmholtz operator:
(2) 
Listing 1 shows the UFL syntax to assemble the linear form as the vector result, on a triangulation of a unit square. We choose to use the firstorder Lagrange element as our approximation space. Listing 2 shows a C representation of this kernel generated by TSFC. We note the following key features of this element kernel:

The kernel takes three array arguments in this case: coords holds the coordinates of the current triangle, w_0 holds , the coefficients of , and A stores the result.

The first part of the kernel (line 7 to line 15) computes the inverse and the determinant of the Jacobian for the coordinate transformation from the reference element to the current element. This is required for pulling back the differential forms to the reference element. The Jacobian is constant for each triangle because the coordinate transformation is affine in this case. Otherwise, the Jacobian needs to be computed at each quadrature point.

The constant arrays t0, t9 are the same for all elements. t0 represents the tabulation of the evaluation of basis functions at quadrature points, t9 represents the quadrature weights.

The ip loop iterates over the quadrature points, evaluating the integrand in (2) and summing to approximate the integral. The j
loops iterate over the degrees of freedom, once inside the quadrature loop, and once upon output to the assembled array
A. The extents of these loops depend on the integrals performed and the choice of function spaces respectively. 
TSFC performs optimization passes on the loop nests. In particular, it applies loopinvariant code motion which pulls invariant expression out of the loop nests into temporary arrays. This reduces the number of operations required while changing the structure of otherwise perfectly nested loops.
2.2 Global assembly
During global assembly, the local contribution from each mesh entity, computed by the element kernel, is accumulated into the global data structure. In Firedrake, PyOP2 (Rathgeber et al., 2012) is responsible for representing and realizing the iteration over mesh entities, marshalling data in and out of the element kernels. The computation is organized as PyOP2 parallel loops, or parloops. A parloop specifies a computational kernel, a set of mesh entities to which the kernel is applied, and all data required for the kernel. The data objects could be directly defined on the mesh entities, or indirectly access through maps from the mesh entities. For instance, the signature for the global assembly of the Helmholtz operator is:
Here helmholtz is the element kernel as shown in Listing 2, generated by TSFC; cells is the set of all triangles in the mesh; r, coords and x are the global data objects that are needed to create the arguments for the element kernel, where r holds the result vector, coords holds the coordinates of the vertices of the triangles which are needed for computing the Jacobian, and x holds the vector representation of function (as weights of basis functions). These global data objects correspond to the kernel arguments A, coords and w_0 respectively. The map cell2vert provides indirection from mesh entities to the global data objects, and each data argument is annotated with an access descriptor (R for readonly, RW for readwrite access). In this example, all three arguments share the same map because firstorder Lagrange element on triangles only have degreesoffreedom defined on the vertices, while the coordinate fields are also defined on the vertices.
Listing 3 shows the C code generated by PyOP2 for the above example. The code is then JITcompiled when the result is needed in Firedrake. In the context of vectorization, this approach, with the inlined element kernel, forms the baseline in our experimental evaluation. We note the following key features of the global assembly kernel:

The outer loop is over mesh entities.

For each entity, the computation can be divided into three parts: gathering the input data from global data structures (t3 and t4 in this case, which correspond to kernel arguments coords and w_0), calling the local assembly kernel, scattering the output data (t2) to the global data structure.

The gathering and scattering of data make use of indirect addressing via base pointers (dats) and indices (maps).

Different mesh entities might share the same degrees of freedom.

Global assembly interacts with local assembly via a function call (Line 23). This call can be inlined by the compiler, but it creates an artificial boundary for loop transformations at the source code level. This is the software engineering challenge that limits vectorization to a single local assembly kernel previously.
3 Vectorization
As one would expect, the loop nests and loop trip counts vary considerably for different integrals, meshes and function spaces that users might choose. This complexity is one of the challenges that our system specifically, and Firedrake more generally, must face in order to deliver predictable performance on modern CPUs, which have increasingly rich SIMD instruction sets.
In the prior approach to vectorization in our framework, the local assembly kernels generated by TSFC are further transformed to facilitate vectorization, as described by Luporini et al. (2015). The arrays are padded so that the trip counts of the innermost loops match multiples of the length of SIMD units. However, padding becomes less effective for low polynomial degrees on wide SIMD units. For instance, AVX512 instructions act on 8 doubleprecision floats, but the loops for degree 1 polynomials on triangles only have trip counts of 3, as shown in Listing 2
. Moreover, loopinvariant code motion is very effective in reducing the number of floatingpoint operations, but hoisted instructions are not easily vectorized as they are no longer in the innermost loops. This effect is more pronounced on tensorproduct elements where TSFC is able to apply
sum factorization (Homolya et al., 2017) to achieve better algorithmic complexity.3.1 Crosselement vectorization and Loopy
Another strategy is to vectorize across several elements in the outer loop over the mesh entities, as proposed previously by Kronbichler and Kormann (2017). This approach computes the contributions from several mesh entities using SIMD instructions, where each SIMD lane handles one entity. This is always possible regardless of the complexity of the local element kernel because the computation on each entity is independent and identical. One potential downside is the increase in memory pressure as the working set is larger.
For a compiler, the difficulty in performing crosselement vectorization (or, more generally, outerloop vectorization) is to automate a sequence of loop transformations and necessary data layout transformations robustly. This is further complicated by the indirect memory access in data gathering and scattering, and the need to unroll and interchange loops across the indirections, which requires significantly more semantic knowledge than what is available to the C compiler.
Loopy (Klöckner, 2014) is a loop generator embedded in Python which targets both CPUs and GPUs. Loopy provides abstractions based on integer sets for loopbased computations and enables powerful transformations based on the polyhedral model (Verdoolaege, 2010). Loopbased computations in Loopy are represented as Loopy kernels. A Loopy kernel is a subprogram consisting of a loop domain and a partiallyordered list of scalar assignments acting on multidimensional arrays. The loop domain is specified as the set of integral points in the convex intersection of quasiaffine constraints, as described by the Integer Set Library (Verdoolaege, 2010).
To integrate with Loopy, the code generation mechanisms in Firedrake were modified as illustrated in Figure 1. Instead of generating source code directly, TSFC and PyOP2 are modified to generate Loopy kernels. We have augmented the Loopy internal representation with the ability to support a generalized notion of kernel fusion through the nested composition of kernels, specifically through subprograms and inlining. This allows PyOP2 to inline the element kernel such that the global assembly Loopy kernel encapsulates the complete computation of global assembly. This holistic view of the overall computation enables robust loop transformations for vectorization across the boundary between global and local assembly.
Listing 4 shows an abridged version of the global assembly Loopy kernel for the Helmholtz operator, with the element kernel fused. We highlight the following key features of Loopy kernels:

Loop indices, such as n, i1, are called inames in Loopy, which define the iteration space. The bounds of the loops are specified by the affine constraints in domains.

Loop transformations operate on kernels by rewriting the loop domain and the statements making up the kernel. In addition, each iname carries a set of tags governing its realization in generated code, perhaps as a sequential loop, as a vector lane index, or through unrolling.

Multidimensional arrays occur as arguments and temporaries. The memory layout of the data can be specified by assigning tags to the array dimensions.

Dependencies between statements specify their partial order. Statement scheduling can also be controlled by assigning priorities to statements and inames.
For example, to achieve crosselement vectorization (by batching 4 elements into one SIMD vector in this example) we invoke the following sequence of Loopy transformations on the global assembly Loopy kernel, exploiting the domain knowledge of finite element assembly:

Split the outer loop n over mesh entities into n_outer and n_simd, with n_simd having trip count of 4. The objective is to generate SIMD instructions for the n_simd loops, such that each vector lane computes one iteration of the n_simd loops.

Assign the tag SIMD to the new iname n_simd. This tag informs Loopy to force the n_simd loop to be innermost, privatizing data by vectorexpansion if necessary.
We highlight the change to the Loopy kernel after these transformations in Listing 5. Loopy supports code generation for different environments from the same kernel by choosing different targets. We introduced an OpenMP Target to Loopy which extends its existing Clanguage Target to support OpenMP pragmas, facilitating SIMD instruction generation.
Listing 6 shows the generated C code for the Helmholtz operator vectorized by grouping together 4 elements. Apart from the previously mentioned changes, we note the following details:

The n_simd loops are pushed to the innermost level. Moreover, this transformation vectorexpands temporary arrays such as t2, t3, t4 by 4, with the expanded dimension labeled as varying the fastest when viewed from (linear) system memory. This ensures their accesses in the n_simd loops always have unit stride.

Loopy provides a mechanism to declare arrays to be aligned to specified memory boundaries (64 bytes in this example).

The n_simd loops are decorated by pragma omp simd to inform C compilers to generate SIMD instructions. The exception is the writing back to the global array (Line 36), which is sequentialized due to potential race conditions, as different mesh entities could share the same degrees of freedom.

The remainder loop which handles the cases where the number of elements is nondivisible by 4 is omitted here for simplicity.
After crosselement vectorization, all local assembly instructions (Lines 24–36) are inside the n_simd loops, which always have trip counts of 4 and are stride 1. All loopvarying array accesses are stride 1 in the fastest moving dimension. There are no loopcarried dependencies in n_simd loops. As a result, the n_simd loops, and therefore all local assembly instructions, are vectorizable without further consideration of dependencies. This is verified by checking the x86 assembly code and running the program with the Intel Software Development Emulator.
3.2 Vector extensions
A more direct way to inform the compiler to emit SIMD instructions without depending on OpenMP implementation is to use vector extensions^{2}^{2}endnote: ^{2}https://gcc.gnu.org/onlinedocs/gcc/VectorExtensions.html, which support vector data types. These were first introduced in the GNU compiler (GCC), but are also supported in recent versions of the Intel C compiler (ICC) and Clang. Analogous mechanisms exist in various vectortype libraries, e.g. VCL (Fog, 2017). To evaluate and compare with the directivebased approach from Section 3.1, we created a new code generation target in Loopy to support vector data types. When inames and corresponding array axes are jointly tagged as vector loops, Loopy generates code to compute on data in vector registers directly, instead of scalar loops over the vector lanes. It is worth noting that the initial intermediate representation of the loop was identical in each case, and that the different specializations were achieved through code transformation.
Listing 7 shows the C code generated for the Helmholtz operator vectorized by batching 4 elements using the vector extension target. Here all vectorized (innermost) loops for local assembly are replaced by operations on vector variables. For instructions which do not fit the vector computation model, most noticeably the indirect data gathering (Line 18), or instructions containing builtin mathematics functions which are not supported on vector data types (Line 32), Loopy defaults to generating scalar loops over vector lanes, decorated with pragma omp simd. In addition, because vector extensions do not automatically broadcast scalars, any vector instruction with scalar righthandside is modified by adding the zero vector to the expression, as shown in Lines 25 and 27.
4 Performance Evaluation
We follow the performance evaluation methodology of Luporini et al. (2017) by measuring the assembly time of a range of operators of increasing complexity and polynomial degrees. Due to the large number of combinations of experimental parameters (operators, meshes, polynomial degrees, vectorization strategies, compilers, hyperthreading), we only report an illustrative portion of the results here, with the entire suite of experiments made available on the interactive online repository CodeOcean (Sun, 2019a).
4.1 Experimental setup
We performed experiments on a single node of two Intel systems, based on the Haswell and Skylake microarchitectures, as detailed in Table 1. Because we observe that hyperthreading usually improves the performance by 5% to 10% for our applications, we set the number of MPI processes to the number of logical cores of the CPU to utilize all available computation resources. Experimental results with hyperthreading turned off are available on CodeOcean. The batch size, i.e., the number of elements grouped together for vectorization, is chosen to be consistent with the SIMD length. We use three C compilers: GCC 7.3, ICC 18.0 and Clang 5.0. The two vectorization strategies described in Section 3 are tested on all platforms. We use the listed Base Frequency to calculate the peak performance in Table 1. In reality, modern Intel CPUs dynamically reduce frequencies on heavy workloads with AVX2 and AVX512 instructions, which results in lower achievable performance. Running the optimized LINPACK benchmark binary provided by Intel gives a reasonable indication of peak performance for real applications.
For the benefit of reproducibility, we have archived the specific versions of Firedrake components used for experimental evaluation on Zenodo (Zenodo/Firedrake, 2019). An installation of Firedrake with components matching the ones used for evaluation in this paper can be obtained following the instruction at https://www.firedrakeproject.org/download.html, with the following command:
The evaluation framework is archived at (Sun, 2019b).
Haswell Xeon E52640 v3  Skylake Xeon Gold 6148  
Base frequency  2.6 GHz  2.4 GHz 
Physical cores  8  20 
SIMD instruction set  AVX2  AVX512 
doubles per SIMD vector  4  8 
Crosselement vectorization batch size  4  8 
FMA^{3}^{3}endnote: ^{3}Fused multiplyadd operations. units per core  2  2 
FMA instruction issue per cycle  2  2 
Peak performance (doubleprecision)^{4}^{4}endnote: ^{4}Calculated as  332.8 GFLOP/s  1536.0 GFLOP/s 
LINPACK performance (doubleprecision)^{5}^{5}endnote: ^{5}Intel LINPACK Benchmark. https://software.intel.com/enus/articles/intelmklbenchmarkssuite  262.5 GFLOP/s  976.7 GFLOP/s 
Memory bandwidth^{6}^{6}endnote: ^{6}STREAM triad benchmark, 2 threads per core.  38.5 GB/s  81.0 GB/s 
GCC/Clang arch flag  march=native  march=native 
ICC SIMD flag  xcoreavx2  xcoreavx512 qoptzmmusage=high 
Other compiler flags  O3 ffastmath fopenmp  O3 ffastmath fopenmp 
Intel Turbo Boost  OFF  OFF 
tri  quad  tet  hex  
P  AI  D  Q  H  S  AI  D  Q  H  S  AI  D  Q  H  S  AI  D  Q  H  S  
mass  1  1.2  3  3  0.7  1.0  2.5  2  2  1.2  1.6  2.7  4  4  0.9  0.9  6.6  2  2  1.1  2.0 
2  1.7  6  6  1.3  1.1  2.4  3  3  1.0  1.4  5.8  10  14  2.4  2.4  5.4  3  3  1.8  2.4  
3  3.0  10  12  2.7  2.2  2.7  4  4  0.7  1.0  8.6  20  24  0.9  1.6  4.9  4  4  1.4  1.4  
4  5.6  15  25  3.4  3.9  2.9  5  5  2.1  2.0  38.8  35  125  1.1  1.4  4.7  5  5  2.3  2.3  
5  7.5  21  36  1.2  1.9  3.0  6  6  2.4  2.5  54.9  56  216  0.8  0.9  4.8  6  6  2.4  2.5  
6  9.7  28  49  0.9  1.7  3.3  7  7  2.6  2.2  79.8  84  343  1.0  1.0  5.0  7  7  2.4  2.8  
helmholtz  1  1.8  3  3  1.0  1.3  5.6  2  2  2.1  2.8  3.8  4  4  1.3  1.3  16.1  2  2  2.1  3.2 
2  5.7  6  6  2.6  3.1  6.4  3  3  1.7  2.6  27.1  10  14  2.9  5.5  16.0  3  3  2.8  3.8  
3  9.6  10  12  2.8  4.5  7.1  4  4  0.9  1.3  37.2  20  24  1.6  3.2  15.3  4  4  1.9  2.7  
4  17.8  15  25  3.0  5.1  7.6  5  5  3.0  4.8  162.7  35  125  2.2  2.6  15.0  5  5  3.3  4.1  
5  23.3  21  36  2.0  3.1  7.9  6  6  3.0  4.7  226.0  56  216  1.4  1.8  15.2  6  6  3.3  4.2  
6  29.8  28  49  1.4  2.5  8.6  7  7  3.1  4.8  325.7  84  343  1.4  1.9  15.8  7  7  3.3  4.2  
laplacian  1  0.5  3  1  0.8  0.8  4.1  2  2  1.7  2.0  1.9  4  1  1.2  1.0  13.3  2  2  1.9  2.4 
2  2.7  6  3  1.7  2.1  5.4  3  3  1.6  2.1  10.2  10  4  2.1  3.5  12.7  3  3  2.6  3.4  
3  4.0  10  6  2.2  3.3  5.8  4  4  1.3  2.0  23.7  20  14  2.2  3.4  12.1  4  4  1.5  1.6  
4  6.9  15  12  2.7  4.6  6.1  5  5  2.8  4.0  31.1  35  24  2.9  3.3  12.0  5  5  2.7  3.3  
5  12.6  21  25  2.1  3.1  6.3  6  6  2.8  4.1  121.6  56  125  3.2  2.5  12.3  6  6  2.6  3.4  
6  16.8  28  36  1.9  2.8  6.8  7  7  2.7  4.3  185.4  84  216  2.8  2.3  12.9  7  7  2.6  3.9  
elasticity  1  0.5  3  1  0.8  1.0  5.2  2  2  1.8  2.2  1.9  4  1  1.2  1.0  16.3  2  2  2.0  2.3 
2  3.0  6  3  1.7  2.4  6.2  3  3  1.8  2.3  11.5  10  4  2.1  3.5  14.6  3  3  2.6  3.2  
3  4.4  10  6  2.3  3.4  6.5  4  4  1.3  2.1  25.3  20  14  2.1  3.4  13.5  4  4  1.5  1.6  
4  7.3  15  12  2.7  4.5  6.7  5  5  2.8  4.2  32.4  35  24  2.8  2.9  13.2  5  5  2.8  3.4  
5  13.1  21  25  2.1  3.2  6.9  6  6  2.8  4.2  124.7  56  125  3.1  2.5  13.4  6  6  2.6  3.5  
6  17.3  28  36  1.8  2.8  7.4  7  7  2.7  4.2  188.7  84  216  2.8  2.3  13.9  7  7  2.6  3.9  
hyperelasticity  
1  0.5  3  1  1.1  1.5  18.4  2  3  2.3  3.2  1.6  4  1  1.9  1.7  76.6  2  3  2.7  4.3  
2  9.8  6  6  2.8  4.5  22.2  3  5  1.6  2.3  62.4  10  14  3.4  6.1  79.1  3  5  2.2  3.7  
3  26.8  10  25  3.6  6.7  23.8  4  7  2.3  4.0  310.0  20  125  3.4  6.3  78.7  4  7  2.4  3.7  
4  40.7  15  49  3.7  7.2  25.1  5  9  3.6  6.2  593.2  35  343  3.5  4.2  79.5  5  9  3.7  5.9  
5  55.9  21  81  2.7  5.1  25.7  6  11  3.6  6.1  895.2  56  729  2.9  2.8  82.1  6  11  3.3  5.5  
6  74.3  28  121  2.4  4.3  27.4  7  13  3.7  6.1  1398.3  84  1331  2.6  2.7  85.8  7  13  3.4  5.7 
We measure the execution time of assembling the residual of five operators: the mass matrix (“mass”), the Helmholtz equation (“helmholtz”), the vector Laplacian (“laplacian”), an elastic model (“elasticity”), and a hyperelastic model (“hyperelasticity”). The mathematical description of the operators is detailed in the supplemental material. These operators stem from realworld applications and cover a wide range of complexity: the generated C code for the corresponding global assembly kernels exceeds hundreds of KB for the hyperelasticity operator on high polynomial degrees.
We performed experiments on both 2D and 3D domains, with two types of mesh used for each case: triangles (“tri”) and quadrilaterals (“quad”) for 2D problems, tetrahedra (“tet”) and hexahedra (“hex”) for 3D problems. The arithmetic intensity of the operators are listed in Table 2
. The memory footprint is calculated assuming perfect caching – it is thus a lower bound which results in an upper bound estimation for the arithmetic intensity. The triangular and tetrahedral meshes use an affine coordinate transformation (requiring only one Jacobian evaluation per element). The quadrilateral and hexahedral meshes use a bilinear (trilinear) coordinate transformation (requiring Jacobian evaluation at every quadrature point), which usually results in higher arithmetic intensities at low orders. In Firedrake, tensorproduct elements
(McRae et al., 2016) benefit from optimizations such as sum factorization to achieve lower asymptotic algorithmic complexity. They are therefore more competitive for higher order methods (Homolya et al., 2017).We record the maximum execution time of the generated global assembly kernels on all MPI processes. This time does not includes the time in synchronization and MPI data exchange for halo updates. Each experiment is run five times, and the average execution time is reported. Exclusive access to the compute nodes is ensured and threads are pinned to individual logical cores. Startup costs such as code generation time and compilation time are excluded. We use automatic vectorization by GCC without batching, compiled with the same optimization flags listed earlier, as the baseline for comparison. Comparing with the crosselement strategy, the baseline represents the outofthebox performance of compiler autovectorization for the local element kernel. We note that crosselement vectorization does not alter the algorithm of local assembly except for the vector expansion, as illustrated by Listing 2 and Listing 6. Consequently, the total number of floatingpoint operations remains the same. The performance benefit from crosselement vectorization is therefore composable with the operationreduction optimizations performed by the form compiler to the local assembly kernels.
4.2 Experimental results and discussion
Figures 2 to 5 show the performance of the helmholtz and elasticity operators on Haswell and Skylake, vectorized with OpenMP pragma as described in Section 3.1, and with vector extensions as described in Section 3.2. We indicate the fraction of peak performance achieved on the left axis, and the fraction of the LINPACK benchmark performance on the right axis. Figure 6 and 7 compare the roofline models (Williams et al., 2009) of the baseline and our crosselement vectorization implementation using GCC and vector extensions on Haswell and Skylake. The speedup achieved is also summarized in Table 2. We analyze the data in the following aspects:
4.2.1 Compiler comparison and vector extensions
When vectorizing with OpenMP pragma, ICC gives the best performance for almost all test cases, followed by Clang, while GCC is significantly less competitive. The performance disparity is more pronounced on Skylake than on Haswell. However, when using vector extensions, Clang and GCC improve significantly and are able to match the performance of ICC on both Haswell and Skylake, whereas ICC performs similarly with OpenMP pragma and with vector extensions.
We use the Intel Software Development Emulator^{7}^{7}endnote: ^{7}https://software.intel.com/enus/articles/intelsoftwaredevelopmentemulator to count the number of instructions executed at runtime for code generated by different compilers. The data indicate that although floatingpoint operations are fully vectorized by all compilers, GCC and Clang generate more load and store instructions between vector registers and memory when using OpenMP pragma for vectorization. One possible reason is that GCC and Clang choose to allocate short arrays to the stack rather than the vector registers directly, causing more load on the memory subsystem.
In light of these results, we conclude that vectorization with vector extensions allows greater performance portability on different compilers and CPUs for our application. It is, therefore, our preferred strategy for implementing crosselement vectorization, and is the default option for the rest of our analysis.
4.2.2 Vectorization speedup
Almost across the board, significant speedup is achieved on the test cases under consideration. Slowdown occurs in two situations. On low polynomial degrees, the kernels tend to have low arithmetic intensity so that the increase in available floating point bandwidth through crosselement vectorization cannot compensate for the increase in the size of the working set of data. On simple operators such as mass on tri and tetra, the kernels have simple loop structures and the compilers can sometimes successfully apply other optimizations such as unrolling and loop interchange to achieve vectorization without batching elements in the outer loop. The pattern of speedup is consistent across Haswell and Skylake. Higher speedup is generally achieved on more complicated operators (e.g. hyperelasticity), and on tensorproduct elements (quad and hex), which generally correspond to more complicated loop structure and higher arithmetic intensity due to the Jacobian recomputation at each quadrature point.
4.2.3 Achieved fraction of peak performance
We observe that the fraction of peak performance varies smoothly with polynomial degrees for crosselement vectorization in all test cases. This fulfils an important design requirement for Firedrake: small changes in problem setup by the users should not create unexpected performance degradation. This is also shown in Figures 6 and 7 where the results are more clustered on the roofline plots after crosselement vectorization. The baseline shows performance inconsistency, especially on low polynomial degrees. For instance, for the helmholtz operator with degree 3 on quad, the quadrature loops and the basis function loops all have trip counts of 4, which fits the vector length on Haswell and results in better performance.
On simplicial meshes (tri and tetra), higher order discretization leads to kernels with very high arithmetic intensity because of the quadratic and cubic increases in the number of basis functions, and thus the loop trip counts. This is due to the current limitation that simplicial elements in Firedrake are not sum factorized. In these test cases, we observe that the baseline approaches crosselement vectorization for sufficiently high polynomial degrees. This is not a serious concern for our optimization approach because the breakeven degrees are very high except for simple operators such as mass, and ultimately tensorproduct elements are more competitive for higher order methods in terms of algorithmic complexity.
We also observe that there exists a small number of test cases where the achieved peak performance is marginally higher than the LINPACK benchmark on Skylake, as shown in Figure 7. One possible reason for this observation is thermal throttling since our test cases typically run for a shorter period of time than LINPACK. We also note that these test cases correspond to high order hyperelasticity operator on tet which are not practically important use cases, since using tensorproduct elements requires much less floatingpoint operations at the same polynomial order.
4.2.4 Tensorproduct elements
We observe higher and more consistent speedup for tensorproduct elements (quad and hex) on both Haswell and Skylake. This is because, on these meshes, more computation can be moved outside the innermost loop due to sum factorization, which results in more challenging loop nests for the baseline strategy which attempts to vectorize within the element kernel. The same applies to the evaluation of the Jacobian of coordinate transformation, which is a nested loop over quadrature points after sum factorization for tensorproduct elements.
The base elements of quad and hex are interval elements in 1D, thus the extents of loops over degrees of freedom increase only linearly with respect to polynomial degrees, as shown in Table 2. As a result, the baseline performance does not improve as quickly for higher polynomial degrees on quad and hex compared with tri and tet, resulting in stable speedup for crosselement vectorization observed on tensorproduct elements.
5 Conclusion and future work
We have presented a portable, generalpurpose solution for delivering stable vectorization performance on modern CPUs for matrixfree finite element assembly for a very broad class of finite element operators on a large range of elements and polynomial degrees. We described the implementation of crosselement vectorization in Firedrake which is transparent to the end users. Although the technique of crosselement vectorization is conceptually simple and has been applied in handwritten kernels before, our implementation based on code generation is automatic, robust and composable with other optimization passes.
The writeback to global data structure is not vectorized in our approach due to possible race conditions. The newly introduced Conflict Detection instructions in the Intel AVX512 instruction set could potentially mitigate this limitation (Zhang, 2016, Section 2.3). This could be achieved by informing Loopy to use the relevant intrinsics when generating code for loops with specific tags.
We have focused on the matrixfree finite element method because it is computeintensive and more likely to benefit from vectorization. However, our methods and implementation also support matrix assembly. Firedrake relies on PETSc (Balay et al., 2017) to handle distributed sparse matrices, and PETSc requires certain data layouts for the input array when updating the global matrices. When several elements are batched together for crosselement vectorization, we need to generate code to explicitly unpack/transpose the local assembly results into individual arrays before calling PETSc functions to update the global sparse matrices for each element. Future improvement could include eliminating this overhead, possibly by extending the PETSc API.
The newly introduced abstraction layer, together with Loopy integration in the code generation and optimization pipeline, opens up multiple possibilities for future research in Firedrake. These include code generation with intrinsics instructions, loop tiling, and GPU acceleration, all of which are already supported in Loopy.
The authors would like to thank Tobias Grosser, Richard Vera, J. Ramanujam and P. Sadayappan for their valuable insights during our discussions which started at Dagstuhl Seminar 18111 on Loop Optimization. The authors are grateful to James Cownie and Andrew Mollinson at Intel Corp. for providing access to the Skylake platform.
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
This work was supported by the Engineering and Physical Sciences Research Council [grant numbers EP/L016796/1, EP/R029423/1], and the Natural Environment Research Council [grant number NE/K008951/1]. It was further funded by the US Navy Office of Naval Research under grant number N000141410117 and the US National Science Foundation under grant number CCF1524433. AK gratefully acknowledges a hardware gift from Nvidia Corporation.
Here we describe the operators used as the test cases for experimental evaluation. They are defined as bilinear forms, and we take their action in UFL to obtain the corresponding linear forms.
 mass

Here and are scalarvalued trial and test functions.
(3)  helmholtz

Here and are scalarvalued trial and test functions.
(4)  laplacian

Here and are vectorvalued trial and test functions.
(5)  elasticity

The linear elasticity model solves for a displacement vector field. Here and are vectorvalued trial and test functions, is the symmetric strain rate tensor. The bilinear form is defined as:
(6)  hyperelasticity

In this simple hyperelastic model, we define the strain energy function over vector field :
(7) where
is the identity matrix,
and are the Lamé parameters of the material, is the deformation gradient, is the right CauchyGreen tensor, is the EulerLagrange strain tensor. We define the PiolaKirchhoff stress tensors as:(8) Finally, we arrive at the residual form of this nonlinear problem:
(9) where is the external forcing. To solve this nonlinear problem, we need to linearize the residual form at an approximate solution , this gives us the bilinear form :
(10) where the trial function is , the test function is , and is a coefficient of the operator. We use the automatic differentiation of UFL to compute the operator symbolically.
Notes
 ^{1} Working paper
 ^{2} https://gcc.gnu.org/onlinedocs/gcc/VectorExtensions.html
 ^{3} Fused multiplyadd operations.
 ^{4} Calculated as
 ^{5} Intel LINPACK Benchmark. https://software.intel.com/enus/articles/intelmklbenchmarkssuite
 ^{6} STREAM triad benchmark, 2 threads per core.
 ^{7} https://software.intel.com/enus/articles/intelsoftwaredevelopmentemulator
References
 Alnæs et al. (2014) Alnæs MS, Logg A, Ølgaard KB, Rognes ME and Wells GN (2014) Unified form language: A domainspecific language for weak formulations of partial differential equations. ACM Trans. Math. Softw. 40(2): 9:1–9:37. DOI:10.1145/2566630.
 Balay et al. (2017) Balay S, Abhyankar S, Adams M, Brown J, Brune P, Buschelman K, Dalcin LD, Eijkhout V, Gropp W, Kaushik D and Others (2017) Petsc users manual revision 3.8. Technical report, Argonne National Lab.(ANL), Argonne, IL (United States).
 Fog (2017) Fog A (2017) VCL — A C++ Vector Class Library. URL https://www.agner.org/optimize/vectorclass.pdf. Retrieved March 19, 2019.
 Hecht (2012) Hecht F (2012) New development in FreeFEM++. Journal of Numerical Mathematics 20(34): 251–266.
 Homolya et al. (2017) Homolya M, Kirby RC and Ham DA (2017) Exposing and exploiting structure: optimal code generation for highorder finite element methods. ArXiv: 1711.02473 [cs.MS].
 Homolya et al. (2018) Homolya M, Mitchell L, Luporini F and Ham DA (2018) TSFC: a structurepreserving form compiler. SIAM Journal on Scientific Computing 40(3): C401–C428. DOI:10.1137/17M1130642.
 Kempf et al. (2018) Kempf D, Heß R, Müthing S and Bastian P (2018) Automatic Code Generation for HighPerformance Discontinuous Galerkin Methods on Modern Architectures. arXiv preprint arXiv:1812.08075 .
 Kirby and Mitchell (2018) Kirby RC and Mitchell L (2018) Solver composition across the PDE/linear algebra barrier. SIAM Journal on Scientific Computing 40(1): C76–C98. DOI:10.1137/17M1133208.
 Klöckner (2014) Klöckner A (2014) Loo. py: transformationbased code generation for GPUs and CPUs. In: Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming. ACM, p. 82.
 Knepley and Terrel (2013) Knepley MG and Terrel AR (2013) Finite element integration on GPUs. ACM Transactions on Mathematical Software (TOMS) 39(2): 10.
 Kronbichler and Kormann (2017) Kronbichler M and Kormann K (2017) Fast matrixfree evaluation of discontinuous Galerkin finite element operators. arXiv preprint arXiv:1711.03590 .
 Logg et al. (2012) Logg A, Mardal KA and Wells GN (eds.) (2012) Automated Solution of Differential Equations by the Finite Element Method: the FEniCS Book. Springer. ISBN 9783642230981. DOI:10.1007/9783642230998.
 Luporini et al. (2017) Luporini F, Ham DA and Kelly PHJ (2017) An algorithm for the optimization of finite element integration loops. ACM Transactions on Mathematical Software (TOMS) 44(1): 3.
 Luporini et al. (2015) Luporini F, Varbanescu AL, Rathgeber F, Bercea GT, Ramanujam J, Ham DA and Kelly PHJ (2015) Crossloop optimization of arithmetic intensity for finite element local assembly. ACM Transactions on Architecture and Code Optimization (TACO) 11(4): 57.
 McRae et al. (2016) McRae ATT, Bercea GT, Mitchell L, Ham DA and Cotter CJ (2016) Automated generation and symbolic manipulation of tensor product finite elements. SIAM Journal on Scientific Computing 38(5): S25–S47. DOI:10.1137/15M1021167.
 Müthing et al. (2017) Müthing S, Piatkowski M and Bastian P (2017) Highperformance Implementation of Matrixfree Highorder Discontinuous Galerkin Methods. arXiv preprint arXiv:1711.10885 .
 Rathgeber et al. (2016) Rathgeber F, Ham DA, Mitchell L, Lange M, Luporini F, McRae ATT, Bercea GT, Markall GR and Kelly PHJ (2016) Firedrake: automating the finite element method by composing abstractions. ACM Transactions on Mathematical Software 43(3): 24:1–24:27. DOI:10.1145/2998441.
 Rathgeber et al. (2012) Rathgeber F, Markall GR, Mitchell L, Loriant N, Ham DA, Bertolli C and Kelly PH (2012) PyOP2: A HighLevel Framework for PerformancePortable Simulations on Unstructured Meshes. In: 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. IEEE. ISBN 9780769549569, pp. 1116–1123. DOI:10.1109/SC.Companion.2012.134.
 Sun (2019a) Sun T (2019a) Crosselement vectorization in Firedrake. https://www.codeocean.com/. DOI:https://doi.org/10.24433/CO.8386435.v1.
 Sun (2019b) Sun T (2019b) tjsun/firedrakevectorization: Scripts for experimental evaluation for the manuscript on crosselement vectorization. DOI:10.5281/zenodo.2590705. URL https://doi.org/10.5281/zenodo.2590705.
 Verdoolaege (2010) Verdoolaege S (2010) isl: An integer set library for the polyhedral model. In: International Congress on Mathematical Software. Springer, pp. 299–302.
 Williams et al. (2009) Williams S, Waterman A and Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM : 65–76URL http://dl.acm.org/citation.cfm?id=1498785.
 Zenodo/Firedrake (2019) Zenodo/Firedrake (2019) Software used in ’A study of vectorization for matrixfree finite element methods’. DOI:10.5281/zenodo.2595487. URL https://doi.org/10.5281/zenodo.2595487.
 Zhang (2016) Zhang B (2016) Guide to automatic vectorization with Intel AVX512 instructions in Knights Landing processors. URL https://colfaxresearch.com/knlavx512. Retrieved March 19, 2019.