Log In Sign Up

A study of vectorization for matrix-free finite element methods

by   Tianjiao Sun, et al.
Imperial College London

Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over unstructured meshes, poses difficulties to effective vectorization. Maintaining a user-friendly high-level interface with a suitable degree of abstraction while generating efficient, vectorized code for the finite element method is a challenge for numerical software systems and libraries. In this work, we study cross-element vectorization in the finite element framework Firedrake via code transformation and demonstrate the efficacy of such an approach by evaluating a wide range of matrix-free operators spanning different polynomial degrees and discretizations on two recent CPUs using three mainstream compilers. Our experiments show that our approaches for cross-element vectorization achieve 30% of theoretical peak performance for many examples of practical significance, and exceed 50% for cases with high arithmetic intensities, with consistent speed-up over (intra-element) vectorization restricted to the local assembly kernels.


GPU Accelerated Finite Element Assembly with Runtime Compilation

In recent years, high performance scientific computing on graphics proce...

Enriched finite element approach for modeling discontinuous electric field in multimaterial problems

This work is devoted to the development of an efficient and robust techn...

Assembly of stiffness matrices via atomics

Finite element methods require the composition of the global stiffness m...

nlfem: A flexible 2d Fem Code for Nonlocal Convection-Diffusion and Mechanics

In this work we present the mathematical foundation of an assembly code ...

Code generation for productive portable scalable finite element simulation in Firedrake

Creating scalable, high performance PDE-based simulations requires a sui...

Acceleration of tensor-product operations for high-order finite element methods

This paper is devoted to GPU kernel optimization and performance analysi...

1 Introduction

The realization of efficient solution procedures for partial differential equations (PDEs) using finite element methods on modern computer systems requires the combination of diverse skills across mathematics, programming languages and high-performance computing. Automated code generation is one of the promising approaches to manage this complexity. It has been increasingly adopted in software systems and libraries. Recent successful examples include FEniCS 

(Logg et al., 2012), Firedrake (Rathgeber et al., 2016) and FreeFem++ (Hecht, 2012). These software packages provide users with high-level interfaces for high productivity while relying on optimizations and transformations in the code generation pipeline to generate efficient low-level code. The challenge, as in all compilers, is to use appropriate abstraction layers that enable optimizations to be applied that achieve high performance on a broad set of programs and machines.

One particular challenge for generating high-performance code on modern hardware is vectorization. Modern CPUs increasingly rely on SIMD instructions to achieve higher throughput and better energy efficiency. Finite element computation requires the assembly of vectors and matrices which represent differential forms on discretized function spaces. This process consists of applying a local function, often called an element kernel

, to each mesh entity, and incrementing the global data structure with the local contribution. Typical local assembly kernels suffer from issues that can preclude effective vectorization. These issues include complicated loop structures, poor data access patterns, and short loop trip counts that are not multiples of the vector width. As we show in this paper, general purpose compilers perform poorly in generating efficient, vectorized code for such kernels. Padding and data layout transformations are required to enable the vectorization of the element kernels 

(Luporini et al., 2015), but the effectiveness of such approaches is not consistent across different examples. Since padding may also result in larger overheads for wider vector architectures, new strategies are needed as vector width increases for the new generation of hardware.

Matrix-free methods avoid building large sparse matrices in applications of the finite element method and thus trade computation for storage. They have become popular for use on modern hardware due to their higher arithmetic intensity (defined as the number of floating-point operations per byte of data transfer). Vectorization is particularly important for computationally intensive high order methods, for which matrix-free methods are often applied. Previous works on improving vectorization of matrix-free operator application, or equivalently, residual evaluation, mostly focus on exposing library interfaces to the users. Kronbichler and Kormann (2017) first perform a change of basis from nodal points to quadrature points, and provide overloaded SIMD types for users to write a quadrature-point-wise expression for residual evaluation. However, since the transformation is done manually, new operators require manual reimplementation. Knepley and Terrel (2013) also transpose to quadrature-point basis but target GPUs instead. Both works vectorize by grouping elements into batches, either to match the SIMD vector length in CPUs or the shared memory capacity on GPUs. In contrast, Müthing et al. (2017) apply an intra-kernel vectorization strategy and exploit the fact that in 3D, evaluating both a scalar field and its three derivatives fills the four lanes of an AVX2 vector register. More recently, Kempf et al. (2018) target high order Discontinuous Galerkin (DG) methods on hexahedral meshes using automated code generation to search for vectorization strategies, while taking advantage of the specific memory layout of the data.

In this work, we present a generic and portable solution based on cross-element vectorization. Our vectorization strategy, implemented in Firedrake, is similar to that of Kronbichler and Kormann (2017) but is fully automated through code generation like that of Kempf et al. (2018). We extend the scope of code generation in Firedrake to incorporate the outer iteration over mesh entities and leverage Loopy (Klöckner, 2014), a loop code generator based loosely on the polyhedral model, to systematically apply a sequence of transformations which promote vectorization by grouping mesh entities into batches so that each SIMD lane operates on one entity independently. This automated code generation mechanism enables us to explore the effectiveness of our techniques on operators spanning a wide range of complexity and systematically evaluate our methodology. Compared with an intra-kernel vectorization strategy, this approach is conceptually well-defined, more portable, and produces more predictable performance. Our experimental evaluation demonstrates that the approach consistently achieves a high fraction of hardware peak performance while being fully transparent to end users.

The contributions of this work are as follows:

  • We present the design of a code transformation pipeline that permits the generation of high-performance, vectorized code on a broad class of FEM models.

  • We provide a thorough evaluation of our code generation strategy and demonstrate that it achieves a substantial fraction of theoretical peak performance across a broad range of test cases.

The rest of this paper is arranged as follows. After reviewing the preliminaries of code generation for the finite element method in Section 2, we describe our implementation of cross-element vectorization in Firedrake in Section 3. In Section 4, we demonstrate the effectiveness of our approach with experimental results. Finally, we review our contributions and identify future research priorities in Section 5.

2 Preliminaries

The computation of multilinear forms using the basis functions spanning the discretized function spaces is called finite element assembly. When applying the matrix-free methods, one only needs to assemble linear forms, or residual forms, because matrix-vector products are essentially the assembly of linear forms which represent the actions of bilinear forms. Optimizing linear form assembly is therefore crucial for improving the performance of matrix-free methods. In Firedrake, one can invoke the matrix-free approach without changing the high-level problem formulation by setting solver options as detailed by Kirby and Mitchell (2018).

The general structure of a linear form is


where , are arbitrary coefficient functions, and is the test function. is linear with respect to , but possibly nonlinear with respect to the coefficient functions.

Let be the set of basis functions spanning . Define , then the assembly of constitutes the computation of the vector . In Firedrake, this is treated as a two-step process: local assembly and global assembly.

2.1 Local assembly

Local assembly of linear forms is the evaluation of the integrals as defined by the weak form of the differential equation on each entity (cell or facet) of the mesh. In Firedrake, the users define the problem in Unified Form Language (UFL) (Alnæs et al., 2014) which captures the weak form and the function space discretization. Then the Two-Stage Form Compiler (TSFC) (Homolya et al., 2018) takes this high-level, mathematical description and generates efficient C code. As an example, consider the linear form of the weak form of the positive-definite Helmholtz operator:

1mesh = UnitSquareMesh(10, 10)
2V = FunctionSpace(mesh, "Lagrange", 1)
3v = TestFunction(V)
4u = Function(V)
5L = (dot(grad(u), grad(v)) + u*v) * dx
6result = assemble(L)
Listing 1: Assembling the linear form of the Helmholtz operator in UFL.
2static inline void helmholtz(double A[3], const double *restrict coords , const double *restrict w_0)
4  static const double t0[3][3]  = {{ ... }};
5  static const double  t9[3]  = { ... };
7  double  t11  = 0.0;
8  double  t1  = ((double)(-1) * coords[0]);
9  double  t2  = (t1 + coords[2]);
10  double  t3  = ((double)(-1) * coords[1]);
11  double  t4  = (t3 + coords[5]);
12  double  t5  = (t1 + coords[4]);
13  double  t6  = (t3 + coords[3]);
14  double  t7  = ((t2 * t4) + ((double)(-1) * (t5 * t6)));
15  double  t8  = fabs(t7);
17  double t13[3];
18  for (int  j  = 0; j < 3; j += 1)
19    t13[j] = 0.0;
21  for (int  ip  = 0; ip < 3; ip += 1)
22  {
23    double  t10  = (t9[ip] * t8);
24    t11 += t10;
25    double  t12  = (t10 * (((t0[ip][0] * w_0[0]) + (t0[ip][1] * w_0[1])) + (t0[ip][2] * w_0[2])));
26    for (int  j  = 0; j < 3; j += 1)
27      t13[j] += t0[ip][j] * t12;
28  }
29  double  t14  = ((double)(1) / t7);
30  double  t15  = (t4 * t14);
31  double  t16  = ((double)(-1) * w_0[0]);
32  double  t17  = (t16 + w_0[1]);
33  double  t18  = (((double)(-1) * t6) * t14);
34  double  t19  = (t16 + w_0[2]);
35  double  t20  = ((t15 * t17) + (t18 * t19));
36  double  t21  = (((double)(-1) * t5) * t14);
37  double  t22  = (t2 * t14);
38  double  t23  = ((t21 * t17) + (t22 * t19));
39  double  t24  = (((t20 * t18) + (t23 * t22)) * t11);
40  static const double  t25[4]  = {-1.0, 0.0, 1.0};
41  double  t26  = (((t20 * t15) + (t23 * t21)) * t11);
42  static const double  t27[4]  = {-1.0, 1.0, 0.0};
44  for (int  j  = 0; j < 3; j += 1)
45    A[j] += (t13[j] + (t27[j] * t26)) + (t25[j] * t24);
Listing 2: Local assembly kernel the linear form of the Helmholtz operator in C.

Listing 1 shows the UFL syntax to assemble the linear form as the vector result, on a triangulation of a unit square. We choose to use the first-order Lagrange element as our approximation space. Listing 2 shows a C representation of this kernel generated by TSFC. We note the following key features of this element kernel:

  • The kernel takes three array arguments in this case: coords holds the coordinates of the current triangle, w_0 holds , the coefficients of , and A stores the result.

  • The first part of the kernel (line 7 to line 15) computes the inverse and the determinant of the Jacobian for the coordinate transformation from the reference element to the current element. This is required for pulling back the differential forms to the reference element. The Jacobian is constant for each triangle because the coordinate transformation is affine in this case. Otherwise, the Jacobian needs to be computed at each quadrature point.

  • The constant arrays t0, t9 are the same for all elements. t0 represents the tabulation of the evaluation of basis functions at quadrature points, t9 represents the quadrature weights.

  • The ip loop iterates over the quadrature points, evaluating the integrand in (2) and summing to approximate the integral. The j

    loops iterate over the degrees of freedom, once inside the quadrature loop, and once upon output to the assembled array

    A. The extents of these loops depend on the integrals performed and the choice of function spaces respectively.

  • TSFC performs optimization passes on the loop nests. In particular, it applies loop-invariant code motion which pulls invariant expression out of the loop nests into temporary arrays. This reduces the number of operations required while changing the structure of otherwise perfectly nested loops.

2.2 Global assembly

1static inline void helmholtz(double A[3], const double *restrict coords , const double *restrict w_0)
3  // ... element kernel as defined previously ... //
6void wrap_helmholtz(int const start, int const end, double *__restrict__ dat0, double const *__restrict__ dat1, double const *__restrict__ dat2, int const *__restrict__ map0)
8  double t2[3];
9  double t3[3 * 2];
10  double t4[3];
12  for (int n = start; n <= -1 + end; ++n)
13  {
14    for (int i5 = 0; i5 <= 2; ++i5)
15    {
16      t4[i5] = dat2[map0[3 * n + i5]];
17      for (int i6 = 0; i6 <= 1; ++i6)
18        t3[2 * i5 + i6] = dat1[2 * map0[3 * n + i5] + i6];
19    }
20    for (int i1 = 0; i1 <= 2; ++i1)
21      t2[i1] = 0.0;
23    helmholtz(t2, t3, t4);
25    for (int i15 = 0; i15 <= 2; ++i15)
26      dat0[map0[3 * n + i15]] += t2[i15];
27  }
Listing 3: Global assembly code for action of the Helmholtz operator in C.

During global assembly, the local contribution from each mesh entity, computed by the element kernel, is accumulated into the global data structure. In Firedrake, PyOP2 (Rathgeber et al., 2012) is responsible for representing and realizing the iteration over mesh entities, marshalling data in and out of the element kernels. The computation is organized as PyOP2 parallel loops, or parloops. A parloop specifies a computational kernel, a set of mesh entities to which the kernel is applied, and all data required for the kernel. The data objects could be directly defined on the mesh entities, or indirectly access through maps from the mesh entities. For instance, the signature for the global assembly of the Helmholtz operator is:

parloop(helmholtz, cells, r(cell2vert, RW),
        coords(cell2vert, R), x(cell2vert, R)).

Here helmholtz is the element kernel as shown in Listing 2, generated by TSFC; cells is the set of all triangles in the mesh; r, coords and x are the global data objects that are needed to create the arguments for the element kernel, where r holds the result vector, coords holds the coordinates of the vertices of the triangles which are needed for computing the Jacobian, and x holds the vector representation of function (as weights of basis functions). These global data objects correspond to the kernel arguments A, coords and w_0 respectively. The map cell2vert provides indirection from mesh entities to the global data objects, and each data argument is annotated with an access descriptor (R for read-only, RW for read-write access). In this example, all three arguments share the same map because first-order Lagrange element on triangles only have degrees-of-freedom defined on the vertices, while the coordinate fields are also defined on the vertices.

Listing 3 shows the C code generated by PyOP2 for the above example. The code is then JIT-compiled when the result is needed in Firedrake. In the context of vectorization, this approach, with the inlined element kernel, forms the baseline in our experimental evaluation. We note the following key features of the global assembly kernel:

  • The outer loop is over mesh entities.

  • For each entity, the computation can be divided into three parts: gathering the input data from global data structures (t3 and t4 in this case, which correspond to kernel arguments coords and w_0), calling the local assembly kernel, scattering the output data (t2) to the global data structure.

  • The gathering and scattering of data make use of indirect addressing via base pointers (dats) and indices (maps).

  • Different mesh entities might share the same degrees of freedom.

  • Global assembly interacts with local assembly via a function call (Line 23). This call can be inlined by the compiler, but it creates an artificial boundary for loop transformations at the source code level. This is the software engineering challenge that limits vectorization to a single local assembly kernel previously.

3 Vectorization

As one would expect, the loop nests and loop trip counts vary considerably for different integrals, meshes and function spaces that users might choose. This complexity is one of the challenges that our system specifically, and Firedrake more generally, must face in order to deliver predictable performance on modern CPUs, which have increasingly rich SIMD instruction sets.

In the prior approach to vectorization in our framework, the local assembly kernels generated by TSFC are further transformed to facilitate vectorization, as described by Luporini et al. (2015). The arrays are padded so that the trip counts of the innermost loops match multiples of the length of SIMD units. However, padding becomes less effective for low polynomial degrees on wide SIMD units. For instance, AVX512 instructions act on 8 double-precision floats, but the loops for degree 1 polynomials on triangles only have trip counts of 3, as shown in Listing 2

. Moreover, loop-invariant code motion is very effective in reducing the number of floating-point operations, but hoisted instructions are not easily vectorized as they are no longer in the innermost loops. This effect is more pronounced on tensor-product elements where TSFC is able to apply

sum factorization (Homolya et al., 2017) to achieve better algorithmic complexity.

3.1 Cross-element vectorization and Loopy

Another strategy is to vectorize across several elements in the outer loop over the mesh entities, as proposed previously by Kronbichler and Kormann (2017). This approach computes the contributions from several mesh entities using SIMD instructions, where each SIMD lane handles one entity. This is always possible regardless of the complexity of the local element kernel because the computation on each entity is independent and identical. One potential downside is the increase in memory pressure as the working set is larger.

For a compiler, the difficulty in performing cross-element vectorization (or, more generally, outer-loop vectorization) is to automate a sequence of loop transformations and necessary data layout transformations robustly. This is further complicated by the indirect memory access in data gathering and scattering, and the need to unroll and interchange loops across the indirections, which requires significantly more semantic knowledge than what is available to the C compiler.

Loopy (Klöckner, 2014) is a loop generator embedded in Python which targets both CPUs and GPUs. Loopy provides abstractions based on integer sets for loop-based computations and enables powerful transformations based on the polyhedral model (Verdoolaege, 2010). Loop-based computations in Loopy are represented as Loopy kernels. A Loopy kernel is a subprogram consisting of a loop domain and a partially-ordered list of scalar assignments acting on multi-dimensional arrays. The loop domain is specified as the set of integral points in the convex intersection of quasi-affine constraints, as described by the Integer Set Library (Verdoolaege, 2010).

To integrate with Loopy, the code generation mechanisms in Firedrake were modified as illustrated in Figure 1. Instead of generating source code directly, TSFC and PyOP2 are modified to generate Loopy kernels. We have augmented the Loopy internal representation with the ability to support a generalized notion of kernel fusion through the nested composition of kernels, specifically through subprograms and inlining. This allows PyOP2 to inline the element kernel such that the global assembly Loopy kernel encapsulates the complete computation of global assembly. This holistic view of the overall computation enables robust loop transformations for vectorization across the boundary between global and local assembly.

Figure 1: Integration of Loopy in Firedrake for global assembly code generation.
1KERNEL: helmholtz
4start: type: int32
5end: type: int32
6dat0: type: float64, shape: (None)
7// ... More arguments ... //
10[end, start] -> { [n] : start <= n < end }
11{ [i2] : 0 <= i2 <= 2 }
12// ... More domains ... //
18t4: type: float64, shape: (3), dim_tags: (stride:1)
19// ... More temporaries ... //
22 for n, i2
23     t4[i2] = dat2[map0[n, i2]]
24   // ... More instructions ... //
25   for i15
26     dat0[map0[n, i15]] += t0[0, i15]
27 end n, i15
Listing 4: Global assembly Loopy kernel of the Helmholtz operator.

Listing 4 shows an abridged version of the global assembly Loopy kernel for the Helmholtz operator, with the element kernel fused. We highlight the following key features of Loopy kernels:

  • Loop indices, such as n, i1, are called inames in Loopy, which define the iteration space. The bounds of the loops are specified by the affine constraints in domains.

  • Loop transformations operate on kernels by rewriting the loop domain and the statements making up the kernel. In addition, each iname carries a set of tags governing its realization in generated code, perhaps as a sequential loop, as a vector lane index, or through unrolling.

  • Multi-dimensional arrays occur as arguments and temporaries. The memory layout of the data can be specified by assigning tags to the array dimensions.

  • Dependencies between statements specify their partial order. Statement scheduling can also be controlled by assigning priorities to statements and inames.

For example, to achieve cross-element vectorization (by batching 4 elements into one SIMD vector in this example) we invoke the following sequence of Loopy transformations on the global assembly Loopy kernel, exploiting the domain knowledge of finite element assembly:

  • Split the outer loop n over mesh entities into n_outer and n_simd, with n_simd having trip count of 4. The objective is to generate SIMD instructions for the n_simd loops, such that each vector lane computes one iteration of the n_simd loops.

  • Assign the tag SIMD to the new iname n_simd. This tag informs Loopy to force the n_simd loop to be innermost, privatizing data by vector-expansion if necessary.

1KERNEL: helmholtz_simd
4start: type: int32
5end: type: int32
6dat0: type: float64, shape: (None)
7// ... More arguments ... //
10[end, start] -> { [n_outer, n_simd] :
11  n_simd >= start - 4n_outer
12  and 0 <= n_simd <= 3
13  and n_simd < end - 4n_outer }
16n_simd: SIMD
19t4: type: float64, shape: (3, 4),
20          dim_tags: (stride:4, stride:1)
21// ... More temporaries ... //
24 for n_outer, n_simd, i2
25     t4[i2, n_simd] = dat2[map0[n_outer*4 + n_simd, i2]]
26   // ... More instructions ... //
27   for i15
28     dat0[map0[n_outer*4+n_simd, i15]] += t2[i7, n_simd]
29 end n_outer, n_simd, i15
Listing 5: Changes to global assembly Loopy kernel of the Helmholtz operator after cross-element vectorization
1// ... Constant array declarations ... //
3void wrap_helmholtz(int const start, int const end, double *__restrict__ dat0, double const *__restrict__ dat1, double const *__restrict__ dat2, int const *__restrict__ map0)
5  double form_t1[4] __attribute__ ((aligned (64)));
6  double t2[3 * 4] __attribute__ ((aligned (64)));
7  // ... More temporary array declarations ... //
8  for (int n_outer = (start / 4); n_outer <= ((-4 + end) / 4); ++n_outer) {
9    for (int i5 = 0; i5 <= 2; ++i5) {
10      for (int i6 = 0; i6 <= 1; ++i6) {
11        #pragma omp simd
12        for (int n_simd = 0; n_simd <= 3; ++n_simd)
13          t3[n_simd + 8 * i5 + 4 * i6] = dat1[2 * map0[12 * n_outer + 3 * n_simd + i5] + i6];
14      }
15      #pragma omp simd
16      for (int n_simd = 0; n_simd <= 3; ++n_simd)
17        t4[n_simd + 4 * i5] = dat2[map0[12 * n_outer + 3 * n_simd + i5]];
18    }
19    for (int i1 = 0; i1 <= 2; ++i1) {
20      #pragma omp simd
21      for (int n_simd = 0; n_simd <= 3; ++n_simd)
22        t2[n_simd + 4 * i1] = 0.0;
23    }
24    #pragma omp simd
25    for (int n_simd = 0; n_simd <= 3; ++n_simd) {
26      form_t11[n_simd] = 0.0;
27      form_t1[n_simd] = -1.0 * t3[n_simd];
28      // ... More similar instructions ... //
29      form_t8[n_simd] = fabs(form_t7[n_simd]);
30    }
31    // ... More similar loop nests ... //
32    for (int form_j_1 = 0; form_j_1 <= 2; ++form_j_1) {
33      #pragma omp simd
34      for (int n_simd = 0; n_simd <= 3; ++n_simd)
35        t2[n_simd + 4 * form_j_1] += form_t25[form_j_1] * form_t24[n_simd] + form_t13[n_simd + 4 * form_j_1] + form_t27[form_j_1] * form_t26[n_simd];
36    }
37    for (int i15 = 0; i15 <= 2; ++i15)
38      for (int n_simd = 0; n_simd <= 3; ++n_simd)
39        dat0[map0[12 * n_outer + 3 * n_simd + i15]] += t2[n_simd + 4 * i15];
40  }
Listing 6: Global assembly code for action of the Helmholtz operator in C vectorized by batching 4 elements.

We highlight the change to the Loopy kernel after these transformations in Listing 5. Loopy supports code generation for different environments from the same kernel by choosing different targets. We introduced an OpenMP Target to Loopy which extends its existing C-language Target to support OpenMP pragmas, facilitating SIMD instruction generation.

Listing 6 shows the generated C code for the Helmholtz operator vectorized by grouping together 4 elements. Apart from the previously mentioned changes, we note the following details:

  • The n_simd loops are pushed to the innermost level. Moreover, this transformation vector-expands temporary arrays such as t2, t3, t4 by 4, with the expanded dimension labeled as varying the fastest when viewed from (linear) system memory. This ensures their accesses in the n_simd loops always have unit stride.

  • Loopy provides a mechanism to declare arrays to be aligned to specified memory boundaries (64 bytes in this example).

  • The n_simd loops are decorated by pragma omp simd to inform C compilers to generate SIMD instructions. The exception is the writing back to the global array (Line 36), which is sequentialized due to potential race conditions, as different mesh entities could share the same degrees of freedom.

  • The remainder loop which handles the cases where the number of elements is non-divisible by 4 is omitted here for simplicity.

    After cross-element vectorization, all local assembly instructions (Lines 24–36) are inside the n_simd loops, which always have trip counts of 4 and are stride 1. All loop-varying array accesses are stride 1 in the fastest moving dimension. There are no loop-carried dependencies in n_simd loops. As a result, the n_simd loops, and therefore all local assembly instructions, are vectorizable without further consideration of dependencies. This is verified by checking the x86 assembly code and running the program with the Intel Software Development Emulator.

3.2 Vector extensions

A more direct way to inform the compiler to emit SIMD instructions without depending on OpenMP implementation is to use vector extensions22endnote: 2, which support vector data types. These were first introduced in the GNU compiler (GCC), but are also supported in recent versions of the Intel C compiler (ICC) and Clang. Analogous mechanisms exist in various vector-type libraries, e.g. VCL (Fog, 2017). To evaluate and compare with the directive-based approach from Section 3.1, we created a new code generation target in Loopy to support vector data types. When inames and corresponding array axes are jointly tagged as vector loops, Loopy generates code to compute on data in vector registers directly, instead of scalar loops over the vector lanes. It is worth noting that the initial intermediate representation of the loop was identical in each case, and that the different specializations were achieved through code transformation.

1typedef double double4 __attribute__ ((vector_size (32)));
2typedef int int4 __attribute__ ((vector_size (16)));
4static double4 const _zeros_double4 __attribute__ ((aligned (64))) = { 0.0 };
6// ... Constant array declarations ... //
8void wrap_form0_cell_integral_otherwise(int const start, int const end, double *__restrict__ dat0, double const *__restrict__ dat1, double const *__restrict__ dat2, int const *__restrict__ map0)
10  double4 form_t1 __attribute__ ((aligned (64)));
11  // ... Temporary array declarations ... //
13  for (int n_outer = (start / 4); n_outer <= ((-4 + end) / 4); ++n_outer) {
14    for (int i5 = 0; i5 <= 2; ++i5) {
15      for (int i6 = 0; i6 <= 1; ++i6) {
16        #pragma omp simd
17        for (int n_simd = 0; n_simd <= 3; ++n_simd)
18          t3[(2 * i5 + i6)][n_simd] = dat1[2 * map0[12 * n_outer + 3 * n_simd + i5] + i6];
19      }
20      #pragma omp simd
21      for (int n_simd = 0; n_simd <= 3; ++n_simd)
22        t4[i5][n_simd] = dat2[map0[12 * n_outer + 3 * n_simd + i5]];
23    }
24    for (int i1 = 0; i1 <= 2; ++i1)
25      t2[i1] = _zeros_double4;
27    form_t11 = _zeros_double4;
28    form_t1 = -1.0 * t3[0];
29    // ... More similar instructions ... //
30    #pragma omp simd
31    for (int n_simd = 0; n_simd <= 3; ++n_simd)
32      form_t8[n_simd] = fabs(form_t7[n_simd]);
33    // ... More similar instructions ... //
34    for (int form_j_1 = 0; form_j_1 <= 2; ++form_j_1)
35      t2[form_j_1] += form_t25[form_j_1] * form_t24 + form_t13[form_j_1] + form_t27[form_j_1] * form_t26;
37    for (int i15 = 0; i15 <= 2; ++i15)
38      for (int n_batch = 0; n_batch <= 3; ++n_batch)
39        dat0[map0[12 * n_outer + 3 * n_batch + i15]] += t2[i15][n_batch];
40  }
Listing 7: Global assembly code for action of the Helmholtz operator in C vectorized by 4 elements (using vector extensions).

Listing 7 shows the C code generated for the Helmholtz operator vectorized by batching 4 elements using the vector extension target. Here all vectorized (innermost) loops for local assembly are replaced by operations on vector variables. For instructions which do not fit the vector computation model, most noticeably the indirect data gathering (Line 18), or instructions containing built-in mathematics functions which are not supported on vector data types (Line 32), Loopy defaults to generating scalar loops over vector lanes, decorated with pragma omp simd. In addition, because vector extensions do not automatically broadcast scalars, any vector instruction with scalar right-hand-side is modified by adding the zero vector to the expression, as shown in Lines 25 and 27.

4 Performance Evaluation

We follow the performance evaluation methodology of Luporini et al. (2017) by measuring the assembly time of a range of operators of increasing complexity and polynomial degrees. Due to the large number of combinations of experimental parameters (operators, meshes, polynomial degrees, vectorization strategies, compilers, hyperthreading), we only report an illustrative portion of the results here, with the entire suite of experiments made available on the interactive online repository CodeOcean (Sun, 2019a).

4.1 Experimental setup

We performed experiments on a single node of two Intel systems, based on the Haswell and Skylake microarchitectures, as detailed in Table 1. Because we observe that hyperthreading usually improves the performance by 5% to 10% for our applications, we set the number of MPI processes to the number of logical cores of the CPU to utilize all available computation resources. Experimental results with hyperthreading turned off are available on CodeOcean. The batch size, i.e., the number of elements grouped together for vectorization, is chosen to be consistent with the SIMD length. We use three C compilers: GCC 7.3, ICC 18.0 and Clang 5.0. The two vectorization strategies described in Section 3 are tested on all platforms. We use the listed Base Frequency to calculate the peak performance in Table 1. In reality, modern Intel CPUs dynamically reduce frequencies on heavy workloads with AVX2 and AVX512 instructions, which results in lower achievable performance. Running the optimized LINPACK benchmark binary provided by Intel gives a reasonable indication of peak performance for real applications.

For the benefit of reproducibility, we have archived the specific versions of Firedrake components used for experimental evaluation on Zenodo (Zenodo/Firedrake, 2019). An installation of Firedrake with components matching the ones used for evaluation in this paper can be obtained following the instruction at, with the following command:

python3 firedrake-install --doi 10.5281/zenodo.2595487

The evaluation framework is archived at (Sun, 2019b).

Haswell Xeon E5-2640 v3 Skylake Xeon Gold 6148
Base frequency 2.6 GHz 2.4 GHz
Physical cores 8 20
SIMD instruction set AVX2 AVX512
doubles per SIMD vector 4 8
Cross-element vectorization batch size 4 8
FMA33endnote: 3Fused multiply-add operations. units per core 2 2
FMA instruction issue per cycle 2 2
Peak performance (double-precision)44endnote: 4Calculated as 332.8 GFLOP/s 1536.0 GFLOP/s
LINPACK performance (double-precision)55endnote: 5Intel LINPACK Benchmark. 262.5 GFLOP/s 976.7 GFLOP/s
Memory bandwidth66endnote: 6STREAM triad benchmark, 2 threads per core. 38.5 GB/s 81.0 GB/s
GCC/Clang arch flag -march=native -march=native
ICC SIMD flag -xcore-avx2 -xcore-avx512 -qopt-zmm-usage=high
Other compiler flags -O3 -ffast-math -fopenmp -O3 -ffast-math -fopenmp
Intel Turbo Boost OFF OFF
Table 1: Hardware specification for experiments
tri quad tet hex
mass 1 1.2 3 3 0.7 1.0 2.5 2 2 1.2 1.6 2.7 4 4 0.9 0.9 6.6 2 2 1.1 2.0
2 1.7 6 6 1.3 1.1 2.4 3 3 1.0 1.4 5.8 10 14 2.4 2.4 5.4 3 3 1.8 2.4
3 3.0 10 12 2.7 2.2 2.7 4 4 0.7 1.0 8.6 20 24 0.9 1.6 4.9 4 4 1.4 1.4
4 5.6 15 25 3.4 3.9 2.9 5 5 2.1 2.0 38.8 35 125 1.1 1.4 4.7 5 5 2.3 2.3
5 7.5 21 36 1.2 1.9 3.0 6 6 2.4 2.5 54.9 56 216 0.8 0.9 4.8 6 6 2.4 2.5
6 9.7 28 49 0.9 1.7 3.3 7 7 2.6 2.2 79.8 84 343 1.0 1.0 5.0 7 7 2.4 2.8
helmholtz 1 1.8 3 3 1.0 1.3 5.6 2 2 2.1 2.8 3.8 4 4 1.3 1.3 16.1 2 2 2.1 3.2
2 5.7 6 6 2.6 3.1 6.4 3 3 1.7 2.6 27.1 10 14 2.9 5.5 16.0 3 3 2.8 3.8
3 9.6 10 12 2.8 4.5 7.1 4 4 0.9 1.3 37.2 20 24 1.6 3.2 15.3 4 4 1.9 2.7
4 17.8 15 25 3.0 5.1 7.6 5 5 3.0 4.8 162.7 35 125 2.2 2.6 15.0 5 5 3.3 4.1
5 23.3 21 36 2.0 3.1 7.9 6 6 3.0 4.7 226.0 56 216 1.4 1.8 15.2 6 6 3.3 4.2
6 29.8 28 49 1.4 2.5 8.6 7 7 3.1 4.8 325.7 84 343 1.4 1.9 15.8 7 7 3.3 4.2
laplacian 1 0.5 3 1 0.8 0.8 4.1 2 2 1.7 2.0 1.9 4 1 1.2 1.0 13.3 2 2 1.9 2.4
2 2.7 6 3 1.7 2.1 5.4 3 3 1.6 2.1 10.2 10 4 2.1 3.5 12.7 3 3 2.6 3.4
3 4.0 10 6 2.2 3.3 5.8 4 4 1.3 2.0 23.7 20 14 2.2 3.4 12.1 4 4 1.5 1.6
4 6.9 15 12 2.7 4.6 6.1 5 5 2.8 4.0 31.1 35 24 2.9 3.3 12.0 5 5 2.7 3.3
5 12.6 21 25 2.1 3.1 6.3 6 6 2.8 4.1 121.6 56 125 3.2 2.5 12.3 6 6 2.6 3.4
6 16.8 28 36 1.9 2.8 6.8 7 7 2.7 4.3 185.4 84 216 2.8 2.3 12.9 7 7 2.6 3.9
elasticity 1 0.5 3 1 0.8 1.0 5.2 2 2 1.8 2.2 1.9 4 1 1.2 1.0 16.3 2 2 2.0 2.3
2 3.0 6 3 1.7 2.4 6.2 3 3 1.8 2.3 11.5 10 4 2.1 3.5 14.6 3 3 2.6 3.2
3 4.4 10 6 2.3 3.4 6.5 4 4 1.3 2.1 25.3 20 14 2.1 3.4 13.5 4 4 1.5 1.6
4 7.3 15 12 2.7 4.5 6.7 5 5 2.8 4.2 32.4 35 24 2.8 2.9 13.2 5 5 2.8 3.4
5 13.1 21 25 2.1 3.2 6.9 6 6 2.8 4.2 124.7 56 125 3.1 2.5 13.4 6 6 2.6 3.5
6 17.3 28 36 1.8 2.8 7.4 7 7 2.7 4.2 188.7 84 216 2.8 2.3 13.9 7 7 2.6 3.9
1 0.5 3 1 1.1 1.5 18.4 2 3 2.3 3.2 1.6 4 1 1.9 1.7 76.6 2 3 2.7 4.3
2 9.8 6 6 2.8 4.5 22.2 3 5 1.6 2.3 62.4 10 14 3.4 6.1 79.1 3 5 2.2 3.7
3 26.8 10 25 3.6 6.7 23.8 4 7 2.3 4.0 310.0 20 125 3.4 6.3 78.7 4 7 2.4 3.7
4 40.7 15 49 3.7 7.2 25.1 5 9 3.6 6.2 593.2 35 343 3.5 4.2 79.5 5 9 3.7 5.9
5 55.9 21 81 2.7 5.1 25.7 6 11 3.6 6.1 895.2 56 729 2.9 2.8 82.1 6 11 3.3 5.5
6 74.3 28 121 2.4 4.3 27.4 7 13 3.7 6.1 1398.3 84 1331 2.6 2.7 85.8 7 13 3.4 5.7
Table 2: Operator characteristics and speed-up summary, using GCC with vector extensions. AI: arithmetic intensity (FLOP/byte). D: trip count of loops over degrees of freedom. Q: trip count of loops over quadrature points. H: speed-up over baseline on Haswell, 16 processes, with vector extensions. S: speed-up over baseline on Skylake, 40 processes, with vector extensions.

We measure the execution time of assembling the residual of five operators: the mass matrix (“mass”), the Helmholtz equation (“helmholtz”), the vector Laplacian (“laplacian”), an elastic model (“elasticity”), and a hyperelastic model (“hyperelasticity”). The mathematical description of the operators is detailed in the supplemental material. These operators stem from real-world applications and cover a wide range of complexity: the generated C code for the corresponding global assembly kernels exceeds hundreds of KB for the hyperelasticity operator on high polynomial degrees.

We performed experiments on both 2D and 3D domains, with two types of mesh used for each case: triangles (“tri”) and quadrilaterals (“quad”) for 2D problems, tetrahedra (“tet”) and hexahedra (“hex”) for 3D problems. The arithmetic intensity of the operators are listed in Table 2

. The memory footprint is calculated assuming perfect caching – it is thus a lower bound which results in an upper bound estimation for the arithmetic intensity. The triangular and tetrahedral meshes use an affine coordinate transformation (requiring only one Jacobian evaluation per element). The quadrilateral and hexahedral meshes use a bilinear (trilinear) coordinate transformation (requiring Jacobian evaluation at every quadrature point), which usually results in higher arithmetic intensities at low orders. In Firedrake, tensor-product elements 

(McRae et al., 2016) benefit from optimizations such as sum factorization to achieve lower asymptotic algorithmic complexity. They are therefore more competitive for higher order methods (Homolya et al., 2017).

We record the maximum execution time of the generated global assembly kernels on all MPI processes. This time does not includes the time in synchronization and MPI data exchange for halo updates. Each experiment is run five times, and the average execution time is reported. Exclusive access to the compute nodes is ensured and threads are pinned to individual logical cores. Startup costs such as code generation time and compilation time are excluded. We use automatic vectorization by GCC without batching, compiled with the same optimization flags listed earlier, as the baseline for comparison. Comparing with the cross-element strategy, the baseline represents the out-of-the-box performance of compiler auto-vectorization for the local element kernel. We note that cross-element vectorization does not alter the algorithm of local assembly except for the vector expansion, as illustrated by Listing 2 and Listing 6. Consequently, the total number of floating-point operations remains the same. The performance benefit from cross-element vectorization is therefore composable with the operation-reduction optimizations performed by the form compiler to the local assembly kernels.

4.2 Experimental results and discussion

Figure 2: The fraction of peak FLOP/s (as listed in Table 1) achieved by different compilers for operators {helmholtz, elasticity}, on meshes {tri, quad, tet, hex} on Haswell using vector extensions with 16 MPI processes. The dotted line indicates the fraction of peak performance achieved by LINPACK benchmark.
Figure 3: The fraction of peak FLOP/s (as listed in Table 1) achieved by different compilers for operators {helmholtz, elasticity}, on meshes {tri, quad, tet, hex} on Haswell using OpenMP pragma with 16 MPI processes. The dotted line indicates the fraction of peak performance achieved by LINPACK benchmark.
Figure 4: The fraction of peak FLOP/s (as listed in Table 1) achieved by different compilers for operators {helmholtz, elasticity}, on meshes {tri, quad, tet, hex} on Skylake using vector extensions with 40 MPI processes. The dotted line indicates the fraction of peak performance achieved by the LINPACK benchmark.

Figures 2 to 5 show the performance of the helmholtz and elasticity operators on Haswell and Skylake, vectorized with OpenMP pragma as described in Section 3.1, and with vector extensions as described in Section 3.2. We indicate the fraction of peak performance achieved on the left axis, and the fraction of the LINPACK benchmark performance on the right axis. Figure 6 and 7 compare the roofline models (Williams et al., 2009) of the baseline and our cross-element vectorization implementation using GCC and vector extensions on Haswell and Skylake. The speed-up achieved is also summarized in Table 2. We analyze the data in the following aspects:

Figure 5: The fraction of peak FLOP/s (as listed in Table 1) achieved by different compilers for operators {helmholtz, elasticity}, on meshes {tri, quad, tet, hex} on Skylake using OpenMP pragma with 40 MPI processes. The dotted line indicates the fraction of peak performance achieved by the LINPACK benchmark.
Figure 6: Roofline model of operators for baseline and cross-element vectorization using GCC on Haswell. The dotted lines indicate the performances of the LINPACK benchmark.
Figure 7: Roofline model of operators for baseline and cross-element vectorization using GCC on Skylake. The dotted lines indicate the performances of the LINPACK benchmark.

4.2.1 Compiler comparison and vector extensions

When vectorizing with OpenMP pragma, ICC gives the best performance for almost all test cases, followed by Clang, while GCC is significantly less competitive. The performance disparity is more pronounced on Skylake than on Haswell. However, when using vector extensions, Clang and GCC improve significantly and are able to match the performance of ICC on both Haswell and Skylake, whereas ICC performs similarly with OpenMP pragma and with vector extensions.

We use the Intel Software Development Emulator77endnote: 7 to count the number of instructions executed at runtime for code generated by different compilers. The data indicate that although floating-point operations are fully vectorized by all compilers, GCC and Clang generate more load and store instructions between vector registers and memory when using OpenMP pragma for vectorization. One possible reason is that GCC and Clang choose to allocate short arrays to the stack rather than the vector registers directly, causing more load on the memory subsystem.

In light of these results, we conclude that vectorization with vector extensions allows greater performance portability on different compilers and CPUs for our application. It is, therefore, our preferred strategy for implementing cross-element vectorization, and is the default option for the rest of our analysis.

4.2.2 Vectorization speed-up

Almost across the board, significant speed-up is achieved on the test cases under consideration. Slowdown occurs in two situations. On low polynomial degrees, the kernels tend to have low arithmetic intensity so that the increase in available floating point bandwidth through cross-element vectorization cannot compensate for the increase in the size of the working set of data. On simple operators such as mass on tri and tetra, the kernels have simple loop structures and the compilers can sometimes successfully apply other optimizations such as unrolling and loop interchange to achieve vectorization without batching elements in the outer loop. The pattern of speed-up is consistent across Haswell and Skylake. Higher speed-up is generally achieved on more complicated operators (e.g. hyperelasticity), and on tensor-product elements (quad and hex), which generally correspond to more complicated loop structure and higher arithmetic intensity due to the Jacobian recomputation at each quadrature point.

4.2.3 Achieved fraction of peak performance

We observe that the fraction of peak performance varies smoothly with polynomial degrees for cross-element vectorization in all test cases. This fulfils an important design requirement for Firedrake: small changes in problem setup by the users should not create unexpected performance degradation. This is also shown in Figures 6 and 7 where the results are more clustered on the roofline plots after cross-element vectorization. The baseline shows performance inconsistency, especially on low polynomial degrees. For instance, for the helmholtz operator with degree 3 on quad, the quadrature loops and the basis function loops all have trip counts of 4, which fits the vector length on Haswell and results in better performance.

On simplicial meshes (tri and tetra), higher order discretization leads to kernels with very high arithmetic intensity because of the quadratic and cubic increases in the number of basis functions, and thus the loop trip counts. This is due to the current limitation that simplicial elements in Firedrake are not sum factorized. In these test cases, we observe that the baseline approaches cross-element vectorization for sufficiently high polynomial degrees. This is not a serious concern for our optimization approach because the break-even degrees are very high except for simple operators such as mass, and ultimately tensor-product elements are more competitive for higher order methods in terms of algorithmic complexity.

We also observe that there exists a small number of test cases where the achieved peak performance is marginally higher than the LINPACK benchmark on Skylake, as shown in Figure 7. One possible reason for this observation is thermal throttling since our test cases typically run for a shorter period of time than LINPACK. We also note that these test cases correspond to high order hyperelasticity operator on tet which are not practically important use cases, since using tensor-product elements requires much less floating-point operations at the same polynomial order.

4.2.4 Tensor-product elements

We observe higher and more consistent speed-up for tensor-product elements (quad and hex) on both Haswell and Skylake. This is because, on these meshes, more computation can be moved outside the innermost loop due to sum factorization, which results in more challenging loop nests for the baseline strategy which attempts to vectorize within the element kernel. The same applies to the evaluation of the Jacobian of coordinate transformation, which is a nested loop over quadrature points after sum factorization for tensor-product elements.

The base elements of quad and hex are interval elements in 1D, thus the extents of loops over degrees of freedom increase only linearly with respect to polynomial degrees, as shown in Table 2. As a result, the baseline performance does not improve as quickly for higher polynomial degrees on quad and hex compared with tri and tet, resulting in stable speed-up for cross-element vectorization observed on tensor-product elements.

5 Conclusion and future work

We have presented a portable, general-purpose solution for delivering stable vectorization performance on modern CPUs for matrix-free finite element assembly for a very broad class of finite element operators on a large range of elements and polynomial degrees. We described the implementation of cross-element vectorization in Firedrake which is transparent to the end users. Although the technique of cross-element vectorization is conceptually simple and has been applied in hand-written kernels before, our implementation based on code generation is automatic, robust and composable with other optimization passes.

The write-back to global data structure is not vectorized in our approach due to possible race conditions. The newly introduced Conflict Detection instructions in the Intel AVX512 instruction set could potentially mitigate this limitation (Zhang, 2016, Section 2.3). This could be achieved by informing Loopy to use the relevant intrinsics when generating code for loops with specific tags.

We have focused on the matrix-free finite element method because it is compute-intensive and more likely to benefit from vectorization. However, our methods and implementation also support matrix assembly. Firedrake relies on PETSc (Balay et al., 2017) to handle distributed sparse matrices, and PETSc requires certain data layouts for the input array when updating the global matrices. When several elements are batched together for cross-element vectorization, we need to generate code to explicitly unpack/transpose the local assembly results into individual arrays before calling PETSc functions to update the global sparse matrices for each element. Future improvement could include eliminating this overhead, possibly by extending the PETSc API.

The newly introduced abstraction layer, together with Loopy integration in the code generation and optimization pipeline, opens up multiple possibilities for future research in Firedrake. These include code generation with intrinsics instructions, loop tiling, and GPU acceleration, all of which are already supported in Loopy.

The authors would like to thank Tobias Grosser, Richard Vera, J. Ramanujam and P. Sadayappan for their valuable insights during our discussions which started at Dagstuhl Seminar 18111 on Loop Optimization. The authors are grateful to James Cownie and Andrew Mollinson at Intel Corp. for providing access to the Skylake platform.

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

This work was supported by the Engineering and Physical Sciences Research Council [grant numbers EP/L016796/1, EP/R029423/1], and the Natural Environment Research Council [grant number NE/K008951/1]. It was further funded by the US Navy Office of Naval Research under grant number N00014-14-1-0117 and the US National Science Foundation under grant number CCF-1524433. AK gratefully acknowledges a hardware gift from Nvidia Corporation.

Here we describe the operators used as the test cases for experimental evaluation. They are defined as bilinear forms, and we take their action in UFL to obtain the corresponding linear forms.


Here and are scalar-valued trial and test functions.


Here and are scalar-valued trial and test functions.


Here and are vector-valued trial and test functions.


The linear elasticity model solves for a displacement vector field. Here and are vector-valued trial and test functions, is the symmetric strain rate tensor. The bilinear form is defined as:


In this simple hyperelastic model, we define the strain energy function over vector field :



is the identity matrix,

and are the Lamé parameters of the material, is the deformation gradient, is the right Cauchy-Green tensor, is the Euler-Lagrange strain tensor. We define the Piola-Kirchhoff stress tensors as:


Finally, we arrive at the residual form of this nonlinear problem:


where is the external forcing. To solve this nonlinear problem, we need to linearize the residual form at an approximate solution , this gives us the bilinear form :


where the trial function is , the test function is , and is a coefficient of the operator. We use the automatic differentiation of UFL to compute the operator symbolically.