Abstract
This work presents algorithms for the efficient implementation of discontinuous Galerkin methods with explicit time stepping for acoustic wave propagation on unstructured meshes of quadrilaterals or hexahedra. A crucial step towards efficiency is to evaluate operators in a matrixfree way with sumfactorization kernels. The method allows for general curved geometries and variable coefficients. Temporal discretization is carried out by lowstorage explicit Runge–Kutta schemes and the arbitrary derivative (ADER) method. For ADER, we propose a flexible basis change approach that combines cheap face integrals with cell evaluation using collocated nodes and quadrature points. Additionally, a degree reduction for the optimized cell evaluation is presented to decrease the computational cost when evaluating higher order spatial derivatives as required in ADER time stepping. We analyze and compare the performance of stateoftheart Runge–Kutta schemes and ADER time stepping with the proposed optimizations. ADER involves fewer operations and additionally reaches higher throughput by higher arithmetic intensities and hence decreases the required computational time significantly. Comparison of Runge–Kutta and ADER at their respective CFL stability limit renders ADER especially beneficial for higher orders when the Butcher barrier implies an overproportional amount of stages. Moreover, vector updates in explicit Runge–Kutta schemes are shown to take a substantial amount of the computational time due to their memory intensity.
Keywords: matrixfree methods, sum factorization, discontinuous Galerkin methods, arbitrary highorder, explicit time stepping, high performance computing
AMS: 65M60, 65Y20, 68Q25, 68W40
1 Introduction
Discontinuous Galerkin (DG) methods are highly attractive for solving partial differential equations of transport character because they combine the robustness of finite volume methods with the accuracy of finite element methods on unstructured meshes
[6, 14]. An essential component is the implementation of DG on modern hardware, which is the focus of this work. Modern processors favor arithmetically intense algorithms such as dense matrix multiplication over patterns typical for many solvers of partial differential equations that include loads of a streaming character with much lower arithmetics. Thus, algorithmic complexities alone are not enough to judge a method’s efficiency as the achievable performance of e.g. a sparse matrixvector product in terms of floating point operations per second can be almost two orders of magnitude below the advertised peak performance [30].This paper derives and analyzes two time stepping schemes for the discontinuous Galerkin (DG) method for the acoustic wave equation that are significantly more efficient than traditional matrixbased algorithms. The first scheme is based on explicit lowstorage Runge–Kutta methods, and the second one on arbitrary derivative (ADER) time stepping that relies on a temporal Taylor expansion of the solution and expresses temporal derivatives by spatial derivatives using the Cauchy–Kowalevski (also called Lax–Wendroff) procedure. ADER has been successfully applied in the context of finite volume and DG methods [9, 10, 26]. An advantage of ADER over explicit Runge–Kutta schemes is that ADER is not restricted by the Butcher barriers and convergence orders beyond four are not overproportionally expensive. Recently, ADER was combined with the hybridizable discontinuous Galerkin (HDG) method [25]. HDG had originally been introduced in the context of implicit time integration [5], but it was later found that it is highly attractive in combination with explicit time integration outperforming implicit time integration by orders of magnitude [22]. A characteristic property of HDG discretizations is that a superconvergent solution can be constructed at very low additional cost. In [25], a reconstruction procedure is presented that allows to combine ADER and HDG while maintaining the superconvergence property.
Previous work on ADERDG (as in [3]) relies on triangles or tetrahedra assuming constant coefficients and straight lined boundaries. The operator evaluation in [3]
is carried out based on an element matrix with a theoretical complexity per degree of freedom of
in the degree for all spatial dimensions . Here, we propose an ADERDG formulation for quadrilaterals and hexahedra for variable coefficients and curved geometries. We use a matrixfree operator evaluation relying on fast quadrature with sumfactorization kernels with a theoretical complexity per degree of freedom of . The techniques of sum factorization with fast quadrature have been established by the spectral element community [8, 16, 18, 24] but are also popular in the DG community for explicit time integration [15]. Advances in computer architecture have rendered the matrixfree evaluation, originally targeting high orders beyond around five, also highly competitive at moderate orders, outperforming the memory bandwidthlimited sparse matrixvector product for second and higher degree polynomials [4, 19]. In terms of algorithmic layout, the sole reduction of arithmetic operations is not advantageous if the memory bandwidth is the performance limiting factor. In this work, we show that an operator evaluation with sum factorization as in explicit Runge–Kutta schemes is memory bandwidth bound, despite its clear improvement over matrixbased operator evaluation. ADER replaces the global operator application in each Runge–Kutta stage by one global operator application and a completely elementlocal evaluation routine, the Taylor–Cauchy–Kowalevski procedure, which allows to perform more computations on data read from the global solution vectors. We show that ADER does not only employ fewer operations but also supplies a higher arithmetic intensity which is beneficial on modern cachebased hardware.Another aspect significant for performance is the choice of the shape function nodal points and the choice of the quadrature rule. In case nodal points and quadrature points coincide, interpolation of the solution to quadrature points is avoided and computational expense is decreased. This approach is well known in the context of spectral elements. Usage of the GaussLobatto points for the definition of the nodal points and for integration was shown to degrade the accuracy of the mass matrix and its inverse
[12, 27], though. Consistent Gaussian quadrature instead yields full accuracy. A drawback of nodes in the Gauss points, however, is that the flux evaluation on element faces requires an extrapolation accessing all degree of freedom values of both adjacent elements because there are no node points on the faces. For the flux evaluation, nodal points on the element faces ensure that only instead of values must be accessed. We propose a new algorithmic method that changes the polynomial basis and its nodes on the fly depending on the quantity to be evaluated. The standard DG global derivative operator including flux evaluations relies on a Lagrange basis with GaussLobatto points while the ADER specific element local Taylor–Cauchy–Kowalevski procedure relies on a Lagrange basis with nodes in collocated Gauss points. Thereby, we are able to combine cheap element evaluation and flux evaluation with minimal data access. Despite this optimization concerning node and quadrature choice, we propose a second optimization concerning the efficient evaluation of higher order spatial derivatives required in the Taylor–Cauchy–Kowalevski procedure. Calculation of first order spatial derivatives and successive projection to a lower order basis in combination with the collocated node and quadrature points minimize computational work.This article is structured as follows. The physical problem and its spatial and temporal discretization are given in Section 2. The basic algorithmic building blocks are described in the first part of Section 3. A quantitative study on the throughput for basis with or without nodes on element surfaces is given in 3.1, which motivates the basis switching approach presented in 3.2. The optimization relying on reduced polynomial spaces for higher spatial derivatives is shown in Section 3.3. A detailed performance analysis in terms of theoretically derived operation counts, throughput, the roofline model, computational timings, and scalability is given in Section 4. We conclude in Section 5.
2 The Problem and its Discretization
The acoustic wave equation in terms of the solution vector function summarizing the particle velocity and the pressure deviation is written as
(1) 
with the matrix containing parameters and first order spatial derivatives in the form
with speed of sound and mass density . Most algorithmic components developed in this work can also be applied to other hyperbolic operators . The solution on a dimensional domain is sought with respect to initial conditions and boundary conditions. To find an approximate solution to problem (1), first the spatial discretization and subsequently the temporal discretization are applied.
2.1 The Discontinuous Galerkin Method
The domain is tesselated into a triangulation of quadrilaterals or hexahedra. For discretization with DG, equation (1) is multiplied with test functions , integrated over all elements of and integrated by parts. The resulting problem reads: Find
in the approximation space spanned by elementwise polynomials of tensor degree
, constructed as the tensor product of onedimensional functions, such that(2) 
with and denoting the inner product over the and dimensional domains and , respectively. The test functions are from the same function space as the solution functions. The boundary of the tesselation comprises all element boundaries both at the interior faces and the boundary of the physical domain . The variable is the numerical flux for evaluation of the element surface integrals by relating quantities from both adjacent elements. Several possibilities for the flux definition exist, such as central fluxes, upwinding, HDG fluxes, and many more. The choice depends on the specific problem and intended numerical properties.
Expressing the solution as product of ansatz functions summarized in a matrix and values of the degrees of freedom , i.e.,
the time continuous space discrete system is derived
(3) 
with a mass matrix and a stiffness matrix . The stiffness matrix entries depend on the choice of the flux definition .
2.2 Explicit Runge–Kutta Time Integration
Time discretization of equation (3) using a general explicit Runge–Kutta method involves summation over the Runge–Kutta stages and repeated operator application
(4) 
with the coefficients and given in Butcher tableaus for specific Runge–Kutta schemes. A well known explicit Runge–Kutta scheme is the classical Runge–Kutta of order four with four stages. Optimized variants, e.g. lowstorage schemes, which were originally introduced to reduce the memory consumption but that we use primarily for their lower the memory transfer, are given in [17].
In this work, the lowstorage schemes with two registers of order four with five stages LSRK4(5) and of order five with nine stages LSRK5(9) from [17] will be used. According to the Butcher barriers, there is no five stage Runge–Kutta scheme of order five. The minimum number of stages is six for accuracy order five. With the additional conditions imposed by the requirement to obtain a lowstorage scheme with two registers, the minimal number of stages is nine. An eight stage solution is presumed to be impossible [17]. For an alternative variant of lowstorage schemes, which was explicitly developed for DG, see e.g. [28]; these schemes are not considered in this work because their efficiency is not significantly better than for the schemes from [17].
2.3 Arbitrary Derivative Time Integration
ADER explicit time integration has been presented in the context of DG spatial discretization and several hyperbolic problems, see [9, 11, 26]. Time discretization of the timecontinuous system (3) with ADER relies on the idea to express temporal derivatives with spatial derivatives using the partial differential equation (1) according to the Cauchy–Kowalevski procedure
The solution field is expanded into a truncated Taylor series in time around the last known state at and the above expression is inserted. A projection onto the degree of freedom values yields
(5) 
where the subscript means evaluation at . Last, equation (3) is integrated in time and equation (5) is used to replace the time integral. The final scheme reads
(6) 
Just as every explicit time integration scheme, ADER time stepping is subject to time step restrictions according to the Courant–Friedrichs–Lewy (CFL) condition [7] with the Courant number Cr
in terms of the signal speed , the time step size , the polynomial degree of the shape functions , and a characteristic element size . The power on the polynomial degree is used to correct for higher order approximations as suggested in [16]. The stability restriction for ADER is generally stricter than for explicit Runge–Kutta methods, as mentioned in [9] where of the Runge–Kutta Courant number is given as rule of thumb. Time step stability is investigated in Section 4.7.
2.4 The Hybridizable Discontinuous Galerkin Method
The hybridizable discontinuous Galerkin method (HDG) in combination with explicit time stepping can be understood as a special choice of the flux in equation (2). For the acoustic wave equation, the HDG flux is
with the superscripts and indicating evaluation of the two adjacent elements and being the outward pointing normal vector. HDG introduces a stabilization parameter typically chosen as . If a time integration scheme of sufficient accuracy is utilized, HDG offers the possibility to obtain superconvergent pressure results of order by means of a simple postprocessing step. To maintain the superconvergence property of HDG in combination with ADER time integration where spatial and temporal discretization are strongly interlinked, the method shown in equation (6) is enhanced to
(7) 
replacing one elementlocal derivative by a discrete DG derivative . This procedure recovers a superconvergent scheme, see [25] for details.
3 Algorithmic Developments
In the beginning of this section, we describe the basic underlying algorithmics on which our new developments build. Subsequently, we give a brief preliminary performance evaluation for different polynomial bases in Section 3.1. Thereby, we motivate a basis change as described in Section 3.2, which is optimized in terms of a degree reduction in Section 3.3.
The evaluation of the integrals in the weak forms in equations (6) or (7) is performed by fast integration relying on sum factorization utilizing the structure of tensor product shape functions that are integrated with a tensor product quadrature rule. Throughout this work, we choose a Gaussian quadrature with points per direction for polynomials of degree , which is enough to integrate bilinear forms with elementwise constant coefficients on affine element shapes exactly. On curved geometries, there is a possible integration error that is often subsumed in the errors from variational crimes [2]. In particular, our choice avoids the accuracy penalty of inexact Gauss–Lobatto quadrature on points as highlighted in [12]. Note that more general hyperbolic problems with nonlinear terms can easily be integrated with more points to avoid aliasing effects in this setup.
For a function described by a basis of degree and coefficients, the interpolation onto quadrature points takes additions and multiplications in a naive implementation without sum factorization. Also, the evaluation of each component of the gradient takes operations. With sum factorization however, the work for the interpolation is reduced to operations. The reduction of operations results from applications of onedimensional interpolations of cost that go through all lines of basis functions and quadrature points, respectively. In the remainder of this work, we refer to one application of a onedimensional interpolation over all points, involving operations, as one “tensor product kernel”. Note that these kernels can be cast as small matrixmatrix multiplications. For details, we refer to [20].
In case of quadrature over Lagrange polynomials with nodes in the points of the quadrature formula (a socalled collocated setup [18]) the interpolation from coefficients to values in quadrature points is the identity operation and can be skipped. Thus, the evaluation of all components of the gradient only involves tensor product kernels for each of the partial derivatives, as opposed to tensor product kernels for the basic evaluation scheme. As will be elaborated in this work, the optimization of using a collocated bases may be premature as other factors, such as the cost of face integrals, may control the decision about which basis to choose. Similar kernels are also used for the summation over quadrature points when multiplying with test functions or test function gradients, see e.g. [20] for details.
For our experiments, we use a stateoftheart implementation of sum factorization based on the deal.II finite element library with support for massively parallel computations and adaptively refined meshes with hanging nodes [1]. Since integration involves a series of heavy arithmetics, the use of vectorization (SIMD) is fundamental for getting optimal performance on current architectures. Following the concepts described in [20, 21], this work applies vectorization over several elements which was found to provide best performance on polynomial degrees up to at least 14 and reaches more than 50% of the arithmetic peak performance when considered in isolation, which is an extremely high value for a code compiled from generic C++ code that contains as a (template) parameter that lets the compiler decide on the loop unrolling and register allocation.
3.1 Efficient Face Integral Evaluation
In this section, the effect of the polynomial basis functions on the evaluation of the derivative operator and the inverse mass matrix are studied. Two contradictory factors are considered: The inverse mass matrix is efficiently evaluated if quadrature points and nodes of the polynomial basis coincide, which results in a diagonal mass matrix and hence simple inversion. On the other hand, the derivative operator includes the evaluation of element but also face integrals. For the face integrals, data from both adjacent elements are required. For nodal polynomials with nodes on the element faces, only the values associated to the nodes on the faces must be accessed. If the nodes are only in the interior or polynomials are not nodal, all vector entries of both adjacent elements are required and an extrapolation to the faces must be carried out.
Two variants are examined to demonstrate the effects of the aforementioned contradictory factors. Quadrature is always based on Gaussian points. In one case, a polynomial basis of Lagrange functions with nodes in the same Gauss points is used, in the other case a polynomial basis of Lagrange functions with nodes in Gauss–Lobatto points is used. The first case results in a diagonal mass matrix, the second case satisfies the requirement to have nodes on the element faces. Note that the usage of Gauss–Lobatto points as nodes and quadrature points is not considered because it degrades accuracy [12, 27]. Fig. 1 plots the results in terms of degrees of freedom processed per second for a threedimensional geometry consisting of Cartesian elements for polynomial degree and Cartesian elements for . Computations are run on the system specified in Table 1.
The evaluation of the inverse mass matrix reaches a considerably higher throughput compared to the derivative operator because it is completely elementlocal whereas requires the evaluation of face integrals and therefore additional data access from neighboring elements. In fact, the throughput of the former is mostly limited by the memory bandwidth of loading the vector and writing back into the same vector, which is degrees of freedom per second for the system’s memory throughput of 115 GB/s when counting byte per polynomial (read and write) and bytes for access to the precomputed inverse entry of the diagonal, which is the same for all components. Following [22], the action of the inverse can be cheaply evaluated by sumfactorization kernels that first transform into an orthogonal basis (i.e., the Lagrange polynomials in the points of Gauss quadrature), apply the inverse diagonal mass matrix, and transform back to the original basis. The comparison to the throughput for collocated nodes of shape functions and quadrature points shows that indeed the evaluation of the nondiagonal mass matrix operation is only up to 15% slower compared to the diagonal mass matrix. Thus, the substantial arithmetic operations can be almost completely hidden behind the memory transfer.
The face evaluation for the Lagrange basis in Gauss points does not only require to read the values for the degrees of freedom located on the face but all values of the adjacent elements to allow interpolation onto the face (or to write the face data into a separate global array with additional memory transfer as e.g. done in [14, 15]). Comparing the throughput for Lagrange polynomials in Gauss–Lobatto points and Gauss points reveals a significant drop in performance for the derivative operator due to the increased memory access, including substantial indirect addressing components as described in [20]. The direct comparison highlights that using polynomial bases with support points on the element faces for the global derivative operator is highly beneficial. This finding applies particularly to Runge–Kutta type time integrators, and it also holds for more general nonlinear operators .
We want to conclude this section by summarizing the main findings:

The evaluation of the inverse mass matrix is memory bandwidth bound (especially for moderate order) and a change of basis is for free.

The throughput for the derivative operator is much lower than for the mass matrix, because values from both adjacent elements must be read in contrast to complete elementlocal evaluations. Combined with the fact that also the Jacobian of the mapping must be accessed, the memory bandwidth is reached earlier.

This effect is counteracted by usage of a polynomial basis with nodes on the element boundary, which will be further investigated in the following section.

Usage of collocated node and quadrature points reduces the number of operations.
From these conclusions, ADER is selfevidently motivated: Applications of the global derivative operator are traded for element local evaluations in the Taylor–Cauchy–Kowalevski procedure.
3.2 Flexible Basis Change
In the previous section, the effect of the memory bandwidth bound became apparent. Despite the memory access, the number of operations is a central quantity of interest to judge code efficiency.
As shown in Fig. 1, a set of shape functions where only functions evaluate to nonzero on each of the element faces is beneficial to reduce the vector access for face evaluation. For element local evaluations, however, this involves additional tensor product kernels per component to interpolate from the solution values to quadrature points as compared to the collocated node and quadrature points commonly used in spectral element solvers [18]. Hence, different bases should be used for the different phases. For element local evaluations, a collocated basis with nodes in Gauss points (denoted by G) is the best choice while a basis with nodes in Gauss–Lobatto points (denoted by GL) is the best choice for the evaluation of face integrals. We propose to switch the basis for the different phases on the fly while looping over the elements for the Taylor–Cauchy–Kowaleski evaluation. The solution approximation is expressed either as
with the matrices containing the shape functions evaluated at the respective nodes and the vectors containing the degree of freedom values for Gauss and Gauss–Lobatto points, respectively. Theoretically, there are no restrictions objecting to change the basis from one evaluation to the other. Since both spaces are of the same degree, the equality holds in all cases.
For ADER, the collocated G basis is not only interesting for the application of the inverse mass matrix but especially for the Taylor–Cauchy–Kowalevski term
summing weighted spatial derivatives from zeroth to th order with different prefactors. Higher order derivatives are computed by iterative calculation of a first derivative and a subsequent projection applying the inverse mass matrix. The Taylor–Cauchy–Kowalevski term is completely element local and significantly more calculations are operated on the read data compared to the evaluation of the inverse mass matrix.
We propose a successive evaluation of the gradient of a field and projection instead of direct evaluation of high derivatives, which is significantly more efficient in the context of a matrixfree implementation on general meshes. Successively, problems of the form are solved, where the th spatial derivative is denoted . Algorithm 1 shows how the spatial derivatives are calculated by evaluation of the gradient field and projection onto the degrees of freedom for the noncollocated basis GL. The application of the inverse mass matrix must be understood in the matrixfree sumfactorization context according to [22].
The algorithm simplifies drastically if nodes and integration points are collocated because the weighting with test functions and the application of the inverse mass matrix cancel out and all the interpolation matrices are identity matrices. The simplified procedure is shown in Algorithm 2.
Note that the basis change is done on the fly when processing the data from one element, not on the global solution vector as that would incur additional memory transfer. If the current global solution vector contains the degree of freedom values in the GL basis description, the first evaluation in Algorithm 2 involves the interpolation from GL to G and then all remaining evaluation from to are computed purely in the collocated basis G.
3.3 Degree Reduction
A possibility to further reduce the work in the evaluation of the Taylor–Cauchy–Kowalevski sum as presented in the previous section is to reduce the polynomial degree in the representation of higher order spatial derivatives. This is a wellestablished method in the ADERDG community for affine element shapes where typically hierarchical bases are used [3, 9]. The higher order spatial derivatives naturally give a contribution only to the lower degree coefficients of the hierarchical basis in the Taylor–Cauchy–Kowalevski procedure. In the case of sum factorization evaluation on general nonCartesian meshes, however, the embedding to lower polynomial degrees involves additional operations as compared to the spectral evaluation routine from Algorithm 2. In a hierarchical basis, i.e., Legendre polynomials on quadrilaterals or hexahedra, the integrals with a nonaffine element geometry are evaluated by quadrature with sum factorization, which in turn must transform between the Legendre basis and the collocation basis. For one spatial derivative in Algorithm 2, the complexity rises from tensor product kernels per component in the spectral evaluation to tensor product kernels, where kernels each are needed for the basis change in interpolation and integration, respectively.
A more efficient algorithm can be devised as follows. First, we reduce the degree in terms of the Lagrange basis in the respective Gaussian integration points of degree and of step in the Taylor–Cauchy–Kowalevski sum. The degree reduction is performed by an operation where is the projection operator from degree to on the reference element. Like in the hierarchical case, this setup combined with the additional interpolation of the result into the points of the Taylor–Cauchy–Kowalevski sum involves tensor product kernels per component, as compared to only kernels for the spectral derivative. Thus, as a second ingredient we propose to apply the degree reduction only for every second spatial derivative. This step also has the advantage of limiting the extra amount of geometry information, i.e., the inverse Jacobian, that needs to be loaded in each quadrature point, as Gauss formulas of different degrees evaluate the integrands in different positions and our implementation uses precomputed inverse Jacobians. In other words, we reuse the data of the inverse Jacobians loaded into caches once again. The th spatial derivative is thus expressed in a basis of polynomial degree . Algorithm 3 details this procedure.
4 Performance Evaluation
In the following subsections, the performance of DG with ADER and DG with explicit Runge–Kutta time integration is analyzed in terms of operation counts, computation time, throughput, and scalability.
If not specified otherwise, the computational setup as shown in Table 1 is used in the numerical examples.
CPU  core Intel Xeon Broadwell E52690v4, 2.6 GHz 

memory  8 channels DDR 4 (2400 MHz) theoretic 
compiler  g++ version 6.2 
compiler optimization  march=haswell 03 funrollloops 
4.1 Operation Counts
For the ADERDG method as given in equation (6), there are three main contributions to the computational costs, namely the application of the inverse mass matrix cost , the application of the global derivative operator cost , and the evaluation of the Taylor–Cauchy–Kowalevski sum cost , resulting in an overall cost of
(8) 
For an stage Runge–Kutta scheme, the costs are
(9) 
In ADER, the highorder approximations contribute to a sum over all terms in the truncated Taylor series and involves an additional dependency on the polynomial degree compared to and . In contrast, highorder approximations with Runge–Kutta schemes use more stages , where the number of stages has to increase faster than the required order of accuracy (for orders larger than four) due to the Butcher barriers. ADERDG as well as Runge–Kutta methods both repeatedly call the application of the derivative operator and of the inverse mass matrices. The main difference is that ADERDG applies the derivative operator locally, i.e., elementwise, while Runge–Kutta integrators rely on the global derivative operator containing both element and face contributions.
The operation counts for and are derived from vector updates, matrixvector or matrixmatrix multiplications, which in turn rely on the matrixfree implementation of integral evaluation for tensorial shape functions explained in Section 3. The cost for the evaluation of one tensor product kernel on a dimensional domain (e.g. in an element or on element faces) calculates to
where the three summands in braces represent additions and subtractions (first term), multiplications (second term), and fused multiplyadd operations (last term, counted as two arithmetic instructions). Note that an evenodd decomposition of the local coefficients and matrices is used that cuts operation count into approximately one half compared to a usual 1D matrixvector product
[18, 20]. This cost is the main building block to derive the operation counts for and with the number of calls to the tensor product kernels as summarized in Table 2. For details on the derivation of operation counts, we refer to [20] and [22].mass  stiffness matrix  TCK  

cell eval.  
cell deriv.  
face eval.  
total 
In Fig. 2, operation counts for the full schemes are compared, i.e., ADER versus a Runge–Kutta scheme with five stages for all polynomial degrees. To allow for generalization and allow comparability, the number of stages for the Runge–Kutta scheme is kept constant. ADER involves fewer arithmetic operations for all considered polynomial degrees for both . The figure also shows operation counts in case no basis change to a collocated basis and no degree reduction for the higher order spatial derivatives are carried out, i.e., if the proposition of Sections 3.2 and 3.3 are not realized in the implementation of the Taylor–Cauchy–Kowalevski procedure. Without optimizations, it is apparent that ADER has a higher polynomial dependency on than Runge–Kutta. The optimizations however compensate this dependency. The basis change is not applied to the Runge–Kutta discretization because the elementlocal mass matrix inversion is dominated by the memory bandwidth in contrast to the ADER Taylor–Cauchy–Kowalevski term with higher operational cost, see also Fig. 1.
Fig. 3 visualizes the gains of the degree reduction approach. It plots the operation counts for the Taylor–Cauchy–Kowalevski procedure with degree reduction relative to an implementation without degree reduction and compares the reduction in every step, every second step, and every third step. The operational cost is halved. For higher orders, the reduction in every step is not preferable because the projection introduces overhead. For moderate polynomial degrees, the approach to reduce the basis in every second step of the Taylor–Cauchy–Kowalevski procedure appears most beneficial.
4.2 Computational Timings
A two and a three dimensional example are set up on a domain , solving the acoustic wave equation with vibrational modes as an analytic solution as in [25]. We test on a mesh with slightly deformed elements to prevent the builtin Cartesian mesh optimizations in the code of [20] for polynomial degrees . In 2D, and elements are used for and , respectively. In 3D, and elements are used for the respective polynomial degrees.
Fig. 4 shows measured run times per time step and per element for numerical experiments on cores with LSRK4(5). Comparison between Fig. 2 and Fig. 4 reveals that the run time follows the operation counts. ADER is significantly more efficient in terms of wall time per element, though, which is due to the fact that the Taylor–Cauchy–Kowalevski procedure requires one global vector access but Runge–Kutta requires one global vector access for each application of the global derivative operator . In other words, the operation counts that ignore the memory access fail to accurately predict the run time. This statement also holds for the ADER implementation without basis change with run times between LSRK4(5) and the optimized ADER.
4.3 Breakdown into Algorithmic Components
Let us examine the algorithmic components separately in terms of the computational time per million degrees of freedom. In Fig. 5, we compare two versions of LSRK4(5) with an optimized vector updater that only reads and writes two vectors per stage by merging the loop over vectors into a single loop and a standard vector updater reading five vectors and writing two vectors per stage, similar to the “merged operations” presented in [13]. Also, the results for the optimized ADER scheme implementing the basis change and degree reduction from Sections 3.2 and 3.3 and the ADER scheme without the degree reduction are shown in Fig. 5. The results are obtained from simulations on a threedimensional nonCartesian grid. The measured run time is the accumulated time spent in the respective functions, e.g., the contribution of the derivative operator appears five times as high for LSRK4(5) compared to ADER, because it is applied in each of the five stages while ADER applies the derivative operator only once per time step.
As can be seen from Fig. 5, the cost for vector updates in Runge–Kutta schemes cannot be neglected, neither in the standard implementation nor in the optimized variant, where they contribute with about a third or a quarter of the run time, respectively. Likewise, our results document the high level of performance reached in the computations of and , a distinctive feature of our implementation. For ADER, the vector updates have a comparably small contribution to the entire cost because only one single update at the end of the method is required.
The application of the derivative operator is most expensive for with its rather disadvantageous ratio between degrees of freedom located on the element faces as compared to the interior. The ratio improves for higher orders and an almost constant throughput per degree of freedom is obtained, despite the theoretical increase in arithmetic complexity. The constant throughput is mainly explained by the memory transfer that scales as per unknown. The Taylor–Cauchy–Kowalevski procedure shows slightly increasing costs for higher degrees as an additional summand contributes in the Taylor expansion according to equation (6). Nonetheless, the increase for higher degrees is moderate due to the proposed degree reduction approach as presented in Section 3.3 and efficient algorithms, less than doubling the run time between degree two and twelve. In the rightmost panel of Fig. 5, results are shown for a Taylor–Cauchy–Kowalevski procedure that does not reduce the polynomial degree, where the increase in computing time is much more significant. Obviously, the latter reaches higher arithmetic throughput with more than 300 GFLOPs/s, which is a secondary quantity, though.
4.4 Roofline Performance Model
Fig. 6 shows a roofline model according to [30] for LSRK4(5) and ADER in two and three dimensions on a nonCartesian mesh. The roofline plots contain the hardware specific limits in terms of the memory bandwidth limit (diagonal lines) measured with the STREAM triad benchmark and the peak arithmetic performance (horizontal lines). All numbers are based on measured data from hardware performance counters, extracted from monitoring our programs with the likwid performance measurement tool, version 4.3, as presented in [29].
Generally, the polynomial degree increases for the points from left to right. The left panel of Fig. 6 highlights that ADER comes with a higher arithmetic intensity, in particular for the higher degrees, as compared to LSRK4(5) that is clearly in the memory bandwidth bound regime. The computations are made for a nonCartesian mesh where not only vector entries and some index data must be loaded from main memory but also the inverse of the Jacobian transformation in each quadrature point.
The right panel of Fig. 6 shows the results for the individual components of the methods. The evaluation of the derivative operator that needs to load geometry data is most strongly limited by the memory. Note that loops over cells and faces are interleaved in our implementation to reuse the vector data loaded into caches in cell integrals also for face integrals. More detailed measurements similar to the ones presented in [20] show that the code of face integrals finds more than 90% of the vector data for face integrals already in caches. Nonetheless, the partial indirect addressing with gather/scatter type instructions to rearrange the face data for vectorization over several faces and the remaining cache misses reduce throughput by around 25% as compared to idealized code that only performs the integration, see the experiments in [20] for details. The Taylor–Cauchy–Kowalevski procedure comes along with considerably higher arithmetic intensities, but the high level of arithmetic optimizations in the proposed algorithm keeps the code still in the memorylimited region, in clear contrast to e.g. [3]. Given these detailed results, the behavior of the entire method can be characterized as follows: the Runge–Kutta scheme applies and five times in each time step, while ADER only applies and once and the rest is taken care of by the completely elementlocal Taylor–Cauchy–Kowalevski procedure, enabling a much higher reuse of cached data.
4.5 Throughput as Function of Polynomial Degree
Fig. 7 lists the computational throughput in terms of degrees of freedom processed per second and the actually realized GFLOPs per second rate, measured with the likwid tool [29], version 4.3. If not stated otherwise, the optimized implementation of ADER using the basis change and reduction as proposed in Sections 3.2 and 3.3 are considered.
The performance advantage of ADER compared to the low storage Runge–Kutta scheme is a factor of around for low polynomial degrees. The advantage decreases to a factor of for owing to the additional computations for the high order of accuracy of the ADER time integration, whereas the LSRK4(5) schemes uses a fixed temporal accuracy of four with 5 stages. In terms of GFLOPs rates, which lie between and GFLOPs per second, ADER operates times more GFLOPs per second due to better caching. For comparison, a matrix based evaluation of a sparse matrix vector product was reported to reach GFLOPs per second [23] on the same hardware.
4.6 Throughput as Function of Problem Size
The throughput as a function of the problem size between and degrees of freedom in 3D with shape functions of polynomial degree is shown in Fig. 8. For small discretizations of size , limited parallelism prevents the full exploitation of the 28 cores. Then, the parallel efficiency improves and a first peak is reached at around where all data fits into the 70 MB of level 3 cache of the two processors and no access to main memory is needed. Performance drops again once the data structures exceed the caches and the vector and geometry data need to be streamed from main memory. For , a slow increase of the throughput is noted, which is related to the decreased influence of the MPI communication for larger discretizations.
4.7 CPU Time Versus Accuracy
In this section, the time to solution is evaluated with respect to the accuracy on a twodimensional example. With the standard setup of a vibrating membrane with seven modes, simulations are run on different refinement levels of the mesh between and elements for polynomial degrees . For a fair comparison, the number of processors is chosen reasonably for the discretization sizes, e.g., the coarsest discretization is computed on one processor core only while the finest discretization is computed on cores. The final numbers report the accumulated CPU time over all utilized processors. For the accuracy, the pressure error at time is considered. The upper panel of Fig. 9 plots the results for a Courant number of . Generally, higher orders appear beneficial in case a strict accuracy criterion is applied. ADER yields the same solution quality as LSRK4(5) in less computational time.
Since ADER and Runge–Kutta schemes are subject to different CFL stability limits, we evaluate the critical Courant number by an iterative procedure. The results are summarized in Table 3 for the considered twodimensional as well as for a threedimensional setup.
2D  3D  

LSRK4(5)  LSRK5(9)  ADER  LSRK4(5)  LSRK5(9)  ADER  
The critical Courant number for ADER is approximately and times smaller compared to the low storage Runge–Kutta schemes LSRK4(5) and LSRK5(9) from [17]. Therefore, we also analyze the solution accuracy versus the CPU time at of the respective time integrator in the bottom panel of Fig. 9. For , ADER is faster for most error tolerances. For , all methods perform similarly. With polynomial degree for the spatial discretization, ADER outperforms LSRK4(5) clearly because the LSRK4(5) scheme is of order four while the spatial discretization is of order eight, and the temporal error dominates over the spatial error. This is also true for LSRK5(9) at high spatial resolution.
In Section 2.4, HDG was introduced as a special DG method with a superconvergence property. By a simple postprocessing step, a superconvergent pressure solution is constructed that converges with rate . The results for and for polynomial shape functions of degree are presented as green lines in Fig. 9. Again, for , ADER performs slightly better compared to LSRK4(5), while a significant performance advantage for ADER is noted for , which is due to better temporal accuracy. Comparing the top and bottom panel for shows that ADER is also faster than LSRK4(5) with smaller time steps. The discretization with LSRK5(9) is competitive to ADER for the superconvergent pressure result with . For the polynomial degree , a change in slope indicates the turnover from spatial error domination to temporal error domination. Where the temporal error dominates, the advantage for ADER is more distinct. This is due to the fact that ADERDG is automatically convergent in space and time while Runge–Kutta is limited by the Butcher barriers: either overproportionally more stages are required to match the temporal order of accuracy with the spatial discretization, or a small time step is required.
4.8 Scalability
In order to assess the strong scalability of the proposed methods, a two and a three dimensional geometry consisting of and elements of polynomial degree ( and , respectively) are used and the solution is computed on to processors on a parallel cluster of core Intel Xeon E52630 v3 (Haswell) processors at 2.4 GHz. Additionally, one computation with fewer elements (, ) in 2D is carried out such that the distribution yields only six elements per processor for the highest level of parallelism. The left panel of Fig. 10 summarizes the results. In accordance with the previous sections, ADER is consistently faster than LSRK4(5) for the same time step size . The scaling is almost ideal with a slight kink when going from to processors, where the code goes from being compute bound to being memory bound. For the small discretization the scaling deteriorates for high processor numbers due to communication overhead. In the right panel, results are shown for a strong scaling study on the SuperMUC Phase 2 Petascale system with nodes of Intel Xeon E52697 v3 (Haswell) processors at 2.6 GHz on to nodes. For the 2D LSRK4(5) simulation, a kink can be seen when going from to cores indicating that simulations are memory bandwidth bound on lower core counts but get computation bound at higher core counts. The scaling is close to ideal considering that the highest level of parallelism corresponds to and elements per processor in 2D and 3D, respectively.
Fig. 11 plots the strong scalability for the algorithmic components. It can be seen that the vector updates give the largest contribution to the reduced scalability between and processors which is due to the fact that the utilized cluster possesses cores but only memory channels which can be saturated already with processors. The kink is only due to shared memory effects because the communication is considered as part of the operator application. Since LSRK4(5) spends a larger fraction of time in vector updates, the reduced scalability in the overall method is more pronounced. The inverse mass matrix and the Taylor–Cauchy–Kowalevski procedure scale almost perfectly while the stiffness matrix shows a slight scaling decay due to a worse volumetosurface effect of the data that must be exchanged with MPI.
In order to illustrate the effect of memory bandwidth on the algorithmic components, we repeat the calculations on an Intel Xeon Phi 7210F (Knights Landing, KNL) system with 64 cores and 16 GB of highbandwidth memory delivering up to 420 GB/s. For KNL, the code is vectorized with AVX512, i.e., eightwide SIMD lanes, as opposed to AVX2 with fourwide SIMD lanes on Broadwell. In Table 4, the computing times per time step are compared to a calculation on Broadwell cores. The KNL system yields a speed up for the vector update of about for LSRK4(5), corresponding to the four times higher memory bandwidth. Also, the application of the inverse mass matrix is substantially faster. The derivative operator , on the other hand, runs slightly slower on KNL because of the more irregular code patterns in face integrals that favor the more sophisticated CPU cores of Broadwell.
VU  TCK  sum  

LSRK4(5) Broadwell  —  
LSRK4(5) KNL  —  
ADER Broadwell  
ADER KNL 
5 Conclusion
We presented a performance analysis for explicit Runge–Kutta and ADER DG implementations. ADER outperforms optimized low storage Runge–Kutta schemes over a range of test scenarios. The ingredients are fast integration techniques with sum factorization that combine optimalcomplexity mathematical algorithms utilizing the tensor product structure of the shape functions with a highly competitive implementation that vectorizes over several elements and faces. The methods have been devised to be applicable also for complex meshes with curved quadrilateral or hexahedral elements. Our experiments clearly show that it is most efficient to evaluate operators including face integrals with nodal basis functions which have nodes on the element boundaries, while the cell evaluations in the Taylor–Cauchy–Kowalevski procedure of ADER are best performed with Lagrange polynomials in the points of the quadrature formula. To combine these two, an onthefly change between the bases has been proposed in this work. This result is in contrast to the consensus belief in spectral elements that favor collocation of the polynomials nodes and quadrature points. While the theoretically derived operation counts already signify a slight benefit for ADER compared to Runge–Kutta, the actual timings show a distinct benefit, reducing the time to perform one time step by approximately a factor of four, because ADER suits modern hardware architecture better: while Runge–Kutta schemes are mostly limited by the memory bandwidth, ADER performs more operations on the data that is loaded from main memory and thus reaches a higher arithmetic intensity. A detailed analysis of Runge–Kutta versus ADER integration at the CFL stability limit has shown comparable performance where the Runge–Kutta time discretization order matches the spatial discretization order. For approximations with a high order of accuracy where the Butcher barriers set in, ADER exceeds the abilities of Runge–Kutta because its computational cost does not grow overproportionally. While the findings for ADER are limited to linear hyperbolic PDEs, the optimizations regarding the basis functions and reduced vector access for the Runge–Kutta time integrators regarding basis functions are also directly applicable to general nonlinear systems of hyperbolic PDEs.
Our work highlights the importance to develop modern DG solvers according to the trends and limits in modern hardware architectures. For common solvers, the memory bandwidth limit is more relevant also when performing the relatively expensive computations of high order DG methods, i.e., is met earlier than the arithmetic performance limit. Trading global for elementlocal operations counters this effect, rendering approaches as the Taylor–Cauchy–Kowalevski procedure favorable. Vector updates are inherently memory bandwidth limited and need to be optimized specifically. Our experiments and performance models highlight that going significantly beyond the throughput recorded in this work demands either hardware with higher memory bandwidth, such as GPUs or the Xeon Phi, or new software paradigms that reduce the memory access over several stages, such as wavefront blocking that is already commonly used in the finite difference community.
References
 [1] D. Arndt, W. Bangerth, D. Davydov, T. Heister, L. Heltai, M. Kronbichler, M. Maier, J.P. Pelteret, B. Turcksin, and D. Wells. The deal.II library, version 8.5. J. Numer. Math., 25(3):137–145, 2017.
 [2] S. C. Brenner and L. R. Scott. The mathematical theory of finite element methods. SpringerVerlag, New York, 3rd edition, 2008.
 [3] A. Breuer, A. Heinecke, S. Rettenberger, M. Bader, A.A. Gabriel, and C. Pelties. Sustained petascale performance of seismic simulations with SeisSol on SuperMUC. In J. M. Kunkel, T. Ludwig, and H. W. Meuer, editors, Supercomputing, volume 8488 of Lecture Notes in Comput. Sci., pages 1–18. Springer, 2014.
 [4] J. Brown. Efficient nonlinear solvers for nodal highorder finite elements in 3D. J. Sci. Comput., 45(13):48–63, 2010.
 [5] B. Cockburn, J. Gopalakrishnan, and R. Lazarov. Unified hybridization of discontinuous Galerkin, mixed, and continuous Galerkin methods for second order elliptic equations. SIAM J. Numer. Anal., 47(2):1139–1365, 2009.
 [6] B. Cockburn, G. Karniadakis, and C.W. Shu. Discontinuous Galerkin Methods  Therory, Computation and Applications. Springer Verlag Berlin Heidelberg, 1st edition, 2000.
 [7] R. Courant, K. Friedrichs, and H. Lewy. Über die partiellen Differenzengleichungen der mathematischen Physik. Math. Ann., 100(1):32–74, 1928.
 [8] M. O. Deville, P. F. Fischer, and E. H. Mund. Highorder methods for incompressible fluid flow, volume 9. Cambridge University Press, 2002.
 [9] M. Dumbser and M. Käser. An arbitrary highorder discontinuous Galerkin method for elastic waves on unstructured meshes – II. The threedimensional isotropic case. Geophys. J. Int., 167:319–336, 2006.
 [10] M. Dumbser, M. Käser, and E. F. Toro. An arbitrary highorder discontinuous Galerkin method for elastic waves on unstructured meshes – V. Local time stepping and adaptivity. Geophys. J. Int., 171:695–717, 2007.
 [11] M. Dumbser, I. Peshkov, E. Romenski, and O. Zanotti. High order ADER schemes for a unified first order hyperbolic formulation of continuum mechanics: Viscous heatconducting fluids and elastic solids. J. Comput. Phys., 314:824–862, 2016.
 [12] M. Durufle, P. Grob, and P. Joly. Influence of Gauss and Gauss–Lobatto quadrature rules on the accuracy of a quadrilateral finite element method in the time domain. Numer. Methods Partial Differential Equations, 25(3):526–551, 2009.
 [13] P. Eller and W. Gropp. Scalable nonblocking preconditioned conjugate gradient methods. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16, pages 18:1–18:12, Piscataway, NJ, USA, 2016. IEEE Press.
 [14] J. S. Hesthaven and T. Warburton. Nodal Discontinuous Galerkin Methods: Algorithms, Analysis, and Application, volume 54 of Texts in Applied Mathematics. Springer, 2008.
 [15] F. Hindenlang, G. Gassner, C. Altmann, A. Beck, M. Staudenmaier, and C.D. Munz. Explicit discontinuous Galerkin methods for unsteady problems. Comput. & Fluids, 61:86–93, 2012.
 [16] G. Karniadakis and S. Sherwin. Spectral/hp element methods for computational fluid dynamics. Oxford University Press, 3rd edition, 2013.
 [17] C. A. Kennedy, M. H. Carpenter, and R. M. Lewis. Lowstorage, explicit RungeKutta schemes for the compressible Navier–Stokes equations. Appl. Numer. Math., 35:177–219, 2000.
 [18] D. A. Kopriva. Implementing Spectral Methods for Partial Differential Equations: Algorithms for Scientists and Engineers. Springer, 2009.
 [19] M. Kronbichler and K. Kormann. A generic interface for parallel cellbased finite element operator application. Comput. & Fluids, 63:135–147, 2012.
 [20] M. Kronbichler and K. Kormann. Fast matrixfree evaluation of discontinuous Galerkin finite element operators. arXiv preprint http://arxiv.org/abs/1711.03590, v1, 2017.
 [21] M. Kronbichler, K. Kormann, I. Pasichnyk, and I. Allalen. Fast matrixfree discontinuous Galerkin kernels on modern computer architectures. In J. M. Kunkel, R. Yokota, P. Balaji, and D. E. Keyes, editors, ISC High Performance 2017, LNCS 10266, pages 237–255, 2017.
 [22] M. Kronbichler, S. Schoeder, C. Müller, and W. Wall. Comparison of implicit and explicit hybridizable discontinuous Galerkin methods for the acoustic wave equation. Internat. J. Numer. Methods Engrg., 106(9):712–739, 2016.
 [23] M. Kronbichler and W. A. Wall. A performance comparison of continuous and discontinuous Galerkin methods with fast multigrid solvers. arXiv preprint https://arxiv.org/abs/1611.03029, v1, 2016.
 [24] S. A. Orszag. Spectral methods for problems in complex geometries. J. Comput. Phys., 37:70–92, 1980.
 [25] S. Schoeder, M. Kronbichler, and W. Wall. Arbitrary highorder explicit hybridizable discontinuous Galerkin methods for the acoustic wave equation. J. Sci. Comput., online first:1–38, 2018.
 [26] T. Schwartzkopff, M. Dumbser, and C.D. Munz. Fast high order ADER schemes for linear hyperbolic equations. J. Comput. Phys., 197:532–539, 2004.
 [27] S. Teukolsky. Short note on the mass matrix for Gauss–Lobatto grid points. J. Comput. Phys., 283(C):408–413, 2015.
 [28] T. Toulorge and W. Desmet. Optimal RungeKutta schemes for discontinuous Galerkin space discretizations applied to wave propagation problems. J. Comput. Phys., 231(4):2067 – 2091, 2012.
 [29] J. Treibig, G. Hager, and G. Wellein. LIKWID: A lightweight performanceoriented tool suite for x86 multicore environments. In Proceedings of PSTI2010, the First International Workshop on Parallel Software Tools and Tool Infrastructures, San Diego CA, 2010.
 [30] S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65–76, Apr. 2009.