A domain-specific language and matrix-free stencil code for investigating electronic properties of Dirac and topological materials

by   Andreas Pieper, et al.

We introduce PVSC-DTM (Parallel Vectorized Stencil Code for Dirac and Topological Materials), a library and code generator based on a domain-specific language tailored to implement the specific stencil-like algorithms that can describe Dirac and topological materials such as graphene and topological insulators in a matrix-free way. The generated hybrid-parallel (MPI+OpenMP) code is fully vectorized using Single Instruction Multiple Data (SIMD) extensions. It is significantly faster than matrix-based approaches on the node level and performs in accordance with the roofline model. We demonstrate the chip-level performance and distributed-memory scalability of basic building blocks such as sparse matrix-(multiple-) vector multiplication on modern multicore CPUs. As an application example, we use the PVSC-DTM scheme to (i) explore the scattering of a Dirac wave on an array of gate-defined quantum dots, to (ii) calculate a bunch of interior eigenvalues for strong topological insulators, and to (iii) discuss the photoemission spectra of a disordered Weyl semimetal.



There are no comments yet.


page 9

page 27

page 30

page 32


Unfitted Nitsche's method for computing wave modes in topological materials

In this paper, we propose an unfitted Nitsche's method for computing wav...

The Chunks and Tasks Matrix Library 2.0

We present a C++ header-only parallel sparse matrix library, based on sp...

Program Generation for Linear Algebra Using Multiple Layers of DSLs

Numerical software in computational science and engineering often relies...

dMath: Distributed Linear Algebra for DL

The paper presents a parallel math library, dMath, that demonstrates lea...

Distributed Memory Techniques for Classical Simulation of Quantum Circuits

In this paper we describe, implement, and test the performance of distri...

Parallel Matrix-Free Implementation of Frequency-Domain Finite Difference Methods for Cluster Computing

Full-wave 3D electromagnetic simulations of complex planar devices, mult...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dirac-type semimetals and topological insulators are a new materials platform with an enormous application potential in fields ranging from nano-electronics, plasmonics and optics to quantum information and computation. Their striking electronic, spectroscopic, and transport properties result from spin-polarized (chiral), (semi)metallic surface states, which are located in the middle of the spectrum and show linear dispersion to a good approximation. The discovery of such massless Dirac fermions in graphene by Castro Neto et al. (2009), on the surface of topological insulators by Hasan and Kane (2010), and in Weyl semimetals by Xu et al. (2015) has triggered the investigation of Dirac physics in crystals, changing thereby, in a certain sense, the focus and perspective of current condensed matter research from strong-correlation to topological effects.

Whether a material may realize distinct topological phases is dictated by the dimension, the lattice structure and associated electronic band structure including the boundary states, and the relevant interactions, all reflected in the system’s Hamilton operator and its symmetries. Therefore it is of great interest to determine and analyze the ground-state and spectral properties of paradigmatic model Hamiltonians for topological matter. This can be achieved by means of unbiased numerical approaches.

PVSC-DTM is a highly parallel, vectorized (matrix-free) stencil code for investigating the properties of two-dimensional (2D) graphene and graphene-nanoribbons (GNRs), three-dimensional (3D) topological insulators as well as Weyl semimetals, including also disorder effects, by using modern numerical methods based on matrix polynomials. Due to the complexity of the problem, a considerable amount of computation is required. Thus, one of the design goals of PVSC-DTM was to build highly parallel software that supports the architectural features of modern computer systems, notably SIMD (Single Instruction Multiple Data) parallelism, shared-memory thread parallelism, and massively parallel, distributed-memory parallelism. On the compute node level, the development process was guided by performance models to ensure that the relevant bottleneck is saturated. The major methodological advantage compared to existing software packages for similar purposes is that all matrix operations are performed without an explicitly stored matrix, thereby greatly reducing the pressure on the memory interface and opening possibilities for advanced optimizations developed for stencil-type algorithms.

In order to ease the burden on users and still provide the flexibility to adapt the code to different physical setups, a domain-specific language (DSL) was developed that allows for a formulation of the problem without any reference to a specific implementation, let alone optimization. The actual code is generated automatically, including parallelization and blocking optimizations. Although several stencil DSLs have been developed (see, e.g., Tang et al. (2011); Zhang et al. (2017); Schmitt et al. (2014)), some even with specific application fields in mind such as in Ragan-Kelley et al. (2013), there is to date no domain-specific approach to generating efficient stencil code for algorithms describing the specific quantum systems mentioned above from a high-level representation. Since optimal blocking factors are calculated automatically from machine properties, performance tuning (automatically or manually) on the generated code or within the code generation phase is not required.

This report gives an overview of the physical motivation and describes in detail the implementation of the framework, including the DSL. Performance models are developed to confirm the optimal resource utilization on the chip level and assess the potential of code optimizations, such as spatial blocking and on-the-fly random number generation. Performance comparisons on the node and the highly parallel level with matrix-bound techniques (using the GHOST library) show the benefit of a matrix-free formulation. The code is freely available for download at http://tiny.cc/PVSC-DTM.

For the benchmark tests we used two different compute nodes: A dual-socket Intel Xeon E5-2660v2 “Ivy Bridge” (IVB) node with 10 cores per socket and 2.2 GHz of nominal clock speed, and an Intel Xeon E5-2697v4 “Broadwell” (BDW) node with 18 cores per socket and 2.3 GHz of nominal clock speed. In all cases the clock frequency was fixed to the nominal value (i.e., Turbo Boost was not used). The “cluster on die” (CoD) mode was switched off on BDW, so both systems ran with two ccNUMA domains. The maximum achievable per-socket memory bandwidth was 40 Gbyte/sec on IVB and 61 Gbyte/sec on BDW. The Intel C/C++ compiler in version 16.0 was used for compiling the source code.

For all distributed-memory benchmarks we employed the “Emmy” cluster at RRZE (Erlangen Regional Computing Center). This cluster comprises over 500 of the IVB nodes described above, each equipped with 64 GB of RAM and connected via a full nonblocking fat-tree InfiniBand network.

This paper is organized as follows. Section 2 provides typical lattice-model Hamiltonians for graphene nanoribbons with imprinted quantum dots, strong topological insulators, and disordered Weyl semimetals. A matrix-free method and code for the calculation of electronic properties of these topological systems is described in Sec. 3, with a focus on a domain-specific language that serves as an input to a code generator. To validate and benchmark the performance of the numerical approach, the proposed PVSC-DTM scheme is executed for several test cases. Section 4 describes the matrix-polynomial algorithms used for some physical ‘real-world’ applications. Finally, we conclude in Sec. 5.

2 Model Hamiltonians

In this section we specify the microscopic model Hamiltonians under consideration, in a form best suitable for the application of the PVSC-DTM stencil code. The emergence of Dirac-cone physics is demonstrated.

2.1 Graphene

Graphene consists of carbon atoms arranged in a 2D honeycomb lattice structure (see the review by Castro Neto et al. (2009)). The honeycomb lattice is not a Bravais lattice, but can be viewed as a triangular Bravais lattice with a two-atom basis. Taking into account only nearest-neighbor hopping processes, the effective tight-binding Hamiltonian for the (-) electrons in can be brought into the following form:


Here and in what follows we use units such that and measure the energy in terms of the carbon-carbon electron transfer integral ; is the number of lattice sites. The first term describes the particle transfer between between neighboring cells (containing four atoms each) in and direction, while the second term gives the transfer within the cells and the potentials , which can vary within the cells and from cell to cell in the case of disorder. The corresponding 44 matrices are


The band structure of pure graphene,


(in the following, we set the lattice constant ), exhibits two bands (the upper anti-bonding and lower bonding band) that touch each other at so-called Dirac points; next to any of those the dispersion becomes linear (see Castro Neto et al. (2009)). The graphene Hamiltionian respects time-inversion symmetry, which implies , and if is the solution for [which is the Fermi energy for intrinsic (undoped) graphene], so is , i.e., the Dirac points occur in pairs.

Compared to the band structure of an infinite 2D graphene sheet, the DOS of finite GNRs is characterized by a multitude of Van Hove singularities, as shown by Castro Neto et al. (2009) and Schubert et al. (2009). For zigzag GNRs, the strong signature at indicates the high degeneracy of edge states, as shown in Fig. 1 (a). By contrast, armchair GNRs are gapped around ; this finite-size gap vanishes when the width of the ribbon tends to infinity.

2.2 Topological insulators

The remarkable properties of 3D topological insulators (TIs) result from a particular topology of their band structure, which exhibits gapped (i.e., insulating) bulk and gapless (i.e., metallic) linearly dispersed Dirac surface states (see reviews by Hasan and Kane (2010) and Qi and Zhang (2011)). Bulk-surface correspondence implies, as shown by Fu et al. (2007), that so-called weak TIs (which are less robust against the influence of non-magnetic impurities) feature none or an even number of helical Dirac cones while strong (largely robust) TIs have a single Dirac cone.

As a minimal theoretical model for a 3D TI with cubic lattice structure we consider – inspired by the orbitals of strained 3D HgTe or the insulators of the family, as studied in Sitte et al. (2012) – the following four-band Hamiltonian:


where is a four-component spinor at site . The Hamiltonian is expressed in terms of the five Dirac matrices , ), and their ten commutators , which satisfy the Clifford algebra, , with being the identity and () the Pauli matrices referring to orbital (spin) space (see Schubert et al. (2012) and Pieper and Fehske (2016)). Hence, constitutes a complex, sparse, banded matrix with seven subdiagonals of small dense blocks of size . The corresponding band dispersion reads:


The parameter can be used to tune the band structure: For , a weak TI with two Dirac cones per surface arises, whereas for , a strong TI results, with a single Dirac cone per surface (see Fig. 1 (b)). In the case that we have a conventional band insulator. External magnetic fields cause finite and , which will break the inversion symmetry. , in addition, breaks the time-inversion symmetry.

We now describe for the TI problem how the density of states (DOS) and the single-particle spectral function depicted in Fig. 1 is obtained using state-of-the art exact diagonalization and kernel polynomial methods that were described in Weiße and Fehske (2008) and  Weiße et al. (2006), respectively. For a given sample geometry (and disorder realization), these quantities are given as




where is the energy (frequency), is the wave vector (crystal momentum) in Fourier space, and designates the single-particle eigenstate with energy . The four-component (ket-) spinor () can be used to construct a Bloch state, just by performing the scalar product with the canonical (bra-) basis vectors of position and band index spaces ( and , respectively). See Schubert et al. (2012) for details.

For the model (4) with (and , ), bulk states occur for energies . Moreover, subgap surface states develop, forming a Dirac cone located at the surface momentum , as shown in Fig. 1 (b). The latter states determine the striking electronic properties of TIs.

2.3 Weyl semimetals

The Weyl semimetallic phase, which can be observed, e.g., in TaAS (see Xu et al. (2015)), is characterized by a set of linear-dispersive band touching points of two adjacent bands, the so-called Weyl nodes. The real-space Weyl points are associated with chiral fermions, which behave in momentum space like magnetic monopoles. Unlike the 2D Dirac points in graphene, the 3D Weyl nodes are protected by the symmetry of the band structure and, as long as there is no translational-symmetry-breaking intervalley-mixing between different Weyl nodes, the Weyl semimetal is robust against perturbations, as was shown by Yang et al. (2011). In this way a Weyl semimetal hosts, like a TI, metallic topological surface states (arising from bulk topological invariants). However, while the topological surface states of TIs give rise to a closed Fermi surface (in momentum space), the surface-state band structure of Weyl semimetals is more exotic; it forms open curves, the so-called Fermi arcs, which terminate on the bulk Weyl points (see Wan et al. (2011)).

The minimal theoretical models for topological Weyl semimetals have been reviewed quite recently by McCormick et al. (2017). Here we consider the following 3D lattice Hamiltonian,


where is now a two-component spinor and are the Pauli matrices (again, the lattice constant is set to unity, just as the transfer element). In momentum space [], the two-band Bloch-Hamilton matrix takes the form ()


developing two Weyl nodes at momenta with , as seen in Fig. 1 (c) [Hasan et al. (2017)].

Figure 1: Band dispersion along deduced from the single-particle spectral function [left panels] and density of states (DOS) [right panels] for the model Hamiltonians (1), (4), and (8). (a) Zigzag GNR with having 16 ‘rows’ and open boundary conditions (BCs) in direction (periodic BCs in direction). (b) Strong TI with , on a cuboid with sites and periodic BCs. Here the Dirac cone (linear dispersion) near is due to the surface states. (c) Weyl semimetal on a cuboid with sites and open BCs in direction (periodic BCs in and directions).

3 Matrix-free code for topological systems with Dirac cones

In general, topological materials have a rather complex lattice structure, although it is so regular that a matrix-free formulation of sparse matrix-vector multiplication (spMVM) and similar kernels is feasible due to the stencil-like neighbor relations. The lattice is always periodic (apart from disorder effects), but particle transfer integrals or interactions vary widely among materials. In other words, the resulting stencil geometry depends strongly on the physics, and in a way that makes it impossible to set up optimal code for all possible situations in advance although the core algorithm is always a stencil-like update scheme. The required blocking strategies for optimal code performance also vary with the stencil shape. Consequently, it is worthwhile to generate the code for a particular physical setup. This allows to hard-code performance-relevant features and takes out a lot of uncertainty about compiler optimizations. In this section we describe some of the design goals and implementation details of our matrix-free code, including the DSL, representative benchmark cases, and performance results.

3.1 Preliminary considerations

Many numerical algorithms that can describe quantum systems, such as eigenvalue solvers or methods for computing spectral properties, such as the Kernel Polynomial Method (KPM) (reviewed in Weiße et al. (2006)), require the multiplication of large sparse matrices with one or more right-hand side vectors as a time-consuming component. If the matrix is stored explicitly in memory and special structures such as symmetry and dense subblocks are not exploited, The data transfer between the CPU and the main memory is the performance-limiting bottleneck. An upper limit for the performance of a typical linear algebra building block such as spMVM can thus be easily calculated by means of the naive roofline model, which was popularized by Williams et al. (2009):


This model assumes that the performance of a loop is either limited by the computational peak performance of the CPU () or by the maximum performance allowed by memory data transfers (), whichever is more stringent. In case of spMVM and similar algorithms on any modern multicore CPU, the former is much larger than the latter, so we can safely ignore it here. is the achievable main memory bandwidth in bytes/s; it can be determined by a suitable benchmark, such as STREAM by McCalpin (1991-2007). is the code balance, i.e., the ratio of the required data volume through the memory interface (in bytes) and the amount of work (usually floating-point operations, but any valid “work” metric will do). Clearly, is then an upper limit for the expected performance of the loop. In practice one can determine the code balance by code inspection and an analysis of data access locality. Whenever the data traffic cannot be calculated accurately, e.g., because some indirect and unpredictable access is involved, it is often possible to give at least a lower limit for and thus an absolute upper limit for the performance. A “good” code in the context of the roofline model is a code whose performance is near the limit. Once this has been achieved, any optimization that lowers will increase the performance accordingly. Many refinements of the model have been developed to make it more accurate in situations where the bottleneck is not so clearly identified, e.g., by Ilic et al. (2014) and Stengel et al. (2015).

It was first shown by Gropp et al. (2000) that the minimal code balance of spMVM for double precision, real matrices in CRS format and a 32-bit index is , leading to memory-bound execution if the matrix does not fit into a cache. A matrix-free formulation can greatly reduce the demand for data and leads, in case of many topological materials, to stencil-like update schemes. Although those are limited by memory bandwidth as well, the code balance can be very low depending on the particular stencil shape and on whether layer conditions (LCs) are satisfied. The concept of layer conditions was conceived by Rivera and Tseng (2000) and applied in the context of advanced analytic performance models by Stengel et al. (2015). In the following we briefly describe the optimizations that were taken into account, using a simple five-point stencil as an example.

double *x, *y; // RHS/LHS vector data
int imax,kmax; // grid size [0:imax]x[0:kmax]
for(int k=1; k<kmax; ++k)
 for(int i=1; i<imax; ++i)
  y[i+k*(imax+1)] =
    c1*x[(i+1)+k*(imax+1)] + c2*x[(i-1)+k*(imax+1)]
  + c3*x[i+(k+1)*(imax+1)] + c4*x[i+(k-1)*(imax+1)]
  + noise[i+k*(imax+1)];
Listing 1: Two-dimensional five-point stencil sweep with one RHS and one LHS vector. The highlighted elements must come from cache for optimal code balance.

Listing 1 shows one update sweep of this code, i.e., one complete update of one LHS vector. In a matrix-bound formulation the coefficients c1, …, c4 would be stored in memory as separate, explicit arrays. In addition to the RHS vector the array noise[] is read, implementing a random potential. As is customary with stencil algorithms we use the lattice site update (LUP) as the principal unit of work, which allows us to decouple the analysis from the actual number of flops executed in the loop body.

The minimum code balance of this loop nest for data in memory is , because each LHS element must be updated in memory (16 bytes), and each RHS and noise element must be loaded (eight bytes each). If nontemporal stores can be used for y[], the code balance reduces to because the write-allocate transfers do not occur and data is written to memory directly. The minimum code balance can only be attained, however, if the highlighted RHS elements do not have to be loaded from memory. This LC is satisfied if at least three successive rows of x[] fit into the cache. Assuming that the noise[] array and the LHS also require one row of cache space, the condition reads:


where is the available cache size in bytes per thread. In multi-threaded execution with outer loop parallelization via OpenMP, each thread must have its LC fulfilled. If the condition is broken, the inner loop can be blocked with a block size of ib and the condition will be satisfied if


If the blocking is done for a cache level that is shared among the threads in the team, the LC gets more and more stringent the more threads are used. For best single-threaded performance it is advisable to block for an inner cache, i.e., L1 or L2. See Stengel et al. (2015) for more details. Our production code determines the optimal block size according to (12).

The noise[] array is a significant contribution to the code balance. However, its contents are static and can be generated on the fly, trading arithmetic effort for memory traffic. The generation of random numbers should certainly be fast and vectorizable, so as to not cause too much overhead. See Section 3.3 below for details.

double *x, *y; // RHS/LHS vector data
int imax,kmax; // grid size [0:imax]x[0:kmax]
for(int k=1; k<kmax; ++k)
 for(int i=1; i<imax; ++i)
  for(int s=0; s<nb; ++s)
   y[s+nb*i+nb*k*(imax+1)] =
   + c2*x[s+nb*(i-1)+nb*k*(imax+1)]
   + c3*x[s+nb*i+nb*(k+1)*(imax+1)]
   + c4*x[s+nb*i+nb*(k-1)*(imax+1)]
   + noise[s+nb*i+nb*k*(imax+1)];
Listing 2: Two-dimensional five-point stencil sweep with r RHS and LHS vectors. SIMD vectorization across RHS and LHS vectors is possible and efficient if the vector storage order can be chosen as shown.

Some algorithmic variants require the concurrent, independent execution of stencil updates on multiple source and target vectors. Although SIMD vectorization is easily possible even with a single update by leveraging the data parallelism along the inner dimension, a more efficient option exists for multiple concurrent updates: If the vectors can be stored in an interleaved way, i.e., with the leading dimension going across vectors, vectorization along this dimension is straightforward if the number of vectors is large compared to the SIMD width. As opposed to the traditional scheme, perfect data alignment can be achieved (if this is required) and no shuffling of data in SIMD registers is necessary for optimal register reuse. See Listing 2 for an example using a simple five-point stencil. The considerations about LCs do not change apart from the fact that now each RHS vector needs to have its own LC fulfilled. Condition (12) is thus modified to


if n_b is the number of concurrent updates and the noise[] arrays have to be loaded from memory.

3.2 Domain-Specific Language (DSL)

In order to provide a maximum amount of flexibility to users and still guarantee optimal code, a DSL was constructed which is used to define the physical problem at hand. A precompiler written in Python then generates OpenMP-parallel C code for the sparse matrix-vector multiplication (a “lattice sweep”), which can be handled by a standard compiler. In the following we describe the DSL in detail by example.

The source code for the DSK program resides in a text file. The code begins with a specification of the problem dimensionality (2D/3D) and the basis size:

  dim 2
  size 4

The stencil coefficients can take various forms: constant, variable, or random. The number of coefficients of each kind is set by the keywords n_coeff_*, where “*” is one of the three options. E.g., for four variable coefficients:

  n_coeff_variable 4

The command nn with two or three arguments (depending on the dim parameter) and the following size lines define a sparse coefficient matrix to a neighbouring lattice block at the offset defined by the arguments of nn. Multiple entries in a line are separated by “;”. Optionally the first entry begins with “l” followed by the row index. A single block entry is written as “{colum index}|{value}” for a fixed value, or as “{colum index}| {type}|{type index or value}” for a different type. This is a simple example for a coefficient matrix one lattice position to the left of the current position (set by nn -1 0) and a fixed entry of value at position (0,1) and another entry of value at position (3,2):

nn -1 0
l0;        1|-1
l3;               2|-1

Note that all indexing is zero-based. The following coefficient types are allowed:

  • Fixed coefficient, hard-coded (default type). I.e., an entry of “1|-1” can also be written as “1|f|-1”. It will be hard-coded into the generated C code.

  • Constant coefficient per lattice site, read from an array of length n_coeff_const. E.g., “1|c|2” means that the coefficient used will be coeff_c[2]. The coefficient array can be changed at runtime if required, or preset in the DSL source code. For example, the line

      coeff_const_default 1 2 -0.5

    will initialize the array coeff_c[] with the specified values.

  • Variable coefficient per lattice site, read from an array of length n_coeff_variable per lattice site.

  • Random coefficient per lattice site, read from an array of length n_coeff_rand per lattice site.

In Listing 3 we show a complete example for a 2D graphene stencil with variable coefficients on the diagonal, while Listing 4 shows the generated C source code for the sparse matrix-vector multiplication (spMVM).

dim 2
size 4
n_coeff_variable 4
nn -1 0
l0;        1|-1
l3;               2|-1
nn 0 -1
l0;                      3|-1
nn 0 0
l0; 0|v|0; 1|-1
l1; 0|-1;  1|v|1; 2|-1
l2;        1|-1;  2|v|2; 3|-1
l3;               2|-1;  3|v|3
nn 0 1
l3; 0|-1
nn 1 0
l1; 0|-1
l2;                      3|-1
Listing 3: DSL source for a graphene stencil
#pragma omp for schedule(static,1) nowait
for(int i2=sys->b_y; i2<sys->n_y+sys->b_y; i2++){ // y loop
 for(int i1=i1_0; i1<i1_end; i1++){ // z loop
  int j = i1 +  ldz * ( i2 + ldy * i3);
  int i  = 4 * j;
  #pragma vector aligned
  for(int k=0; k<4; k++) {  // vector block loop
   y[(i+0)*4+k] =  scale_z[k] * y[(i+0)*4+k] + scale_h[k]
    * ( -shift_h[k] * x[ (i+0)*4+k ]
    -1. * x[(i+1+4*(0+ldz*(-1)))*4+k]
    -1. * x[(i+3+4*(-1+ldz*(0)))*4+k]
    -1. * x[(i+1+4*(0+ldz*(0))*4+k]
    +coeff_v[j*ldv+0] * x[(i+0+4*(0+ldz*(0)))*4+k]);
   y[(i+1)*4+k] =  scale_z[k] * y[(i+1)*4+k] + scale_h[k]
    * ( -shift_h[k] * x[ (i+1)*4 + k ]
    -1. * x[(i+0+4*(0+ldz*(0)))*4+k]
    -1. * x[(i+2+4*(0+ldz*(0)))*4+k]
    +coeff_v[j*ldv+1] * x[(i+1+4*(0+ldz*(0)))*4+k]
    -1. * x[(i+0+4*(0+ldz*(1)))*4+k]);
   y[(i+2)*4+k] =  scale_z[k] * y[(i+2)*4+k] + scale_h[k]
    * ( -shift_h[k] * x[(i+2)*4+k]
    -1. * x[(i+1+4*(0+ldz*(0)))*4+k]
    -1. * x[(i+3+4*(0+ldz*(0)))*4+k]
    +coeff_v[j*ldv+2] * x[(i+2+4*(0+ldz*(0)))*4+k]
    -1. * x[(i+3+4*(0+ldz*(1)))*4+k] );
   y[(i+3)*4+k] =  scale_z[k] * y[(i+3)*4+k] + scale_h[k]
    * ( -shift_h[k] * x[(i+3)*4+k]
    -1. * x[(i+2+4*(0+ldz*(-1)))*4+k]
    -1. * x[(i+2+4*(0+ldz*(0)))*4+k]
    +coeff_v[j*ldv+3] * x[(i+3+4*(0+ldz*(0)))*4+k]
    -1. * x[(i+0+4*(1+ldz*(0)))*4+k] );
  if( p->dot_xx || p->dot_xy || p->dot_yy ) {
  for(int i1=i1_0; i1<i1_end; i1++) {
   int j = i1 +  ldz * ( i2 + ldy * i3);
   int i  = 4 * j;
   #pragma  vector aligned
   for(int k=0; k<4; k++) {
Listing 4: Generated matrix-vector multiplication code for the graphene stencil

Coefficient arrays and execution parameters (such as, e.g., the grid size) can be configured and changed at runtime. The code repository111http://tiny.cc/PVSC-DTM contains numerous examples that demonstrate the DSL usage and how the generated source code can be embedded into algorithms.

3.3 Random number generator

for(int k=1; k<kmax; ++k) {
  for(int i=1; i<imax; ++i)
    random[i] = ...; // fast RNG
  for(int i=1; i<imax; ++i)
    y[i+k*(imax+1)] =
      c1*x[(i+1)+k*(imax+1)] + c2*x[(i-1)+k*(imax+1)]
    + c3*x[i+(k+1)*(imax+1)] + c4*x[i+(k-1)*(imax+1)]
    + random[i];
Listing 5: Using an “out-of-band” fast RNG to save memory data traffic

In some physically relevant cases, the Hamiltonian matrix has a diagonal, random component. These random numbers are usually stored in a constant array to be loaded during the matrix-vector multiplication step. At double precision this leads to an increase in code balance by 8 bytes/LUP, which can be entirely saved by generating the random numbers on the fly using a fast random number generator (RNG). Considering that the stencil update schemes studied here can run at several billions of lattice site updates per second on modern server processors, a suitable RNG must be able to produce random numbers at a comparable rate. This is not possible with standard library-based implementations such as, e.g., drand48(), but faster and (in terms of quality) better options do exist. The RNG code should be inlined with the actual spMVM or at least be available as a function that generates long sequences of random numbers in order to avoid hazardous call overhead.

The standard type of RNG used in scientific computing is the linear congruential generator (LCG), which calculates and casts the result to a floating-point number. The numbers , , and parametrize the generator; for efficiency reasons one can choose to be a power of two (e.g., in drand48()), but such simple methods fail the statistical tests of the popular TESTU01 suite devised by L’Ecuyer and Simard (2007). However, if there are no particular quality requirements (i.e., if only “some randomness” is asked for), they may still be of value. Despite the nonresolvable dependency of on , which appears to rule out SIMD vectorization, LCGs can be vectorized if a number of independent random number sequences is needed and if the SIMD instruction set of the hardware allows for the relevant operations (e.g., SIMD-parallel addition, multiplication, and modulo on unsigned integer operands with the required number of bits).

A more modern and similarly fast approach to RNGs are the xorshift generators by Marsaglia (2003). In the simplest case they work by a sequence of XOR mask operations of the seed with a bit-shifted version of itself: . Improved versions like the xorshift128+ by Vigna (2014) pass all statistical tests of the “SmallCrush” suite in TESTU01. Table 1 shows a performance comparison of different RNGs on one socket of the IVB and BDW systems, respectively.

RNG (loop unrolling) failed SmallCrush Perf. IVB [GRN/sec] Perf. BDW [GRN/sec]
lgc32 (32) 1,2,3,4,5,6,7,8,9,10 19.3 28.8
xorshift32 (32) 1,3,4,8,9,10 10.1 28.8
xorshift128 (16) 6 8.29 25.4
xorshift64_long (16) 8 8.26 31.7
xorshift128plus_long (16) 6.34 20.7
Intel MKL SFMT19937 7.72 22.0
Table 1:

Performance comparison of the LCG32 RNG, four xorshift generators of different sophistication, and the SFMT19937 generator available in the Intel MKL. Performance numbers are given in billions of random numbers per second (GRN/sec) one one socket (10 or 18 cores, respectively). The benchmark consisted in the (repeated) computation of 2000 double-precision random numbers with uniform distribution. For each generator, the second column lists the failed SmallCrush tests of the TESTU01 suite. These particular test results were taken with the Intel C compiler version 17.0 update 5. The benchmarks are available in the code repository.

The speedup between IVB and BDW is particularly large for the xorshift generators because the AVX2 instruction set on BDW supports SIMD-parallel bit shifting operations, which are not available in AVX. For comparison we have included a SIMD-vectorized Mersenne Twister RNG (SFMT19937), which is available in Intel’s Math Kernel Library (MKL).

Whether a particular RNG impacts the performance of the matrix-free spMVM step depends on the application. In the cases we investigate here, a whole row of random numbers can be generated in advance before each inner loop traversal (see Listing 5) without a significant performance degradation in the bandwidth-saturated case. Fusing the RNG with the inner loop is left for future work.

3.4 Geometry

The current implementation of PVSC-DTM supports cuboid domains of arbitrary size (only limited by memory capacity) with 3D domain decomposition using MPI, and OpenMP-based multithreading on the subdomain level. For each spatial dimension, periodic boundary conditions (BCs) can be configured separately as needed. Variable-coefficient arrays and initial vectors can be preset via a user-defined callback function with the following interface:

  double (* fnc)(long * r, int k, \
                 void * parm , void * seed);

This function must expect the following input parameters:

  { x=r[0], y=r[1], z=r[2], \
    basis_place=r[3], vector_block_index=k }

It returns the respective vector entry as a double-precision number. The pointer parm is handed down from the vector initial call and allows for configuring specific options. The pointer seed is a reference to a 128-bit process- and thread-local random seed.

Finally a vector block will be initialized by calling the function:

pvsc_vector_from_func( pvsc_vector * vec, \
      pvsc_vector_func_ptr * fnc, void * parm);

This mechanism lets the user define a generalized initial function with optionally free parameters. In addition, a thread-local random seed for optional random functions is available in the initialization function, which enables a fully parallelized initialization of vectors.

3.5 Benchmarks

In order to validate the performance claims of our matrix-free implementation and optimization of random number generation we ran several test cases on the benchmark systems described in the introduction. Performance is quantified in billions of lattice site updates per second (GLUPs/sec). For all stencil update kernels (spMVMs) with constant coefficients studied here, the minimal code balance is 24 bytes/LUP with on-the-fly RNGs (see Listing 5) and 32 bytes/LUP with random numbers stored as constant arrays. The roofline model thus predicts bandwidth-bound per-socket upper performance limits of 1.67 GLUPs/sec on IVB and 2.5 GLUPs/sec on BDW.

Figure 2: Performance scaling of a spMVM kernel using a constant-coefficient 3D 7-point stencil problem with (as in Listing 5) and without on-the-fly RNG on one IVB socket (10 cores) and with SMT (2 threads per core). The dashed line shows the performance without a random potential, whereas the filled black squares (fallback) show the result with random numbers read from memory. All other data sets were obtained with different on-the-fly RNGs. (, system size )
Figure 3: Performance scaling as in Fig. 2 but on one BDW socket (18 cores).

Figures 2 and 3 show the performance scaling of the spMVM with a 3D 7-point stencil on one socket of the benchmark systems. On IVB, the “fallback” kernel, which uses explicitly stored random numbers, saturates the memory bandwidth with eight cores at about 95% of the achievable bandwidth (black solid line). The kernel without random numbers (labeled “const. coeff.”) marks a practical upper performance limit. It also saturates at about the same bandwidth (and thus at 33% higher performance), with a very slight additional speedup from SMT (Hyper-Threading). As expected, the versions with on-the-fly RNGs are somewhat slower on the core level due to the increased amount of work, which, in case of the xorshift variants, leads to a lower performance than for the fallback variant up to seven cores, and non-saturation when only physical cores are used. SMT can close this gap by filling pipeline bubbles on the core level, and all RNG versions end up at exactly the same performance with 20 SMT threads. On BDW the full-socket situation is similar, but all versions come closer to the practical bandwidth limit than on IVB, and the fallback variant is slower than all RNG versions at all core counts.

The bottom line is that even the most “expensive” on-the-fly RNG allows memory bandwidth saturation on both architectures, that the roofline model predictions are quite accurate, that the automatic spatial blocking in PVSC-DTM works as intended and yields the optimal in-memory code balance, and that the elimination of the stored random number stream causes the expected speedup even with high-quality RNGs.

Figure 4: Performance scaling of spMVM with graphene stencil kernels on blocks of vectors of size , 2, and 4 with variable coefficients (left) and on-the-fly RNGs (right) on BDW. (system size )

Figure 4 shows a performance comparison of stored random numbers and on-the-fly RNG for a 2D graphene application with four subsites, a block vector size of 1, 2, and 4, and four variable coefficients. The code balance goes down from 32 bytes/LUP to 28 bytes/LUP and finally to 26 bytes/LUP when going from to 2 and 4, approaching the limit of 24 bytes/LUP at . With on-the-fly RNGs substituting the variable-coefficient arrays this balance is achieved for any , which is shown in the right panel of Fig. 4. It can also be observed in the data that the improved SIMD vectorization with speeds up the code measurably in the nonsaturated regime, but this advantage vanishes close to saturation because the data transfer becomes the only bottleneck.

Figure 5: Performance scaling of the stencil spMVM kernel of a 3D TI model without (left) and with (right) on-the-fly dot products for , 2, 4, and 8 on IVB. (system size )
Figure 6: Performance scaling of the stencil spMVM kernel of a 3D TI model without (left) and with (right) on-the-fly dot products for , 2, and 4 on BDW. (system size )

Figures 5 and 6 show the performance of the stencil kernels of a 3D TI model with different on IVB and BDW. Two versions are shown for each architecture: The standard one and an optimized version with dot products fused into the stencil kernel, increasing the number of flops per update by six. The code balance for TI stencils is lower than for graphene or the 7-point stencil, hence more cores are required for bandwidth saturation.

At larger the loop body becomes more complicated, and the benefit of SIMD vectorization may be compensated by a more inefficient in-core execution due to register shortage and less effective out-of-order execution. This can be seen on IVB at , where the available number of cores is too small to reach saturation, as opposed to , where the SIMD width matches the number of block vectors.

Calculating dot products on the fly has a rather small impact on performance (less that 15%), which on BDW vanishes at saturation because of its generally lower machine balance. Still, overall the roofline model provides a good estimate of the expected socket-level performance of our matrix-free codes even for topological insulators. Note, however, that the saturation properties depend on many factors, such as the number of cores per socket, the memory bandwidth, the clock speed, and the SIMD width. An accurate prediction of speedup vs. number of cores would require a more advanced performance model, such as the ECM model described in 

Stengel et al. (2015).

Figure 7: Runtime of spMVM () for a strong scaling test case of the TI model (system size ) with 200 iterations on the Emmy cluster, comparing PVSC-DTM with the GHOST library. The dashed lines show the communication time only. Inset: ratio of communication time vs. total runtime. All codes were run with one MPI process per socket (ten cores) and ten OpenMP threads per process.
Figure 8: Runtime of spMVM () for a strong scaling test case of the TI model (system size ) with 200 iterations on the Emmy cluster, comparing PVSC-DTM with the GHOST library. The dashed lines show the communication time only. Inset: ratio of communication time vs. total runtime. All codes were run with one MPI process per socket (ten cores) and ten OpenMP threads per process.

In Figures 7 and 8 we compare PVSC-DTM the with the GHOST library developed by Kreutzer et al. (2017) using a strong scaling TI test case on the Emmy cluster. One MPI process was bound to each socket, with OpenMP parallelization (ten threads) across the cores. Both PVSC-DTM and GHOST were used in “vector mode,” i.e., without overlap between communication and computation. GHOST always uses explicitly stored matrices, which is why PVSC-DTM not only has the expected performance advantage due to its matrix-free algorithms but also requires less hardware to handle a given problem size. The maximum number of nodes was chosen such that a maximum communication overhead of about 40-50% (see insets) can be observed for PVSC-DTM, which is a reasonable upper limit in production runs for resource efficiency reasons.

In the test case in Fig. 7, GHOST requires at least 16 nodes for storing the matrix and two vectors. With the same resources, PVSC-DTM is about 5 faster, and can outperform GHOST already with four nodes. The ratio of communication to computation time is naturally larger with PVSC-DTM due to the faster code execution. Although this particular test case cannot be run on a single node with GHOST, the performance comparison at 16 nodes also reflects quite accurately the per-node (or per-socket, i.e., pure OpenMP) performance ratio between PVSC-DTM and GHOST, since at this point the communication overhead is still only 10–20%.

For the memory traffic caused by the matrix becomes less significant and the speedup of PVSC-DTM vs. GHOST gets smaller. In the smaller test case shown in Fig. 8 GHOST requires at least four nodes and is still about 2.5 slower than PVSC-DTM at that point. Again, this is also the expected per-socket speedup if it were possible to run the test case on a single socket with GHOST.

4 Algorithms and application examples

In large-scale simulations of any kind, avoiding global synchronization points is crucial for scalability. This challenge can be met by modern matrix polynomial methods. The kernel polynomial method, the Chebyshev time propagation approach described in Weiße and Fehske (2008) and Alvermann and Fehske (2008), and the high-performance Chebyshev filter diagonalization technique (ChebFD) implementation introduced in Pieper et al. (2016) are already available in PVSC-DTM. These algorithms benefit from partial dot products, vector blocking, and loop fusion. The high-order commutator-free exponential time-propagation algorithm introduced by Alvermann and Fehske (2011) for driven quantum systems will be implemented in the near future. In the following sections we give some examples for typical applications in the field of Dirac and topological materials.

4.1 Time Propagation

The time evolution of a quantum state is described by the Schrödinger equation. If the Hamilton operator does not explicitly depend on the time we can formally integrate this equation and express the dynamics in terms of the time evolution operator as with (). Expanding the time evolution operator into a finite series of first-kind Chebyshev polynomials of order , , we obtain [Tal-Ezer and Kosloff (1984); Fehske et al. (2009)]


Prior to the expansion the Hamiltonian has to be shifted and rescaled such that the spectrum of is within the definition interval of the Chebyshev polynomials, , where and are calculated from the extreme eigenvalues of : and . The expansion coefficients are given by


( denotes the -th order Bessel function of the first kind).

Calculating the evolution of a state from one time grid point to the adjacent one, , we have to accumulate the -weighted vectors . Since the coefficients depend on the time step but not on time explicitly, we need to calculate them only once. The vectors can be computed iteratively, exploiting the recurrence relation of the Chebyshev polynomials, , with and . Evolving the wave function from one time step to the next then requires MVMs of a given complex vector with the (sparse) Hamilton matrix of dimension and the summation of the resulting vectors after an appropriate rescaling. Thus, for time-independent , arbitrary large time steps are in principle possible at the expense of increasing . We may choose such that for the modulus of all expansion coefficients is smaller than a desired accuracy cutoff. This is facilitated by the fast asymptotic decay of the Bessel functions, for . Thus, for large , the Chebyshev expansion can be considered as quasi-exact. Besides the high accuracy of the method, the linear scaling of computation time with both time step and Hilbert space dimension are promising in view of potential applications to more complex systems. In our cases almost all computation time is spent in sparse MVMs, which can be efficiently parallelized, allowing for a good speedup on highly parallel computers. This also means that any significant speedup that can be achieved for the MVM, such as by our matrix-free formulation, will have a corresponding effect on the runtime of the overall algorithm. The actual speedup is a function of the memory traffic reduction; for instance, a sparse matrix stored in CRS format that describes a stencil-like neighborhood relation with eight neighbors will (in double precision) cause a minimum data traffic of approximately 7.6 bytes/flop when acting on a right-hand side vector. In a matrix-free formulation this balance reduces to 1.3 bytes/flop, leading to a performance improvement of 5.7 if the memory bandwidth can be saturated in both cases.

As an example, we apply the Chebyshev time evolution scheme the propagation and scattering of a Dirac electron wave packet on a graphene sheet with an imprinted gate-defined quantum dot array [Pieper et al. (2013); Fehske et al. (2015)]. This is a timely issue of of high experimental relevance [Walls and Hadad (2015); Caridad et al. (2016); Tang et al. (2016)]. We mimic the quantum dot array by implementing the potential in (1) as


with varying amplitude in direction. In (16), () is the radius of a single quantum dot (the nearest-neighbor distance between dots) and gives the dot’s position [ count in () direction]. The quantum dot lattice can be created by applying spatially confined top gate voltages. The gradient change of the dot potentials mimics spatially varying effective refraction indices for the Dirac electron waves.

Figure 9: Time evolution of a Dirac electron wave impinging on a graphene quantum dot array (visible by the bright spots). We consider a GNR (periodic BCs in direction) with an imprinted quantum dot lattice (). The radii of the quantum dots  nm, their (midpoint) distance  nm, and the dot potentials (parametrized by , ) vary along the direction between a minimum and maximum value of  nm and  nm or  nm, respectively. A (Gaussian) wave packet with momentum in direction was created at with at time . The panels give the (color coded) squared amplitude of the wave function at times , , and with  sec (from top to bottom).

Figure 9 illustrates the scattering and temporary particle confinement by the quantum dot array. It has been demonstrated by by Heinisch et al. (2013) and Pieper et al. (2013) that the normal modes of an isolated quantum dot lead to sharp resonances in the scattering efficiency. Appearing for particular values of , , and , they allow the “trapping” even of Dirac electrons. Of course, for the scattering setup considered here, only quasi-bound states can appear, which may have an exceptionally long lifetime, however. Thereby the energy of the wave is fed into vortices inside the dot. For a periodic array of dots the normal modes at neighboring dots can couple, leading to coherence effects (such inter-dot particle transfer takes place on a reduced energy scale compared to pure graphene [Fehske et al. (2015)]). The situation becomes more complicated – but also more interesting – when the dot potentials are modulated spatially or energetically. In this case, a direction-dependent transmission (cascaded Mie scattering [Caridad et al. (2016)]) or even the focusing of the electron beam outside the dots can be observed [Tang et al. (2016)]. Similar phenomena are demonstrated by Fig. 9. For this simulation the electron is created by a Gaussian wave packet


where is Dirac electron with momentum in direction. When the wavefront hits the dot region, the wave is partly trapped by the quantum dots, whereby – for the parameters used – the resonance conditions are better fulfilled near the lower/upper end of the dot array (here the particle wave is best captured). The other way around, the transmission (and also the reflection, i.e., the backscattering) is strongest in the central region, leading to a curved wavefront. For larger time values a second pulse (wavefront) emerges (note that we have reflections and repeated scattering events in our finite GNR, but the largest time considered, , is much shorter than the pass-through time of an unperturbed Dirac wave). In any case, one observes a strongly time- and direction-dependent emission pattern of our graphene-based nanostructure, which can be exploited to manipulate electron beams. Particularly interesting in this respect would be focusing of the electron beam with large focal length, such that the focal spot lies outside the array. Then the structure can be used as a coupler to other electronic units. Achieving this by tuning the gradient of the gate potential appears to be a very efficient way, which is more easily realized in practice than modifying the geometrical properties of the array such as the lattice gradient or the layer number [Tang et al. (2016)].

4.2 Interior eigenvalues – TIs

Since the electronic properties of TIs are mainly determined by the (topologically non-trival) surface states located in the bulk-state gap, an efficient calculation of electron states at or close to the center of a spectrum is of vital importance. This can be done by Chebyshev filter diagonalization (ChebFD), a straightforward scheme for interior eigenvalue computation, which is based on polynomial filter functions and therefore has much in common with the KPM. ChebFD applies a matrix polynomial filter that is suitable for the target interval to a block of vectors. In each iteration, the search space is checked for convergence using a Rayleigh-Ritz procedure. ChebFD has already proven its practical suitability: Parallelized and implemented on the “SuperMUC” supercomputer at LRZ Garching, central eigenvalues of a -dimensional sparse matrix have been calculated at 40 Tflop/s sustained performance by Pieper et al. (2016).

Figure 10 shows the DOS of a strong TI. The focus is on the (pseudo-) gap region of the DOS. Implementing the effect of nonmagnetic impurities by uniformly distributed random on-site potentials , we see how disorder fills the gap that exists in the DOS of system with a finite number of sites (see the red curves in the upper panels). Introducing a finite , which mimics, e.g., the effect of an external magnetic field, the midgap Dirac cone formed by the surface states is broken up. Again, disorder will induce electronic states in the gap region generated by . This is demonstrated by the lower panel of Fig. 10, showing the DOS at the band center () in the - plane. As the disorder strength increases, more and more states pop up at until the DOS saturates when reaches the order of magnitude of the bulk band gap. For a more detailed investigation of disordered (weak and strong) TI we refer the reader to Kobayashi et al. (2013) where, besides the phase diagram, also the DOS was calculated using the KPM (see supplementary material in that paper). Compared to KPM, our ChebFD approach yields a better resolution at the same computational cost in the target interval (band center), which is important regarding the scientific applications.

Figure 10: Density of states for a strong TI described by the Hamiltonian (4) with , , and open (periodic) BCs in ( and ) direction. Top panel: DOS without (; black curve) and with [; red curve] disorder, where . Data obtained by KPM with stochastic trace evaluation for a cuboid with sites. Middle panel: Zoom-in of the central part of the spectrum with the target interval used for the ChebFD calculations [Pieper et al. (2016)]. Bottom panel: DOS at the band center () in dependence on the gap parameter and the disorder strength [Pieper and Fehske (2016)

]. Applying the KPM, 2048 Chebyshev moments were used for a system with

sites. Note that the finite DOS at

is a finite-size effect and due to the finite KPM resolution (variance


The ChebFD algorithm is robust and scalable, but algorithmically sub-optimal. In PVSC-DTM we have also implemented the trLanczosFD (thick-restart Lanczos with polynomial filters) algorithm by Li et al. (2016). This algorithm benefits particularly from a matrix-free formulation. A thorough description would exceed the scope of this paper; however, in Table 2 we show runtime data and the maximum residuum of the inner Ritz eigenvalues for trLanczosFD on two TI test cases in comparison with ChebFD. TrLanczosFD outperforms ChebFD by a factor of almost four (using PVSC-DTM for both).

test case 1 test case 2
size 4804806 2402406
eigenpairs 72 40
Emmy nodes 8 8
runtime [s] 2852 642
max res.
runtime [s] 760 142
max res.
Table 2: Test cases for the filter diagonalization method, using a matrix for TI with all eigenvalues in the range [-5.5:5.5]. Runtime and residuum data for runs with eight nodes on the Emmy cluster are shown for the ChebTB and the trLanczosFD algorithm, respectively.

4.3 Disorder effects – Weyl semimetals

The Weyl nodes in the gapless topological Weyl semimetals are believed to be robust against perturbations unless, e.g., the translation or charge conservation symmetry is broken. Showing the stability of a single or a pair of Weyl nodal points against disorder has been the subject of intense research [Liu et al. (2016); Chen et al. (2015); Pixley et al. (2015); McCormick et al. (2017); Shapourian and Hughes (2016); Zhao and Wang (2015)]. Due to the vanishing DOS at the nodal points, disorder effects can be expected to be particularly pronounced. Since analytic methods fail widely in their quantitative predictions, even in the case of weak disorder, we use a purely numerical, KPM-based approach to analyze the spectral properties of Weyl semimetals with real-space quenched potential disorder.

Figure 11 displays the momentum-resolved spectral function of a disordered Weyl metal along different paths in the bulk Brillouin zone. The photoemission spectra shown were calculated for the model (8) with random potentials drawn from a uniform box distribution of strength , i.e., . The presented data should be compared with the results for the clean case provided by Fig. 1 (c). Most notably, the electronic states at the Fermi arc (connecting the nodal points) and its immediate vicinity are hardly influenced by weak and even intermediate disorder. This does not apply for states further away from the Fermi surface. Here, the spectral signatures (band dispersion) are rapidly washed out, even for weak disorder. Of course, strong disorder will also affect the Fermi arc and the nodal points: Above a certain disorder strength they will be smeared out in both energy and momentum space and, as a result, the Weyl semimetal will transform into a diffusive metal with a finite DOS at the nodal points. A more detailed investigation of the spectral properties would be desirable in order to confirm the very recent evidence found by Su et al. (2017) for an intermediate Chern insulator state between the disordered Weyl semimetallic and diffusive metallic phases. At even stronger disorder, the distribution of the local density of states significantly broadens (just as in the case of strongly disordered strong TIs [Schubert et al. (2012)] or disordered GNR [Schubert et al. (2009); Schubert and Fehske (2012)]) and Anderson localization sets in [Pixley et al. (2015)].

Figure 11: Spectral function for a disordered Weyl semimetal with Fermi arcs, as obtained from the respective equation (7) for a 3D system with sites and periodic (open) BCs in () directions. The test wave function is initialized only on one surface. Left: along the direction for . Right: in the - plane () at . The disorder strength (top panels), (top panels), and (bottom panels).

5 Conclusion and outlook

The PVSC-DTM DSL and library have been demonstrated to be powerful tools for generating high-performance code to investigate ground-state, spectral, and dynamic properties of graphene, TIs, and other materials whose physics is governed by short-range interactions that lead to stencil-like numerical kernels. Due to is matrix-free design, PVSC-DTM outperforms matrix-based libraries such as GHOST. It also implements effective SIMD vectorization and fast on-the-fly random number generation and yields optimal memory-bound chip-level performance as shown by the roofline model. Spatial blocking of the iteration loops is fully automatic and based on layer conditions.

Several improvements to the library are left for future work: A better integration of the random number generator with the inner update loop would increase the non-saturated and sequential performance. Overlapping computation with communication would improve the distributed-memory parallel efficiency. Both optimizations are prerequisites for a possible integration with temporal blocking frameworks such as Girih by Malas et al. (2015), which would further boost the chip-level performance. The system geometry is currently limited to rectangular and cuboid domains, a restriction that may be lifted to support more interesting physical setups. Finally we plan to implement more algorithms in order to make the library more versatile beyond the showcases described here.


We thank Andreas Alvermann, Rafael L. Heinisch, and Gerhard Wellein for fruitful discussions. A.P. is grateful for financial support by KONWIHR, the Bavarian Competence Network for High Performance Computing (project PVSC-TM). We are also indebted to the LRZ Garching for providing access to HLRB-II.


  • Alvermann and Fehske (2008) Alvermann A and Fehske H (2008) Chebyshev approach to quantum systems coupled to a bath. Phys. Rev. B 77: 045125. DOI:10.1103/PhysRevB.77.045125.
  • Alvermann and Fehske (2011) Alvermann A and Fehske H (2011) High-order commutator-free exponential time-propagation of driven quantum systems. J. Comput. Phys. 230(15): 5930–5956. DOI:10.1016/j.jcp.2011.04.006.
  • Caridad et al. (2016) Caridad JM, Connaughton S, Ott C, Weber HB and Krstić V (2016) An electrical analogy to Mie scatterings. Nature Communications 7: 12894. DOI:10.1038/ncomms12894.
  • Castro Neto et al. (2009) Castro Neto AH, Guinea F, Peres NMR, Novoselov KS and Geim AK (2009) The electronic properties of graphene. Rev. Mod. Phys. 81: 109–162. DOI:10.1103/RevModPhys.81.109.
  • Chen et al. (2015) Chen CZ, Song J, Jiang H, Sun Qf, Wang Z and Xie XC (2015) Disorder and metal-insulator transitions in Weyl semimetals. Phys. Rev. Lett. 115: 246603. DOI:10.1103/PhysRevLett.115.246603.
  • Fehske et al. (2015) Fehske H, Hager G and Pieper A (2015) Electron confinement in graphene with gate-defined quantum dots. physica status solidi (b) 252(8): 1868–1871. DOI:10.1002/pssb.201552119.
  • Fehske et al. (2009) Fehske H, Schleede J, Schubert G, Wellein G, Filinov VS and Bishop AR (2009) Numerical approaches to time evolution of complex quantum systems. Physics Letters A 373: 2182 – 2188. DOI:10.1016/j.physleta.2009.04.022.
  • Fu et al. (2007) Fu L, Kane CL and Mele EJ (2007) Topological insulators in three dimensions. Phys. Rev. Lett. 98: 106803. DOI:10.1103/PhysRevLett.98.106803.
  • Gropp et al. (2000) Gropp WD, Kaushik DK, Keyes DE and Smith BF (2000) Towards realistic performance bounds for implicit CFD codes. In: Keyes D, Periaux J, Ecer A, Satofuka N and Fox P (eds.) Parallel Computational Fluid Dynamics 1999. Elsevier. ISBN 978-0-444-82851-4, pp. 241–248. DOI:10.1016/B978-044482851-4.50030-X.
  • Hasan and Kane (2010) Hasan MZ and Kane CL (2010) Colloquium: Topological insulators. Rev. Mod. Phys. 82: 3045–3067. DOI:10.1103/RevModPhys.82.3045.
  • Hasan et al. (2017) Hasan MZ, Xu SY, Belopolski I and Huang SM (2017) Discovery of Weyl fermion semimetals and topological Fermi arc states. Annual Review of Condensed Matter Physics 8(1): 289–309.
  • Heinisch et al. (2013) Heinisch RL, Bronold FX and Fehske H (2013) Mie scattering analog in graphene: Lensing, particle confinement, and depletion of Klein tunneling. Phys. Rev. B 87: 155409. DOI:10.1103/PhysRevB.87.155409.
  • Ilic et al. (2014) Ilic A, Pratas F and Sousa L (2014) Cache-aware roofline model: Upgrading the loft. IEEE Computer Architecture Letters 13(1): 21–24. DOI:10.1109/L-CA.2013.6.
  • Kobayashi et al. (2013) Kobayashi K, Ohtsuki T and Imura KI (2013) Disordered weak and strong topological insulators. Phys. Rev. Lett. 110: 236803. DOI:10.1103/PhysRevLett.110.236803.
  • Kreutzer et al. (2017) Kreutzer M, Thies J, Röhrig-Zöllner M, Pieper A, Shahzad F, Galgon M, Basermann A, Fehske H, Hager G and Wellein G (2017) GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems. International Journal of Parallel Programming 45(5): 1046–1072. DOI:10.1007/s10766-016-0464-z.
  • L’Ecuyer and Simard (2007) L’Ecuyer P and Simard R (2007) TestU01: A C library for empirical testing of random number generators. ACM Trans. Math. Softw. 33(4): 22:1–22:40. DOI:10.1145/1268776.1268777.
  • Li et al. (2016) Li R, Xi Y, Vecharynski E, Yang C and Saad Y (2016) A thick-restart Lanczos algorithm with polynomial filtering for Hermitian eigenvalue problems. SIAM Journal on Scientific Computing 38(4): A2512–A2534. DOI:10.1137/15M1054493.
  • Liu et al. (2016) Liu S, Ohtsuki T and Shindou R (2016) Effect of disorder in a three-dimensional layered chern insulator. Phys. Rev. Lett. 116: 066401. DOI:10.1103/PhysRevLett.116.066401.
  • Malas et al. (2015) Malas T, Hager G, Ltaief H and Keyes D (2015) Multi-dimensional intra-tile parallelization for memory-starved stencil computations. ArXiv e-prints 1510.04995. To be published in ACM Transactions on Parallel Computing (TOPC).
  • Marsaglia (2003) Marsaglia G (2003) Xorshift RNGs. Journal of Statistical Software 8(1): 1–6. DOI:10.18637/jss.v008.i14.
  • McCalpin (1991-2007) McCalpin JD (1991-2007) STREAM: Sustainable memory bandwidth in high performance computers. Technical report, University of Virginia, Charlottesville, VA. URL http://www.cs.virginia.edu/stream/. A continually updated technical report.
  • McCormick et al. (2017) McCormick TM, Kimchi I and Trivedi N (2017) Minimal models for topological Weyl semimetals. Phys. Rev. B 95: 075133. DOI:10.1103/PhysRevB.95.075133.
  • Pieper and Fehske (2016) Pieper A and Fehske H (2016) Topological insulators in random potentials. Phys. Rev. B 93: 035123. DOI:10.1103/PhysRevB.93.035123.
  • Pieper et al. (2013) Pieper A, Heinisch RL and Fehske H (2013) Electron dynamics in graphene with gate-defined quantum dots. EPL (Europhysics Letters) 104(4): 47010. URL http://stacks.iop.org/0295-5075/104/i=4/a=47010.
  • Pieper et al. (2016) Pieper A, Kreutzer M, Alvermann A, Galgon M, Fehske H, Hager G, Lang B and Wellein G (2016) High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations. Journal of Computational Physics 325: 226–243. DOI:10.1016/j.jcp.2016.08.027.
  • Pixley et al. (2015) Pixley JH, Goswami P and Das Sarma S (2015) Anderson localization and the quantum phase diagram of three dimensional disordered Dirac semimetals. Phys. Rev. Lett. 115: 076601. DOI:10.1103/PhysRevLett.115.076601.
  • Qi and Zhang (2011) Qi XL and Zhang SC (2011) Topological insulators and superconductors. Rev. Mod. Phys. 83: 1057–1110. DOI:10.1103/RevModPhys.83.1057.
  • Ragan-Kelley et al. (2013) Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F and Amarasinghe S (2013) Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13. New York, NY, USA: ACM. ISBN 978-1-4503-2014-6, pp. 519–530. DOI:10.1145/2491956.2462176. URL http://doi.acm.org/10.1145/2491956.2462176.
  • Rivera and Tseng (2000) Rivera G and Tseng CW (2000) Tiling optimizations for 3D scientific computations. In: Supercomputing, ACM/IEEE 2000 Conference. pp. 32–32. DOI:10.1109/SC.2000.10015.
  • Schmitt et al. (2014) Schmitt C, Kuckuk S, Hannig F, Köstler H and Teich J (2014) Exaslang: A domain-specific language for highly scalable multigrid solvers. In: 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing. pp. 42–51. DOI:10.1109/WOLFHPC.2014.11.
  • Schubert and Fehske (2012) Schubert G and Fehske H (2012) Metal-to-insulator transition and electron-hole puddle formation in disordered graphene nanoribbons. Phys. Rev. Lett. 108: 066402. DOI:10.1103/PhysRevLett.108.066402.
  • Schubert et al. (2012) Schubert G, Fehske H, Fritz L and Vojta M (2012) Fate of topological-insulator surface states under strong disorder. Phys. Rev. B 85: 201105. DOI:10.1103/PhysRevB.85.201105.
  • Schubert et al. (2009) Schubert G, Schleede J and Fehske H (2009) Anderson disorder in graphene nanoribbons: A local distribution approach. Phys. Rev. B 79: 235116. DOI:10.1103/PhysRevB.79.235116.
  • Shapourian and Hughes (2016) Shapourian H and Hughes TL (2016) Phase diagrams of disordered Weyl semimetals. Phys. Rev. B 93: 075108. DOI:10.1103/PhysRevB.93.075108.
  • Sitte et al. (2012) Sitte M, Rosch A, Altman E and Fritz L (2012) Topological insulators in magnetic fields: Quantum hall effect and edge channels with a nonquantized term. Phys. Rev. Lett. 108: 126807. DOI:10.1103/PhysRevLett.108.126807.
  • Stengel et al. (2015) Stengel H, Treibig J, Hager G and Wellein G (2015) Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. In: Proceedings of the 29th ACM International Conference on Supercomputing, ICS ’15. New York, NY, USA: ACM. DOI:10.1145/2751205.2751240.
  • Su et al. (2017) Su Y, Wang XS and Wang SR (2017) A generic phase between disordered Weyl semimetal and diffusive metal. Scientific Reports 7: 14382. DOI:10.1038/s41598-017-14760-8.
  • Tal-Ezer and Kosloff (1984) Tal-Ezer H and Kosloff R (1984) An accurate and efficient scheme for propagating the time dependent schrödinger equation. The Journal of Chemical Physics 81(9): 3967–3971. DOI:10.1063/1.448136.
  • Tang et al. (2016) Tang Y, Cao X, Guo R, Zhang Y, Che Z, Yannick FT, Zhang W and Du J (2016) Flat-lens focusing of electron beams in graphene. Scientific Reports 6: 33522. DOI:10.1038/srep33522.
  • Tang et al. (2011) Tang Y, Chowdhury RA, Kuszmaul BC, Luk CK and Leiserson CE (2011) The Pochoir stencil compiler. In: Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’11. New York, NY, USA: ACM. ISBN 978-1-4503-0743-7, pp. 117–128. DOI:10.1145/1989493.1989508.
  • Vigna (2014) Vigna S (2014) Further scramblings of Marsaglia’s xorshift generators. ArXiv e-prints .
  • Walls and Hadad (2015) Walls J and Hadad D (2015) Suppressing Klein tunneling in graphene using a one-dimensional array of localized scatterers. Scientific Reports 5. DOI:10.1038/srep08435.
  • Wan et al. (2011) Wan X, Turner AM, Vishwanath A and Savrasov SY (2011) Topological semimetal and Fermi-arc surface states in the electronic structure of pyrochlore iridates. Phys. Rev. B 83: 205101. DOI:10.1103/PhysRevB.83.205101.
  • Weiße and Fehske (2008) Weiße A and Fehske H (2008) Lecture Notes in Physics 739: 529.
  • Weiße et al. (2006) Weiße A, Wellein G, Alvermann A and Fehske H (2006) The kernel polynomial method. Rev. Mod. Phys. 78: 275–306. DOI:10.1103/RevModPhys.78.275.
  • Williams et al. (2009) Williams S, Waterman A and Patterson D (2009) Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52(4): 65–76. DOI:10.1145/1498765.1498785.
  • Xu et al. (2015) Xu SY, Belopolski I, Alidoust N, Neupane M, Bian G, Zhang C, Sankar R, Chang G, Yuan Z, Lee CC, Huang SM, Zheng H, Ma J, Sanchez DS, Wang B, Bansil A, Chou F, Shibayev PP, Lin H, Jia S and Hasan MZ (2015) Discovery of a Weyl fermion semimetal and topological Fermi arcs. Science 349(6248): 613–617. DOI:10.1126/science.aaa9297.
  • Yang et al. (2011) Yang KY, Lu YM and Ran Y (2011) Quantum Hall effects in a Weyl semimetal: Possible application in pyrochlore iridates. Phys. Rev. B 84: 075129. DOI:10.1103/PhysRevB.84.075129.
  • Zhang et al. (2017) Zhang N, Driscoll M, Markley C, Williams S, Basu P and Fox A (2017) Snowflake: A lightweight portable stencil dsl. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). pp. 795–804. DOI:10.1109/IPDPSW.2017.89.
  • Zhao and Wang (2015) Zhao YX and Wang ZD (2015) Disordered Weyl semimetals and their topological family. Phys. Rev. Lett. 114: 206602. DOI:10.1103/PhysRevLett.114.206602.