1 Introduction
Diractype semimetals and topological insulators are a new materials platform with an enormous application potential in fields ranging from nanoelectronics, plasmonics and optics to quantum information and computation. Their striking electronic, spectroscopic, and transport properties result from spinpolarized (chiral), (semi)metallic surface states, which are located in the middle of the spectrum and show linear dispersion to a good approximation. The discovery of such massless Dirac fermions in graphene by Castro Neto et al. (2009), on the surface of topological insulators by Hasan and Kane (2010), and in Weyl semimetals by Xu et al. (2015) has triggered the investigation of Dirac physics in crystals, changing thereby, in a certain sense, the focus and perspective of current condensed matter research from strongcorrelation to topological effects.
Whether a material may realize distinct topological phases is dictated by the dimension, the lattice structure and associated electronic band structure including the boundary states, and the relevant interactions, all reflected in the system’s Hamilton operator and its symmetries. Therefore it is of great interest to determine and analyze the groundstate and spectral properties of paradigmatic model Hamiltonians for topological matter. This can be achieved by means of unbiased numerical approaches.
PVSCDTM is a highly parallel, vectorized (matrixfree) stencil code for investigating the properties of twodimensional (2D) graphene and graphenenanoribbons (GNRs), threedimensional (3D) topological insulators as well as Weyl semimetals, including also disorder effects, by using modern numerical methods based on matrix polynomials. Due to the complexity of the problem, a considerable amount of computation is required. Thus, one of the design goals of PVSCDTM was to build highly parallel software that supports the architectural features of modern computer systems, notably SIMD (Single Instruction Multiple Data) parallelism, sharedmemory thread parallelism, and massively parallel, distributedmemory parallelism. On the compute node level, the development process was guided by performance models to ensure that the relevant bottleneck is saturated. The major methodological advantage compared to existing software packages for similar purposes is that all matrix operations are performed without an explicitly stored matrix, thereby greatly reducing the pressure on the memory interface and opening possibilities for advanced optimizations developed for stenciltype algorithms.
In order to ease the burden on users and still provide the flexibility to adapt the code to different physical setups, a domainspecific language (DSL) was developed that allows for a formulation of the problem without any reference to a specific implementation, let alone optimization. The actual code is generated automatically, including parallelization and blocking optimizations. Although several stencil DSLs have been developed (see, e.g., Tang et al. (2011); Zhang et al. (2017); Schmitt et al. (2014)), some even with specific application fields in mind such as in RaganKelley et al. (2013), there is to date no domainspecific approach to generating efficient stencil code for algorithms describing the specific quantum systems mentioned above from a highlevel representation. Since optimal blocking factors are calculated automatically from machine properties, performance tuning (automatically or manually) on the generated code or within the code generation phase is not required.
This report gives an overview of the physical motivation and describes in detail the implementation of the framework, including the DSL. Performance models are developed to confirm the optimal resource utilization on the chip level and assess the potential of code optimizations, such as spatial blocking and onthefly random number generation. Performance comparisons on the node and the highly parallel level with matrixbound techniques (using the GHOST library) show the benefit of a matrixfree formulation. The code is freely available for download at http://tiny.cc/PVSCDTM.
For the benchmark tests we used two different compute nodes: A dualsocket Intel Xeon E52660v2 “Ivy Bridge” (IVB) node with 10 cores per socket and 2.2 GHz of nominal clock speed, and an Intel Xeon E52697v4 “Broadwell” (BDW) node with 18 cores per socket and 2.3 GHz of nominal clock speed. In all cases the clock frequency was fixed to the nominal value (i.e., Turbo Boost was not used). The “cluster on die” (CoD) mode was switched off on BDW, so both systems ran with two ccNUMA domains. The maximum achievable persocket memory bandwidth was 40 Gbyte/sec on IVB and 61 Gbyte/sec on BDW. The Intel C/C++ compiler in version 16.0 was used for compiling the source code.
For all distributedmemory benchmarks we employed the “Emmy” cluster at RRZE (Erlangen Regional Computing Center). This cluster comprises over 500 of the IVB nodes described above, each equipped with 64 GB of RAM and connected via a full nonblocking fattree InfiniBand network.
This paper is organized as follows. Section 2 provides typical latticemodel Hamiltonians for graphene nanoribbons with imprinted quantum dots, strong topological insulators, and disordered Weyl semimetals. A matrixfree method and code for the calculation of electronic properties of these topological systems is described in Sec. 3, with a focus on a domainspecific language that serves as an input to a code generator. To validate and benchmark the performance of the numerical approach, the proposed PVSCDTM scheme is executed for several test cases. Section 4 describes the matrixpolynomial algorithms used for some physical ‘realworld’ applications. Finally, we conclude in Sec. 5.
2 Model Hamiltonians
In this section we specify the microscopic model Hamiltonians under consideration, in a form best suitable for the application of the PVSCDTM stencil code. The emergence of Diraccone physics is demonstrated.
2.1 Graphene
Graphene consists of carbon atoms arranged in a 2D honeycomb lattice structure (see the review by Castro Neto et al. (2009)). The honeycomb lattice is not a Bravais lattice, but can be viewed as a triangular Bravais lattice with a twoatom basis. Taking into account only nearestneighbor hopping processes, the effective tightbinding Hamiltonian for the () electrons in can be brought into the following form:
(1)  
Here and in what follows we use units such that and measure the energy in terms of the carboncarbon electron transfer integral ; is the number of lattice sites. The first term describes the particle transfer between between neighboring cells (containing four atoms each) in and direction, while the second term gives the transfer within the cells and the potentials , which can vary within the cells and from cell to cell in the case of disorder. The corresponding 44 matrices are
(2)  
The band structure of pure graphene,
(3)  
(in the following, we set the lattice constant ), exhibits two bands (the upper antibonding and lower bonding band) that touch each other at socalled Dirac points; next to any of those the dispersion becomes linear (see Castro Neto et al. (2009)). The graphene Hamiltionian respects timeinversion symmetry, which implies , and if is the solution for [which is the Fermi energy for intrinsic (undoped) graphene], so is , i.e., the Dirac points occur in pairs.
Compared to the band structure of an infinite 2D graphene sheet, the DOS of finite GNRs is characterized by a multitude of Van Hove singularities, as shown by Castro Neto et al. (2009) and Schubert et al. (2009). For zigzag GNRs, the strong signature at indicates the high degeneracy of edge states, as shown in Fig. 1 (a). By contrast, armchair GNRs are gapped around ; this finitesize gap vanishes when the width of the ribbon tends to infinity.
2.2 Topological insulators
The remarkable properties of 3D topological insulators (TIs) result from a particular topology of their band structure, which exhibits gapped (i.e., insulating) bulk and gapless (i.e., metallic) linearly dispersed Dirac surface states (see reviews by Hasan and Kane (2010) and Qi and Zhang (2011)). Bulksurface correspondence implies, as shown by Fu et al. (2007), that socalled weak TIs (which are less robust against the influence of nonmagnetic impurities) feature none or an even number of helical Dirac cones while strong (largely robust) TIs have a single Dirac cone.
As a minimal theoretical model for a 3D TI with cubic lattice structure we consider – inspired by the orbitals of strained 3D HgTe or the insulators of the family, as studied in Sitte et al. (2012) – the following fourband Hamiltonian:
(4)  
where is a fourcomponent spinor at site . The Hamiltonian is expressed in terms of the five Dirac matrices , ), and their ten commutators , which satisfy the Clifford algebra, , with being the identity and () the Pauli matrices referring to orbital (spin) space (see Schubert et al. (2012) and Pieper and Fehske (2016)). Hence, constitutes a complex, sparse, banded matrix with seven subdiagonals of small dense blocks of size . The corresponding band dispersion reads:
(5)  
The parameter can be used to tune the band structure: For , a weak TI with two Dirac cones per surface arises, whereas for , a strong TI results, with a single Dirac cone per surface (see Fig. 1 (b)). In the case that we have a conventional band insulator. External magnetic fields cause finite and , which will break the inversion symmetry. , in addition, breaks the timeinversion symmetry.
We now describe for the TI problem how the density of states (DOS) and the singleparticle spectral function depicted in Fig. 1 is obtained using stateofthe art exact diagonalization and kernel polynomial methods that were described in Weiße and Fehske (2008) and Weiße et al. (2006), respectively. For a given sample geometry (and disorder realization), these quantities are given as
(6) 
and
(7) 
where is the energy (frequency), is the wave vector (crystal momentum) in Fourier space, and designates the singleparticle eigenstate with energy . The fourcomponent (ket) spinor () can be used to construct a Bloch state, just by performing the scalar product with the canonical (bra) basis vectors of position and band index spaces ( and , respectively). See Schubert et al. (2012) for details.
2.3 Weyl semimetals
The Weyl semimetallic phase, which can be observed, e.g., in TaAS (see Xu et al. (2015)), is characterized by a set of lineardispersive band touching points of two adjacent bands, the socalled Weyl nodes. The realspace Weyl points are associated with chiral fermions, which behave in momentum space like magnetic monopoles. Unlike the 2D Dirac points in graphene, the 3D Weyl nodes are protected by the symmetry of the band structure and, as long as there is no translationalsymmetrybreaking intervalleymixing between different Weyl nodes, the Weyl semimetal is robust against perturbations, as was shown by Yang et al. (2011). In this way a Weyl semimetal hosts, like a TI, metallic topological surface states (arising from bulk topological invariants). However, while the topological surface states of TIs give rise to a closed Fermi surface (in momentum space), the surfacestate band structure of Weyl semimetals is more exotic; it forms open curves, the socalled Fermi arcs, which terminate on the bulk Weyl points (see Wan et al. (2011)).
The minimal theoretical models for topological Weyl semimetals have been reviewed quite recently by McCormick et al. (2017). Here we consider the following 3D lattice Hamiltonian,
(8)  
where is now a twocomponent spinor and are the Pauli matrices (again, the lattice constant is set to unity, just as the transfer element). In momentum space [], the twoband BlochHamilton matrix takes the form ()
(9)  
developing two Weyl nodes at momenta with , as seen in Fig. 1 (c) [Hasan et al. (2017)].
3 Matrixfree code for topological systems with Dirac cones
In general, topological materials have a rather complex lattice structure, although it is so regular that a matrixfree formulation of sparse matrixvector multiplication (spMVM) and similar kernels is feasible due to the stencillike neighbor relations. The lattice is always periodic (apart from disorder effects), but particle transfer integrals or interactions vary widely among materials. In other words, the resulting stencil geometry depends strongly on the physics, and in a way that makes it impossible to set up optimal code for all possible situations in advance although the core algorithm is always a stencillike update scheme. The required blocking strategies for optimal code performance also vary with the stencil shape. Consequently, it is worthwhile to generate the code for a particular physical setup. This allows to hardcode performancerelevant features and takes out a lot of uncertainty about compiler optimizations. In this section we describe some of the design goals and implementation details of our matrixfree code, including the DSL, representative benchmark cases, and performance results.
3.1 Preliminary considerations
Many numerical algorithms that can describe quantum systems, such as eigenvalue solvers or methods for computing spectral properties, such as the Kernel Polynomial Method (KPM) (reviewed in Weiße et al. (2006)), require the multiplication of large sparse matrices with one or more righthand side vectors as a timeconsuming component. If the matrix is stored explicitly in memory and special structures such as symmetry and dense subblocks are not exploited, The data transfer between the CPU and the main memory is the performancelimiting bottleneck. An upper limit for the performance of a typical linear algebra building block such as spMVM can thus be easily calculated by means of the naive roofline model, which was popularized by Williams et al. (2009):
(10) 
This model assumes that the performance of a loop is either limited by the computational peak performance of the CPU () or by the maximum performance allowed by memory data transfers (), whichever is more stringent. In case of spMVM and similar algorithms on any modern multicore CPU, the former is much larger than the latter, so we can safely ignore it here. is the achievable main memory bandwidth in bytes/s; it can be determined by a suitable benchmark, such as STREAM by McCalpin (19912007). is the code balance, i.e., the ratio of the required data volume through the memory interface (in bytes) and the amount of work (usually floatingpoint operations, but any valid “work” metric will do). Clearly, is then an upper limit for the expected performance of the loop. In practice one can determine the code balance by code inspection and an analysis of data access locality. Whenever the data traffic cannot be calculated accurately, e.g., because some indirect and unpredictable access is involved, it is often possible to give at least a lower limit for and thus an absolute upper limit for the performance. A “good” code in the context of the roofline model is a code whose performance is near the limit. Once this has been achieved, any optimization that lowers will increase the performance accordingly. Many refinements of the model have been developed to make it more accurate in situations where the bottleneck is not so clearly identified, e.g., by Ilic et al. (2014) and Stengel et al. (2015).
It was first shown by Gropp et al. (2000) that the minimal code balance of spMVM for double precision, real matrices in CRS format and a 32bit index is , leading to memorybound execution if the matrix does not fit into a cache. A matrixfree formulation can greatly reduce the demand for data and leads, in case of many topological materials, to stencillike update schemes. Although those are limited by memory bandwidth as well, the code balance can be very low depending on the particular stencil shape and on whether layer conditions (LCs) are satisfied. The concept of layer conditions was conceived by Rivera and Tseng (2000) and applied in the context of advanced analytic performance models by Stengel et al. (2015). In the following we briefly describe the optimizations that were taken into account, using a simple fivepoint stencil as an example.
Listing 1 shows one update sweep of this code, i.e.,
one complete update of one LHS vector. In a matrixbound formulation
the coefficients c1
, …, c4
would be stored in
memory as separate, explicit arrays. In addition to the RHS vector the array
noise[]
is read, implementing a random potential. As is
customary with stencil algorithms we use the lattice site update
(LUP) as the principal unit of work, which allows us to
decouple the analysis from the actual number of flops executed
in the loop body.
The minimum code balance of this loop nest for data in memory is
, because each LHS
element must be updated in memory (16 bytes), and each RHS and noise
element must be loaded (eight bytes each). If nontemporal stores can
be used for y[]
, the code balance reduces to because
the writeallocate transfers do not occur and data is written to
memory directly. The minimum code balance can only be attained,
however, if the highlighted RHS elements do not have to be loaded from
memory. This LC is satisfied if at least three
successive rows of x[]
fit into the cache. Assuming that the
noise[]
array and the LHS also require one row of cache
space, the condition reads:
(11) 
where is the available cache size in bytes per thread.
In multithreaded execution with outer loop parallelization via
OpenMP, each thread must have its LC fulfilled.
If the condition is broken,
the inner loop can be blocked with a block size of ib
and
the condition will be satisfied if
(12) 
If the blocking is done for a cache level that is shared among the threads in the team, the LC gets more and more stringent the more threads are used. For best singlethreaded performance it is advisable to block for an inner cache, i.e., L1 or L2. See Stengel et al. (2015) for more details. Our production code determines the optimal block size according to (12).
The noise[]
array is a significant contribution to the
code balance. However, its contents are static and can be generated
on the fly, trading arithmetic effort for memory traffic.
The generation of random numbers should certainly be fast
and vectorizable, so as to not cause too much overhead.
See Section 3.3 below for details.
Some algorithmic variants require the concurrent, independent execution of stencil updates on multiple source and target vectors. Although SIMD vectorization is easily possible even with a single update by leveraging the data parallelism along the inner dimension, a more efficient option exists for multiple concurrent updates: If the vectors can be stored in an interleaved way, i.e., with the leading dimension going across vectors, vectorization along this dimension is straightforward if the number of vectors is large compared to the SIMD width. As opposed to the traditional scheme, perfect data alignment can be achieved (if this is required) and no shuffling of data in SIMD registers is necessary for optimal register reuse. See Listing 2 for an example using a simple fivepoint stencil. The considerations about LCs do not change apart from the fact that now each RHS vector needs to have its own LC fulfilled. Condition (12) is thus modified to
(13) 
if n_b
is the number of concurrent updates and the noise[]
arrays
have to be loaded from memory.
3.2 DomainSpecific Language (DSL)
In order to provide a maximum amount of flexibility to users and still guarantee optimal code, a DSL was constructed which is used to define the physical problem at hand. A precompiler written in Python then generates OpenMPparallel C code for the sparse matrixvector multiplication (a “lattice sweep”), which can be handled by a standard compiler. In the following we describe the DSL in detail by example.
The source code for the DSK program resides in a text file. The code begins with a specification of the problem dimensionality (2D/3D) and the basis size:
The stencil coefficients can take various forms: constant,
variable, or random. The number of coefficients
of each kind is set by the keywords n_coeff_*
, where
“*
” is one of the three options. E.g., for four
variable coefficients:
The command nn
with two or three arguments (depending on the
dim
parameter) and the following size
lines define a sparse
coefficient matrix to a neighbouring lattice block at the offset
defined by the arguments of nn
.
Multiple entries in a line are separated by “;
”.
Optionally the first entry begins with “l
” followed by the row index.
A single block entry is written as “{colum index}{value}
” for
a fixed value, or as
“{colum index}
{type}{type index or value}
” for a different type.
This is a simple example for a coefficient matrix one lattice position to the
left of the current position (set by nn 1 0
) and a fixed entry of value at
position (0,1) and another entry of value at position (3,2):
Note that all indexing is zerobased. The following coefficient types are allowed:

Fixed coefficient, hardcoded (default type). I.e., an entry of “
11
” can also be written as “1f1
”. It will be hardcoded into the generated C code. 
Constant coefficient per lattice site, read from an array of length
n_coeff_const
. E.g., “1c2
” means that the coefficient used will becoeff_c[2]
. The coefficient array can be changed at runtime if required, or preset in the DSL source code. For example, the linecoeff_const_default 1 2 0.5will initialize the array
coeff_c[]
with the specified values. 
Variable coefficient per lattice site, read from an array of length
n_coeff_variable
per lattice site. 
Random coefficient per lattice site, read from an array of length
n_coeff_rand
per lattice site.
In Listing 3 we show a complete example for a 2D graphene stencil with variable coefficients on the diagonal, while Listing 4 shows the generated C source code for the sparse matrixvector multiplication (spMVM).
Coefficient arrays and execution parameters (such as, e.g., the grid size) can be configured and changed at runtime. The code repository^{1}^{1}1http://tiny.cc/PVSCDTM contains numerous examples that demonstrate the DSL usage and how the generated source code can be embedded into algorithms.
3.3 Random number generator
In some physically relevant cases, the Hamiltonian matrix has
a diagonal, random component. These random numbers are usually
stored in a constant array to be loaded during the matrixvector
multiplication step. At double precision this leads to an increase
in code balance by 8 bytes/LUP, which can be entirely saved by
generating the random numbers on the fly using a fast random
number generator (RNG). Considering that the stencil update
schemes studied here can run at several billions of lattice
site updates per second on modern server processors, a suitable
RNG must be able to produce random numbers at a comparable
rate. This is not possible with standard librarybased implementations
such as, e.g., drand48()
, but faster and (in terms
of quality) better options do exist. The RNG code should
be inlined with the actual spMVM or at least be available
as a function that generates long sequences of random numbers
in order to avoid hazardous call overhead.
The standard type of RNG used in scientific computing is the linear
congruential generator (LCG), which calculates and casts the result to a floatingpoint number. The numbers
, , and parametrize the generator; for efficiency reasons
one can choose to be a power of two (e.g., in
drand48()
), but such simple methods fail the statistical tests
of the popular TESTU01 suite devised by L’Ecuyer and Simard (2007). However, if there are
no particular quality requirements (i.e., if only “some randomness”
is asked for), they may still be of value.
Despite the nonresolvable dependency of
on , which appears to rule out SIMD vectorization,
LCGs can be vectorized if a number of independent random number
sequences is needed and if the SIMD instruction set of the hardware
allows for the relevant operations (e.g., SIMDparallel addition,
multiplication, and modulo on unsigned integer operands with the
required number of bits).
A more modern and similarly fast approach to RNGs are the xorshift generators by Marsaglia (2003). In the simplest case they work by a sequence of XOR mask operations of the seed with a bitshifted version of itself: . Improved versions like the xorshift128+ by Vigna (2014) pass all statistical tests of the “SmallCrush” suite in TESTU01. Table 1 shows a performance comparison of different RNGs on one socket of the IVB and BDW systems, respectively.
RNG (loop unrolling)  failed SmallCrush  Perf. IVB [GRN/sec]  Perf. BDW [GRN/sec] 

lgc32 (32)  1,2,3,4,5,6,7,8,9,10  19.3  28.8 
xorshift32 (32)  1,3,4,8,9,10  10.1  28.8 
xorshift128 (16)  6  8.29  25.4 
xorshift64_long (16)  8  8.26  31.7 
xorshift128plus_long (16)  —  6.34  20.7 
Intel MKL SFMT19937  —  7.72  22.0 
Performance comparison of the LCG32 RNG, four xorshift generators of different sophistication, and the SFMT19937 generator available in the Intel MKL. Performance numbers are given in billions of random numbers per second (GRN/sec) one one socket (10 or 18 cores, respectively). The benchmark consisted in the (repeated) computation of 2000 doubleprecision random numbers with uniform distribution. For each generator, the second column lists the failed SmallCrush tests of the TESTU01 suite. These particular test results were taken with the Intel C compiler version 17.0 update 5. The benchmarks are available in the code repository.
The speedup between IVB and BDW is particularly large for the xorshift generators because the AVX2 instruction set on BDW supports SIMDparallel bit shifting operations, which are not available in AVX. For comparison we have included a SIMDvectorized Mersenne Twister RNG (SFMT19937), which is available in Intel’s Math Kernel Library (MKL).
Whether a particular RNG impacts the performance of the matrixfree spMVM step depends on the application. In the cases we investigate here, a whole row of random numbers can be generated in advance before each inner loop traversal (see Listing 5) without a significant performance degradation in the bandwidthsaturated case. Fusing the RNG with the inner loop is left for future work.
3.4 Geometry
The current implementation of PVSCDTM supports cuboid domains of arbitrary size (only limited by memory capacity) with 3D domain decomposition using MPI, and OpenMPbased multithreading on the subdomain level. For each spatial dimension, periodic boundary conditions (BCs) can be configured separately as needed. Variablecoefficient arrays and initial vectors can be preset via a userdefined callback function with the following interface:
This function must expect the following input parameters:
It returns the respective vector entry as a doubleprecision number.
The pointer parm
is handed down from the vector initial call
and allows for configuring specific options. The pointer
seed
is a reference to a 128bit process and threadlocal
random seed.
Finally a vector block will be initialized by calling the function:
This mechanism lets the user define a generalized initial function with optionally free parameters. In addition, a threadlocal random seed for optional random functions is available in the initialization function, which enables a fully parallelized initialization of vectors.
3.5 Benchmarks
In order to validate the performance claims of our matrixfree implementation and optimization of random number generation we ran several test cases on the benchmark systems described in the introduction. Performance is quantified in billions of lattice site updates per second (GLUPs/sec). For all stencil update kernels (spMVMs) with constant coefficients studied here, the minimal code balance is 24 bytes/LUP with onthefly RNGs (see Listing 5) and 32 bytes/LUP with random numbers stored as constant arrays. The roofline model thus predicts bandwidthbound persocket upper performance limits of 1.67 GLUPs/sec on IVB and 2.5 GLUPs/sec on BDW.
Figures 2 and 3 show the performance scaling of the spMVM with a 3D 7point stencil on one socket of the benchmark systems. On IVB, the “fallback” kernel, which uses explicitly stored random numbers, saturates the memory bandwidth with eight cores at about 95% of the achievable bandwidth (black solid line). The kernel without random numbers (labeled “const. coeff.”) marks a practical upper performance limit. It also saturates at about the same bandwidth (and thus at 33% higher performance), with a very slight additional speedup from SMT (HyperThreading). As expected, the versions with onthefly RNGs are somewhat slower on the core level due to the increased amount of work, which, in case of the xorshift variants, leads to a lower performance than for the fallback variant up to seven cores, and nonsaturation when only physical cores are used. SMT can close this gap by filling pipeline bubbles on the core level, and all RNG versions end up at exactly the same performance with 20 SMT threads. On BDW the fullsocket situation is similar, but all versions come closer to the practical bandwidth limit than on IVB, and the fallback variant is slower than all RNG versions at all core counts.
The bottom line is that even the most “expensive” onthefly RNG allows memory bandwidth saturation on both architectures, that the roofline model predictions are quite accurate, that the automatic spatial blocking in PVSCDTM works as intended and yields the optimal inmemory code balance, and that the elimination of the stored random number stream causes the expected speedup even with highquality RNGs.
Figure 4 shows a performance comparison of stored random numbers and onthefly RNG for a 2D graphene application with four subsites, a block vector size of 1, 2, and 4, and four variable coefficients. The code balance goes down from 32 bytes/LUP to 28 bytes/LUP and finally to 26 bytes/LUP when going from to 2 and 4, approaching the limit of 24 bytes/LUP at . With onthefly RNGs substituting the variablecoefficient arrays this balance is achieved for any , which is shown in the right panel of Fig. 4. It can also be observed in the data that the improved SIMD vectorization with speeds up the code measurably in the nonsaturated regime, but this advantage vanishes close to saturation because the data transfer becomes the only bottleneck.
Figures 5 and 6 show the performance of the stencil kernels of a 3D TI model with different on IVB and BDW. Two versions are shown for each architecture: The standard one and an optimized version with dot products fused into the stencil kernel, increasing the number of flops per update by six. The code balance for TI stencils is lower than for graphene or the 7point stencil, hence more cores are required for bandwidth saturation.
At larger the loop body becomes more complicated, and the benefit of SIMD vectorization may be compensated by a more inefficient incore execution due to register shortage and less effective outoforder execution. This can be seen on IVB at , where the available number of cores is too small to reach saturation, as opposed to , where the SIMD width matches the number of block vectors.
Calculating dot products on the fly has a rather small impact on performance (less that 15%), which on BDW vanishes at saturation because of its generally lower machine balance. Still, overall the roofline model provides a good estimate of the expected socketlevel performance of our matrixfree codes even for topological insulators. Note, however, that the saturation properties depend on many factors, such as the number of cores per socket, the memory bandwidth, the clock speed, and the SIMD width. An accurate prediction of speedup vs. number of cores would require a more advanced performance model, such as the ECM model described in
Stengel et al. (2015).In Figures 7 and 8 we compare PVSCDTM the with the GHOST library developed by Kreutzer et al. (2017) using a strong scaling TI test case on the Emmy cluster. One MPI process was bound to each socket, with OpenMP parallelization (ten threads) across the cores. Both PVSCDTM and GHOST were used in “vector mode,” i.e., without overlap between communication and computation. GHOST always uses explicitly stored matrices, which is why PVSCDTM not only has the expected performance advantage due to its matrixfree algorithms but also requires less hardware to handle a given problem size. The maximum number of nodes was chosen such that a maximum communication overhead of about 4050% (see insets) can be observed for PVSCDTM, which is a reasonable upper limit in production runs for resource efficiency reasons.
In the test case in Fig. 7, GHOST requires at least 16 nodes for storing the matrix and two vectors. With the same resources, PVSCDTM is about 5 faster, and can outperform GHOST already with four nodes. The ratio of communication to computation time is naturally larger with PVSCDTM due to the faster code execution. Although this particular test case cannot be run on a single node with GHOST, the performance comparison at 16 nodes also reflects quite accurately the pernode (or persocket, i.e., pure OpenMP) performance ratio between PVSCDTM and GHOST, since at this point the communication overhead is still only 10–20%.
For the memory traffic caused by the matrix becomes less significant and the speedup of PVSCDTM vs. GHOST gets smaller. In the smaller test case shown in Fig. 8 GHOST requires at least four nodes and is still about 2.5 slower than PVSCDTM at that point. Again, this is also the expected persocket speedup if it were possible to run the test case on a single socket with GHOST.
4 Algorithms and application examples
In largescale simulations of any kind, avoiding global synchronization points is crucial for scalability. This challenge can be met by modern matrix polynomial methods. The kernel polynomial method, the Chebyshev time propagation approach described in Weiße and Fehske (2008) and Alvermann and Fehske (2008), and the highperformance Chebyshev filter diagonalization technique (ChebFD) implementation introduced in Pieper et al. (2016) are already available in PVSCDTM. These algorithms benefit from partial dot products, vector blocking, and loop fusion. The highorder commutatorfree exponential timepropagation algorithm introduced by Alvermann and Fehske (2011) for driven quantum systems will be implemented in the near future. In the following sections we give some examples for typical applications in the field of Dirac and topological materials.
4.1 Time Propagation
The time evolution of a quantum state is described by the Schrödinger equation. If the Hamilton operator does not explicitly depend on the time we can formally integrate this equation and express the dynamics in terms of the time evolution operator as with (). Expanding the time evolution operator into a finite series of firstkind Chebyshev polynomials of order , , we obtain [TalEzer and Kosloff (1984); Fehske et al. (2009)]
(14) 
Prior to the expansion the Hamiltonian has to be shifted and rescaled such that the spectrum of is within the definition interval of the Chebyshev polynomials, , where and are calculated from the extreme eigenvalues of : and . The expansion coefficients are given by
(15) 
( denotes the th order Bessel function of the first kind).
Calculating the evolution of a state from one time grid point to the adjacent one, , we have to accumulate the weighted vectors . Since the coefficients depend on the time step but not on time explicitly, we need to calculate them only once. The vectors can be computed iteratively, exploiting the recurrence relation of the Chebyshev polynomials, , with and . Evolving the wave function from one time step to the next then requires MVMs of a given complex vector with the (sparse) Hamilton matrix of dimension and the summation of the resulting vectors after an appropriate rescaling. Thus, for timeindependent , arbitrary large time steps are in principle possible at the expense of increasing . We may choose such that for the modulus of all expansion coefficients is smaller than a desired accuracy cutoff. This is facilitated by the fast asymptotic decay of the Bessel functions, for . Thus, for large , the Chebyshev expansion can be considered as quasiexact. Besides the high accuracy of the method, the linear scaling of computation time with both time step and Hilbert space dimension are promising in view of potential applications to more complex systems. In our cases almost all computation time is spent in sparse MVMs, which can be efficiently parallelized, allowing for a good speedup on highly parallel computers. This also means that any significant speedup that can be achieved for the MVM, such as by our matrixfree formulation, will have a corresponding effect on the runtime of the overall algorithm. The actual speedup is a function of the memory traffic reduction; for instance, a sparse matrix stored in CRS format that describes a stencillike neighborhood relation with eight neighbors will (in double precision) cause a minimum data traffic of approximately 7.6 bytes/flop when acting on a righthand side vector. In a matrixfree formulation this balance reduces to 1.3 bytes/flop, leading to a performance improvement of 5.7 if the memory bandwidth can be saturated in both cases.
As an example, we apply the Chebyshev time evolution scheme the propagation and scattering of a Dirac electron wave packet on a graphene sheet with an imprinted gatedefined quantum dot array [Pieper et al. (2013); Fehske et al. (2015)]. This is a timely issue of of high experimental relevance [Walls and Hadad (2015); Caridad et al. (2016); Tang et al. (2016)]. We mimic the quantum dot array by implementing the potential in (1) as
(16) 
with varying amplitude in direction. In (16), () is the radius of a single quantum dot (the nearestneighbor distance between dots) and gives the dot’s position [ count in () direction]. The quantum dot lattice can be created by applying spatially confined top gate voltages. The gradient change of the dot potentials mimics spatially varying effective refraction indices for the Dirac electron waves.
Figure 9 illustrates the scattering and temporary particle confinement by the quantum dot array. It has been demonstrated by by Heinisch et al. (2013) and Pieper et al. (2013) that the normal modes of an isolated quantum dot lead to sharp resonances in the scattering efficiency. Appearing for particular values of , , and , they allow the “trapping” even of Dirac electrons. Of course, for the scattering setup considered here, only quasibound states can appear, which may have an exceptionally long lifetime, however. Thereby the energy of the wave is fed into vortices inside the dot. For a periodic array of dots the normal modes at neighboring dots can couple, leading to coherence effects (such interdot particle transfer takes place on a reduced energy scale compared to pure graphene [Fehske et al. (2015)]). The situation becomes more complicated – but also more interesting – when the dot potentials are modulated spatially or energetically. In this case, a directiondependent transmission (cascaded Mie scattering [Caridad et al. (2016)]) or even the focusing of the electron beam outside the dots can be observed [Tang et al. (2016)]. Similar phenomena are demonstrated by Fig. 9. For this simulation the electron is created by a Gaussian wave packet
(17) 
where is Dirac electron with momentum in direction. When the wavefront hits the dot region, the wave is partly trapped by the quantum dots, whereby – for the parameters used – the resonance conditions are better fulfilled near the lower/upper end of the dot array (here the particle wave is best captured). The other way around, the transmission (and also the reflection, i.e., the backscattering) is strongest in the central region, leading to a curved wavefront. For larger time values a second pulse (wavefront) emerges (note that we have reflections and repeated scattering events in our finite GNR, but the largest time considered, , is much shorter than the passthrough time of an unperturbed Dirac wave). In any case, one observes a strongly time and directiondependent emission pattern of our graphenebased nanostructure, which can be exploited to manipulate electron beams. Particularly interesting in this respect would be focusing of the electron beam with large focal length, such that the focal spot lies outside the array. Then the structure can be used as a coupler to other electronic units. Achieving this by tuning the gradient of the gate potential appears to be a very efficient way, which is more easily realized in practice than modifying the geometrical properties of the array such as the lattice gradient or the layer number [Tang et al. (2016)].
4.2 Interior eigenvalues – TIs
Since the electronic properties of TIs are mainly determined by the (topologically nontrival) surface states located in the bulkstate gap, an efficient calculation of electron states at or close to the center of a spectrum is of vital importance. This can be done by Chebyshev filter diagonalization (ChebFD), a straightforward scheme for interior eigenvalue computation, which is based on polynomial filter functions and therefore has much in common with the KPM. ChebFD applies a matrix polynomial filter that is suitable for the target interval to a block of vectors. In each iteration, the search space is checked for convergence using a RayleighRitz procedure. ChebFD has already proven its practical suitability: Parallelized and implemented on the “SuperMUC” supercomputer at LRZ Garching, central eigenvalues of a dimensional sparse matrix have been calculated at 40 Tflop/s sustained performance by Pieper et al. (2016).
Figure 10 shows the DOS of a strong TI. The focus is on the (pseudo) gap region of the DOS. Implementing the effect of nonmagnetic impurities by uniformly distributed random onsite potentials , we see how disorder fills the gap that exists in the DOS of system with a finite number of sites (see the red curves in the upper panels). Introducing a finite , which mimics, e.g., the effect of an external magnetic field, the midgap Dirac cone formed by the surface states is broken up. Again, disorder will induce electronic states in the gap region generated by . This is demonstrated by the lower panel of Fig. 10, showing the DOS at the band center () in the  plane. As the disorder strength increases, more and more states pop up at until the DOS saturates when reaches the order of magnitude of the bulk band gap. For a more detailed investigation of disordered (weak and strong) TI we refer the reader to Kobayashi et al. (2013) where, besides the phase diagram, also the DOS was calculated using the KPM (see supplementary material in that paper). Compared to KPM, our ChebFD approach yields a better resolution at the same computational cost in the target interval (band center), which is important regarding the scientific applications.
]. Applying the KPM, 2048 Chebyshev moments were used for a system with
sites. Note that the finite DOS atis a finitesize effect and due to the finite KPM resolution (variance
).The ChebFD algorithm is robust and scalable, but algorithmically suboptimal. In PVSCDTM we have also implemented the trLanczosFD (thickrestart Lanczos with polynomial filters) algorithm by Li et al. (2016). This algorithm benefits particularly from a matrixfree formulation. A thorough description would exceed the scope of this paper; however, in Table 2 we show runtime data and the maximum residuum of the inner Ritz eigenvalues for trLanczosFD on two TI test cases in comparison with ChebFD. TrLanczosFD outperforms ChebFD by a factor of almost four (using PVSCDTM for both).
test case 1  test case 2  

size  4804806  2402406 
eigenpairs  72  40 
Emmy nodes  8  8 
ChebFD:  
runtime [s]  2852  642 
max res.  
trLanczosFD:  
runtime [s]  760  142 
max res. 
4.3 Disorder effects – Weyl semimetals
The Weyl nodes in the gapless topological Weyl semimetals are believed to be robust against perturbations unless, e.g., the translation or charge conservation symmetry is broken. Showing the stability of a single or a pair of Weyl nodal points against disorder has been the subject of intense research [Liu et al. (2016); Chen et al. (2015); Pixley et al. (2015); McCormick et al. (2017); Shapourian and Hughes (2016); Zhao and Wang (2015)]. Due to the vanishing DOS at the nodal points, disorder effects can be expected to be particularly pronounced. Since analytic methods fail widely in their quantitative predictions, even in the case of weak disorder, we use a purely numerical, KPMbased approach to analyze the spectral properties of Weyl semimetals with realspace quenched potential disorder.
Figure 11 displays the momentumresolved spectral function of a disordered Weyl metal along different paths in the bulk Brillouin zone. The photoemission spectra shown were calculated for the model (8) with random potentials drawn from a uniform box distribution of strength , i.e., . The presented data should be compared with the results for the clean case provided by Fig. 1 (c). Most notably, the electronic states at the Fermi arc (connecting the nodal points) and its immediate vicinity are hardly influenced by weak and even intermediate disorder. This does not apply for states further away from the Fermi surface. Here, the spectral signatures (band dispersion) are rapidly washed out, even for weak disorder. Of course, strong disorder will also affect the Fermi arc and the nodal points: Above a certain disorder strength they will be smeared out in both energy and momentum space and, as a result, the Weyl semimetal will transform into a diffusive metal with a finite DOS at the nodal points. A more detailed investigation of the spectral properties would be desirable in order to confirm the very recent evidence found by Su et al. (2017) for an intermediate Chern insulator state between the disordered Weyl semimetallic and diffusive metallic phases. At even stronger disorder, the distribution of the local density of states significantly broadens (just as in the case of strongly disordered strong TIs [Schubert et al. (2012)] or disordered GNR [Schubert et al. (2009); Schubert and Fehske (2012)]) and Anderson localization sets in [Pixley et al. (2015)].
5 Conclusion and outlook
The PVSCDTM DSL and library have been demonstrated to be powerful tools for generating highperformance code to investigate groundstate, spectral, and dynamic properties of graphene, TIs, and other materials whose physics is governed by shortrange interactions that lead to stencillike numerical kernels. Due to is matrixfree design, PVSCDTM outperforms matrixbased libraries such as GHOST. It also implements effective SIMD vectorization and fast onthefly random number generation and yields optimal memorybound chiplevel performance as shown by the roofline model. Spatial blocking of the iteration loops is fully automatic and based on layer conditions.
Several improvements to the library are left for future work: A better integration of the random number generator with the inner update loop would increase the nonsaturated and sequential performance. Overlapping computation with communication would improve the distributedmemory parallel efficiency. Both optimizations are prerequisites for a possible integration with temporal blocking frameworks such as Girih by Malas et al. (2015), which would further boost the chiplevel performance. The system geometry is currently limited to rectangular and cuboid domains, a restriction that may be lifted to support more interesting physical setups. Finally we plan to implement more algorithms in order to make the library more versatile beyond the showcases described here.
Acknowledgments
We thank Andreas Alvermann, Rafael L. Heinisch, and Gerhard Wellein for fruitful discussions. A.P. is grateful for financial support by KONWIHR, the Bavarian Competence Network for High Performance Computing (project PVSCTM). We are also indebted to the LRZ Garching for providing access to HLRBII.
References
 Alvermann and Fehske (2008) Alvermann A and Fehske H (2008) Chebyshev approach to quantum systems coupled to a bath. Phys. Rev. B 77: 045125. DOI:10.1103/PhysRevB.77.045125.
 Alvermann and Fehske (2011) Alvermann A and Fehske H (2011) Highorder commutatorfree exponential timepropagation of driven quantum systems. J. Comput. Phys. 230(15): 5930–5956. DOI:10.1016/j.jcp.2011.04.006.
 Caridad et al. (2016) Caridad JM, Connaughton S, Ott C, Weber HB and Krstić V (2016) An electrical analogy to Mie scatterings. Nature Communications 7: 12894. DOI:10.1038/ncomms12894.
 Castro Neto et al. (2009) Castro Neto AH, Guinea F, Peres NMR, Novoselov KS and Geim AK (2009) The electronic properties of graphene. Rev. Mod. Phys. 81: 109–162. DOI:10.1103/RevModPhys.81.109.
 Chen et al. (2015) Chen CZ, Song J, Jiang H, Sun Qf, Wang Z and Xie XC (2015) Disorder and metalinsulator transitions in Weyl semimetals. Phys. Rev. Lett. 115: 246603. DOI:10.1103/PhysRevLett.115.246603.
 Fehske et al. (2015) Fehske H, Hager G and Pieper A (2015) Electron confinement in graphene with gatedefined quantum dots. physica status solidi (b) 252(8): 1868–1871. DOI:10.1002/pssb.201552119.
 Fehske et al. (2009) Fehske H, Schleede J, Schubert G, Wellein G, Filinov VS and Bishop AR (2009) Numerical approaches to time evolution of complex quantum systems. Physics Letters A 373: 2182 – 2188. DOI:10.1016/j.physleta.2009.04.022.
 Fu et al. (2007) Fu L, Kane CL and Mele EJ (2007) Topological insulators in three dimensions. Phys. Rev. Lett. 98: 106803. DOI:10.1103/PhysRevLett.98.106803.
 Gropp et al. (2000) Gropp WD, Kaushik DK, Keyes DE and Smith BF (2000) Towards realistic performance bounds for implicit CFD codes. In: Keyes D, Periaux J, Ecer A, Satofuka N and Fox P (eds.) Parallel Computational Fluid Dynamics 1999. Elsevier. ISBN 9780444828514, pp. 241–248. DOI:10.1016/B9780444828514.50030X.
 Hasan and Kane (2010) Hasan MZ and Kane CL (2010) Colloquium: Topological insulators. Rev. Mod. Phys. 82: 3045–3067. DOI:10.1103/RevModPhys.82.3045.
 Hasan et al. (2017) Hasan MZ, Xu SY, Belopolski I and Huang SM (2017) Discovery of Weyl fermion semimetals and topological Fermi arc states. Annual Review of Condensed Matter Physics 8(1): 289–309.
 Heinisch et al. (2013) Heinisch RL, Bronold FX and Fehske H (2013) Mie scattering analog in graphene: Lensing, particle confinement, and depletion of Klein tunneling. Phys. Rev. B 87: 155409. DOI:10.1103/PhysRevB.87.155409.
 Ilic et al. (2014) Ilic A, Pratas F and Sousa L (2014) Cacheaware roofline model: Upgrading the loft. IEEE Computer Architecture Letters 13(1): 21–24. DOI:10.1109/LCA.2013.6.
 Kobayashi et al. (2013) Kobayashi K, Ohtsuki T and Imura KI (2013) Disordered weak and strong topological insulators. Phys. Rev. Lett. 110: 236803. DOI:10.1103/PhysRevLett.110.236803.
 Kreutzer et al. (2017) Kreutzer M, Thies J, RöhrigZöllner M, Pieper A, Shahzad F, Galgon M, Basermann A, Fehske H, Hager G and Wellein G (2017) GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems. International Journal of Parallel Programming 45(5): 1046–1072. DOI:10.1007/s107660160464z.
 L’Ecuyer and Simard (2007) L’Ecuyer P and Simard R (2007) TestU01: A C library for empirical testing of random number generators. ACM Trans. Math. Softw. 33(4): 22:1–22:40. DOI:10.1145/1268776.1268777.
 Li et al. (2016) Li R, Xi Y, Vecharynski E, Yang C and Saad Y (2016) A thickrestart Lanczos algorithm with polynomial filtering for Hermitian eigenvalue problems. SIAM Journal on Scientific Computing 38(4): A2512–A2534. DOI:10.1137/15M1054493.
 Liu et al. (2016) Liu S, Ohtsuki T and Shindou R (2016) Effect of disorder in a threedimensional layered chern insulator. Phys. Rev. Lett. 116: 066401. DOI:10.1103/PhysRevLett.116.066401.
 Malas et al. (2015) Malas T, Hager G, Ltaief H and Keyes D (2015) Multidimensional intratile parallelization for memorystarved stencil computations. ArXiv eprints 1510.04995. To be published in ACM Transactions on Parallel Computing (TOPC).
 Marsaglia (2003) Marsaglia G (2003) Xorshift RNGs. Journal of Statistical Software 8(1): 1–6. DOI:10.18637/jss.v008.i14.
 McCalpin (19912007) McCalpin JD (19912007) STREAM: Sustainable memory bandwidth in high performance computers. Technical report, University of Virginia, Charlottesville, VA. URL http://www.cs.virginia.edu/stream/. A continually updated technical report.
 McCormick et al. (2017) McCormick TM, Kimchi I and Trivedi N (2017) Minimal models for topological Weyl semimetals. Phys. Rev. B 95: 075133. DOI:10.1103/PhysRevB.95.075133.
 Pieper and Fehske (2016) Pieper A and Fehske H (2016) Topological insulators in random potentials. Phys. Rev. B 93: 035123. DOI:10.1103/PhysRevB.93.035123.
 Pieper et al. (2013) Pieper A, Heinisch RL and Fehske H (2013) Electron dynamics in graphene with gatedefined quantum dots. EPL (Europhysics Letters) 104(4): 47010. URL http://stacks.iop.org/02955075/104/i=4/a=47010.
 Pieper et al. (2016) Pieper A, Kreutzer M, Alvermann A, Galgon M, Fehske H, Hager G, Lang B and Wellein G (2016) Highperformance implementation of Chebyshev filter diagonalization for interior eigenvalue computations. Journal of Computational Physics 325: 226–243. DOI:10.1016/j.jcp.2016.08.027.
 Pixley et al. (2015) Pixley JH, Goswami P and Das Sarma S (2015) Anderson localization and the quantum phase diagram of three dimensional disordered Dirac semimetals. Phys. Rev. Lett. 115: 076601. DOI:10.1103/PhysRevLett.115.076601.
 Qi and Zhang (2011) Qi XL and Zhang SC (2011) Topological insulators and superconductors. Rev. Mod. Phys. 83: 1057–1110. DOI:10.1103/RevModPhys.83.1057.
 RaganKelley et al. (2013) RaganKelley J, Barnes C, Adams A, Paris S, Durand F and Amarasinghe S (2013) Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13. New York, NY, USA: ACM. ISBN 9781450320146, pp. 519–530. DOI:10.1145/2491956.2462176. URL http://doi.acm.org/10.1145/2491956.2462176.
 Rivera and Tseng (2000) Rivera G and Tseng CW (2000) Tiling optimizations for 3D scientific computations. In: Supercomputing, ACM/IEEE 2000 Conference. pp. 32–32. DOI:10.1109/SC.2000.10015.
 Schmitt et al. (2014) Schmitt C, Kuckuk S, Hannig F, Köstler H and Teich J (2014) Exaslang: A domainspecific language for highly scalable multigrid solvers. In: 2014 Fourth International Workshop on DomainSpecific Languages and HighLevel Frameworks for High Performance Computing. pp. 42–51. DOI:10.1109/WOLFHPC.2014.11.
 Schubert and Fehske (2012) Schubert G and Fehske H (2012) Metaltoinsulator transition and electronhole puddle formation in disordered graphene nanoribbons. Phys. Rev. Lett. 108: 066402. DOI:10.1103/PhysRevLett.108.066402.
 Schubert et al. (2012) Schubert G, Fehske H, Fritz L and Vojta M (2012) Fate of topologicalinsulator surface states under strong disorder. Phys. Rev. B 85: 201105. DOI:10.1103/PhysRevB.85.201105.
 Schubert et al. (2009) Schubert G, Schleede J and Fehske H (2009) Anderson disorder in graphene nanoribbons: A local distribution approach. Phys. Rev. B 79: 235116. DOI:10.1103/PhysRevB.79.235116.
 Shapourian and Hughes (2016) Shapourian H and Hughes TL (2016) Phase diagrams of disordered Weyl semimetals. Phys. Rev. B 93: 075108. DOI:10.1103/PhysRevB.93.075108.
 Sitte et al. (2012) Sitte M, Rosch A, Altman E and Fritz L (2012) Topological insulators in magnetic fields: Quantum hall effect and edge channels with a nonquantized term. Phys. Rev. Lett. 108: 126807. DOI:10.1103/PhysRevLett.108.126807.
 Stengel et al. (2015) Stengel H, Treibig J, Hager G and Wellein G (2015) Quantifying performance bottlenecks of stencil computations using the ExecutionCacheMemory model. In: Proceedings of the 29th ACM International Conference on Supercomputing, ICS ’15. New York, NY, USA: ACM. DOI:10.1145/2751205.2751240.
 Su et al. (2017) Su Y, Wang XS and Wang SR (2017) A generic phase between disordered Weyl semimetal and diffusive metal. Scientific Reports 7: 14382. DOI:10.1038/s41598017147608.
 TalEzer and Kosloff (1984) TalEzer H and Kosloff R (1984) An accurate and efficient scheme for propagating the time dependent schrödinger equation. The Journal of Chemical Physics 81(9): 3967–3971. DOI:10.1063/1.448136.
 Tang et al. (2016) Tang Y, Cao X, Guo R, Zhang Y, Che Z, Yannick FT, Zhang W and Du J (2016) Flatlens focusing of electron beams in graphene. Scientific Reports 6: 33522. DOI:10.1038/srep33522.
 Tang et al. (2011) Tang Y, Chowdhury RA, Kuszmaul BC, Luk CK and Leiserson CE (2011) The Pochoir stencil compiler. In: Proceedings of the Twentythird Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’11. New York, NY, USA: ACM. ISBN 9781450307437, pp. 117–128. DOI:10.1145/1989493.1989508.
 Vigna (2014) Vigna S (2014) Further scramblings of Marsaglia’s xorshift generators. ArXiv eprints .
 Walls and Hadad (2015) Walls J and Hadad D (2015) Suppressing Klein tunneling in graphene using a onedimensional array of localized scatterers. Scientific Reports 5. DOI:10.1038/srep08435.
 Wan et al. (2011) Wan X, Turner AM, Vishwanath A and Savrasov SY (2011) Topological semimetal and Fermiarc surface states in the electronic structure of pyrochlore iridates. Phys. Rev. B 83: 205101. DOI:10.1103/PhysRevB.83.205101.
 Weiße and Fehske (2008) Weiße A and Fehske H (2008) Lecture Notes in Physics 739: 529.
 Weiße et al. (2006) Weiße A, Wellein G, Alvermann A and Fehske H (2006) The kernel polynomial method. Rev. Mod. Phys. 78: 275–306. DOI:10.1103/RevModPhys.78.275.
 Williams et al. (2009) Williams S, Waterman A and Patterson D (2009) Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52(4): 65–76. DOI:10.1145/1498765.1498785.
 Xu et al. (2015) Xu SY, Belopolski I, Alidoust N, Neupane M, Bian G, Zhang C, Sankar R, Chang G, Yuan Z, Lee CC, Huang SM, Zheng H, Ma J, Sanchez DS, Wang B, Bansil A, Chou F, Shibayev PP, Lin H, Jia S and Hasan MZ (2015) Discovery of a Weyl fermion semimetal and topological Fermi arcs. Science 349(6248): 613–617. DOI:10.1126/science.aaa9297.
 Yang et al. (2011) Yang KY, Lu YM and Ran Y (2011) Quantum Hall effects in a Weyl semimetal: Possible application in pyrochlore iridates. Phys. Rev. B 84: 075129. DOI:10.1103/PhysRevB.84.075129.
 Zhang et al. (2017) Zhang N, Driscoll M, Markley C, Williams S, Basu P and Fox A (2017) Snowflake: A lightweight portable stencil dsl. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). pp. 795–804. DOI:10.1109/IPDPSW.2017.89.
 Zhao and Wang (2015) Zhao YX and Wang ZD (2015) Disordered Weyl semimetals and their topological family. Phys. Rev. Lett. 114: 206602. DOI:10.1103/PhysRevLett.114.206602.
Comments
There are no comments yet.