1. Introduction
Heat dissipation in microchips reached alarming peak values of 100 W/cm already in 2006 (wei; pop). This led to the end of Dennard scaling and the beginning of the “multicore crisis”, an era with energyefficient parallel, but sequentially slower multicore CPUs. Now, more than ten years later, average power densities of up to 30 W/cm, about four times more than hot plates, are commonplace in modern highperformance CPUs, putting thermal management at the center of attention of circuit designers (hotplate). By scaling the dimensions of transistors more rapidly than their supply voltage, the semiconductor industry has kept increasing heat dissipation from one generation of microprocessors to the other. In this context, largescale data and supercomputing centers are facing critical challenges regarding the design and cost of their cooling infrastructures. The price to pay for that has become exorbitant, as the cooling can be up to 40% of the total electicity consumed by data centers; a cumulative cost of many billion dollars per year.
Landauer’s theoretical limit of energy consumption for nonreversible computing offers a glimmer of hope: today’s processing units require orders of magnitude more energy than the Joule bound to (irreversibly) change one single bit. However, to approach this limit, it will be necessary to first properly understand the mechanisms behind nanoscale heat dissipation in semiconductor devices (pop). Fin fieldeffect transistors (FinFETs), as schematized in Fig. 1(ac), build the core of all recent integrated circuits (ICs). Their dimensions do not exceed 100 nanometers along all directions, even 10 nm along one of them, with an active region composed of fewer than 1 million atoms. This makes them subject to strong quantum mechanical and peculiar thermal effects.
When a voltage is applied across FinFETs, electrons start to flow from the source to the drain contact, giving rise to an electrical current whose magnitude depends on the gate bias. The potential difference between source and drain allows electrons to transfer part of their energy to the crystal lattice surrounding them. This energy is converted into atomic vibrations, called phonons, that can propagate throughout FinFETs. The more atoms vibrate, the “hotter” a device becomes. This phenomenon, known as self or Jouleheating, plays a detrimental role in today’s transistor technologies and has consequences up to the system level. It is illustrated in Fig. 1(d): a strong increase of the lattice temperature can be observed close to the drain contact of the simulated FinFET. The negative influence of selfheating on the CPU/GPU performance can be minimized by devising computerassisted strategies to efficiently evacuate the generated heat from the active region of transistors.
Electrothermal properties of nanodevices can be modeled and analyzed via Quantum Transport (QT) simulation
, where electron and phonon currents are evaluated by taking quantum mechanics into account. Due to the large height/width ratio of FinFETs, these effects can be physically captured in a twodimensional simulation domain comprising 10,000 to 15,000 thousand atoms. Such dissipative simulations involve solving the Schrödinger equation with open boundary conditions over several momentum and energy vectors that are coupled to each other through electronphonon interactions. A straightforward algorithm to address this numerical problem consists of defining two loops, one over the momentum points and another one over the electron energies, which results in potentially extraneous execution dependencies and a complex communication pattern. The latter scales suboptimally with the number of participating atoms and computational resources, thus limiting current simulations to the order of thousand atoms.
While the schedule (ordering) of the loops in the solver is natural from the physics perspective (§ 2), the data decomposition it imposes when parallelizing is not scalable from the computational perspective. To investigate larger device structures, it is crucial to reformulate the problem as a communicationavoiding algorithm — rescheduling computations across compute resources to minimize data movement.
Even when each node is operating at maximum efficiency, largescale QT simulations are both bound by communication volume and memory requirements. The former inhibits strong scaling, as simulation time includes nanostructuredependent pointtopoint communication patterns, which becomes infeasible when increasing node count. The memory bottleneck is a direct result of the former. It hinders large simulations due to the increased memory requirements w.r.t. atom count. Transforming the QT simulation algorithm to minimize communication is thus the key to simultaneously model larger devices and increase scalability on different supercomputers.
The current landscape of supercomputing resources is dominated by heterogeneous nodes, where no two clusters are the same. Each setup requires careful tuning of application performance, focused mostly around data locality (padal). As this kind of tuning demands indepth knowledge of the hardware, it is typically performed by a Performance Engineer, a developer who is versed in intricate system details, existing highperformance libraries, and capable of modeling performance and setting up optimized procedures independently. This role, which complements the Domain Scientist, has been increasingly important in scientific computing for the past three decades, but is now essential for any application beyond straightforward linear algebra to operate at extreme scales. Until recently, both Domain Scientists and Performance Engineers would work with one codebase. This creates a codependent situation (patmaccormickssostalk), where the original domain code is tuned to a point that making modifications to the algorithm or transforming its behavior is difficult to one without the presence of the other, even if data locality or computational semantics are not changed.
In this paper, we propose a paradigm change by rewriting the problem from a datacentric perspective. We use OMEN, the current stateoftheart quantum transport simulation application (omen), as our baseline, and show that the key to formulating a communicationavoiding variant is tightly coupled with recovering local and global data dependencies of the application. We start from a reference Python implementation, using DataCentric (DaCe) Parallel Programming (sdfg) to express the computations separately from data movement. DaCe automatically constructs a stateful dataflow view that can be used to optimize data movement without modifying the original computation. This enables rethinking the communication pattern of the simulation, and tuning the data movement for each target supercomputer.
In sum, the paper makes the following contributions:

Construction of the dissipative quantum transport simulation problem from a physical perspective;

Definition of the stateful dataflow of the algorithm, making data and control dependencies explicit on all levels;

Creation of a novel tensorfree communicationavoiding variant of the algorithm based on the datacentric view;

Optimal choice of the decomposition parameters based on the modeling of the performance and communication of our variant, nanodevice configuration, and cluster architecture;

Demonstration of the algorithm’s scalability on two vastly different supercomputers — Piz Daint and Summit — up to fullscale runs on 10k atoms, 21 momentum points and 1,000 energies per momentum;

A performance increase of 1–2 orders of magnitude over the previous state of the art, all from a datacentric Python implementation that reduces codelength by a factor of five.
2. Statement of the Problem
A technology computer aided design (TCAD) tool can shed light on the electrothermal properties of nanodevices, provided that it includes the proper physical models:

The dimensions of FinFETs calls for an atomistic quantum mechanical treatment of the device structures;

The electron and phonon bandstructures should be accurately and fully described;

The interactions between electrons and phonons, especially energy exchanges, should be accounted for.
The Nonequilibrium Green’s Function (NEGF) formalism (datta) combined with density functional theory (DFT) (kohn) fulfills all these requirements and lends itself optimally to the investigation of selfheating in arbitrary device geometries. With NEGF, both electron and phonon transport can be described, together with their respective interactions. With the help of DFT, an ab initio method, any material (combination) can be handled at the atomic level without the need for empirical parameters.
FinFETs are essentially threedimensional (3D) components that can be approximated as 2D slices in the  plane, whereas the height, aligned with the axis (see Fig. 1(ab)), can be treated as a periodic dimension and represented by a momentum vector or in the range . Hence, the DFT+NEGF equations have the following form:
(1) 
In Eq. (1), is the electron energy, and the overlap and Hamiltonian matrices, respectively. They typically exhibit a block tridiagonal structure and a size with as the total number of atoms in the considered structure and the number of orbitals (basis components) representing each atom. The and matrices must be produced by a DFT package relying on a localized basis set, e.g., SIESTA (siesta) or CP2K (cp2k). The ’s refer to the electron Green’s Functions (GF) at energy and momentum . They are of the same size as , , and
, the identity matrix. The GF can be either retarded (
), advanced (), lesser (), or greater () with =. The same conventions apply to the selfenergies that include a boundary and a scattering term. The former connects the simulation domain to external contacts, whereas the latter encompasses all possible interactions of electrons with their environment.To handle phonon transport, the following NEGFbased system of equations must be processed:
(2) 
where the ’s are the phonon Green’s functions at frequency and momentum and the ’s the selfenergies, while refers to the dynamical (Hessian) matrix of the studied domain, computed with density functional perturbation theory (DFPT) (dfpt). The phonon Green’s Function types are the same as for electrons (retarded, advanced, lesser, and greater). All matrices involved in Eq. (2) are of size , =3 corresponding to the number of directions along which the crystal can vibrate (, , ).
Equations (1) and (2) must be solved for all possible electron energy () and momentum () points as well as all phonon frequencies () and momentum (). This can be done with a socalled recursive Green’s Function (RGF) algorithm (rgf) that takes advantage of the block tridiagonal structure of the matrices , , and . All matrices can be divided into blocks with atoms each, if the structure is homogeneous, as here. RGF then performs a forward and backward pass over the blocks that compose the 2D slice. Both passes involve a number of multiplications between matrices of size for electrons (or for phonons) for each block.
The main computational bottleneck does not come from RGF, but from the fact that in the case of selfheating simulations the energymomentum (,) and frequencymomentum (,) pairs are not independent from each other, but tightly coupled through the scattering selfenergies (SSE) and . These matrices are made of blocks of size and , respectively, and are given by (stieger):
(3) 
(4)  
(5)  
In Eqs. (35), all Green’s Functions () are matrices of size (). They describe the coupling between all orbitals (vibrational directions) of two neighbor atoms and situated at position and . Each atom possesses neighbors. Furthermore, is the derivative of the Hamiltonian block coupling atoms and w.r.t variations along the =, , or coordinate of the bond . To obtain the retarded components of the scattering selfenergies, the following relationship can be used: , which is also valid for (lake). Due to computational reasons, only the diagonal blocks of are retained, while nondiagonal connections are kept for .
Variable  Description  Range 

Number of electron momentum points  
Number of phonon momentum points  
Number of energy points  [700, 1500]  
Number of phonon frequencies  [10, 100]  
Total number of atoms per device structure  See Table 2  
Neighbors considered for each atom  [4, 50]  
Number of orbitals per atom  [1, 30]  
Degrees of freedom for crystal vibrations  3 
Name  Maximum # of Computed Atoms  Scalability  

Tightbindinglike  DFT  Max. Cores  Using  
(Magnitude)  GPUs  
GOLLUM (ferrer2014gollum)  1k  1k  —  100  100  —  N/A  ✗ 
Kwant (groth2014kwant)  10k  —  —  —  —  —  N/A  ✗ 
NanoTCAD ViDES (vides)  10k  —  —  —  —  —  N/A  ✗ 
QuantumATK (atk)  10k  10k  —  1k  1k  —  1k  ✗ 
TB_sim (niquet)  100k  —  10k  1k  —  —  10k  ✓ 
NEMO5 (nemo5)  100k  100k  10k  —  —  —  100k  ✓ 
OMEN (omen)  100k (1.44 Pflop/s (sc11))  100k  10k  10k  10k (15 Pflop/s (sc15))  1k (0.16 Pflop/s)  100k  ✓ 
This work  N/A  N/A  N/A  10k  10k  10k (19.71 Pflop/s)  1M  ✓ 
: including MaximallyLocalized Wannier Functions (MLWF), : Ballistic, : Simplified.
The evaluation of Eqs. (35) does not require the knowledge of all entries of the and matrices, but of two (lesser and greater) 5D tensors of shape for electrons and two 6D tensors of shape for phonons. Each and combination is produced independently from the other by solving Eq. (1) and (2), respectively. The electron and phonon scattering selfenergies can also be reshaped into multidimensional tensors that have exactly the same dimensions as their Green’s functions counterparts. However, the selfenergies cannot be computed independently, one energymomentum or frequencymomentum pair depending on many others, as defined in Eqs. (35) and depicted in Fig. 2. Furthermore, is a function of , while is needed to calculate .
To obtain the electrical and energy currents that flow through a given device and the corresponding charge density, Eqs. (12) (GF) and Eqs. (35) (SSE) must be iteratively solved until convergence is reached, and all GF contributions must be accumulated (stieger). The algorithm starts by setting ==0 and continues by computing all GFs under this condition. The latter then serve as inputs to the next phase, where the SSE are evaluated for all (,) and (,) pairs. Subsequently, the SSE matrices are fed into the GF calculation and the process repeats itself until the GF variations do not exceed a predefined threshold. In terms of HPC, the main challenges reside in the distribution of these quantities over thousands of compute units, the resulting communicationintensive gathering of all data to handle the SSE phase, and the efficient solution of Eqs. (35) on hybrid nodes, as they involve many small matrix multiplications. Typical simulation parameters are listed in Table 1.
2.1. Current State of the Art
There exist several atomistic quantum transport simulators (ferrer2014gollum; groth2014kwant; nemo5; atk; vides; niquet; omen) that can model the characteristics of nanodevices. Their performance is summarized in Table 2
, where their estimated maximum number of atoms that can be simulated for a given physical model is provided. Only orders of magnitude are shown, as these quantities depend on the device geometries and bandstructure method. It should be mentioned that most tools are limited to tightbindinglike (TB) Hamiltonians, because they are computationally cheaper than DFT ones (
and ). This explains the larger systems that can be treated with TB. However, such approaches lack accuracy when it comes to the exploration of material stacks, amorphous layers, metallic contacts, or interfaces. In these cases, only DFT ensures reliable results, but at much higher computational cost.To the best of our knowledge, the only tool that can solve Eqs. (1) to (5) selfconsistently, in structures composed of thousands of atoms, at the DFT level is OMEN, a two times Gordon Bell Prize finalist (sc11; sc15).^{1}^{1}1Previous achievements: development of parallel algorithms to deal with ballistic transport (Eq. (1) alone) expressed in a tightbinding (SC11 (sc11)) or DFT (SC15 (sc15)) basis. The code is written in C++, contains 90,000 lines of code in total, and uses MPI as its communication protocol. Some parts of it have been ported to GPUs using the CUDA language and taking advantage of libraries such as cuBLAS, cuSPARSE, and MAGMA. The electronphonon scattering model was first implemented based on the tightbinding method and a threelevel MPI distribution of the workload (momentum, energy, and spatial domain decomposition). A first release of the model with equilibrium phonon (=0) was validated up to 95k cores for a device with =5,402, =4, =10, =21, and =1,130. These runs showed that the application can reach a parallel efficiency of 57%, when going from 3,276 up to 95,256 cores, with the SSE phase consuming from 25% to 50% of the total simulation times. The reason for the SSE increase could be attributed to the communication time required to gather all Green’s Function inputs for Eq. (3), which grew from 16 to 48% of the total simulation time (sc10) as the number of cores went from 3,276 to 95,256.
After extending the electronphonon scattering model to DFT and adding phonon transport to it, it has been observed that the time spent in the SSE phase (communication and computation) explodes. Even for a small structure with =2,112, =4, ==11, =650, =30, and =13, 95% of the total simulation time is dedicated to SSE, regardless of the number of used cores/nodes, among which 60% for the communication between the different MPI tasks. To simulate selfheating in realistic FinFETs (10,000), with a high accuracy (20, 1,000), and within reasonable times (a couple of minutes for one GFSSE iteration at machine scale), the algorithms involved in the solution of Eqs. (1) to (5) must be drastically improved: as compared to the state of the art, an improvement of at least one order of magnitude is needed in terms of the number of atoms that can be handled, and two orders of magnitude for what concerns the computational time.
3. DataCentric Parallel Programming
CommunicationAvoiding (CA) algorithms (demmel13ca; writeav) are defined as algorithm variants and schedules (orders of operations) that minimize the total number of performed memory loads and stores, achieving lower bounds in some cases. To achieve such bounds, a subset of those algorithms is matrixfree^{2}^{2}2The term is derived from solvers that do not need to store the entire matrix in memory., potentially reducing communication at the expense of recomputing parts of the data onthefly. A key requirement in modifying an algorithm to achieve communication avoidance is to explicitly formulate its data movement characteristics. The schedule can then be changed by reorganizing the data flow to minimize the sum of accesses in the algorithm. Recovering a DataCentric (DaCe) view of an algorithm, which makes movement explicit throughout all levels (from a single core to the entire cluster), is thus the path forward in scaling up the creation of CA variants to more complex algorithms and multilevel memory hierarchies as one.
DaCe defines a development workflow where the original algorithm is independent from its data movement representation, enabling symbolic analysis and transformation of the latter without modifying the scientific code. This way, a CA variant can be formulated and developed by a performance engineer, while the original algorithm retains readability and maintainability. At the core of the DaCe implementation is the Stateful DataFlow multiGraph (SDFG) (sdfg), an intermediate representation that encapsulates data movement and can be generated from highlevel code in Python. The syntax (node and edge types) of SDFGs is listed in Fig. 3. The workflow is as follows: The domain scientist designs an algorithm and implements it as linear algebra operations (imposing dataflow implicitly), or using Memlets and Tasklets (specifying dataflow explicitly). This implementation is then parsed into an SDFG, where performance engineers may apply graph transformations to improve data locality. After transformation, the optimized SDFG is compiled to machine code for performance evaluation. It may be further transformed interactively and tuned for different target platforms and memory hierarchy characteristics.
An example of a naïve matrix multiplication SDFG (C = A @ B in Python) is shown in Fig. 4. In the figure, we see that data flows from Data nodes A and B through a Map scope. This would theoretically expand to M*N*K multiplication Tasklets (mult), where the contribution of each Tasklet (i.e., a multiplied pair) will be summed in Data node C (due to conflicting writes that are resolved by CR: Sum). The Memlet edges define all data movement, which is seen in the input and output of each Tasklet, but also entering and leaving the Map with its overall requirements (in brackets) and number of accesses (in parentheses). The accesses and ranges are symbolic expressions, which can be summed to obtain the algorithm’s data movement characteristics. The SDFG representation allows the performance engineer to add transient (local) arrays, reshape and nest Maps (e.g., to impose a tiled schedule), fuse multiple scopes, map computations to accelerators (GPUs and FPGAs), and other transformations that may modify the overall number of accesses.
Comments
There are no comments yet.