1 Introduction
The US DOE Exascale Computing Project (ECP) Codesign Center for Particle Applications (CoPA) provides contributions to enable application readiness as we move toward exascale architectures for the “motif” of particlebased applications Alexander et al. (2020). CoPA focuses on codesign for the following “submotifs”: shortrange particleparticle interactions (e.g., those which often dominate molecular dynamics (MD) and smoothed particle hydrodynamics (SPH) methods), longrange particleparticle interactions (e.g., electrostatic MD and gravitational Nbody), particleincell (PIC) methods, and O(N) complexity electronic structure and quantum molecular dynamics (QMD) algorithms.
Particlebased simulations start with a description of the problem in terms of particles and commonly use a singleprogram multipledata domain decomposition paradigm for internode parallelism. Because particles in each domain interact with particles outside its domain, a list of these outside particles must be maintained and updated through internode communication. This list of outside particles is commonly kept in a set of ghost cells or a ghost region on each node. Intranode parallelism is commonly performed through work decomposition. From a description of the neighborhood (neighbor list), each particle’s forces are calculated to propagate the particles to new positions. The particles are then resorted to begin the next timestep. The main components of a timestep across the submotifs are shown in Figure 1. PIC is unique in that particles are used to solve continuum field problems on a grid. QMD solves the computationally intensive electronicstructure for problems where details of interatomic bonding are particularly important. Shared and specific functionality are highlighted in Figure 1. The compute, memory, and/or communication challenges requiring optimization on modern computer architectures are identified, extracted and assembled into libraries and proxy applications during the CoPA codesign process.
CoPA’s codesign process of using proxy applications (or apps) and libraries has grown out of a predecessor project, the Exascale CoDesign Center for Materials in Extreme Environments (ExMatEx) Germann et al. (2013). Two main library directions have emerged, the Cabana Particle Simulation Toolkit and the PROGRESS/BML QMD Libraries, each described in later sections. Each strive for performance portability, flexibility, and scalability across architectures with and without GPU acceleration by providing optimized data structure, data layout, and data movement in the context of the submotifs they address. Cabana is focused on shortrange and longrange particle interactions for MD, PIC, and Nbody applications, while PROGRESS/BML is focused on O(N) complexity algorithms for electronic structure and QMD applications. This split is primarily motivated by the difference in submotifs: QMD is computationally dominated by matrix operations, while the other submotifs share particle and particlegrid operations.
The locations for the opensource CoPA libraries and
proxy apps are noted in the sections in which they are described.The particle motif is used by many application codes to describe physical systems, including molecular dynamics simulations using empirical models or the underlying quantum mechanics for particle interactions, cosmological simulations in which the particle may represent an object (e.g. a star) or a cluster of objects and the particle interaction is through gravity, and plasma simulations on grids within a PIC framework to solve the interaction of particles with the electrodynamic field. The computational motifs associated with these application codes depends on the nature of the particle interactions. Shortranged interactions rely heavily on the creation of a list of neighbors for direct interactions, while longrange interactions use particlegrid methods and FFTs to solve the longrange field problem. Details are described in the section on the Cabana Toolkit. The quantum mechanics in QMD problems is often expressed as a matrix problem. QMD based on localized orbitals in Density Functional Theory (DFT) and tightbinding models are reliant on sparsematrix solvers. Details are described in the section on the PROGRESS/BML Libraries.
Relevant particle applications are represented within CoPA and help drive the codesign process. ECP application projects such as EXAALT (LAMMPSSNAP), WDMApp (XGC), ExaSky (HACC/SWFFT), and ExaAM (MPM) serve as application partners as shown in Figure 2, as well as nonECP applications. Details of these engagements are described in the section on Application Partners.
We present descriptions of the Cabana Particle Simulation Toolkit and PROGRESS/BML QMD libraries, followed by PIC algorithm development, and codesign examples with our application partners: XGC, HACC, and LAMMPSSNAP. We conclude with a summary of our lessons learned and impact on the broader community.
2 Cabana Particle Simulation Toolkit
The Cabana toolkit is a collection of libraries and proxy applications which allows scientific software developers targeting exascale machines to develop scalable and portable particlebased algorithms and applications. The toolkit is an opensource implementation of numerous particlebased algorithms and data structures applicable to a range of application types including (but not limited to) PIC and its derivatives, MD, SPH, and Nbody codes Hockney and Eastwood (1989); Liu and Liu (2003) and is usable by application codes written in C++, C, and FORTRAN. Notably, this covers the first three submotifs (see Figure 1). Cabana is designed as a library particularly because so many computational algorithms are shared across particle applications in these submotifs: neighbor list construction, particle sorting, multinode particle redistribution and halo exchange, etc. This effectively separates shared capabilities from the specific application physics within individual steps of Fig. 1
, e.g. the peratom force computation in MD and the particlegrid interpolation for PIC.
Cabana is available at https://github.com/ECPCoPA/Cabana.The toolkit provides both particle algorithm implementations and userconfigurable particle data structures. Users of Cabana can leverage the algorithms and computational kernels provided by the toolkit independent of whether or not they are also utilizing the native data structures of the toolkit through memorywrapping interfaces. The algorithms themselves span the space of particle operations necessary to support each relevant application type, spanning across all submotifs. This includes intranode (local and threaded) operations on particles and internode (communication between nodes) operations to form a hybrid parallel capability. Cabana uses the Kokkos programming model for onnode parallelism Edwards et al. (2014), providing performance and portability on preexascale and anticipated exascale systems using current and future DOEdeployed architectures, including multicore CPUs and GPUs. Within Cabana, Kokkos is used for abstractions to memory allocation, arraylike data structures, and parallel loop concepts which allow a single code to be written for multiple architectures.
While the only required dependency for Cabana is Kokkos, the toolkit is also intended to be interoperable with other ECP scientific computing libraries which the user may leverage for functionality not within the scope of the toolkit. Use of the library in concert with other ECP libraries can greatly facilitate the composition of scalable particlebased application codes on new architectures. Current library dependencies are shown in Figure 3 with libraries developed by the ECP Software Technology (ST) projects, including hypre for preconditioners and linear solvers Falgout and Yang (2002), heFFTe for 3Ddistributed FFTs Ayala et al. (2019), and ArborX LebrunGrandié et al. (2019) for threaded and distributed search algorithms.
Cabana includes both particle and particlegrid operations which are critical across the particle submotifs. We next review each of these capabilities.
2.1 Particle Abstractions
We first summarize the major particlecentric abstractions and functionality of the toolkit including the underlying data structure and the common operations which can be applied to the particles.
2.1.1 Data Structures
Particles in Cabana are represented as tuples of multidimensional data. They are general and may be composed of fields of arbitrary types and dimensions (e.g. a mass scalar, a velocity vector, a stress tensor, and an integer id number) as defined by the user’s application. Considering the tuple to be the fundamental particle data structure (struct), several choices exist for composing groups of structs as shown in Figure
4. A simple list of structs, called an ArrayofStructs (AoS) is a traditional choice for particle applications, especially those not targeting optimization for vector hardware. All of the data for a single particle is encapsulated in a contiguous chunk of memory, thereby improving memory locality for multiple fields within a particle. An AoS also offers simplicity for basic particle operations, such as sorting or communication, as the memory associated with a given particle may be manipulated in a single operation. An AoS, however, also has a downside: accessing the same data component inmultiple particles concurrently requires strided (noncoalesced) memory accesses which incurs a significant performance penalty on modern vectorbased machines.
This penalty for strided access can be alleviated through the use of the StructofArrays (SoA) memory layout. In an SoA, each particle field is stored in a separate array, with the values of an individual field stored contiguously for all particles. This structure allows for a highperformance memory access pattern that maps well to modern vectorbased architectures. The drawback of this approach, however, is twofold: first, the hardware has to track a memory stream for each particle property used within a kernel and second, the programmer and hardware may have a harder time efficiently operating on all of the data together for a given particle. In light of these features, it is typically favorable to use SoA when effective use of vectorlike hardware is vital (as is the case with GPUs), or when a subset of particle fields are used in a given kernel.
Cabana offers a zerocost abstraction to these memory layouts, and further implements a hybrid scheme known as ArrayofStructsofArrays (AoSoA). An AoSoA attempts to combine the benefits of both AoS and SoA by offering userconfigurable sized blocks of contiguous particle fields. This approach means a single memory load can fetch a coalesced group of particle field data from memory, while retaining high memory locality for all fields of a given particle. A key performance tuning parameter exposed by Cabana, the added AoSoA dimension enables the user to configure the memory layout of a given particle array at compile time or to have it automatically selected based on the target hardware.
2.1.2 Particle Sorting
Particle sorting is a functionality requirement of all currently identified user applications. In plasma PIC, fluid/solid PIC, MD, Nbody, and SPH calculations, particle sorting serves as a means of improving memory access patterns based on some criteria which can provide an improvement in onnode performance. This could be, for example, placing particles that access the same grid cell data near each other in memory, or grouping particles together by material type such that particles adjacent in memory will be operated on by the same computational kernel in the application. The frequency of sorting often depends on the application, as well as the given problem. Slowly evolving problems may need to sort particles less frequently as local particle properties that affect memory access (e.g. grid location or nearest neighbors) will be relatively stable. Efficient memory access objectives should be based on the target computing platform (e.g. GPUs or multiCPU devices). Particle sorting is applicable to locally owned particles or to both locally owned and ghost particles. Cabana provides the ability to sort particles by spatial location through geometric binning or by an arbitrary key value, which can include either a particle property or another userprovided value. Cabana uses the bin sort functionality in Kokkos, with plans for additional options through Kokkos in the future.
2.1.3 Neighbor List Creation
PIC, MD, Nbody, and SPH can all benefit from the efficient construction of particle neighbor lists. These neighbor lists are typically generated based on a distance criteria where a physical neighborhood is defined or instead based on some fixed number of nearest neighbors. In both MD and SPH simulations the neighbor list is a critical data structure and is computed more frequently than most other particle operations up to the frequency of every time step. In fluid/solid PIC applications the neighbor list is an auxiliary data structure for computational convenience that, when used, is similarly computed up to the frequency of every time step. In the fluid/solid PIC case the background grid can be used to accelerate the neighbor search, whereas in MD, Nbody, and SPH a grid may need to be created specifically for this search acceleration. Cabana provides multiple variations of neighbor lists, including traditional binning methods used in many MD applications (Verlet lists), as well as treebased algorithms from the ECP ST ArborX library LebrunGrandié et al. (2019), using compressed and dense storage formats, and threadparallel hierarchical list creation; all of these options can improve performance for different architectures and particle distributions.
2.1.4 Halo Exchange and Redistribution
Many user applications require parallel communication of particles between compute nodes when spatial domain decomposition is used and particle data must be shared between adjacent domains. Some applications require halo exchange operations on particles and some, in particular gridfree methods (e.g. MD), additionally require ghost particle representations to complete local computations near domain boundaries. In many cases, the halo exchange is executed at every single time step of the simulation with a communication pattern that may also be computed at every time step or at some larger interval and reused between constructions. In addition, particles need to be redistributed to new compute nodes in many algorithms either as a result of a load balancing operation or because advection has moved particles to a new region of space owned by another compute node. The toolkit provides implementations for ghost particle generation and halo exchange, including both gather and scatter operations, as well as a migration operation to redistribute particles to new owning domains.
2.1.5 Parallel Loops
Cabana adds two main extensions to the Kokkos parallel constructs which handle portable threaded parallelism and mapping hierarchical and nested parallelism to up to three levels on the hardware. First, SIMD parallel loops are directly connected to the AoSoA data structures and provide the user simple iteration over the added inner vector dimension, for threaded parallelism over both particle structs and vector, with potential performance improvements by exposing coalesced memory operations. Second, neighbor parallel loops provide both convenience and flexibility for any particle codes which use a neighbor list (see above) to iterate over both particles and neighboring particles (including first and/or secondlevel neighbors). Cabana handles all neighbor indexing and the user kernel deals only with application physics. This also enables applications to easily change the parallel execution policy, and use the appropriate threading over the neighbor list structures (or serial execution) for the kernel, problem, and hardware.
2.2 ParticleGrid Abstractions
In addition to pure particle abstractions, the toolkit also contains optional infrastructure for particlegrid concepts which we present next.
2.2.1 Long Range Solvers
Among the user applications, long range solvers encapsulate a wide variety of kernels and capabilities, but many include critical kernels that can apply to many applications. For example, embedded within the long range solve of a PIC operation are kernels for interpolating data from the particles to the grid and from the grid to the particles, as well as possibly gridbased linear solvers. Other simulations, such as MD and SPH calculations, compose the long range solvers with particlegrid operations, but instead use other algorithms such as Fast Fourier Transforms (FFT) to complete the long range component of the solve. A variety of libraries provide FFT capabilities, including highperformance scalable FFT libraries being developed for large systems. Cabana currently implements direct use of the ECP ST heFFTe library for performance portable FFTs
Ayala et al. (2019) and Cabana’s flexibility will enable use of other ECP FFT libraries in the future, including FFTX Franchetti et al. (2020), SWFFT Pope et al. (2017), and fftMPI Plimpton et al. (2018). In addition, Cabana interfaces to hypre Falgout and Yang (2002) to provide interfaces to linear solvers.2.2.2 ParticleGrid Interpolation
PIC methods, as well as methods which require long range solvers, usually need some type of interpolation between particle and grid representations of a field in order to populate the grid data needed by FFTs, linear solvers, or other field operations. The toolkit provides services for interpolation to logically structured grids based on multidimensional spline functions, which are available in multiple orders. By differentiating the spline functions, differential operators may be composed during the interpolation process, allowing users to interpolate gradients, divergences, and other operators of scalar, vector, and tensor fields. Other types of interpolants can also be added in the future for additional capabilities in PIC applications. As needed, we also envision generalizing the Cabana interpolation infrastructure to support userdefined interpolants on structured grids.
2.3 Proxy apps
Cabanabased proxy apps have been developed in order to demonstrate and improve Cabana functionality as well as to explore new algorithms and ideas when they are deployed in the context of a submotif. In addition to those presented here, more proxies are planned to cover additional variations of the algorithmic abstractions represented by application partners.
2.3.1 CabanaMD
CabanaMD is a LAMMPS Plimpton (1995) proxy app built on Cabana, developed directly from the ExaMiniMD proxy (“KokkosMD”) Edwards et al. (2014). CabanaMD represents both the short and long range MD submotifs; in fact, the MD timestep can easily be reexpressed as calls to the Cabana library, as shown in Figure 5. Similar figures could also be created for all submotifs in Figure 1 with all the CoPA proxy apps.
CabanaMD is available at https://github.com/ECPCoPA/CabanaMD. MD uses Newton’s equations for the motion of atoms, with various models for interatomic forces and ignoring electrons. In contrast to ExaMiniMD and many applications, only the main physics, the force kernel evaluating the interatomic model and position integration, is entirely retained in the application with everything else handled by Cabana. Main results for demonstrating Cabana capabilities include performance implications of combining or separating particle properties in memory, changing the particle property AoSoA memory layouts, and using the available options for each algorithm (e.g. data layout, hierarchical creation, and hierarchical traversal for neighbor lists). This has been done primarily with the LennardJones short range force benchmark kernel. In addition, CabanaMD enabled speedup on the order of 3x (one Nvidia V100 GPU compared to a full IBM Power 9 CPU node)
in new neural network interatomic models
Behler and Parrinello (2007) by reimplementing with Kokkos and Cabana, rewriting the short range force kernels, and exposing threaded parallelism Desai et al. (2020).CabanaMD also includes long range forces using Cabana data structures and particlegrid algorithms, covering the second submotif. The smooth particle mesh Ewald (SPME) Essmann et al. (1995) method is implemented with Cabana grid structures and spline kernels to spread particle charge onto a uniform grid, 3D FFTs using the Cabana interface to heFFTe Ayala et al. (2019) for reciprocal space energies and forces, and Cabana gradient kernel to gather force contributions back from the grid to atoms. Realspace energy and force calculations use the Cabana neighbor parallel iteration options, as with short range interactions.
Continuing work for CabanaMD will include benchmarking long range performance and using it as a vehicle to implement and improve performance portable machine learned interatomic models.
2.3.2 CabanaPIC
CabanaPIC is a relativistic PIC proxy app using Cabana, capable of modeling kinetic plasma and representing the PIC submotif. CabanaPIC is available at https://github.com/ECPCoPA/CabanaPIC. It has strong ties with the production code VPIC Bowers et al. (2009), but is able to act as a representative proxy for all traditional electromagnetic PIC codes which use a structured grid. It implements a typical Boris pusher, as well as a finitedifference timedomain (FDTD) field solver.
CabanaPIC focuses on short range particlegrid interactions, and its performance is heavily dependent on techniques to martial conflicting writes to memory (such as atomics). It employs Cabana’s particle sorting techniques, as well as offers examples of how to use Cabana for simple MPI based particle passing.
2.3.3 ExaMPM
ExaMPM is a Cabanabased proxy application for the Material Point Method (MPM) which is being used as part of highfidelity simulations of additive manufacturing with metals in the ExaAM project in ECP. ExaMPM also represents the PIC submotif and is available at https://github.com/ECPCoPA/ExaMPM. MPM, a derivative of PIC, is used to solve the Cauchy stress form of the NavierStokes equations including terms for mass, momentum, and energy transport where particles track the full description of the material being modeled in a Lagrangian and continuum sense. The MPM simulations in ExaMPM model the interaction of a laser with metal powders, the melting of the powder due to heating from the laser, and the solidification of the melted substrate after the laser is turned off. When modeled at a very highfidelity, using particles to track the free surface interface of the molten metal and the liquidsolid interface during phase change will allow for both empirical model generation in tandem with experiments, as well as reducedorder model generation to use with engineeringscale codes in the ExaAM project.
ExaMPM largely implements the base algorithmic components of the MPM model including an explicit form of time integration, a free surface formulation with complex moving interfaces, and higherorder particlegrid transfer operators which reduce dissipation. As an example, Figure 6 shows a water column collapse modeled with ExaMPM. This problem has a dynamic moving surface structure resembling the dynamics of the molten metal in the highfidelity ExaAM simulations as well as particle populations which change rapidly with respect to the local domain. Scaling up this problem through larger particle counts and larger computational meshes will allow us to study better techniques for interface tracking, more scalable communication and load balancing algorithms to handle the moving particles, and sorting routines to improve locality in particlegrid operations.
2.4 Cabana Applications
A significant metric for the impact of a given software library is its adoption. First, Cabana is being used as the basis for a new production application closely related to the ExaMPM proxy described in the previous section. Another notable usage of Cabana is in the transition of an existing application; the section on XGCCabana below details the process of converting XGC, initially using Cabana through FORTRAN interfaces, eventually to full C++ with Cabana.
In addition, the Cabana library has and will continue to influence production applications and related libraries. This includes LAMMPS and HACC (also described in later sections), where the performance of an algorithm, data layout, etc. can be demonstrated with Cabana and/or its proxy apps and migrated to the separate application; similarly, interactions between Cabana and the libraries it depends on can improve each for a given application.
3 PROGRESS/BML Quantum Molecular Dynamics Libraries
This section focuses on the solvers addressing the quantum part of Quantum Molecular Dynamics (QMD), the submotif listed in the fourth column of Fig.1. QMD uses electronic structure (ES) based atomic forces to advance the position of classical particles (atoms) in the BornOppenheimer approximation Marx and Hutter (2009). There are many benefits of this technique as compared to classical methods. These benefits include: independence of the results associated with the choice of a particular force field; enabling the formation and breaking of bonds (chemical reactions) as the simulation proceeds; and, the possibility of extracting information from the electronic structure throughout the simulation. The counterpart to these benefits is the large computational cost associated with having to determine the ES of the system before advancing the particles coordinates. In order to perform practical MD simulations, the strong scaling limit becomes important, as timetosolution needs to be as small as possible to enable long simulations with tens of thousands of timesteps, and achieve a good sampling of the phasespace. Determining the ES is the main bottleneck of QMD, where the so called singleparticle density matrix (DM) needs to be computed from the Hamiltonian matrix. The latter requires a significant amount of arithmetic operations, and it typically scales with the cube of the number of particles. QMD is hence characterized by this unique critical step that sets it apart from all the particle simulation methods within the scope of the Cabana toolkit (described in the previous section); therefore, additional libraries are needed for increasing its performance and portability.
Another significant challenge surrounding the development of QMD codes is that solvers used to compute the ES are strongly dependent on the chemical systems (atom types and bonds), which implies that it is necessary to develop and maintain several different algorithms that are suitable for each particular system. For instance, the singleparticle DM for insulators (with a wide energy gap between the highestoccupied and lowestunoccupied state), is essentially a projector onto the subspace spanned by the eigenfunctions associated with the lowest eigenvalues of the Hamiltonian matrix (occupied states). For metals, on the other hand, the DM is not strictly a projector since a temperature dependent weight between 0 and 1 is associated with each eigenstate (FermiDirac distribution). Nevertheless, in both cases the DM can be computed from the knowledge of the eigenpairs of the Hamiltonian which are computationally expensive to determine.
In general, we can say that the construction of the DM can be expressed as a matrix function of the Hamiltonian . Such a function , can be computed exactly by diagonalizing matrix . The function then becomes: , where is a diagonal matrix containing the eigenvalues of , and
is a unitary transformation where its columns contain the eigenvectors of
. is the Fermi distribution functions for finite temperatures.In order to increase productivity in the implementation and optimization of these algorithms we have adopted a framework in which we clearly separate the matrix operations from the solver implementations. The framework relies on two main libraries: “Parallel, Rapid O(N) and Graphbased Recursive Electronic Structure Solve” (PROGRESS), and the “Basic Matrix Library” (BML). The software stack can be seen in Figure 7. At the highest level, electronic structure codes call the solvers in the PROGRESS library which, in turn, rely on BML. The BML library provides basic matrix data structure and operations. These consist of linear algebra matrix operations which are optimized based on the format of the matrix and the architecture where the program will run. Applications can also directly implement specific algorithms based on BML when those are not available routines in PROGRESS. Both libraries use travisci and codecov.io for continuous integration and code coverage analysis respectively. Every commit is tested over a set of compiler and compiler options. Our overarching goal is to construct a flexible library ecosystem that helps to quickly adapt and optimize electronic structure applications on exascale architectures. Alternative libraries that overlap with the matrix formats and algorithms implemented in PROGRESS and BML include DBCSR The CP2K Developers Group (2020) and NTPoly Dawson and Nakajima (2018), both focusing also on electronic structure applications.
3.1 Basic Matrix Library
The increase in availability of heterogeneous computer platforms is the motivation behind the development of the BML software library. Multiple data storage formats (both for sparse and dense) and programming models (distributed, threaded, and taskbased) complicate the testing and optimization of electronic structure codes.
The basic matrix library package (BML) contains the essential linear algebra operations that are relevant to electronic structure problems. The library is written in C, which allows for straightforward implementation on exascale machines. The library also exposes a Fortran interface, with Fortran wrappers written around C functions. This facilitates its usage by a wide variety of codes since many applications codes in this community are written in Fortran. One of the main advantages of BML is that the APIs are the same for all matrix types, data types, and architectures, enabling users to build unified solvers that work for multiple matrix formats. Lowlevel implementations within the BML library are tailored to particular matrix formats and computer architectures. The formats that can be handled so far are: dense, ELLPACKR Mniszewski et al. (2015), Compressed Sparse Row (CSR) Saad (2003), and ELLBLOCK (see Figure 8)
. Here, dense is used to refer to a typical twodimensional array. It is the most suitable format for treating systems that have a high proportion of nonzero values in the DM. ELLPACKR is a sparse matrix format constructed using three arrays: a onedimensional array used to keep track of the number of nonzeros per row for each row; a twodimensional array used to keep track of the column indices of the nonzero values within a row; and finally, another twodimensional array used to store the nonzero values. The row data are zeropadded to constant length, so row data are separated by constant stride. ELLBLOCK is a block version of ELLPACKR. In a nutshell, a matrix is decomposed into blocks that are either considered full of zeroes and not stored, or dense blocks that are treated as all nonzeroes. Loops over nonzero matrix elements are replaced by loops over nonzero blocks, and dense linear algebra operations are done on blocks. The CSR format in its typical implementation utilizes a floating point array to store the nonzero entries of the matrix in rowmajor order and an integer array to store the corresponding column indices. In addition, an array which indexes the beginning of each row in the data arrays is needed for accessing the data. Unlike the ELLPACKR format, which stores entries in a twodimensional array with a fixed width for all the nonzero entries in the rows of the matrix, the CSR format keeps the variability in the number of nonzeros per row, thus avoiding the need for zeropadding. The implementation of the CSR format in BML follows an
ArrayofStructsofArrays (AoSoA) approach. A matrix is represented as an array of compressed sparse rows, where each compressed row stores only the nonzero entries and the associated column indices of the corresponding row of the matrix. In this approach, the additional array of indexes to the beginning of each row is no longer needed. The matrix stored in this way is extensible, allowing the matrix to grow by simply adding new entries without the need to destroy the matrix.BML dense matrix functions are typically wrappers on top of a vendor optimized library, such as BLAS/LAPACK implementations. For NVIDIA GPUs, we use the MAGMA library for dense matrices Dongarra et al. (2014)
. We use its memory allocation functions and many of its functionalities. One exception is the dense eigensolver for which we use the NVIDIA cuSOLVER which performs substantially better than the MAGMA solver at the moment. In the case of the sparse formats, each BML function is specifically written for that particular format. Performance portability is achieved by keeping one codebase with a high flexibility for configuring and building. BML was compiled and tested with multiple compilers (GNU, IBM, Intel, etc.) on several preexascale architectures.
BML ELLPACKR matrix functions are implemented with OpenMP, both on CPU and GPU, the latter using target offload. The algorithm implemented utilizes a work array of size per row which is larger than GPU cache for matrix sizes of interest, leading to poor performance on GPU. Previous work by MohdYusof et al. (2015) demonstrated the performance of a novel mergebased implementation of sparsematrix multiply on GPU, implemented in CUDA. Future implementations will utilize a mix of OpenMP offload and native CUDA kernels to enable performance while retaining a consistent interface with the existing OpenMP implementations. Benchmarking indicates this should allow a speedup of on an Nvidia V100 compared to IBM Power 9 (using all 21 cores of one socket).
BML offers support for four datatypes:
single precision, double precision, complex single precision, and complex double precision.
The source code for all these datatypes is the same
for most functions, with C macros that are preprocessed at compile time to generate functions for the four
different formats.
All BML function names are prefixed with bml_
.
The code listing in Figure 9 shows the use of the BML API on one of our O(N) complexity algorithms, the “second order spectral projection” (SP2)
Niklasson et al. (2003).
We show how BML matrices are allocated passing the matrix type (variable "ellpack"
in this case), the element kind (a real kind indicated with the predefined variable bml_element_real
), and the precision, (in this case a double passed with the variable dp
). More information about how to use the BML API can be found at https://lanl.github.io/bml/API/index.html.
Our implementation of various matrix formats are quite mature for CPUs, including their threaded versions. GPU implementation efforts are ongoing. Future developments include a distributed version of BML for various matrix types.
A recent effort Adedoyin et al. (2019), focused on optimizing at the multithreaded level as opposed to modifying the datastructures for performance at the Single Instruction Multiple Data (SIMD) scale. Several active and passive directives/pragmas were incorporated that aid to inform the compiler on the nature of the datastructures and algorithms. Though these optimizations targeted multicore architectures, most are also applicable to manycore architectures present on modern heterogeneous platforms. Herein, we refer to multicore systems as readily available HPC nodes customarily configured with approximately 10 to 24 cores per socket at high clock speed (2.4 to 3.9 GHz) and manycore systems as accelerators or GPUs. Several optimization strategies were introduced including 1.) Strength Reduction 2.) Memory Alignment for large arrays 3.) Non Uniform Memory Access (NUMA) aware allocations to enforce data locality and 4.) the appropriate thread affinity and bindings to enhance the overall multithreaded performance.
A more indepth description of the BML library and its functionalities can be found in Bock et al. (2018) and the code is available at https://github.com/lanl/bml.
3.2 PROGRESS Library
The computational cost of solving this eigenvalue problem to compute the DM, scales as O(N), where N is the number of atomic orbitals in the system. Recursive methods, however, such as SP2 Niklasson et al. (2003), can compute the density matrix in O(N) operations for a sparse Hamiltonian matrix.
PROGRESS is a Fortran library that can be used for general purpose quantum chemistry calculations. It implements several O(N) solvers Niklasson et al. (2016); Negre et al. (2016); Mniszewski et al. (2019) and is publicly available at https://github.com/lanl/qmdprogress. As described above and shown in Figure 7, PROGRESS relies entirely on BML for algebraic operations, hence, while electronic structure algorithms and solvers are outlined in PROGRESS, the mathematical manipulations are all performed in BML. This library is currently used by LATTE, a tightbinding (TB) code specifically developed to perform QMD simulations Bock et al. (2008). It can also be used with DFTB+, a widely used density functional tightbinding code Hourahine et al. (2020). In TB methods, matrix elements are typically obtained empirically from fits to more accurate calculations or to experiments, rather than being explicitly computed from electronic wave functions. However, the BML library also can be used for FirstPrinciples codes, in particular for O(N) codes where matrix elements correspond to pairs of localized orbitals Fattebert et al. (2016).
As was mentioned previously, the appropriate solver to compute the electronic structure depends strongly on the type of chemical system. Metals, for example, are difficult to treat since their electronic structure is hard to converge given the delocalized nature of the electrons. The Hamiltonian and DM have different sparsity patterns that will determine the matrix format to use. Hence, there is room for exploring different matrix formats and solvers depending on the type of system. The SP2 method, as implemented in the PROGRESS library with the posibility of using BML sparse matrix multiplications, computes the density matrix without diagonalizing the Hamiltonian matrix. An example SP2 algorithm as implemented in PROGRESS is shown in Figure 9. Its computational complexity becomes O(N) for sparse Hamiltonians, provided a proper thresholding is applied at every iteration. Performance of the PROGRESS library is tested using model Hamiltonian matrices that mimic the actual Hamiltonians for different materials such as semiconductors, softmatter, and metals.
Figure 10 shows the performance of two typical PROGRESS routines for constructing the DM on GPU applied to a softmatter type of Hamiltonian. The standard algorithm for constructing the DM is based on matrix diagonalization (shown in black on the plot of Figure 10). The SP2 algorithm (see Figure 9), instead, is based on matrix multiplications (shown in red on the plot of Figure 10). The computational complexity is O(N) for both algorithms due to the nature of the matrix operations involved. In both cases these operations scale as O(N) for dense (unthresholded) matrices. Furthermore, in these cases the density matrix is solved exactly since no threshold is used. For systems where the DM becomes dense and where a sparse format cannot be used, the GPU versions of these algorithms are significantly more performant than the corresponding CPU threaded version. We also notice that the DIAG algorithm is slower than SP2 for smaller systems (less than 6000 orbitals). This is because the dense diagonalization algorithm and its implementation on GPUs is not as efficient, in particular for small matrices, while the SP2 algorithm is dominated by matrixmatrix multiplications which can be implemented very efficiently on GPUs. For large systems, however, the DIAG algorithm performs slightly better.
For systems leading to a sparse DM, O(N) complexity is achieved by using the SP2 algorithm in combination with a sparse format as is shown in Figure 11. This plot shows the performance of the SP2 algorithm on CPU using different formats (ELLPACKR, CSR, and ELLBLOCK) applied to a softmatter type Hamiltonian. In this figure we notice a large gain in performance obtained by using sparse formats, and the O(N) complexity for the range of system sizes analyzed. In these cases, the DM is not exact and the error depends on the threshold parameter, regardless of which of the three formats is used, and can be chosen to be sufficient for many practical applications.
Depending on the size of system to be simulated, and the resulting structure of the matrix, optimal time to solution may be obtained from a particular combination of matrix format and hardware choice. The use of PROGRESS and BML allows these choices to be made at runtime, without changing the underlying code structure.
3.3 PROGRESS/BML Applications
Our work on PROGRESS/BML is now being used to develop a capability for allatom QMD simulations of proteins, extending the impact of our ECP work to biomedical research including studies of SARSCov2 proteins. Current classical biomolecular MD simulations typically involve O(1010) atoms and exceed 100 ns in simulation time. Obtaining nanosecondduration simulations for such systems is currently beyond reach of QMD codes. The highly scalable NAMD MD code—a popular choice for biomolecular simulations—recently incorporated breakthrough capabilities in hybrid quantum mechanical/molecular mechanical (QM/MM) simulations Melo et al. (2018), enabling useful simulations with O(1010) of the atoms treated using QM. To extend the size of the QM region to a whole protein with O(10) atoms, we are now integrating PROGRESS/BML with NAMD, using the LATTE electronic structure code as the QM solver Bock et al. (2008). LATTE uses Density Functional Theory (DFT) in the tight binding approximation which is an established approach for ab initio studies of biomolecular systems Cui and Elstner (2014). It combines O(N) computational complexity with extendedLagrangian BornOppenheimer MD, and has achieved a rate of 2.1 ps of simulation time per day of wall clock time for a solvated Trp cage miniprotein system consisting of 8349 atoms Mniszewski et al. (2015). The combination of LATTE with NAMD therefore is an excellent choice for pursuing nanosecondduration wholeprotein QM/MM simulations.
Our efforts in integrating LATTE with the PROGRESS and BML libraries have significantly benefited the EXAALT ECP project. Some materials which are the subject of study in EXAALT such as UO, required a very involved modification of the LATTE code to increase performance. This was made possible by the extensive use of the PROGRESS and BML routines.
4 PIC Algorithm Development
The longevity of any software framework is dictated, at least partially, by its ability to adapt to emerging algorithms that may not have existed, nor been foreseen, at the time of the framework’s development. In this regard, CoPA has been supporting the development of novel PIC algorithms that can improve and accelerate simulations, with an eye toward their implementation in Cabana. In addition to exercising Cabana, this also represents a unique opportunity to rethink and develop new algorithms at scale, potentially shortening the often lengthy gap between algorithmic innovation and subsequent scientific discovery. These algorithms may connect with the PIC codes in XGC and HACC, described in the next sections.
The algorithms being developed build mainly on two recent advances in PIC methodology. The first is the fully implicit PIC algorithm first introduced in Chen et al. (2011) and subsequently expanded upon in Chen et al. (2020); Chen and Chacon (2015); Stanier et al. (2019); Chen et al. (2012), among others. Compared to standard PIC algorithms, these implicit methods enforce exact discrete conservation laws, which are especially important for longterm accuracy of simulations. Their ability to stably step over unimportant but often stiff physical timescales promises tremendous computational speedups in certain contexts. The second recent advance being leveraged is known as “sparse PIC” Ricketson and Cerfon (2016, 2018), in which the sparse grid combination technique (SGCT) Griebel et al. (1992) is used to reduce particle sampling noise in grid quantities. This is achieved by projecting particle data onto various different component grids, each of which is wellresolved in at most one coordinate direction. A clever linear combination of these grid quantities recovers nearoptimal resolution, but with reduced noise due to the increased size of the cells  and consequently more particles per cell  in the component grids. Quantitatively, use of the SGCT changes the scaling of grid sampling errors from from to Ricketson and Cerfon (2016). Here, is the total number of particles in the simulation, the spatial cell size, and the spatial dimension of the problem. For comparable sampling errors, sparse PIC thus reduces the required particle number by a factor of .
Algorithm development efforts within CoPA have been threefold. First, a new asymptoticpreserving time integrator for the particle push  PIC item 4 in Figure 1  component of implicit PIC schemes has been developed. This new scheme allows implicit PIC methods to step over the gyroperiod, which often represents a stiff timescale in strongly magnetized plasmas (e.g. in magnetic confinement fusion devices). Second, implicit PIC schemes have been generalized to handle a broader class of electromagnetic field solvers  PIC item 2 in Figure 1
. In particular, we show that exact energy conservation can be implemented using a spectral field solve, while previous studies have been mostly focused on finitedifference schemes. Spectral solvers have much higher accuracy given the same degree of freedom than finitedifference schemes, which can have particular advantages in simulating electromagnetic waves (e.g. in laserplasma interaction applications). Third, we show that the implicit and sparse PIC methods can be used in tandem, thereby achieving the stability and conservation properties of the former along with the noisereduction properties of the latter.
Use of sparse PIC primarily impacts particle depostion and force gather operations  PIC items 1 and 3 in Figure 1. Each of these advances is described in turn below.In magnetized plasmas, charged particles gyrate around magnetic field lines with frequency , where is the magnitude of the particle’s charge, the magnetic field strength, and the particle mass. In many scientific applications, the timescale of this gyration (i.e. ) can be orders of magnitude smaller than the physical timescales of interest. Consequently, it is often too expensive for standard PIC to simulate those applications. Numerous works have circumvented this difficulty by using gyrokinetic models, in which the gyration scale is analytically removed from the governing equations by asymptotic expansion. However, this approach can become difficult, or breaks down if the approximation is only valid in a portion of the problem domain  scientifically relevant examples include the edge of tokamak reactors, magnetic reconnection Lau and Finn (1990), and field reversed configurations Tuszewski (1988). A more effective approach is to derive a timeintegrator that recovers the gyrokinetic limit when while recovering the exact dynamics in the limit .
Our work derives just such a scheme. We present a summary here; the interested reader should refer to Ricketson and Chacón (2020) for more detail. Note that this work represents an interproject collaboration with the HBPS SciDAC, and may be viewed as part of XGC’s efforts toward fullorbit capability.
Our new algorithm builds on prior efforts Brackbill and Forslund (1985); Genoni et al. (2010); Parker and Birdsall (1991); Vu and Brackbill (1995) that noticed that standard integrators such as Boris and CrankNicolson fail to capture the proper limit when . In particular, the Boris algorithm Parker and Birdsall (1991) captures magnetic gradient drift motion but artificially enlarges the radius of gyration, while CrankNicolson captures the gyroradius but misses the magnetic drift. Some prior efforts cited above derive schemes that capture both, but at the cost of large errors in particle energy. These energy errors are particularly problematic in the longtime simulations enabled by exascale resources.
Our new scheme is the first to capture magnetic drift motion, the correct gyroradius, and conserve energy exactly for arbitrary values of . The scheme is built on CrankNicolson, but adds an additional fictitious force that produces the magnetic drift for larger timesteps, and tends to zero for small timesteps (thus preserving the scheme’s convergence to the exact dynamics). Energy conservation is preserved by ensuring that this fictitious force is necessarily orthogonal to particle velocity, thereby guaranteeing that it can do no work (i.e. mimicking the effects of the Lorentz force). The scheme shows promising results in various test problems, and implementation in Cabana is expected to help guide the development of effective preconditioning strategies for the necessary implicit solves.
Our second thrust concerns the field solvers (these are examples of longrange solvers
 see the section on Cabana above and SWFFT below for additional discussion) in implicit PIC schemes. In the references above and all other extant implicit PIC work, these solvers are assumed to be based on secondorder finite difference approximations of the underlying partial differential equations. However, there are significant advantages to the use of spectral solvers when treating electromagnetic waves
Vay et al. (2013, 2018).With these considerations in mind, we have generalized the implicit PIC method to function with spectral solvers without sacrificing the important conservation properties the scheme enjoys. This is done by adapting the mathematical proofs of energy conservation to accommodate spectral solvers. The key necessary features are two integrationbyparts identities that must be satisfied by the solver. A spectral solver can be made to satisfy these identities if a binomial filter is applied in a preprocessing step. Such filters are commonly used in PIC schemes to mitigate particle noise, so this requirement is not considered onerous.
The third prong consists of combining sparse grid PIC schemes with implicit PIC. As above, the key here is retaining energy conservation. Because potential energy is computed on the grid and the SGCT introduces a multitude of distinct grids with different resolutions, care is needed even in the definition of a single potential energy quantity. Having taken this care, we have shown that it is indeed possible to conserve energy exactly in the sparse context. The resulting method also leverages the advances above by being compatible with spectral field solvers.
Initial tests have confirmed the theoretically predicted conservation properties, as depicted in Figure 12, which shows energy conservation upto numerical roundoff for the new implicit scheme applied to the diocotron instability Davidson (2001). In addition, we illustrate in Figure 13 the ability of the sparse grid scheme to dramatically reduce particle sampling noise in the solution. Future implementation of this method in Cabana poses unique challenges, as its structure is rather different from a typical PIC method. This is due to the various grids required and the need to perform not only particlegrid and gridparticle interpolations, but also gridgrid interpolations for postprocessing. As a result, it offers a particularly valuable testcase for the flexibility of the software infrastructure.
5 Application Partners
To enable a deep window into how particle applications use the computational motifs, the CoPA codesign center established partnerships with several ECP application development projects. Direct application engagement through deep dives and hackathons has resulted in XGC adoption of Cabana/Kokkos, LAMMPSSNAP GPU algorithm optimization, and the open source HACC/SWFFT code. Details of these engagements and their impact on exascale readiness are presented below. Kernels can easily be extracted and explored through proxy apps leading to performance improvements. CoPA’s interdependencies with ECP ST library projects, which provide common software capabilities, has also led to improvements and additions. Details of these engagements are described below.
6 XGC and WDMApp
In this section, we show how the Cabana library has been utilized to enable the fusion WDMApp PIC code XGC to be portable while preserving scalibility and performance. In the Anatomy of a Time Step Figure 1, XGC fits into the PIC submotif. The particle movement in the particle resorting step has been minimized to avoid MPI communications. Instead, the particle remapping step is heavily utilized in each particle cell independently, which is an embarrassingly parallel operation.
The ECP Whole Device Model Application (WDMApp) project’s aim is to develop a highfidelity model of magnetically confined fusion plasma that can enable better understanding and prediction of ITER and other future fusion devices, validated on present tokamak (and stellarator) experiments. In particular, it aims for a demonstration and assessment of coreedge coupled gyrokinetic physics on sufficiently resolved timescales to study formation of the pedestal, a physical phenomenon essential to ITER’s success but whose mechanisms are still not wellunderstood. The WDMapp project involves coupling a less expensive code (GENE continuum code or GEM PIC code, solves for perturbed parts only), which models the tokamak’s core, with a more expensive code, XGC (obtains total 5D solution), which is capable of modeling the edge of the device plasma where the computational demands are highest. Performance of the coupled WDMApp code is expected to be dominantly determined by XGC. Performance optimization of XGC is essential to meet the exascale demands.
XGC is a Fortran particleincell code used to simulate plasma turbulence in magnetically confined fusion devices Ku et al. (2018). It is gyrokinetic, a common plasma modeling approach in which velocity is reduced to two dimensions (parallel and perpendicular to the magnetic field), thus reducing total model complexity from 6D to 5D. Markers containing information about the ion and electron particle distribution functions are distributed in this phase space. In a given time step, particle position is used to map charge density onto an unstructured grid. The charge density is solved to determine the global electric field, which in turn is used to update (“push”) particle position for the next time step. Particle velocity is also mapped onto an unstructured grid to evaluate the velocity space Coulomb scattering in accordance with the FokkerPlanck operator.
Electron position must be updated with a much smaller time step than ions due to their high relative velocity. They are typically pushed 60 times for every ion step (and field solve), and as a result the electron push is by far the most expensive kernel in XGC.
In the past, several versions of the electron push kernel were developed that maximized performance on specific hardware. XGC maintained a CUDA Fortran version of the electron push kernel optimized for the previous Oak Ridge supercomputer Titan; an OpenMP version that vectorizes and performs well on CPUs; and an unvectorized OpenMP version for use as a cleaner reference.
In addition to the basic time step cycle described above, XGC also has source terms including a Fokker Planck collision solver on each grid node as briefly mentioned above, which is the second most computationally expensive kernel after the electron push. This kernel offloads work to GPU with OpenACC if available, or uses OpenMP if on CPU. Utilizing multiple offloading programming models in the same simulation poses additional challenges when adapting the code to new platforms and compilers. For example, on Summit only one available compiler (PGI) supported both OpenACC and CUDA Fortran.
To prepare for exascale architectures, XGC is in the process of significant restructuring. Instead of multiple code bases and offloading programming models, it is being rewritten to use Kokkos and Cabana and to strive toward a single maintainable, flexible codebase that performs well on all relevant architectures Scheinberg et al. (2019).
6.1 Kokkos/Cabana Implementation
Since XGC is written in Fortran, utilizing Kokkos and Cabana posed the additional challenge of FortranC++ interfacing. Our initial goal was to use these libraries without significant changes to the main code or to the individual kernels to be offloaded. We developed an initial such implementation, in which the XGC main code would call a C++ subroutine that wrapped a Kokkos parallel_for that launched a kernel that looped over particles and called the necessary Fortran kernel. Kokkos was therefore restricted to a thin interface that managed kernel launching. The Fortran kernel itself had to be modified with preprocessor macros which directed the compiler to compile the code for CPU or GPU as specified; under the hood, CUDA Fortran was still used for GPU offloading.
There were several downsides to this approach. First, it restricted the Kokkos and Cabana features available for use, instead often necessitating custom features for memory management and hostdevice communication. Second, reliance on Fortran modules often made proper encapsulation difficult. Third, it was unclear if the approach could be easily extended to platforms with AMD or Intel GPUs where no foolproof equivalence to CUDA Fortran would be available. For these reasons, we instead opted to convert XGC into C++, beginning with the major kernels that require offloading, and gradually converting the remaining code. With this new approach, we are better able to utilize the strengths of Kokkos and Cabana by relying on them for memory management, hostdevice communication, etc.
Due to the piecemeal approach to the code conversion, many data structures on the CPU are still allocated on the Fortran side. At each time step, the particles are rearranged into an array of structures and sent to the GPU as a Cabana AoSoA object. Other data residing in Fortran arrays are wrapped in unmanaged Kokkos Views and can then be copied to Views on the GPU. This method was found to be the least disruptive means of interfacing as the code is gradually converted to C++.
Within kernels that loop over particles, an inner loop is also present, with a range of 1 on GPU and a precompiled length (32 by default) on the CPU. These inner loops are mostly vectorized if compiled on CPU, and loop over particles within a single structure of arrays from the AoSoA while the outer loop (the parallel_for) loops over structures with OpenMP. The result is a single code base that vectorizes if compiled for CPU and coalesces if on GPU.
6.2 Results
The most expensive operation, the electron push, is now in C++ and offloaded using Kokkos, as well as electron charge deposition and sorting. Since the ion push is independent from the electron push and is still CPUonly, it is performed asynchronously while the electron push is performed on GPU.
A scaling study and comparisons between the different code bases were conducted on both Summit and Cori (KNL) supercomputers. These tests used simulation parameters and size comparable to those used in scientific production. The new code was found to weakscale well on both machines (Figure 14AB). Performance on Cori was found to be similar to the performance of the vectorized Fortran code, while performance on Summit is also similar to the CUDA Fortran version of the code (Figure 14CD). In fact, the Kokkos/Cabana version outperformed previous versions; however, the improvements cannot be entirely attributed to this, since minor algorithmic and structural changes occurred during the porting process.
We conclude that adopting Kokkos and Cabana enabled us to consolidate to a single codebase that is portable to diverse architectures without sacrificing performance.
6.3 Exascale Outlook
Conversion of the remaining XGC kernels into C++ is underway. In addition to offloading more XGC kernels, experimentation with more Cabana features (sorting, interGPU particle exchange, etc.) will be performed. This may prove useful particularly as more data will be resident on GPU on exascale architectures.
The collision kernel has been converted and offloaded with Kokkos, though performance results are not yet available. With the new collision kernel, OpenACC will no longer be needed and XGC will rely solely on Kokkos for GPU offloading.
7 ExaSky
The ExaSky ECP project focuses on extremescale cosmological simulations targeted at nextgeneration sky surveys that observe across multiple wavebands. The simulations follow the development and evolution of cosmic structure in an expanding universe, including not only the effects of gravity, but also gas dynamics and a number of astrophysically relevant processes such as radiative cooling, star formation, and various feedback mechanisms, several of which are treated via phenomenological subgrid models.
Cosmological simulations have a vast dynamic range in space, approximately six orders in magnitude, and the corresponding demands on time and density resolution are very severe. ExaSky uses two codes, HACC (Hardware/Hybrid Accelerated Cosmology Code) Habib et al. (2016) and Nyx Almgren et al. (2013); HACC uses tracer particles for both dark and ordinary matter (‘baryons’), whereas Nyx uses an Eulerian adaptive mesh refinement (AMR) based method for the gas dynamics. Nyx is strongly coupled to methods being developed by the AMReX ECP codesign center, whereas HACC, because it is essentially a Lagrangian, particlebased code framework, has strong ties to CoPA. In Figure 1, HACC represents a combination of submotifs, where PIC methods are used for a Poisson solver to calculate gravitational forces over large distances, and MDlike methods on nearby particles are used to evaluate local contributions to the gravitational force. More details about HACC’s gravitational forcesplitting are given in the next section.
7.1 Hacc
HACC solves the the 6D VlasovPoisson equation in an expanding universe Peebles (1980) and includes gas dynamics via a new SPH scheme, CRKSPH (Conservative Reproducing Kernel SPH), an effectively higherorder method that overcomes many of SPH’s known problems, while maintaining its advantages Frontiere et al. (2017). HACC’s gravity solver splits the gravitational force computation into two parts, a longrange Poisson solver based on a highorder hybrid spectral method, and matched shortrange solvers that are designed to be separately optimized for different architectures (direct particleparticle, tree, fast multipole). HACC’s longrange solver is essentially a PIC method that actively leverages the use of a large, distributed FFT to minimize indirection, reduce particle noise, isotropize the force kernel, and compactly implement higherorder methods for particle deposition and force computation. Timestepping is performed via an adaptive splitoperator, symplectic method that uses subcycling for increased temporal resolution for the dynamics associated with the shortrange force. HACC’s Poisson solver is unusual in that it uses error compensation in the Fourier domain to effectively increase the order of the solver even though the particlegrid interaction is only kept to first nontrivial order (i.e., CIC deposition and interpolation). Details are given in Habib et al. (2016).
HACC has its own dedicated, distributed 3D FFT, SWFFT (see below), which has been made publicly available under CoPA. The shortrange gravity and hydro solvers comprise the most computationally intensive kernels within HACC and are heavily performanceoptimized on a number of architectures. These kernels are highly compact and are excellent candidates to test and exploit the performance portability possibilities using the Cabana framework. As a deliberate result of HACC’s design, 95% of the code does not change as one runs on different platforms (e.g., CPU or CPU+GPU systems), a feature which greatly aids in implementing different performanceportable solutions. Because of the isolation of the computational work into a finite number of compact kernels, a very high level of targeted performance optimization is possible, which would not be the case with the use of generic external libraries. Additionally, the algorithms used are also tied to the architecture as an instance of “software codesign” so the dependencies are not static. Finally, as HACC is often used as a benchmark code on emerging architectures, performant libraries often do not exist on these platforms.
Future work envisaged for HACC is a proxy app based on Cabana and a general longrange solver implemented in Cabana that uses highorder spectral gradients. In addition, as a test of performance portability, we envisage building a shortrange gravity kernel in Cabana that can interface with the rest of the HACC code. In this case we can run the full code with a compact, localized modification.
7.2 CosmoTools
CosmoTools is the analysis framework associated with HACC. In situ, coscheduled, and offline analyses associated with HACC are complex and computationally demanding in their own right, and are as important as running the underlying simulations. Because the analysis methods are diverse, performance portability, and especially the ability to use accelerators are both key issues for CosmoTools.
In the ECP context, in situ analysis is of particular importance. Some algorithms in CosmoTools can be built on primitives used by the solver, whereas others, such as neighborfinding and other clusteringbased measurements are unique to CosmoTools; implementation of the latter class of methods often requires the use of efficient graph algorithms. Work is ongoing with the ArborX team LebrunGrandié et al. (2019) to implement new algorithms for clustering analyses (e.g., DBSCAN, Npoint correlation functions) on GPUs with promising initial results having been obtained.
7.3 Swfft
HACC’s performance and scaling requirements involve running very large 3D FFTs ( grids, where ) distributed across a potentially very large number of MPI ranks () in order reach the desired dynamic range of the longrange gravitational force via a Poisson solver. A common first approach to a distributedmemory 3D FFT is to divide the grid among ranks along one dimension at a time, creating a 1D “slab” decomposition, but this approach can only work if the number of ranks does not exceed the number of grid vertexes along one dimension (). In order to scale to more ranks, HACC’s 3D FFT employs a 2D “pencil” decomposition, where ranks are distributed across one face of the grid at a time, relaxing the constraint on the number of ranks relative to grid size (). HACC’s particle operations are sensitive to the ratio between surface area and volume on each rank, so the deposition of particle information onto a grid occurs in a 3D “brick” decomposition. HACC’s 3D FFT requires grid data to be redistributed between the 3D brick decomposition and each of three 2D pencil decompositions () where the actual 1D FFTs are performed one dimension at a time. HACC’s 3D FFT has implemented this process by going back to the 3D brick decomposition in between all of the pencil decompositions (see Figure 15), so the only communication routines are those that go back and forth between 3D and 2D decompositions. The HACC development team maintains an open source version of this 3D FFT as the SouthWest Fast Fourier Transform (SWFFT, https://xgitlab.cels.anl.gov/hacc/SWFFT). SWFFT is implemented as an outofplace transform on doubleprecision complex grid data. The low level communication is implemented in C, the native highlevel FFT interface is implemented as headeronly C++, and a Fortran FFT interface is also supported. Currently there are a few minor differences in the API between SWFFT and HACC’s internal 3D FFT, but the codes are functionally the same.
SWFFT’s implementation and performance characteristics are driven by HACC’s requirements, and the primary goal is excellent weak scaling in memorylimited regimes. An advantage of SWFFT’s communication pattern is that the number of rank pairs that must exchange data scales as the cuberoot of the total number of ranks (). For a communication pattern where data is exchanged directly between pencil decompositions, the number of rank pairs that must exchange data scales as the squareroot of the total number of ranks (). HACC can maintain a relatively small number of large messages as the number of MPI ranks becomes large, though there are several more communication stages than a direct pencilpencil communication pattern, so this can emphasize robust weak scaling over absolute minimum latency. Figure 16 shows the scaling of HACC’s Poisson solver, where each Poisson solve involves four 3D FFTs  one forward, three backward for force components using spectral gradients. The largest HACC simulation so far used a grid on 1,572,846 MPI ranks on LLNL’s Sequoia IBM Blue Gene/Q system, and each FFT took seconds to complete. In addition to the source and destination grid memory, SWFFT uses send and receive buffers to reorganize data into messages, but the fractional overhead of those buffers scales as and becomes smaller at larger scales.
The standalone SWFFT code was developed to serve as a MiniApp that represented the dominant communication workload in HACC, and also as a potential tool for use in solvers in other applications. Through CoPA and ExaSky, an experimental version of Nyx implemented a gravitational solver based on SWFFT. For Nyx, a branch of SWFFT was created with additional flexibility in mapping 3D subvolumes to MPI ranks, and that branch will be reintegrated into the main branch and used to support a new memorybalancing mode in HACC. We are also exploring integrating SWFFT as a backend FFT for solvers written in CoPA’s Cabana framework. SWFFT has already demonstrated scaling up to grids on MPI ranks, and on exascale systems HACC plans to use grids. By maintaining the standalone open source SWFFT and participating in ECP, we hope to help other applications and science domains that could benefit from using extremely large FFTs on exascale systems.
8 Improving GPU performance of a machine learned potential for MD
The EXAALT project within ECP seeks to extend the accuracy, length, and time scales of materials science simulations to model plasmafacing metals used in future fusion reactors like ITER. One method to extend time scales is to run up to millions of small molecular dynamics (MD) simulations (1K to 1M atoms each) and use the parallel replica dynamics (PRD) algorithm as encoded in the ParSplice program Perez et al. (2016) to stitch them together into statistically accurate long timescale trajectories. To accurately model defects in metals surfaces bombarded with plasma ions, each replica uses the SNAP machinelearned (ML) interatomic potential Thompson et al. (2015), available in the LAMMPS MD code Plimpton (1995) (https://lammps.sandia.gov). The ability to run the fullscale model on an exascale machine for long timescales thus depends on the performance of SNAP on one or a few GPUs when simulating a small system (one replica in the PRD ensemble).
A Kokkos version of the SNAP potential was originally implemented in the ExaMiniMD proxy app (https://github.com/ECPcopa/ExaMiniMD) and then ported to LAMMPS. At the time ECP began, the fractionofpeak performance for SNAP for this baseline version was very low on GPUs. To address this concern, a collaboration between EXAALT, CoPA, NERSC/NESAP, Cray, and NVIDIA was formed. A new proxy app version of the SNAP model, called TestSNAP, was created (https://github.com/FitSNAP/TestSNAP).
TestSNAP is a serial code derived from the parallel CPU version of SNAP in LAMMPS. It is a good proxy in terms of memory and computational costs. It computes step 3 of the MD submotif in Figure 1, which dominates all other parts of the timestep for a simulation using SNAP. Importantly, the isolation of the SNAP algorithm in the proxy code made it possible to rapidly experiment with different formulations of the highlevel algorithm as well as lowlevel optimizations such as data structure alterations or loop reordering. The proxy also includes a correctness check which was very helpful to insure changes did not alter the numerical results. The team used the proxy to explore a variety of GPU strategies, first using the OpenACC and CUDA programming models, and then Kokkos. Improvements made in TestSNAP were ported back to the Kokkos version of SNAP in the production LAMMPS code. Further improvements were also implemented directly in LAMMPS Gayatri et al. (2020).
The following optimizations improved both CPU and GPU performance of the SNAP potential in LAMMPS:

Altered the structure of the SNAP equations to avoid duplicate computations in different terms as well as the order of summations by using an adjoint refactorization. This enabled a dramatic reduction in the flop count, as well as reduced memory footprint and memory access count.

Flattened jagged multidimensional arrays which further reduced memory use.

Symmetrized data layouts of certain matrices, which reduced memory overhead and use of thread atomics on GPUs.
These optimizations were GPU specific:

Broke up one large kernel into multiple kernels. This reduced register pressure, but also greatly increased memory use as intermediate quantities needed to be stored between kernel launches. However, with other optimizations, the net effect was a large reduction in memory use with reduced register pressure.

Reversed the order of peratom and perneighbor loops.

Optimized the memory data layout for the chosen access patterns (e.g. columnmajor vs rowmajor).

Changed the memory data layout of an array between kernels via transpose operations.

Refactored loop indices and data structures to use complex numbers and multidimensional arrays instead of arrays of structs.

Refactored some of the kernels to avoid thread atomics and use of global memory.

Judiciously used Kokkos hierarchical parallelism and GPU shared memory.

Fused a few selected kernels, which helped eliminate intermediate data structures and reduced memory use.

Added a new memory data layout inspired by Cabana, which enforced perfect coalescing and load balancing in one of the kernels.

Precomputation of certain parameters.
Figure 17 shows the effect of these optimizations on the SNAP potential performance over time for the EXAALT benchmark problem running on a single NVIDIA V100 GPU on OLCF Summit. For the original Kokkos version of SNAP in LAMMPS, the performance was 5.09 Katomsteps/s per GPU (Katomsteps = 1000s of atomsteps). With the improvements listed above, the new performance is now 110.7 Katomsteps/s per GPU, which is a 22 speedup.
The algorithmic improvements have also been implemented in the CPU (nonKokkos) version of SNAP in LAMMPS, with the exception of the third item in the CPU/GPU list. Running on 36 MPI ranks of a dualsocket Intel Broadwell CPU, these changes increased the performance of the CPU version of SNAP by a factor of 3 for the same benchmark.
9 Summary
Library efforts, algorithm development, and interactions with particle applications represented within CoPA all contribute to our codesign process and strategy. The anatomy of a timestep for particle applications (Figure 1) provides a window into the scope of the CoPA Codesign Center. The computational kernels requiring optimization for exascale computing are associated with the nature of particle interactions. Applications with shortranged, longranged, and particlegrid interactions are addressed within the Cabana library. While applications requiring a quantum mechanical description of interactions are addressed within the PROGRESS/BML libraries. Inclusion of expertise and application partners representing all the submotifs has allowed us to understand and create these libraries as well as proxy apps of interest for shortrange MD, longrange MD, PIC, and QMD applications. Success is measured by the use of these products within both ECP and nonECP projects. We close by highlighting some lessons learned, followed by impacts within ECP and the broader community.
Important lessons learned include:

Many times over we have discovered the benefits of proxy apps for rapid prototyping of different ideas and speedup of the performance optimization process.

Improving performance on GPUs requires multiple approaches including optimizing data layout, coalescing memory accesses, increasing arithmetic intensity, and using profiling to guide optimizations. Gains can come from both improving the algorithm as well as improving the implementation.

Codesign teams of domain scientists, computational scientists, and expert programmers in hardwarespecific languages and programming models, working together, proved beneficial to design and optimization efforts.

Focused hackathon sessions proved highly productive for small teams over short timeframes, collaborating on algorithms, implementations, and benchmarking.
Impacts as successes with our application partners across all submotifs include:

WDMApp/XGC is transitioning from Fortran to C++ using Kokkos/Cabana, replacing much of their code with Cabana kernels. The result will be a single flexible codebase with performance portability across relevant architectures.

EXAALT/LAMMPS, as part of a codesign team, was able to improve the performance of their SNAP ML model by 22.

Integration of the PROGRESS/BML QMD capability, the LATTE electronic structure code, and the NAMD MD code, has enabled hybrid QM/MM simulations of proteins. This capability will extend the impact of our ECP work to biomedical research including studies of SARSCov2 proteins.
Library efforts have influenced improvements in a number of the ECP ST libraries, such as Kokkos, heFFTe, ArborX, and others. CoPA’s library codesign capability allows for integration into existing particle applications, as well as creation of new applications as we continue on the road to exascale.
This work was performed as part of the Codesign Center for Particle Applications, supported by the Exascale Computing Project (17SC20SC), a collaborative effort of the U.S. DOE Office of Science and the NNSA. Assigned: Los Alamos Unclassified Report (LAUR) 2026599.
This work was performed at Argonne National Laboratory under the U.S. Department of Energy contract DEAC0206CH11357, Lawrence Livermore National Laboratory under U.S. Government Contract DEAC5207NA27344, Oak Ridge National Laboratory under U.S. Government Contract DEAC0500OR22725, Princeton Plasma Physics Laboratory under U.S. Department of Energy contract DEAC0206CH11357 with Princeton University, Los Alamos National Laboratory, and at Sandia National Laboratories.
Los Alamos National Laboratory is operated by Triad National Security, LLC, for the National Nuclear Security Administration of the U.S. Department of Energy (Contract No. 89233218NCA000001).
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract number DENA0003525.
This research used resources of the Oak Ridge Leadership Computing Facility (OLCF), the Argonne Leadership Computing Facility (ALCF), and the National Energy Research Scientific Computing Center (NERSC), supported by DOE under the contract numbers DEAC0500OR22725, DEAC02–06CH11357, and DEAC0205CH11231, respectively.
This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.
References
 Adedoyin et al. (2019) Adedoyin AA, Negre CFA, Bock N, MohdYusof J, OseiKuffuor D, Fattebert JL, Wall ME, Niklasson AMN and Mniszewski SM (2019) Performance optimizations of recursive electronic structure solvers targeting multicore architectures. LAUR2026665.
 Alexander et al. (2020) Alexander F, Almgren A, Bell J, Bhattacharjee A, Chen J, Colella P, Daniel D, DeSlippe J, Diachin L, Draeger E, Dubey A, Dunning T, Evans T, Foster I, Francois M, Germann T, Gordon M, Habib S, Halappanavar M, Hamilton S, Hart W, (Henry) Huang Z, Hungerford A, Kasen D, Kent PRC, Kolev T, Kothe DB, Kronfeld A, Luo Y, Mackenzie P, McCallen D, Messer B, Mniszewski S, Oehmen C, Perazzo A, Perez D, Richards D, Rider WJ, Rieben R, Roche K, Siegel A, Sprague M, Steefel C, Stevens R, Syamlal M, Taylor M, Turner J, Vay JL, Voter AF, Windus TL and Yelick K (2020) Exascale applications: skin in the game. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 378(2166): 20190056. DOI:10.1098/rsta.2019.0056.
 Almgren et al. (2013) Almgren AS, Bell JB, Lijewski MJ, Lukić Z and Van Andel E (2013) Nyx: A Massively Parallel AMR Code for Computational Cosmology. Astrophys. J. 765(1): 39. DOI:10.1088/0004637X/765/1/39.
 Ayala et al. (2019) Ayala A, Tomov S, Luo X, Shaiek H, Haidar A, Bosilca G and Dongarra J (2019) Impacts of multiGPU MPI collective communications on large FFT computation. In: Workshop on Exascale MPI (ExaMPI) at SC19. Denver, CO. DOI:10.1109/ExaMPI49596.2019.00007.
 Behler and Parrinello (2007) Behler J and Parrinello M (2007) Generalized NeuralNetwork Representation of HighDimensional PotentialEnergy Surfaces. Physical Review Letters 98(14): 146401. DOI:10.1103/PhysRevLett.98.146401.
 Bock et al. (2008) Bock N, Cawkwell MJ, Coe JD, Krishnapriyan A, Kroonblawd MP, Lang A, , Liu C, Saez EM, Mniszewski SM, Negre CFA, Niklasson AMN, Sanville E, Wood MA and Yang P (2008) Latte. URL https://github.com/lanl/LATTE.
 Bock et al. (2018) Bock N, Negre CFA, Mniszewski SM, MohdYusof J, Aradi B, Fattebert JL, OseiKuffuor D, Germann TC and Niklasson AMN (2018) The basic matrix library (BML) for quantum chemistry. J. Supercomput. 74(11): 6201–6219.
 Bowers et al. (2009) Bowers KJ, Albright BJ, Yin L, Daughton W, Roytershteyn V, Bergen B and Kwan T (2009) Advances in petascale kinetic plasma simulation with VPIC and Roadrunner. In: Journal of Physics: Conference Series, volume 180. IOP Publishing, p. 012055.
 Brackbill and Forslund (1985) Brackbill J and Forslund D (1985) Simulation of lowfrequency, electromagnetic phenomena in plasmas. In: Multiple time scales. Elsevier, pp. 271–310.
 Chen and Chacon (2015) Chen G and Chacon L (2015) A multidimensional, energyand chargeconserving, nonlinearly implicit, electromagnetic Vlasov–Darwin particleincell algorithm. Computer Physics Communications 197: 73–87.
 Chen et al. (2011) Chen G, Chacón L and Barnes DC (2011) An energyand chargeconserving, implicit, electrostatic particleincell algorithm. Journal of Computational Physics 230(18): 7018–7036.
 Chen et al. (2012) Chen G, Chacón L and Barnes DC (2012) An efficient mixedprecision, hybrid CPU–GPU implementation of a nonlinearly implicit onedimensional particleincell algorithm. Journal of Computational Physics 231(16): 5374–5388.
 Chen et al. (2020) Chen G, Chacón L, Yin L, Albright BJ, Stark DJ and Bird RF (2020) A semiimplicit, energyand chargeconserving particleincell algorithm for the relativistic Vlasov–Maxwell equations. Journal of Computational Physics 407: 109228.
 Cui and Elstner (2014) Cui Q and Elstner M (2014) Density functional tight binding: values of semiempirical methods in an ab initio era. Phys Chem Chem Phys 16(28): 14368–14377. DOI:doi:10.1039/c4cp00908h.
 Davidson (2001) Davidson RC (2001) Physics of nonneutral plasmas. Imperial College Press London.
 Dawson and Nakajima (2018) Dawson W and Nakajima T (2018) Massively parallel sparse matrix function calculations with NTPoly. Computer Physics Communications 225: 154 – 165. DOI:https://doi.org/10.1016/j.cpc.2017.12.010.
 Desai et al. (2020) Desai S, Reeve ST and Belak JF (2020) Implementing a neural network interatomic model with performance portability for emerging exascale architectures. arXiv:2002.00054 [condmat, physics:physics] .
 Dongarra et al. (2014) Dongarra J, Gates M, Haidar A, Kurzak J, Luszczek P, Tomov S and Yamazaki I (2014) Accelerating numerical dense linear algebra calculations with GPUs. Numerical Computations with GPUs : 1–26.
 Edwards et al. (2014) Edwards HC, Trott CR and Sunderland D (2014) Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing 74(12): 3202–3216. DOI:10.1016/j.jpdc.2014.07.003.
 Essmann et al. (1995) Essmann U, Perera L, Berkowitz ML, Darden T, Lee H and Pedersen LG (1995) A smooth particle mesh Ewald method. The Journal of Chemical Physics 103(19): 8577–8593. DOI:10.1063/1.470117.
 Falgout and Yang (2002) Falgout RD and Yang UM (2002) hypre: A Library of High Performance Preconditioners. In: Sloot PMA, Hoekstra AG, Tan CJK and Dongarra JJ (eds.) Computational Science — ICCS 2002, Lecture Notes in Computer Science. Berlin, Heidelberg: Springer. ISBN 9783540477891, pp. 632–641. DOI:10.1007/3540477896˙66.
 Fattebert et al. (2016) Fattebert JL, OseiKuffuor D, Draeger EW, Ogitsu T and Krauss WD (2016) Modeling dilute solutions using firstprinciples molecular dynamics: Computing more than a million atoms with over a million cores. In: SC ’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 12–22.
 Franchetti et al. (2020) Franchetti F, Spampinato DG, Kulkarni A, Low TM, Franusich M, Popovici T, Canning A, McCorquodale P, Straalen BV and Colella P (2020) FFT and solver libraries for exascale: FFTx and spectralpack. Exascale Computing Project (ECP) Annual Meeting. Poster.
 Frontiere et al. (2017) Frontiere N, Raskin CD and Owen JM (2017) CRKSPH – a conservative reproducing kernel smoothed particle hydrodynamics scheme. Journal of Computational Physics 332: 160–209. DOI:10.1016/j.jcp.2016.12.004.
 Gayatri et al. (2020) Gayatri R, Moore S, Weinberg E, Lubbers N, Anderson S, Deslippe J, Perez D and Thompson AP (2020) Rapid exploration of optimization strategies on advanced architectures using TestSNAP and LAMMPS. arXiv eprints : arXiv:2011.12875.
 Genoni et al. (2010) Genoni T, Clark R and Welch D (2010) A fast implicit algorithm for highly magnetized charged particle motion. The Open Plasma Physics Journal 3(1).
 Germann et al. (2013) Germann TC, McPherson AL, Belak JF and Richards DF (2013) Exascale codesign center for Materials in Extreme environments (ExMatEx) annual report  year 2. Technical report.
 Griebel et al. (1992) Griebel M, Schneider M and Zenger C (1992) A combination technique for the solution of sparse grid problems. In: Iterative Methods in Linear Algebra.
 Habib et al. (2012) Habib S, Morozov V, Finkel H, Pope A, Heitmann K, Kumaran K, Peterka T, Insley J, Daniel D, Fasel P, Frontiere N and Lukic Z (2012) The Universe at Extreme Scale: MultiPetaflop Sky Simulation on the BG/Q. arXiv eprints : arXiv:1211.4864.
 Habib et al. (2016) Habib S, Pope A, Finkel H, Frontiere N, Heitmann K, Daniel D, Fasel P, Morozov V, Zagaris G, Peterka T, Vishwanath V, Lukić Z, Sehrish S and Liao Wk (2016) HACC: Simulating sky surveys on stateoftheart supercomputing architectures. New Astron. 42: 49–65. DOI:10.1016/j.newast.2015.06.003.
 Hockney and Eastwood (1989) Hockney RW and Eastwood JW (1989) Computer Simulation Using Particles. 1st edition edition. Bristol England ; Philadelphia: CRC Press. ISBN 9780852743928.
 Hourahine et al. (2020) Hourahine B, Aradi B, Blum V, Bonafé F, Buccheri A, Camacho C, Cevallos C, Deshaye MY, Dumitrică T, Dominguez A, Ehlert S, Elstner M, van der Heide T, Hermann J, Irle S, Kranz JJ, Köhler C, Kowalczyk T, Kubař T, Lee IS, Lutsker V, Maurer RJ, Min SK, Mitchell I, Negre C, Niehaus TA, Niklasson AMN, Page AJ, Pecchia A, Penazzi G, Persson MP, Řezáč J, Sánchez CG, Sternberg M, Stöhr M, Stuckenberg F, Tkatchenko A, Yu VWZ and Frauenheim T (2020) DFTB+, a software package for efficient approximate density functional theory based atomistic simulations. J. Chem. Phys. 152(12): 124101.
 Ku et al. (2018) Ku S, Chang C, Hager R, Churchill R, Tynan G, Cziegler I, Greenwald M, Hughes J, Parker S, Adams M, D’Azevedo E and Worley P (2018) A fast lowtohigh confinement mode bifurcation dynamics in the boundaryplasma gyrokinetic code XGC1. Physics of Plasmas 25: 056107. DOI:10.1063/1.5020792. URL https://doi.org/10.1063/1.5020792.
 Lau and Finn (1990) Lau YT and Finn JM (1990) Threedimensional kinematic reconnection in the presence of field nulls and closed field lines. The Astrophysical Journal 350: 672–691.
 LebrunGrandié et al. (2019) LebrunGrandié D, Prokopenko A, Turcksin B and Slattery SR (2019) ArborX: A Performance Portable Search Library. arXiv:1908.11807 [cs] .
 Liu and Liu (2003) Liu GR and Liu MB (2003) Smoothed particle hydrodynamics: a meshfree particle method. World scientific.
 Marx and Hutter (2009) Marx D and Hutter J (2009) Ab Initio Molecular Dynamics: Basic Theory and Advanced Methods. Cambridge University Press.
 Melo et al. (2018) Melo MCR, Bernardi RC, Rudack T, Scheurer M, Riplinger C, Phillips JC, Maia JDC, Rocha GB, Ribeiro JV, Stone JE, Neese F, Schulten K and LutheySchulten Z (2018) NAMD goes quantum: An integrative suite for hybrid simulations. Nature Methods 15(5): 351–354. DOI:10.1038/nmeth.4638.
 Mniszewski et al. (2015) Mniszewski SM, Calkwell MJ, Wall ME, MohdYusof J, Bock N, Germann TC and Niklasson AMN (2015) Efficient parallel linear scaling construction of the density matrix for Born–Oppenheimer molecular dynamics. J. Chem. Theory Comput. 11(10): 4644––4654.
 Mniszewski et al. (2019) Mniszewski SM, R P, Rubensson EH, Negre CFA, Calkwell MJ and Niklasson AMN (2019) Linear scaling pseudo Fermioperator expansion for fractional occupation. J. Chem. Theory Comput. 15(1): 190–300.
 MohdYusof et al. (2015) MohdYusof J, Sakharnykh N, Mniszewski SM, Cawkwell MJ, Bock N, Germann TC and Niklasson AMN (2015) Fast sparse matrix multiplication for QMD using parallel merge. In: GPU Technology Conference. San Jose, CA.
 Negre et al. (2016) Negre CFA, Mniszewski SM, Cawkwell MJ, Bock N, Wall ME and Niklasson AMN (2016) Recursive factorization of the inverse overlap matrix in linear scaling quantum molecular dynamics simulations. J. Chem. Theory Comput. 12(7): 3063–3073.
 Niklasson et al. (2016) Niklasson AMN, Mniszewski SM, Negre CFA, Calkwell MJ, Swart PJ, MohdYusof J, Germann TC, Wall ME, Bock N, Rubensson EH and Djidjev H (2016) Graphbased linear scaling electronic structure theory. J. Chem. Phys. 144: 234101.
 Niklasson et al. (2003) Niklasson AMN, Tymczak CJ and Challacombe M (2003) Trace resetting density matrix purification in O(N) selfconsistentfield theory. The Journal of Chemical Physics 118(19): 8611–8620. DOI:10.1063/1.1559913.
 Parker and Birdsall (1991) Parker S and Birdsall C (1991) Numerical error in electron orbits with large cet. Journal of Computational Physics 97(1): 91–102.
 Peebles (1980) Peebles PJE (1980) The largescale structure of the universe.
 Perez et al. (2016) Perez D, Cubuk ED, Waterland A, Kaxiras E and Voter AF (2016) Longtime dynamics through parallel trajectory splicing. Journal of Chemical Theory and Computation 12(1): 18–28.
 Plimpton (1995) Plimpton S (1995) Fast parallel algorithms for shortrange molecular dynamics. J. Comput. Phys. 117(1): 1 – 19. URL http://lammps.sandia.gov.
 Plimpton et al. (2018) Plimpton S, Kohlmeyer A, Coffman P and Blood P (2018) fftMPI, a library for performing 2d and 3d FFTs in parallel, version 00. URL https://www.osti.gov//servlets/purl/1457552.
 Pope et al. (2017) Pope A, Daniel D and Frontiere N (2017) SWFFT: A standalone version of HACC’s distributedmemory, pencildecomposed, parallel 3D FFT. URL https://xgitlab.cels.anl.gov/hacc/SWFFT.
 Ricketson and Cerfon (2016) Ricketson LF and Cerfon AJ (2016) Sparse grid techniques for particleincell schemes. Plasma Physics and Controlled Fusion 59(2): 024002.
 Ricketson and Cerfon (2018) Ricketson LF and Cerfon AJ (2018) Sparse grid particleincell scheme for noise reduction in beam simulations. Proceeding of the 13th International Computational Accelerator Physics Conference .
 Ricketson and Chacón (2020) Ricketson LF and Chacón L (2020) An energyconserving and asymptoticpreserving chargedparticle orbit implicit time integrator for arbitrary electromagnetic fields. Journal of Computational Physics : 109639.
 Saad (2003) Saad Y (2003) Iterative Methods for Sparse Linear Systems. Second edition. Society for Industrial and Applied Mathematics. DOI:10.1137/1.9780898718003.
 Scheinberg et al. (2019) Scheinberg A, Chen G, Ethier S, Slattery S, Bird R, Worley P and Chang C (2019) Kokkos and Fortran in the exascale computing project plasma physics code XGC. Proceedings of SC19 Conference .
 Stanier et al. (2019) Stanier A, Chacón L and Chen G (2019) A fully implicit, conservative, nonlinear, electromagnetic hybrid particleion/fluidelectron algorithm. Journal of Computational Physics 376: 597–616.
 The CP2K Developers Group (2020) The CP2K Developers Group (2020) DBCSR: Distributed Block Compressed Sparse Row matrix library. URL https://github.com/cp2k/dbcsr.
 Thompson et al. (2015) Thompson A, Swiler L, Trott C, Foiles S and Tucker G (2015) Spectral neighbor analysis method for automated generation of quantumaccurate interatomic potentials. Journal of Computational Physics 285: 316 – 330.
 Tuszewski (1988) Tuszewski M (1988) Field reversed configurations. Nuclear Fusion 28(11): 2033.
 Vay et al. (2018) Vay JL, Almgren A, Bell J, Ge L, Grote D, Hogan M, Kononenko O, Lehe R, Myers A, Ng C et al. (2018) Warpx: A new exascale computing platform for beam–plasma simulations. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 909: 476–479.
 Vay et al. (2013) Vay JL, Haber I and Godfrey BB (2013) A domain decomposition method for pseudospectral electromagnetic simulations of plasmas. Journal of Computational Physics 243: 260–268.
 Vu and Brackbill (1995) Vu H and Brackbill J (1995) Accurate numerical solution of charged particle motion in a magnetic field. Journal of Computational Physics 116(2): 384–387.
Comments
There are no comments yet.