I Problem Overview
In astrophysics, -body simulations are used to study the dynamics of globular clusters, galaxy evolutions, galaxy and intergalactic interactions and cosmology . Cosmological parameters are measured to the percent or even sub-percent levels. This precision requires equally accurate numerical modeling of the large scale structure (LSS) formation and might lead to cosmological -body simulations with a very large . For example, for studies of dark matter and dark energy, we need to resolve the structures and assembly histories of faint galaxies in a cosmological volume matching modern spectroscopic galaxy surveys. This requires a mass resolution of -body particles and physical scale of the simulation . Another example is the study of neutrino mass using cosmological effects. To model small but nonlinear cosmic neutrino background, we need a large to suppress the Poisson noise (e.g.,). These problems need a dynamical range of 4 to 5 orders of magnitude, corresponding to, at least, (trillion) particle -body simulations.
A direct-summation algorithm is unaffordable for a large because the computational complexity of such a pairwise force (PP force) calculation is . Many algorithms have been designed to reduce the complexity to , such as the particle-mesh (PM) or tree methods, or even , such as the fast multipole method. Fortunately, cosmological -body simulations rarely require accurate trajectories of individual particles, but rather just correct statistical distributions. In this scenario, the PM-based algorithms usually meet the accuracy requirements and have a potential to simulate with a very large .
However, achieving the largest with the PM-based algorithms is typically bound by memory capacity, instead of computing capacity. For example, TianNu , one of the world’s largest -body simulations, used the PM algorithm  to complete an cosmological -body simulation on the Tianhe-2 supercomputer. The simulation used all the memory of Tianhe-2, but only 30% of its computing resource111TianNu used only CPUs in Tianhe-2 which contribute to about 30% of the computing resource. The rest are contributed from the coprocessors (Intel Knights Corner, KNC) which associate with small, independent memory, and TianNu did not use.. The challenge to further increase the scale of -body simulations is to reduce its memory consumption.
To tackle this challenge, we designed a two-level PM (PMPM) algorithm CUBE  to obtain the lowest memory consumption possible. -body simulations typically require considerable amounts of memory to store the positions and velocities of -body particles. These six numbers occupy 24/48 bytes per particle (bpp) if they are stored as single/double precision floating numbers. By using an information-optimized algorithm, fixed-point format can be used to store the phase-space information of particles in memory. This fixed-point compression can significantly minimize the memory footprint toward only 6 bpp, thus break the memory capacity bound for scaling large -body simulations.
In this work, we implemented the CUBE algorithm in C and optimized the C code for performance and scalability. We then scaled the optimized code on the Intel Cascade Lake based supercomputer 2.0, to 512 nodes (20,480 cores) with 95% weak-scaling efficiency. This cosmological simulation, Cosmo-, evolved 4.4 trillion cold dark matter (CDM) particles from redshift222The cosmological redshift is an indicator of cosmic time in an expanding universe. The scale factor of the universe satisfies . 99 to 0 in a cubic comoving volume per side. To our best knowledge, Cosmo- is the largest completed cosmological -body simulation.
Ii Application Scenario
High-performance -body simulation codes must address many challenges – maintaining a low memory footprint given a large , minimizing the communication across computing nodes, reducing and accelerating the memory accesses to large arrays, and efficiently using high-performance libraries to speed up standard calculations such as FFTs. A series of cosmological -body codes, including “Moving-PM” , PMFAST , CUBEPM , and CUBE , are designed especially for weak-scaling on supercomputers333They introduced, for example, the 2-level particle-mesh (PMPM) algorithm, optimized on cubic decompositions  and fixed-point information optimization technique .. Our scaling test is based on our continuous development of CUBE, aiming for optimizations on all the above aspects.
CUBE solves the gravitational force using the PMPM algorithm, with optional extended-PP force modules for increased accuracy. The traditional PM-based algorithm is suboptimal in parallel computing as it requires a full resolution parallel FFT. The PMPM algorithm solves this problem by splitting the gravitational force into a short-range force and a long-range force , . The short-range force has a cutoff, i.e., , and the gravity beyond the cutoff is fully provided by the long-range force, . When , both force components contribute to . In particular, for smaller , is gradually dominated by . The force-matching optimizations under different criteria are discussed in Ref.[6, 3].
For either or , the force calculation is a convolution in real space, equivalent to multiplying the density field with component-wise force kernels in Fourier space,
is the density field obtained by interpolating particles onto a mesh (mass assignment),is the force kernel, a tilde indicates that the variable is in Fourier space, and corresponds to three spatial dimensions. The PMPM algorithm essentially computes the short-range () and long-range () components of Eq.(1) separately. The two PM algorithms are applied on two meshes with different resolutions.
For the long-range force, because it requires a lower resolution – is usually computed on a 4-times-coarser mesh (coarse-mesh).
In contrast, for , the short-range PM is computed on full-resolution fine-meshes. Because , we can divide the simulation volume into many cubic sub-volumes , called tiles, and on each is computed by a fine-mesh PM on a slightly enlarged volume . So in is done locally using local FFTs.
If the PP force is implemented, the gravity between particles within a preset number of adjacent neighboring fine-cells are computed explicitly, and the short-range force kernel is further corrected accordingly.
Thus, only the coarse-mesh FFTs require global MPI communications444Besides computing the gravitational force, MPI communications are also used to send/receive particles and to synchronize density/velocity fields.. The coarse-mesh PM scales as but with 1/4 of the full-resolution, so it typically consumes negligible CPU time. This makes the PMPM algorithm excellent for weak-scaling problems as computing time scales like , because an increase of the problem size leads to a proportional increase of when is fixed. It also dramatically reduces memory footprint requirements as the coarse grid has times fewer cells.
Computationally, the global simulation volume is first broken down into small cubical sub-volumes, each of which is assigned to one MPI process. A second level of cubical decomposition occurs inside each process, where the node sub-volumes are broken into a number of local volumes called tiles as mentioned above. Inside an MPI process, the tiles are calculated in turn, and using OpenMP parallelizes the calculations within the tile.
Iii-a Fixed-point Compression
The PMPM algorithm is intrinsically memory efficient, and the memory consumption is thus dominated by the phase-space coordinates of particles. CUBE is information-optimized and further reduce this memory footprint by using fix-point formats instead of floating formats.
This format is based on the structure of the coarse-mesh. For the -th dimension, instead of using a 4/8-byte floating number storing each particle’s global coordinate, , we store its relative position w.r.t. its parent-cell in the coarse-mesh. In particular, if is divided by coarse-cells per dimension, we further divide each coarse-cell into 256 bins, and for a particle in the -th cell, its coordinate in the -th dimension can be expressed by
where is the left boundary of the cell, and is the relative position. can be stored as an 1-byte integer. To have the correct , particles’ must be stored in memory in a cell-ordered way. A complementary number count of particle numbers in this mesh (density field) will give complete information on the particle distribution in the mesh.
The particle velocities in each dimension is decomposed as
where is the velocity field averaged on the -th cell, and is each particle’s relative velocity w.r.t. . Similarly, can be stored as an 1-byte integer, and maps to by a nonlinear function
. The nonlinearity is due to the fact that the probability distribution oflinearly predictable as a function of redshift and smoothing scale . In practice,
being the form similar to the inverse error function (the cumulative distribution function of Gaussian) minimizes the error in the velocity conversion, because it better solves the dominant slower moving particles.becomes nonlinear at lower redshifts, and can be directly measured from simulations.
For each particle, are stored by six 1-byte integers. and information are provided by the density and velocity fields on the coarse-mesh. Usually each coarse-cell contains large number (64 in Cosmo-) particles so the auxiliary density and velocity fields consumes negligible additional memory . and can either or both stored as 2-byte integers, corresponding to a more accurate position/velocity storage. More detailed discussions and results are in Ref..
Iii-B1 Precompute Expensive Functions
The fixed-point compression dramatically lowers the required memory footprint of -body simulations; however, the data compression and decompression introduce extra computing costs. The profiling results showed 21% of total elapsed time is spent on them. For example, calculating Eq.(3) is very time-consuming due to the expensive math functions in . To implement them efficiently, we precomputed their value and stored them into arrays. As the number of compression fixed-point representation is limited, 256 for 1-byte integer, the time and memory cost of the precomputing are negligible.
Iii-B2 Approximate Expensive Functions
As the fixed-point compression casts a float (32 bits) to just an 1- or 2-byte integer, the high accuracy of some expensive math functions is redundant. Therefore, we use approximate functions to replace them without losing the accuracy of final scientific results. For example, to accelerate the expensive function (which is used in Ref. to approximate velocity compression function ), we modified approximate function  and expanded its domain (Eq. 4). This fast approximate function has a maximum absolute error of 0.0038 radians, which has no effects on the accuracy of final results.
Iii-B3 Compute PP Force Kernel in Mixed-precision
The PP force kernel requires calculating Euclidean distances between each pair of particles within a certain range. As the particles positions are compressed in integers, we cannot use traditional AVX512 vectorization instructions. Therefore, we adopt the new AVX512 VNNI (Vector Neural Network Instructions) extension (shipped with Intel Cascade Lake processors and after) to compute the compression position for gravity calculation in mixed-precision. As shown in Fig. 1, the AVX512 VNNI VPDPBUSD instruction multiplies the individual bytes (8-bit) of the first source operand by the corresponding bytes (8-bit) of the second source operand, producing intermediate word (16-bit) results which are summed and accumulated in the double word (32-bit) of the destination operand. Theoretically, using the VPDPBUSD instruction can increase 3 times of operations throughput and use 3 times less memory footprint than scalar instructions.
Iii-B4 Distribute MPI Processes
The buffer communication requires the positions and velocities of particles to be transferred between adjacent grids. To reduce the overhead of buffer communications, as illustrated in Fig. 2, we use the 3D MPI rank distribution to ensure the processes with adjacent simulation volumes are nearby in the physical domain.
Iv Performance Metrics and Results
Iv-a Experimental Setup
Iv-A1 Hardware & Software
We evaluated all scaling tests on supercomputer 2.0, which has 650 computing nodes with 2PFlops peak performance. Each node has two Intel Cascade Lake 6248 processors (20 cores for each socket) and 192GB DDR4 memory. All nodes are interconnected with Intel 100Gbps Omni-Path Architecture (OPA).
We compiled CUBE with Intel C/C++ Compiler 18.0.5. The FFT and MPI are supported by Intel MKL 2018 (Update 4) and Intel MPI 2018 (Update 4), respectively.
Iv-A2 Simulation Parameters
We use 4,096 MPI processes on 512 nodes ( of the full system of 2.0) to evolve cold dark matter particles in a Gpc cosmological volume. We use the Zel’dovich Approximation  to determine the initial positions and velocities of particles at redshift and then use CUBE to evolve the particles to . The simulation models a CDM universe with Hubble parameter , CDM density , baryon density and initial conditions characterized by and 0.96. For fixed-point compression, we use the 1-byte fixed-point format to store the particle phase space.
The PM-PM force calculation is computed using a coarse mesh composed of cells per process ( in total). Each process volume is decomposed into 8 tiles resolved by ( for physical volume and for buffer region) fine cells per dimension. This geometry is equivalent to a regular PM calculation with a fine cell mesh (or fine cells per process), but utilizing 8 times less memory footprint and substantially faster due to the decreased size of the global FFT.
Iv-B Performance Metrics
We choose bytes per particle (bpp) and wall clock time as the two performance metrics for the scaling tests. We use the first metric to measure the memory consumption for each MPI process or each node by using /proc/self/statm and MPI_Allreduce; we use the second one to measure code speeds by using MPI_Wtime.
Iv-C Performance Results
Iv-C1 Memory Consumption
Tab. I lists the memory consumption for one MPI process of CUBE, which includes three parts: Particle, Coarse mesh, and Fine mesh. Particle includes three components: Physical region and Image buffer are used to store the particle phase space in the physical domain and buffered region whereas Tile buffer is the temporary buffer for each tile. Coarse mesh and Fine mesh are the arrays associated with the coarse mesh and fine mesh, respectively. In CUBE, each MPI process requires 13.75GB memory to simulate particles. The bpp of CUBE is thus .
Due to using fixed-point compression, CUBE has significantly smaller bpp than any other cosmological -body simulation codes. For example, TianNu  simulates 2.97 trillion particles on Tianhe-2. Each Tianhe-2 node holds an average of neutrino particles and CDM particles, and uses 40GB memory. Its bpp is thus , which is 14.5 times larger than CUBE’s bpp.
To study the weak-scaling of CUBE, we allow each process to evolve a volume using fine cells and gradually scale from 40 cores to 20,480 cores. Fig. 3(a) shows CUBE’s weak-scaling result both with and without the PP force (PM-PM-PP and PM-PM in the legend). We see an almost perfect linear speed achieving 95% parallel efficiency in both cases. For comparison, the TianNu simulation had 72% weak-scaling efficiency ; although we note that this scaling test was done at redshift where nonlinear structure substantially increases iterations of the PP force kernel.
While often less relevant for cosmological applications, we also report CBUE’s strong-scaling result. We evolved a fixed global simulation volume of with fine cells, but slowly increased the number of cores and distributed corresponding fine cells per process. Fig. 3(b) shows the strong-scaling where we can see around 81.5% and 84.2% parallel efficiency at 20,480 cores, respectively. The degradation in performance with increasing cores comes from two aspects: the decrease in computation to communication time, and the increase in the ratio of the buffer region to physical volume.
We validate the correctness of the Cosmo- simulation visually and statistically. Fig. 4 illustrates the LSS distribution of CDM in a thin slice of the simulation volume. The -body particles in Cosmo- at the final checkpoint are interpolated onto regular grids by the cloud-in-cell (CIC) method. The resulting 3D density field is then partly projected onto the plane of Fig. 4 where dark/light colors show high/low column densities. The size of the figure corresponds to the box size of the simulation, and the thickness of the slice is chosen to clearly visualize the structure of the cosmic web – nodes, filaments, and voids of certain scales. The hierarchically zoomed-in panels show more detailed CDM structures on smaller scales and the most zoomed-in panel shows the projected -body particles’ distribution directly.
A useful statistics of LSS is the power spectrum, corresponding to the Fourier transform of the two-point correlation function. From the CDM density fieldwe define the dimensionless density contrast , from which the power spectrum is given by , where is the Fourier wave vector with and is the 3D Dirac delta function. We usually plot the dimensionless power spectrum , defined by . Cosmo- store checkpoints at various of redshifts and in Fig. 5 we plot their dimensionless power spectra compared with their linear and nonlinear predictions. Note that at high redshifts (e.g., the initial condition of Cosmo-, ), the LSS statistics are well described by linear theory; while at low redshifts, the linear/nonlinear power spectra555According to LSS formation theories, at first order, linear approximations where , Fourier modes evolve independently and is scaled by a growth factor , independent of . At lower redshifts, nonlinearities cannot be neglected and we use various nonlinear predictions (e.g.,[11, 12]) for the power spectrum. The exact LSS evolution can only be accurately modeled by -body simulations. mismatch and we see follows the nonlinear predictions. The power spectrum suppression at high wavenumber corresponds to the limited sub-grid force resolution, which can be improved by using the extended PP force. The result here is consistent with Ref..
V Related Approaches
The gravitational -body simulation is a critical tool for understanding modern cosmology that relies on supercomputing. The past ten years has seen the rise of trillion particle simulations starting with the Gordon-Bell Prize winning simulation  and a finalist . Tab. II compares the large-scale cosmological -body simulations in recent years. There are now several different codes that have been used to exceed the trillion particle mark, which differ primarily in their implementation of the gravitational force computation. Our work evolved 4.39 trillion particles and thus, to our best knowledge, is the largest cosmological -body simulation.
|Dark Sky ||2014||2HOT||Titan||12,288g 1110||1.074||8||Tree||36.8|
|GC ||2015||GreeM||K Computer||131,072c||0.55||1.12||TreePM||4.27|
|Q Continuum ||2015||HACC||Titan||16,384g 1110||0.55||0.923||PM||2|
|Euclid Flagship||2017||PKDGRAV3||Piz Daint||>4,000g 1110||2.0||3||FMM||4.8|
|Outer Rim ||2019||HACC||MIRA||524,288c||1.074||3||TreePM||2.84|
|Cosmo- (This work)||2019||CUBE||2.0||20,480c||4.39||3.2||PM-PM||195|
1. These three simulations were carried out using NVIDIA Tesla K20X.
Vi Impact of the Solution
We summarize the impact of the Cosmo- simulation in three aspects. First, Cosmo- is, to the best of our knowledge, the largest completed cosmological -body simulation, evolving 4.39 trillion particles from redshift 99 to 0. Simulations such as Cosmo-, as well as higher resolution ones using the PP force, will allow for improved LSS statistics and better understanding of halo assembly and substructure.
Second, we believe CUBE has a huge potential for large-scale cosmological simulations. Cosmo- was able to evolve 4.39 trillion particles using just 20,480 cores, a substantial improvement in enabled by CUBE’s memory consumption optimization. In the next few years, exascale supercomputers will be available which will allow for simulations using more than ten million cores, increasing the problem scale by at least three orders of magnitude. This will have a profound impact on studies of LSS and other astronomical simulations.
Finally, we show fixed-point compression and mixed-precision calculation can be extremely valuable tools for scientific applications. We expect traditional HPC applications will benefit more from emerging AI computing technologies.
We thank HPC Center of Shanghai Jiao Tong University for providing computing resource and excellent technical support. Hao-Ran Yu acknowledges National Science Foundation of China No.11903021. Shenggan Cheng and Hao-Ran Yu contributed equally to this paper. James Lin is the corresponding author.
-  R. W. Hockney and J. W. Eastwood, Computer simulation using particles, 1988.
-  H.-R. Yu, J. D. Emberson, D. Inman, T.-J. Zhang, U.-L. Pen, J. Harnois-Déraps, S. Yuan, H.-Y. Teng, H.-M. Zhu, X. Chen, Z.-Z. Xing, Y. Du, L. Zhang, Y. Lu, and X. Liao, “Differential neutrino condensation onto cosmic structure,” Nature Astronomy, vol. 1, p. 0143, Jul. 2017.
-  J. Harnois-Déraps, U.-L. Pen, I. T. Iliev, H. Merz, J. D. Emberson, and V. Desjacques, “High-performance PM N-body code: CUBEPM,” MNRAS, vol. 436, pp. 540–559, Nov. 2013.
-  H.-R. Yu, U.-L. Pen, and X. Wang, “CUBE: An Information-optimized Parallel Cosmological N-body Algorithm,” The Astrophysical Journal Supplement Series, vol. 237, p. 24, Aug. 2018.
-  U.-L. Pen, “A Linear Moving Adaptive Particle-Mesh N-Body Algorithm,” ApJS, vol. 100, p. 269, Sep. 1995.
-  H. Merz, U.-L. Pen, and H. Trac, “Towards optimal parallel PM N-body codes: PMFAST,” New A, vol. 10, pp. 393–407, Apr. 2005.
-  S. Rajan, S. Wang, R. Inkol, and A. Joyal, “Efficient approximations for the arctangent function,” IEEE Signal Processing Magazine, vol. 23, no. 3, pp. 187–198, 2006.
A. Rodriguez, E. Sen, E. Meiri, E. Fomenko, Y. Jin, H. Shen, and Z. Barukh, “Lower numerical precision deep learning inference and training,” 2018.
-  Y. B. Zel’dovich, “Gravitational instability: An approximate theory for large density perturbations.” A&A, vol. 5, pp. 84–89, Mar. 1970.
-  J. D. Emberson, H.-R. Yu, D. Inman, T.-J. Zhang, U.-L. Pen, J. Harnois-Déraps, S. Yuan, H.-Y. Teng, H.-M. Zhu, X. Chen, and Z.-Z. Xing, “Cosmological neutrino simulations at extreme scale,” Research in Astronomy and Astrophysics, vol. 17, p. 085, Aug. 2017.
-  D. Blas, J. Lesgourgues, and T. Tram, “The Cosmic Linear Anisotropy Solving System (CLASS). Part II: Approximation schemes,” J. Cosmology Astropart. Phys., vol. 7, p. 034, Jul. 2011.
-  R. E. Smith, J. A. Peacock, A. Jenkins, S. D. M. White, C. S. Frenk, F. R. Pearce, P. A. Thomas, G. Efstathiou, and H. M. P. Couchman, “Stable clustering, the halo model and non-linear cosmological power spectra,” Monthly Notices of the Royal Astronomical Society, vol. 341, no. 4, p. 1311–1332, Jun 2003. [Online]. Available: http://dx.doi.org/10.1046/j.1365-8711.2003.06503.x
-  T. Ishiyama, K. Nitadori, and J. Makino, “4.45 pflops astrophysical n -body simulation on k computer: the gravitational trillion-body problem,” pp. 1–10, 2012.
-  S. Habib, V. Morozov, H. Finkel, A. C. Pope, K. Heitmann, K. Kumaran, T. Peterka, J. Insley, D. Daniel, P. Fasel et al., “The universe at extreme scale: Multi-petaflop sky simulation on the bg/q,” arXiv: Distributed, Parallel, and Cluster Computing, 2012.
-  S. W. Skillman, M. S. Warren, M. J. Turk, R. H. Wechsler, D. E. Holz, and P. M. Sutter, “Dark Sky Simulations: Early Data Release,” arXiv e-prints, p. arXiv:1407.2600, Jul 2014.
-  T. Ishiyama, M. Enoki, M. A. R. Kobayashi, R. Makiya, M. Nagashima, and T. Oogi, “The GC simulations: Quantifying the dark side of the universe in the Planck cosmology,” Publications of the ASJ, vol. 67, no. 4, p. 61, Aug 2015.
-  K. Heitmann, N. Frontiere, C. Sewell, S. Habib, A. Pope, H. Finkel, S. Rizzi, J. Insley, and S. Bhattacharya, “The q continuum simulation: Harnessing the power of gpu accelerated supercomputers,” Astrophysical Journal Supplement Series, vol. 219, no. 2, p. 34, 2015.
-  D. Potter, J. Stadel, and R. Teyssier, “PKDGRAV3: beyond trillion particle cosmological simulations for the next era of galaxy surveys,” Computational Astrophysics and Cosmology, vol. 4, no. 1, p. 2, May 2017.
-  K. Heitmann, H. Finkel, A. Pope, V. Morozov, N. Frontiere, S. Habib, E. Rangel, T. Uram, D. Korytov, H. Child, S. Flender, J. Insley, and S. Rizzi, “The Outer Rim Simulation: A Path to Many-core Supercomputers,” Astrophysical Journal Supplement Series, vol. 245, no. 1, p. 16, Nov 2019.