I Problem Overview
In astrophysics, body simulations are used to study the dynamics of globular clusters, galaxy evolutions, galaxy and intergalactic interactions and cosmology [1]. Cosmological parameters are measured to the percent or even subpercent levels. This precision requires equally accurate numerical modeling of the large scale structure (LSS) formation and might lead to cosmological body simulations with a very large . For example, for studies of dark matter and dark energy, we need to resolve the structures and assembly histories of faint galaxies in a cosmological volume matching modern spectroscopic galaxy surveys. This requires a mass resolution of body particles and physical scale of the simulation . Another example is the study of neutrino mass using cosmological effects. To model small but nonlinear cosmic neutrino background, we need a large to suppress the Poisson noise (e.g.,[2]). These problems need a dynamical range of 4 to 5 orders of magnitude, corresponding to, at least, (trillion) particle body simulations.
A directsummation algorithm is unaffordable for a large because the computational complexity of such a pairwise force (PP force) calculation is . Many algorithms have been designed to reduce the complexity to , such as the particlemesh (PM) or tree methods, or even , such as the fast multipole method. Fortunately, cosmological body simulations rarely require accurate trajectories of individual particles, but rather just correct statistical distributions. In this scenario, the PMbased algorithms usually meet the accuracy requirements and have a potential to simulate with a very large .
However, achieving the largest with the PMbased algorithms is typically bound by memory capacity, instead of computing capacity. For example, TianNu [2], one of the world’s largest body simulations, used the PM algorithm [3] to complete an cosmological body simulation on the Tianhe2 supercomputer. The simulation used all the memory of Tianhe2, but only 30% of its computing resource^{1}^{1}1TianNu used only CPUs in Tianhe2 which contribute to about 30% of the computing resource. The rest are contributed from the coprocessors (Intel Knights Corner, KNC) which associate with small, independent memory, and TianNu did not use.. The challenge to further increase the scale of body simulations is to reduce its memory consumption.
To tackle this challenge, we designed a twolevel PM (PMPM) algorithm CUBE [4] to obtain the lowest memory consumption possible. body simulations typically require considerable amounts of memory to store the positions and velocities of body particles. These six numbers occupy 24/48 bytes per particle (bpp) if they are stored as single/double precision floating numbers. By using an informationoptimized algorithm, fixedpoint format can be used to store the phasespace information of particles in memory. This fixedpoint compression can significantly minimize the memory footprint toward only 6 bpp, thus break the memory capacity bound for scaling large body simulations.
In this work, we implemented the CUBE algorithm in C and optimized the C code for performance and scalability. We then scaled the optimized code on the Intel Cascade Lake based supercomputer 2.0, to 512 nodes (20,480 cores) with 95% weakscaling efficiency. This cosmological simulation, Cosmo, evolved 4.4 trillion cold dark matter (CDM) particles from redshift^{2}^{2}2The cosmological redshift is an indicator of cosmic time in an expanding universe. The scale factor of the universe satisfies . 99 to 0 in a cubic comoving volume per side. To our best knowledge, Cosmo is the largest completed cosmological body simulation.
Ii Application Scenario
Highperformance body simulation codes must address many challenges – maintaining a low memory footprint given a large , minimizing the communication across computing nodes, reducing and accelerating the memory accesses to large arrays, and efficiently using highperformance libraries to speed up standard calculations such as FFTs. A series of cosmological body codes, including “MovingPM” [5], PMFAST [6], CUBEPM [3], and CUBE [4], are designed especially for weakscaling on supercomputers^{3}^{3}3They introduced, for example, the 2level particlemesh (PMPM) algorithm[6], optimized on cubic decompositions [3] and fixedpoint information optimization technique [4].. Our scaling test is based on our continuous development of CUBE, aiming for optimizations on all the above aspects.
CUBE solves the gravitational force using the PMPM algorithm, with optional extendedPP force modules for increased accuracy. The traditional PMbased algorithm is suboptimal in parallel computing as it requires a full resolution parallel FFT. The PMPM algorithm solves this problem by splitting the gravitational force into a shortrange force and a longrange force , . The shortrange force has a cutoff, i.e., , and the gravity beyond the cutoff is fully provided by the longrange force, . When , both force components contribute to . In particular, for smaller , is gradually dominated by . The forcematching optimizations under different criteria are discussed in Ref.[6, 3].
For either or , the force calculation is a convolution in real space, equivalent to multiplying the density field with componentwise force kernels in Fourier space,
(1) 
where
is the density field obtained by interpolating particles onto a mesh (mass assignment),
is the force kernel, a tilde indicates that the variable is in Fourier space, and corresponds to three spatial dimensions. The PMPM algorithm essentially computes the shortrange () and longrange () components of Eq.(1) separately. The two PM algorithms are applied on two meshes with different resolutions.For the longrange force, because it requires a lower resolution – is usually computed on a 4timescoarser mesh (coarsemesh).
In contrast, for , the shortrange PM is computed on fullresolution finemeshes. Because , we can divide the simulation volume into many cubic subvolumes , called tiles, and on each is computed by a finemesh PM on a slightly enlarged volume . So in is done locally using local FFTs.
If the PP force is implemented, the gravity between particles within a preset number of adjacent neighboring finecells are computed explicitly, and the shortrange force kernel is further corrected accordingly.
Thus, only the coarsemesh FFTs require global MPI communications^{4}^{4}4Besides computing the gravitational force, MPI communications are also used to send/receive particles and to synchronize density/velocity fields.. The coarsemesh PM scales as but with 1/4 of the fullresolution, so it typically consumes negligible CPU time. This makes the PMPM algorithm excellent for weakscaling problems as computing time scales like , because an increase of the problem size leads to a proportional increase of when is fixed. It also dramatically reduces memory footprint requirements as the coarse grid has times fewer cells.
Computationally, the global simulation volume is first broken down into small cubical subvolumes, each of which is assigned to one MPI process. A second level of cubical decomposition occurs inside each process, where the node subvolumes are broken into a number of local volumes called tiles as mentioned above. Inside an MPI process, the tiles are calculated in turn, and using OpenMP parallelizes the calculations within the tile.
Iii Solution
Iiia Fixedpoint Compression
The PMPM algorithm is intrinsically memory efficient, and the memory consumption is thus dominated by the phasespace coordinates of particles. CUBE is informationoptimized and further reduce this memory footprint by using fixpoint formats instead of floating formats.
This format is based on the structure of the coarsemesh. For the th dimension, instead of using a 4/8byte floating number storing each particle’s global coordinate, , we store its relative position w.r.t. its parentcell in the coarsemesh. In particular, if is divided by coarsecells per dimension, we further divide each coarsecell into 256 bins, and for a particle in the th cell, its coordinate in the th dimension can be expressed by
(2) 
where is the left boundary of the cell, and is the relative position. can be stored as an 1byte integer. To have the correct , particles’ must be stored in memory in a cellordered way. A complementary number count of particle numbers in this mesh (density field) will give complete information on the particle distribution in the mesh.
The particle velocities in each dimension is decomposed as
(3) 
where is the velocity field averaged on the th cell, and is each particle’s relative velocity w.r.t. . Similarly, can be stored as an 1byte integer, and maps to by a nonlinear function
. The nonlinearity is due to the fact that the probability distribution of
is not uniform but is approximated by a Gaussian distribution with its variance
linearly predictable as a function of redshift and smoothing scale . In practice,being the form similar to the inverse error function (the cumulative distribution function of Gaussian) minimizes the error in the velocity conversion, because it better solves the dominant slower moving particles.
becomes nonlinear at lower redshifts, and can be directly measured from simulations.For each particle, are stored by six 1byte integers. and information are provided by the density and velocity fields on the coarsemesh. Usually each coarsecell contains large number (64 in Cosmo) particles so the auxiliary density and velocity fields consumes negligible additional memory [4]. and can either or both stored as 2byte integers, corresponding to a more accurate position/velocity storage. More detailed discussions and results are in Ref.[4].
IiiB Optimizations
IiiB1 Precompute Expensive Functions
The fixedpoint compression dramatically lowers the required memory footprint of body simulations; however, the data compression and decompression introduce extra computing costs. The profiling results showed 21% of total elapsed time is spent on them. For example, calculating Eq.(3) is very timeconsuming due to the expensive math functions in . To implement them efficiently, we precomputed their value and stored them into arrays. As the number of compression fixedpoint representation is limited, 256 for 1byte integer, the time and memory cost of the precomputing are negligible.
IiiB2 Approximate Expensive Functions
As the fixedpoint compression casts a float (32 bits) to just an 1 or 2byte integer, the high accuracy of some expensive math functions is redundant. Therefore, we use approximate functions to replace them without losing the accuracy of final scientific results. For example, to accelerate the expensive function (which is used in Ref.[4] to approximate velocity compression function ), we modified approximate function [7] and expanded its domain (Eq. 4). This fast approximate function has a maximum absolute error of 0.0038 radians, which has no effects on the accuracy of final results.
(4) 
IiiB3 Compute PP Force Kernel in Mixedprecision
The PP force kernel requires calculating Euclidean distances between each pair of particles within a certain range. As the particles positions are compressed in integers, we cannot use traditional AVX512 vectorization instructions. Therefore, we adopt the new AVX512 VNNI (Vector Neural Network Instructions) extension
[8] (shipped with Intel Cascade Lake processors and after) to compute the compression position for gravity calculation in mixedprecision. As shown in Fig. 1, the AVX512 VNNI VPDPBUSD instruction multiplies the individual bytes (8bit) of the first source operand by the corresponding bytes (8bit) of the second source operand, producing intermediate word (16bit) results which are summed and accumulated in the double word (32bit) of the destination operand. Theoretically, using the VPDPBUSD instruction can increase 3 times of operations throughput and use 3 times less memory footprint than scalar instructions.IiiB4 Distribute MPI Processes
The buffer communication requires the positions and velocities of particles to be transferred between adjacent grids. To reduce the overhead of buffer communications, as illustrated in Fig. 2, we use the 3D MPI rank distribution to ensure the processes with adjacent simulation volumes are nearby in the physical domain.
Iv Performance Metrics and Results
Iva Experimental Setup
IvA1 Hardware & Software
We evaluated all scaling tests on supercomputer 2.0, which has 650 computing nodes with 2PFlops peak performance. Each node has two Intel Cascade Lake 6248 processors (20 cores for each socket) and 192GB DDR4 memory. All nodes are interconnected with Intel 100Gbps OmniPath Architecture (OPA).
We compiled CUBE with Intel C/C++ Compiler 18.0.5. The FFT and MPI are supported by Intel MKL 2018 (Update 4) and Intel MPI 2018 (Update 4), respectively.
IvA2 Simulation Parameters
We use 4,096 MPI processes on 512 nodes ( of the full system of 2.0) to evolve cold dark matter particles in a Gpc cosmological volume. We use the Zel’dovich Approximation [9] to determine the initial positions and velocities of particles at redshift and then use CUBE to evolve the particles to . The simulation models a CDM universe with Hubble parameter , CDM density , baryon density and initial conditions characterized by and 0.96. For fixedpoint compression, we use the 1byte fixedpoint format to store the particle phase space.
The PMPM force calculation is computed using a coarse mesh composed of cells per process ( in total). Each process volume is decomposed into 8 tiles resolved by ( for physical volume and for buffer region) fine cells per dimension. This geometry is equivalent to a regular PM calculation with a fine cell mesh (or fine cells per process), but utilizing 8 times less memory footprint and substantially faster due to the decreased size of the global FFT.
IvB Performance Metrics
We choose bytes per particle (bpp) and wall clock time as the two performance metrics for the scaling tests. We use the first metric to measure the memory consumption for each MPI process or each node by using /proc/self/statm and MPI_Allreduce; we use the second one to measure code speeds by using MPI_Wtime.
IvC Performance Results
IvC1 Memory Consumption
Tab. I lists the memory consumption for one MPI process of CUBE, which includes three parts: Particle, Coarse mesh, and Fine mesh. Particle includes three components: Physical region and Image buffer are used to store the particle phase space in the physical domain and buffered region whereas Tile buffer is the temporary buffer for each tile. Coarse mesh and Fine mesh are the arrays associated with the coarse mesh and fine mesh, respectively. In CUBE, each MPI process requires 13.75GB memory to simulate particles. The bpp of CUBE is thus .
Due to using fixedpoint compression, CUBE has significantly smaller bpp than any other cosmological body simulation codes. For example, TianNu [2] simulates 2.97 trillion particles on Tianhe2. Each Tianhe2 node holds an average of neutrino particles and CDM particles, and uses 40GB memory. Its bpp is thus , which is 14.5 times larger than CUBE’s bpp.
Parts  Memory Consumption  

/GB  /bpp  Percentage  
Particles  Physical region  7.3  6  53% 
Image buffer  1.55  2.24  11.36%  
Tile buffer  1.11  1.03  8.04%  
Coarse mesh  2.38  2.21  17.30%  
Fine mesh  1.32  1.22  9.56%  
Total  13.75  12.8  100.00% 
IvC2 Scalability
To study the weakscaling of CUBE, we allow each process to evolve a volume using fine cells and gradually scale from 40 cores to 20,480 cores. Fig. 3(a) shows CUBE’s weakscaling result both with and without the PP force (PMPMPP and PMPM in the legend). We see an almost perfect linear speed achieving 95% parallel efficiency in both cases. For comparison, the TianNu simulation had 72% weakscaling efficiency [10]; although we note that this scaling test was done at redshift where nonlinear structure substantially increases iterations of the PP force kernel.
While often less relevant for cosmological applications, we also report CBUE’s strongscaling result. We evolved a fixed global simulation volume of with fine cells, but slowly increased the number of cores and distributed corresponding fine cells per process. Fig. 3(b) shows the strongscaling where we can see around 81.5% and 84.2% parallel efficiency at 20,480 cores, respectively. The degradation in performance with increasing cores comes from two aspects: the decrease in computation to communication time, and the increase in the ratio of the buffer region to physical volume.
IvD Validation
We validate the correctness of the Cosmo simulation visually and statistically. Fig. 4 illustrates the LSS distribution of CDM in a thin slice of the simulation volume. The body particles in Cosmo at the final checkpoint are interpolated onto regular grids by the cloudincell (CIC) method. The resulting 3D density field is then partly projected onto the plane of Fig. 4 where dark/light colors show high/low column densities. The size of the figure corresponds to the box size of the simulation, and the thickness of the slice is chosen to clearly visualize the structure of the cosmic web – nodes, filaments, and voids of certain scales. The hierarchically zoomedin panels show more detailed CDM structures on smaller scales and the most zoomedin panel shows the projected body particles’ distribution directly.
A useful statistics of LSS is the power spectrum, corresponding to the Fourier transform of the twopoint correlation function. From the CDM density field
we define the dimensionless density contrast , from which the power spectrum is given by , where is the Fourier wave vector with and is the 3D Dirac delta function. We usually plot the dimensionless power spectrum , defined by . Cosmo store checkpoints at various of redshifts and in Fig. 5 we plot their dimensionless power spectra compared with their linear and nonlinear predictions. Note that at high redshifts (e.g., the initial condition of Cosmo, ), the LSS statistics are well described by linear theory; while at low redshifts, the linear/nonlinear power spectra^{5}^{5}5According to LSS formation theories, at first order, linear approximations where , Fourier modes evolve independently and is scaled by a growth factor , independent of . At lower redshifts, nonlinearities cannot be neglected and we use various nonlinear predictions (e.g.,[11, 12]) for the power spectrum. The exact LSS evolution can only be accurately modeled by body simulations. mismatch and we see follows the nonlinear predictions. The power spectrum suppression at high wavenumber corresponds to the limited subgrid force resolution, which can be improved by using the extended PP force. The result here is consistent with Ref.[4].V Related Approaches
The gravitational body simulation is a critical tool for understanding modern cosmology that relies on supercomputing. The past ten years has seen the rise of trillion particle simulations starting with the GordonBell Prize winning simulation [13] and a finalist [14]. Tab. II compares the largescale cosmological body simulations in recent years. There are now several different codes that have been used to exceed the trillion particle mark, which differ primarily in their implementation of the gravitational force computation. Our work evolved 4.39 trillion particles and thus, to our best knowledge, is the largest cosmological body simulation.
Simulations  Years  Codes  Supercomputers  Scale  ,  (Gpc/)  Force  (kpc/)  

Dark Sky [15]  2014  2HOT  Titan  12,288g ^{1}^{1}10  1.074  8  Tree  36.8  
GC [16]  2015  GreeM  K Computer  131,072c  0.55  1.12  TreePM  4.27  
Q Continuum [17]  2015  HACC  Titan  16,384g ^{1}^{1}10  0.55  0.923  PM  2  
TianNu [2]  2017  CUBEPM  Tianhe2  331,776c  1.2  PMPMPP  13  
Euclid Flagship[18]  2017  PKDGRAV3  Piz Daint  >4,000g ^{1}^{1}10  2.0  3  FMM  4.8  
Outer Rim [19]  2019  HACC  MIRA  524,288c  1.074  3  TreePM  2.84  
Cosmo (This work)  2019  CUBE  2.0  20,480c  4.39  3.2  PMPM  195 
1. These three simulations were carried out using NVIDIA Tesla K20X.
Vi Impact of the Solution
We summarize the impact of the Cosmo simulation in three aspects. First, Cosmo is, to the best of our knowledge, the largest completed cosmological body simulation, evolving 4.39 trillion particles from redshift 99 to 0. Simulations such as Cosmo, as well as higher resolution ones using the PP force, will allow for improved LSS statistics and better understanding of halo assembly and substructure.
Second, we believe CUBE has a huge potential for largescale cosmological simulations. Cosmo was able to evolve 4.39 trillion particles using just 20,480 cores, a substantial improvement in enabled by CUBE’s memory consumption optimization. In the next few years, exascale supercomputers will be available which will allow for simulations using more than ten million cores, increasing the problem scale by at least three orders of magnitude. This will have a profound impact on studies of LSS and other astronomical simulations.
Finally, we show fixedpoint compression and mixedprecision calculation can be extremely valuable tools for scientific applications. We expect traditional HPC applications will benefit more from emerging AI computing technologies.
Acknowledgments
We thank HPC Center of Shanghai Jiao Tong University for providing computing resource and excellent technical support. HaoRan Yu acknowledges National Science Foundation of China No.11903021. Shenggan Cheng and HaoRan Yu contributed equally to this paper. James Lin is the corresponding author.
References
 [1] R. W. Hockney and J. W. Eastwood, Computer simulation using particles, 1988.
 [2] H.R. Yu, J. D. Emberson, D. Inman, T.J. Zhang, U.L. Pen, J. HarnoisDéraps, S. Yuan, H.Y. Teng, H.M. Zhu, X. Chen, Z.Z. Xing, Y. Du, L. Zhang, Y. Lu, and X. Liao, “Differential neutrino condensation onto cosmic structure,” Nature Astronomy, vol. 1, p. 0143, Jul. 2017.
 [3] J. HarnoisDéraps, U.L. Pen, I. T. Iliev, H. Merz, J. D. Emberson, and V. Desjacques, “Highperformance PM Nbody code: CUBEPM,” MNRAS, vol. 436, pp. 540–559, Nov. 2013.
 [4] H.R. Yu, U.L. Pen, and X. Wang, “CUBE: An Informationoptimized Parallel Cosmological Nbody Algorithm,” The Astrophysical Journal Supplement Series, vol. 237, p. 24, Aug. 2018.
 [5] U.L. Pen, “A Linear Moving Adaptive ParticleMesh NBody Algorithm,” ApJS, vol. 100, p. 269, Sep. 1995.
 [6] H. Merz, U.L. Pen, and H. Trac, “Towards optimal parallel PM Nbody codes: PMFAST,” New A, vol. 10, pp. 393–407, Apr. 2005.
 [7] S. Rajan, S. Wang, R. Inkol, and A. Joyal, “Efficient approximations for the arctangent function,” IEEE Signal Processing Magazine, vol. 23, no. 3, pp. 187–198, 2006.

[8]
A. Rodriguez, E. Sen, E. Meiri, E. Fomenko, Y. Jin, H. Shen, and Z. Barukh, “Lower numerical precision deep learning inference and training,” 2018.
 [9] Y. B. Zel’dovich, “Gravitational instability: An approximate theory for large density perturbations.” A&A, vol. 5, pp. 84–89, Mar. 1970.
 [10] J. D. Emberson, H.R. Yu, D. Inman, T.J. Zhang, U.L. Pen, J. HarnoisDéraps, S. Yuan, H.Y. Teng, H.M. Zhu, X. Chen, and Z.Z. Xing, “Cosmological neutrino simulations at extreme scale,” Research in Astronomy and Astrophysics, vol. 17, p. 085, Aug. 2017.
 [11] D. Blas, J. Lesgourgues, and T. Tram, “The Cosmic Linear Anisotropy Solving System (CLASS). Part II: Approximation schemes,” J. Cosmology Astropart. Phys., vol. 7, p. 034, Jul. 2011.
 [12] R. E. Smith, J. A. Peacock, A. Jenkins, S. D. M. White, C. S. Frenk, F. R. Pearce, P. A. Thomas, G. Efstathiou, and H. M. P. Couchman, “Stable clustering, the halo model and nonlinear cosmological power spectra,” Monthly Notices of the Royal Astronomical Society, vol. 341, no. 4, p. 1311–1332, Jun 2003. [Online]. Available: http://dx.doi.org/10.1046/j.13658711.2003.06503.x
 [13] T. Ishiyama, K. Nitadori, and J. Makino, “4.45 pflops astrophysical n body simulation on k computer: the gravitational trillionbody problem,” pp. 1–10, 2012.
 [14] S. Habib, V. Morozov, H. Finkel, A. C. Pope, K. Heitmann, K. Kumaran, T. Peterka, J. Insley, D. Daniel, P. Fasel et al., “The universe at extreme scale: Multipetaflop sky simulation on the bg/q,” arXiv: Distributed, Parallel, and Cluster Computing, 2012.
 [15] S. W. Skillman, M. S. Warren, M. J. Turk, R. H. Wechsler, D. E. Holz, and P. M. Sutter, “Dark Sky Simulations: Early Data Release,” arXiv eprints, p. arXiv:1407.2600, Jul 2014.
 [16] T. Ishiyama, M. Enoki, M. A. R. Kobayashi, R. Makiya, M. Nagashima, and T. Oogi, “The GC simulations: Quantifying the dark side of the universe in the Planck cosmology,” Publications of the ASJ, vol. 67, no. 4, p. 61, Aug 2015.
 [17] K. Heitmann, N. Frontiere, C. Sewell, S. Habib, A. Pope, H. Finkel, S. Rizzi, J. Insley, and S. Bhattacharya, “The q continuum simulation: Harnessing the power of gpu accelerated supercomputers,” Astrophysical Journal Supplement Series, vol. 219, no. 2, p. 34, 2015.
 [18] D. Potter, J. Stadel, and R. Teyssier, “PKDGRAV3: beyond trillion particle cosmological simulations for the next era of galaxy surveys,” Computational Astrophysics and Cosmology, vol. 4, no. 1, p. 2, May 2017.
 [19] K. Heitmann, H. Finkel, A. Pope, V. Morozov, N. Frontiere, S. Habib, E. Rangel, T. Uram, D. Korytov, H. Child, S. Flender, J. Insley, and S. Rizzi, “The Outer Rim Simulation: A Path to Manycore Supercomputers,” Astrophysical Journal Supplement Series, vol. 245, no. 1, p. 16, Nov 2019.
Comments
There are no comments yet.