1 Introduction
Tesla V100 is the leading graphics processing unit (GPU) based on the latest Volta architecture^{2}^{2}2The Turing architecture is the latest generation of GPU at the time of writing the manuscript, but its main focus is graphics. Besides, the compute capability of the Turing architecture is 7.5, which means that the major number is identical to that of the Volta architecture. for highperformance computing by NVIDIA. The singleprecision theoretical peak performance of Tesla V100 is GV100Whitepaper , which is times higher in comparison with Tesla P100, the previous flagship GPU. The primary driver of the performance improvement is the increase in the number of streaming multiprocessors (SMs) from to . One notable modification of SM in the Volta architecture is the independence of integer units from CUDA cores that were unified in the Pascal or earlier GPU architectures. The results of the detailed microbenchmarks on Tesla V100 are reported by Jia2018 . For example, the measured load throughput of L1 cache per SM is more than three times higher than that of Tesla P100.
Recently installed largescale systems, such as Summit by Oak Ridge National Laboratory Summit and ABCI (AI Bridging Cloud Infrastructure) by National Institute of Advanced Industrial Science and Technology ABCI , employ Tesla V100 as the accelerator device. They appear at the top of the TOP500 and Green500 lists TOP500 ; Green500 . Such systems demand porting the software developed for the Pascal or earlier GPU architectures to the Volta architecture, including optimizing the software performance for it. Summarizing the pitfalls, preparing recipes for porting applications to the Volta architecture, and unveiling their performance behavior would help users to develop software for the Volta GPU architecture.
This study focuses on body simulations as a representative example of computeintensive applications. In astrophysics, body simulations are the common tool for investigating the formation and evolution processes of gravitational manybody systems such as galaxies. body simulations solve the time evolution of the system by tracking the orbital evolution of particles obeying the gravitational force between the particles. The acceleration of the th particle out of particles is given by Newton’s equation of motion as
(1) 
where , , and are the particle mass, position, and the gravitational constant, respectively. The gravitational softening , which is introduced to reduce the artificial effects caused by the small particle number compared to the actual system, acts to remove the divergence due to division by zero and selfinteraction.
The direct method that calculates the gravitational force between particles one by one according to Equation 1 is too slow to execute the simulations in a realistic time because the computation complexity is . The tree method proposed by BarnesHut1986 reduces the workload to by using a multipole expansion technique when calculating the gravitational force by distant particles. Using accelerator devices is another way of accelerating body simulations. Earlier implementations have shown that GPUs can accelerate the tree method Gaburov2010 ; Nakasato2012 ; Ogiya2013 ; Bedorf2012 ; Bedorf2014 ; WatanabeNakasato2014 ; MikiUmemura2017 .
GOTHIC, the gravitational octree code developed earlier MikiUmemura2017 , adopts both the tree method and the block time step McMillan1986 . The code was initially developed for the Fermi GPU architecture using CUDA C/C++ and was optimized for the Fermi, Kepler, and Maxwell architectures (see MikiUmemura2017 for the implementation details). GOTHIC automatically adjusts the frequency of rebuilding the tree structure to minimize the timetosolution by monitoring the execution time of the tree construction and the gravity calculation. GOTHIC exploits the acceleration MAC (multipole acceptance criterion) proposed by Springel2001 ; Springel2005 and defined as
(2) 
where and are the mass and the size of the group of distant particles, is the distance between the particle feeling the gravity and the distant particles causing the gravity, and is the particle acceleration in the previous time step. This MAC enables a faster computation to achieve the same accuracy of the gravity calculation compared to other MACs as reported by Nelson2009 ; MikiUmemura2017 . GOTHIC generates a small interaction list shared by concurrently working threads within a warp to achieve a high performance by increasing arithmetic intensity. When the MAC judges that the gravity calculation by a group of distant particles is acceptable, the corresponding pseudo particle is added to the interaction list. The MAC evaluation continues until the size of the list reaches its capacity, then calculates the gravitational force caused by the listed particles and flushes the list. This procedure continues until the calculation of gravity from all body particles finishes.
The present study adjusts the gravitational octree code GOTHIC for the Volta GPU architecture, as well as measures and analyzes the performance of the code on Tesla V100 and Tesla P100. Section 2 summarizes the major revision in the Volta GPU architecture and introduces some recipes for the porting. Section 3 presents the results of the performance measurements, and Section 4 discusses the origin of the performance improvements in the Volta architecture compared to the Pascal architecture.
2 Porting software to Volta GPU
This section summarizes the major revision introduced in the Volta GPU architecture (§2.1) and the code tuning for Tesla V100 and Tesla P100 (§2.2).
2.1 Major revision in the Volta architecture
The independent thread scheduling between threads in a warp is the most significant revision in the Volta architecture compared to the earlier architectures. On GPUs with the Pascal or earlier architectures, all operations within threads in a warp are executed at the same time; therefore, there is no need to execute explicit synchronization within a warp. On the other hand, simultaneity of operations is not guaranteed despite the operations are confined in threads in a warp. Explicit synchronizations using __syncwarp() instruction for threads in a warp, or tiled synchronization using Cooperative Groups when the number of the involved threads is of the power of two and smaller than , are necessary to proceed the operations properly.
Removal of implicit synchronization within a warp introduces an adverse side effect in the warp divergence: it potentially extends the duration of the warp divergence. On GPUs based on earlier or equal to the Pascal architecture, the warp divergence occurs only within the conditionalbranch sections, and all threads in a warp reconverge soon after the branch ends (Figure 20 of GV100Whitepaper ). On GPUs with the Volta architecture, the warp divergence lasts until executing any explicit synchronization even after the conditionalbranch ended (Figures 22 and 23 of GV100Whitepaper ). Therefore, inserting explicit synchronization instructions at the appropriate location is necessary not to decelerate the computing on the Volta GPU. Since the procedure requires detailed reviewing and careful edit of the source code, NVIDIA provides a workaround without adding explicit synchronization. In particular, compilation with gencode arch=compute_60,code=sm_70 enforces the implicit synchronization of threads within a warp similar to the Pascal architecture. Hereafter, we call the Pascal mode or the Volta mode for compilation with gencode arch=compute_60,code=sm_70 or gencode arch=compute_70,code=sm_70, respectively. The Volta mode is the default mode in CUDA, while the Pascal mode alleviates the need of significant edit of the source code by taking a potential risk of dropping the degree of performance optimization by CUDA. In Sections 3 and 4.1, the Volta and Pascal modes are compared based on the timetosolution.
The absence of implicit synchronization within a warp also modifies the warp shuffle functions. NVIDIA deprecates functions like __shfl() and __shfl_xor(), and encourages to use the warpsynchronous version of the functions like __shfl_sync() and __shfl_xor_sync() instead. The new functions require a new input argument named mask to specify which thread should involve the function. For example, 0xffffffff or 0xffff are the appropriate mask when threads or threads in succession call the common warp shuffle function, respectively. Careful implementation is necessary if software execution has data dependency such as a gravitational octree code. Let us consider the case when threads involve a warp shuffle function. There are two situations: (1) a single group of threads calls the function or (2) two groups of threads call the function at the same time. The expected mask for the former case is 0xffff as usual; however, it is 0xffffffff for the latter case. None of 0xffff in both groups or 0xffff for the first halfwarp and 0xffff0000 for the second halfwarp returns the correct value. Therefore, appropriately switching 0xffffffff or 0xffff depending on the case during the runtime is necessary to yield the correct results. Obtaining the appropriate mask during runtime is possible by calling __activemask() just before calling the warp shuffle function. In the implementation of GOTHIC for the Volta GPU architecture, we set the mask as 0xffffffff for the Pascal mode or the Volta mode when threads never diverge, or the returned value by __activemask(), when the threads may diverge under the Volta mode.
The Volta architecture introduces flexibility regarding the configuration of the onchip memory compared to the earlier architectures. CUDA automatically determines the capacity of the shared memory per SM from the possible candidates of , , , , , and . Alternatively, user can manually specify the capacity by calling cudaFuncSetAttribute() with the name of the target function, cudaFuncAttributePreferredSharedMemoryCarveout, and the ratio of the expected capacity of the shared memory to the maximum value as integer. For example, inputting an integer value of assigns , which is % of , as the shared memory. There is a pitfall that putting assigns instead of ; therefore, the input integer should be the largest integer value not greater than the expected ratio (i.e., the floor function of the value).
CUDA 9 introduces global synchronization among multiple threadblocks as a new feature by using Cooperative Groups. The function extends the flexibility of the implementation. In the case of GOTHIC, we use GPU lockfree synchronization proposed by XiaoFeng2010 instead of the global synchronization by Cooperative Groups because it has already been implemented and the microbenchmarks show that the former version is faster than the latter one (see A for details).
2.2 Tuning for Tesla V100 and Tesla P100
The performance of GOTHIC strongly depends on the configuration of the threadblock (e.g., the number of threads per threadblock Ttot for each function and the minimum number of threads involving common reduction operations or scan Tsub). Therefore, we performed microbenchmarks to determine the optimal values of the fundamental parameters. The measurements were carried out on POWER9 servers at The University of Tokyo and TSUBAME3.0 supercomputer at Tokyo Institute of Technology. The POWER9 servers and TSUBAME3.0 are equipped with Tesla V100 and Tesla P100, respectively. Table 1 lists the details of both environments.
Since the performance of the tree code also depends on the particle distribution, we adopted a model of the Andromeda galaxy (M31) as a realistic distribution. The model is an updated version of the M31 model constructed by Geehan2006 ; Fardal2007 : a dark matter halo with the Navarro–Frenk–White model Navarro1995 (the mass is and the scale length is ), a stellar halo with the Sérsic model Sersic1963 (the mass is , the scale length is , and the Sérsic index is Gilbert2012 ; Ibata2014 ), a stellar bulge with the Hernquist model Hernquist1990 (the mass is and the scale length is ), and an exponential disk (the mass is , the scale length is , the scale height is , and the minimum of the Toomre’s value is Tenjes2017 ). We utilized MAGI MikiUmemura2018 to generate the particle distribution in a dynamical equilibrium state under the constraint of the mass of all body particles being identical.
We set the configuration of GOTHIC to achieve the shortest timetosolution of steps integration using particles with the accuracy controlling parameter . Microbenchmarks were carried out in the Volta mode on Tesla V100. Table 2 summarizes the optimal configuration of GOTHIC: walkTree calculates the gravitational acceleration using the tree traversal, calcNode computes the location of the centerofmass and the total mass of the tree nodes, makeTree builds the tree structure, predict and correct execute the orbit integration using the secondorder Runge–Kutta method.
3 Results
We measured the execution time as a function of the accuracy controlling parameter from to on Tesla P100 and Tesla V100. We compared the results with the Volta mode and the Pascal mode on Tesla V100. The other settings of the performance measurements are identical to the microbenchmarks in §2.2. The measured execution time with the typical accuracy of are , , and on Tesla P100, Tesla V100 in the Volta mode, and Tesla V100 in the Pascal mode, respectively. Figure 1 shows the results of the measurements. The figure also compiles the results measured previously MikiUmemura2017 on earlier GPUs (GeForce GTX TITAN X, Tesla K20X, and Tesla M2090). The execution time on later GPUs is always faster than that of earlier GPUs. The observed dependence of the execution time on the accuracy controlling parameter is similar in all architectures except for the Kepler architecture. Figure 1 confirms that Tesla V100 achieves an acceleration that is tenfold higher than that of Tesla M2090 in the same algorithm. On Tesla V100, the Pascal mode is always faster than the Volta mode. Hereafter, we regard the Pascal mode as a fiducial configuration on Tesla V100.
Figure 2 shows the speedup of Tesla V100 in the Pascal mode compared to Tesla P100 or Tesla V100 in the Volta mode. The Pascal mode is – times faster than the Volta mode irrespective of the accuracy of the gravity calculation (the red circles with the dotted curve). Tesla V100 achieves a to fold acceleration compared to Tesla P100 (the blue squares with the dotdashed curve). In particular, the speedup ratio exceeds in the regions of . The observed value is greater than , which is the ratio of the theoretical peak performance of Tesla V100 to that of Tesla P100. The origin of the higher speedup than the GPU performance improvements is discussed in §4.2.
We also measured the execution time of GOTHIC with varying the total number of body particles from to on Tesla V100. The number of time steps in the measurements is to ensure the accuracy of the measurements using small particle numbers. The execution time per step with the maximum number of particles is . Figure 3 exhibits the scaling of GOTHIC to and the breakdown of the representative functions. Similar to all earlier GPU architectures, the execution time for gravity calculation (the red circles with the dashed curve) is always the dominant contributor; however, the contribution from calcNode (the blue squares with the dotted curve) is not negligible especially in the small regions.
On Tesla P100, the identical measurements reveal that the execution time is for the maximum particle number of . The reason for Tesla P100 can perform larger simulations than Tesla V100 is the higher capacity of the global memory per SM. Both GPUs have HBM2 memory, but Tesla V100 has times more SMs. Since GOTHIC adopts the breadthfirst traverse and every SM needs the buffer to store tree cells to be evaluated, the buffer size per SM determines the maximum problem size. Therefore, Tesla V100 with HBM2 would be able to perform body simulations using more particles than Tesla P100.
4 Discussion
4.1 Origin of speedup from the Volta mode
Performance measurements presented in the previous section reveal that the Pascal mode is to times faster than the Volta mode. The Pascal mode disables the independent thread scheduling of the Volta GPU architecture and certifies that threads in a warp execute operations simultaneously, similar to the Pascal or earlier GPU architectures. It reduces overheads by __syncwarp() instruction or tiled synchronization by Cooperative Groups. On the other hand, it disables some features introduced in the Volta architecture and also has a risk to drop the performance due to insufficient performance optimization for the Volta GPU architecture by the CUDA compiler.
To evaluate the pros and cons of using the Pascal mode, we measured the execution time of the representative functions in GOTHIC and compared them between the Pascal and the Volta modes. Figure 4 presents the execution time of the representative functions in GOTHIC on Tesla V100 with the Pascal mode at . The figure reveals that the execution time of walkTree decreases with the decrease of the accuracy of gravity calculation (the red circles with the dashed curve), while that of calcNode (the blue squares with the dotted line) or orbit integration (the green triangles with the dotted line) are independent of the accuracy of gravity calculation. It is a natural consequence, because the precise computation of gravity takes a longer time compared to the lowaccurate computation, while computation of tree nodes and orbit integration are irrelevant for the accuracy of the gravitational force.
The tree construction is essentially independent from the accuracy of gravity calculation; however, the interval of tree rebuilds reflects the change of the execution time for gravity calculation since GOTHIC automatically adjusts the tree rebuild interval to minimize the sum of the execution time for gravity calculation and tree construction. Throughout the measurements, the rebuild interval is six steps in the highest accuracy and around steps in the lowest accuracy.
Figure 5 shows the speedup ratio of the Pascal mode compared to the Volta mode for the representative functions in GOTHIC. The figure indicates that the Pascal mode provides faster calculation for all functions. In case of the dominant contributor, walkTree, the Pascal mode is about % faster and tends to increase the speedup rate in . Furthermore, calcNode, which shows about % acceleration, contributes to the overall speedup in the regions of .
The above two functions are the origin of the observed to fold acceleration of the Pascal mode compared to the Volta mode. Both functions include reduction operations and scan within threads consisting a warp, and therefore, call __syncwarp() for many times in the Volta mode. Since the Pascal mode never calls __syncwarp(), the origin of the speedup is considered as the execution cost of __syncwarp(). It implies that the Pascal mode will achieve better performance for functions having many synchronizations within a warp.
Orbit integration does not require innerwarp synchronization using __syncwarp(), and operations would be identical in both the Pascal and the Volta modes. The only possible difference between the two modes is the degree of performance optimization by the CUDA compiler. As the figure shows, there is no difference in the performance between the two modes, which implies that the degree of the optimization is similar in the two modes.
The function for tree construction, which executes tiled synchronization by Cooperative Groups and __activemask(), does not show a higher speedup ratio compared to the functions using __syncwarp()
only. The probable reasons are that the function also exploits intrablock synchronization and most of the execution time is the radix sort using
cub::DeviceRadixSort::SortPairs function in CUB CUB . Both reduce the contribution of the synchronization costs within a warp.4.2 Origin of speedup from Tesla P100
This section elucidates what enables the observed speedup to exceed the ratio of the theoretical peak performance between Tesla V100 and Tesla P100. We concentrate on the function for gravity calculation since it is the dominant contributor to the execution time. The independence of the integer operation units from the floatingpoint number operation units on GPUs with the Volta architecture introduces opportunities to hide the execution time of integer operations or floatingpoint number operations by implementing their overlapped execution. The acceleration due to the overlapping is only available when both integer and floatingpoint number operations are executed independently in computeintensive situations. body simulations are the representative examples of computeintensive applications, where the tree method includes integer operations, as opposed to the direct method, which executes floatingpoint number operations only. The gravitational octree simulations are the ideal testbed to confirm whether the acceleration by hiding integer operations is available or not in the actual applications.
To examine the hypothesis that the overlapping of integer and floatingpoint number operations achieves the observed speedup, we counted the number of instructions executed in walkTree using nvprof. The metrics specified in the measurements are inst_integer, flop_count_sp_fma, flop_count_sp_add, flop_count_sp_mul, and flop_count_sp_special. The measurements using nvprof with the above metrics serialize the execution of GOTHIC and the computations become significantly slow. Hence, we ran the simulation with , and steps on Tesla V100. We also disabled the autotuning of the tree reconstruction intervals and set a fixed value using the interval taken from the executions without nvprof considering that nvprof extends the execution time of the functions that are the basis of the autotuning.
Figure 6 shows the measured number of instructions as a function of the accuracy controlling parameter. The number of instructions executed by the special function units (the green crosses with the solid curve), which are the reciprocal square root of singleprecision floatingpoint numbers, is nearly tenfold smaller than that of the fused multiplyadd (FMA) operations (the red squares with the dashed curve). Even considering the throughput of the reciprocal square root to be a quarter of the throughput of the operations executed by CUDA cores, the execution time of the reciprocal square root is smaller than that of the other floatingpoint number operations. For simplicity, we hereafter assume that the execution time of the reciprocal square root is entirely hidden by the other floatingpoint number operations.
Figure 7 compares the number of instructions executed on different operating units. The operation count of integer operations (the blue squares with the dotdashed curve) is identical to that of Figure 6. The sum of all floatingpoint number operations performed by CUDA cores (FMA, multiplication, and addition) is plotted using the magenta triangles with the dotted curve. The instruction counts of floatingpoint number operations are always higher than that of integer operations; hence, the maximum number of both instructions (the red circles with the solid curve) corresponds to that of floatingpoint number operations. It implies that the execution time of integer operations can be hidden by the overlapped execution of the floatingpoint number operations on GPUs with the Volta architecture if there are no obstacles such as the dependency between the two types of operations. The black squares with the dashed curve indicate the summed number of integer and floatingpoint number operations. The summed number determines the execution time when the identical operation unit executes integer operations and floatingpoint number operations similar to GPUs with the Pascal or earlier architectures.
Figure 8 is the theoretical prediction of the speedup ratio of Tesla V100 compared to Tesla P100. The magenta dotdashed and the black dotted lines represent the ratio of the theoretical peak performance and the measured memory bandwidth, respectively. The blue squares with the dashed curve represent the ratio of the instructions, in which the execution of integer operations is never hidden (corresponding to Tesla P100, the black squares in Figure 7) to the instructions, in which the execution of integer operations is completely hidden (the ideal situation on Tesla V100, the red circles in Figure 7). The actual prediction is the product of the theoretical peak performance ratio and the ratio of the two scenarios (the red circles with the solid curve). The theoretical prediction suggests that the observed fold acceleration is possible in . On the other hand, the model fails to explain the observed decline of the speedup in .
Possible explanations for the disagreement between the theoretical prediction and the observed speedup in include the following: (1) there is no room for the overlapped execution of integer and floatingpoint number operations; (2) the transition of the performance limiting factor from computebound to memorybound. The currently focused regions are the lowaccuracy regions for gravity calculation; therefore, MAC evaluations tend to judge body particles causing the gravity to be sufficiently far and calculate the gravitational force using the tree nodes in the levels nearby the root node. To enable the overlapped execution of the integer and floatingpoint number operations, multiple iterations of the procedure that fills the interaction list in the shared memory by MAC evaluations (dominated by integer operations) and flushes the lists just after the gravity calculations (dominated by floatingpoint number operations) are necessary. Decreasing the accuracy of the gravity calculation might reduce the chance of the overlapped execution of the two types of operations, and therefore, the observed speedup ratio goes down compared to the ratio of the theoretical peak performance. Reduction of opportunities for the overlapping also makes it difficult to hide the access latency in the global memory. It leads the computebound kernel to become the latencybound kernel, which means that the overlapping of multiple executions cannot contribute to the total execution time. Furthermore, the decrease of the executed operations may increase the byteperflop ratio, leading to the computebound kernel to be converted to the bandwidthbound function.
The direct inspection of the reason for the disagreement between the theoretical model and the observed speedup is difficult. However, a performance drop would be observed when the force accuracy is decreased providing that some of the above considerations are correct. The performance evaluation requires individual measurements of the execution time of the gravity calculation function and the assumption of the count of the singleprecision reciprocal square root. Here we assume that the reciprocal square root corresponds to four since the ratio of the throughput of this operation to the singleprecision addition or multiplication operation is four Miki2013 ; MikiUmemura2017 . Figure 9
reveals the estimated performance of the gravity calculation. The sustained performance reaches
, which is % of the singleprecision theoretical peak performance, in the regions of . The performance decreases with the increase of the accuracy controlling parameter as shown in Figure 9. This suggests that the reduced computational workload with the decrease of the force accuracy deteriorate the performance and the speedup ratio.The division of the obtained floatingpoint number operations by the total elapsed time of the code indicates the estimated performance of GOTHIC (Figure 10). The measured singleprecision performance is (% of the theoretical peak performance) and (% of the theoretical peak performance) for and , respectively, at . The performance dependency on becomes stronger than that of in Figure 9 because the execution time of calcNode also contributes to the total execution time in the regions of greater .
5 Summary
The gravitational octree code GOTHIC, which was originally optimized for GPUs with the Fermi, Kepler, and Maxwell architectures, was adapted to the Volta architecture. The Volta architecture introduces the independent thread scheduling and removes implicit synchronization within threads consisting a warp. The change of the programming model requires modifying the code by inserting explicit synchronizations in appropriate locations or enforcing execution with implicit synchronization similar to the Pascal or earlier architectures by specifying gencode arch=compute_60,code=sm_70 during the compilation.
The performance on Tesla V100 and Tesla P100, which are the latest and the previous flagship GPUs developed by NVIDIA, respectively, was measured. The performance measurements on Tesla V100 revealed that the body simulations of the Andromeda galaxy model with particles take with explicit synchronizations or in the same execution mode as that of the Pascal or earlier architectures. The results showed that the latter case is faster regardless of the gravity calculation accuracy, while specifying gencode arch=compute_60,code=sm_70 enables a to fold acceleration.
Tesla V100 achieves a to fold acceleration compared to Tesla P100. The observed speedup of is higher than the improvement of the theoretical peak performance. The independence of the integer operation units from the floatingpoint number operation units enables the overlapped execution of integer operations and floatingpoint number operations. Such overlapped execution hides the execution time of the integer operations and leads to the speedup rate above the theoretical peak performance ratio.
Tesla V100 can execute the body simulation with up to particles and takes per step. It corresponds to which is % of the singleprecision theoretical peak performance.
Acknowledgment
This work is supported by “Joint Usage/Research Center for Interdisciplinary Largescale Information Infrastructures” and “High Performance Computing Infrastructure” in Japan (Project ID: jh180045NAH), and “TSUBAME Encouragement Program for Young/Female Users” of Global Scientific Information and Computing Center, Tokyo Institute of Technology. The performance optimization for Tesla P100 and performance measurements using Tesla P100 were carried out on the TSUBAME3.0 supercomputer at Tokyo Institute of Technology.
Appendix A Global synchronization using Cooperative Groups
CUDA 9 introduces a global synchronization function among multiple threadblocks using the Cooperative Groups as a new feature. GOTHIC is already equipped with global synchronization using GPU lockfree synchronization proposed by XiaoFeng2010 since the code had been developed before the Cooperative Groups was introduced.
To exploit global synchronization using the Cooperative Groups when the source code is split into multiple files, it is necessary to append devicec and devicelink outputfile tmp.o while generating object files and invoking the device linker, respectively, and link all object files with the intermediate file tmp.o. Here, we compare the execution time of the function for tree node in three cases: (1) original implementation, (2) global synchronization using Cooperative Groups, and (3) compiling in the above procedure but executing in the original implementation.
The measured execution time are , , and for each case, respectively. The number of registers used by each thread is in the original implementation and for the other cases. In the latter cases, the number of blocks per SM decreases to from in the original implementation. The reason for the changed compilation method decelerating the application is the decrease of the occupancy with the drop of blocks per SM. The additional cost of global synchronization using the Cooperative Groups is estimated to be because the function performs global synchronization times per step.
References
 (1) NVIDIA, NVIDIA Tesla V100 GPU Architecture (2017).
 (2) Z. Jia, M. Maggioni, B. Staiger, D. P. Scarpazza, Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking, ArXiv eprintsarXiv:1804.06826.

(3)
Summit.
URL https://www.olcf.ornl.gov/summit/ 
(4)
ABCI.
URL https://abci.ai/ 
(5)
TOP500 Lists.
URL https://www.top500.org/lists/top500/ 
(6)
Green500.
URL https://www.top500.org/green500/  (7) J. Barnes, P. Hut, A hierarchical O(N log N) forcecalculation algorithm, Nature 324 (1986) 446–449. doi:10.1038/324446a0.
 (8) E. Gaburov, J. Bédorf, S. Portegies Zwart, Gravitational treecode on graphics processing units: implementation in CUDA, Procedia Computer Science, volume 1, p. 11191127 1 (2010) 1119–1127. arXiv:1005.5384, doi:10.1016/j.procs.2010.04.124.
 (9) N. Nakasato, Implementation of a parallel tree method on a GPU, Journal of Computational Science 3 (3) (2012) 132 – 141. doi:10.1016/j.jocs.2011.01.006.
 (10) G. Ogiya, M. Mori, Y. Miki, T. Boku, N. Nakasato, Studying the corecusp problem in cold dark matter halos using Nbody simulations on GPU clusters, Journal of Physics Conference Series 454 (1) (2013) 012014. doi:10.1088/17426596/454/1/012014.
 (11) J. Bédorf, E. Gaburov, S. Portegies Zwart, A sparse octree gravitational Nbody code that runs entirely on the GPU processor, Journal of Computational Physics 231 (2012) 2825–2839. arXiv:1106.1900, doi:10.1016/j.jcp.2011.12.024.
 (12) J. Bédorf, E. Gaburov, M. S. Fujii, K. Nitadori, T. Ishiyama, S. Portegies Zwart, 24.77 Pflops on a Gravitational TreeCode to Simulate the Milky Way Galaxy with 18600 GPUs, ArXiv eprintsarXiv:1412.0659.
 (13) T. Watanabe, N. Nakasato, GPU accelerated Hybrid Tree Algorithm for Collisionless Nbody Simulations, ArXiv eprintsarXiv:1406.6158.
 (14) Y. Miki, M. Umemura, GOTHIC: Gravitational octtree code accelerated by hierarchical time step controlling, New Astronomy 52 (2017) 65–81. arXiv:1610.07279, doi:10.1016/j.newast.2016.10.007.

(15)
S. L. W. McMillan, The Vectorization of SmallN Integrators, in: P. Hut, S. L. W. McMillan (Eds.), The Use of Supercomputers in Stellar Dynamics, Vol. 267 of Lecture Notes in Physics, Berlin Springer Verlag, 1986, p. 156.
doi:10.1007/BFb0116406.  (16) V. Springel, N. Yoshida, S. D. M. White, GADGET: a code for collisionless and gasdynamical cosmological simulations, New Astronomy 6 (2001) 79–117. arXiv:astroph/0003162, doi:10.1016/S13841076(01)000422.
 (17) V. Springel, The cosmological simulation code GADGET2, Monthly Notices of the Royal Astronomical Society 364 (2005) 1105–1134. arXiv:astroph/0505010, doi:10.1111/j.13652966.2005.09655.x.
 (18) A. F. Nelson, M. Wetzstein, T. Naab, Vine–A Numerical Code for Simulating Astrophysical Systems Using Particles. II. Implementation and Performance Characteristics, The Astrophysical Journal Supplement 184 (2009) 326–360. arXiv:0802.4253, doi:10.1088/00670049/184/2/326.
 (19) S. Xiao, W. Feng, Interblock GPU communication via fast barrier synchronization, in: Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, 2010, pp. 1–12. doi:10.1109/IPDPS.2010.5470477.
 (20) J. J. Geehan, M. A. Fardal, A. Babul, P. Guhathakurta, Investigating the Andromeda stream  I. Simple analytic bulgedischalo model for M31, Monthly Notices of the Royal Astronomical Society 366 (2006) 996–1011. arXiv:astroph/0501240, doi:10.1111/j.13652966.2005.09863.x.
 (21) M. A. Fardal, P. Guhathakurta, A. Babul, A. W. McConnachie, Investigating the Andromeda stream  III. A young shell system in M31, Monthly Notices of the Royal Astronomical Society 380 (2007) 15–32. arXiv:arXiv:astroph/0609050, doi:10.1111/j.13652966.2007.11929.x.
 (22) J. F. Navarro, C. S. Frenk, S. D. M. White, Simulations of Xray clusters, Monthly Notices of the Royal Astronomical Society 275 (1995) 720–740. arXiv:astroph/9408069, doi:10.1093/mnras/275.3.720.
 (23) J. L. Sérsic, Influence of the atmospheric and instrumental dispersion on the brightness distribution in a galaxy, Boletin de la Asociacion Argentina de Astronomia La Plata Argentina 6 (1963) 41.
 (24) K. M. Gilbert, P. Guhathakurta, R. L. Beaton, J. Bullock, M. C. Geha, J. S. Kalirai, E. N. Kirby, S. R. Majewski, J. C. Ostheimer, R. J. Patterson, E. J. Tollerud, M. Tanaka, M. Chiba, Global Properties of M31’s Stellar Halo from the SPLASH Survey. I. Surface Brightness Profile, The Astrophysical Journal 760 (2012) 76. arXiv:1210.3362, doi:10.1088/0004637X/760/1/76.
 (25) R. A. Ibata, G. F. Lewis, A. W. McConnachie, N. F. Martin, M. J. Irwin, A. M. N. Ferguson, A. Babul, E. J. Bernard, S. C. Chapman, M. Collins, M. Fardal, A. D. Mackey, J. Navarro, J. Peñarrubia, R. M. Rich, N. Tanvir, L. Widrow, The Largescale Structure of the Halo of the Andromeda Galaxy. I. Global Stellar Density, Morphology and Metallicity Properties, The Astrophysical Journal 780 (2014) 128. arXiv:1311.5888, doi:10.1088/0004637X/780/2/128.
 (26) L. Hernquist, An analytical model for spherical galaxies and bulges, The Astrophysical Journal 356 (1990) 359–364. doi:10.1086/168845.
 (27) P. Tenjes, T. Tuvikene, A. Tamm, R. Kipper, E. Tempel, Spiral arms and disc stability in the Andromeda galaxy, Astronomy and Astrophysics 600 (2017) A34. arXiv:1701.05815, doi:10.1051/00046361/201629991.
 (28) Y. Miki, M. Umemura, MAGI: manycomponent galaxy initializer, Monthly Notices of the Royal Astronomical Society 475 (2018) 2269–2281. doi:10.1093/mnras/stx3327.

(29)
CUB.
URL https://nvlabs.github.io/cub/  (30) Y. Miki, D. Takahashi, M. Mori, Highly scalable implementation of an Nbody code on a GPU cluster, Computer Physics Communications 184 (2013) 2159–2168. doi:10.1016/j.cpc.2013.04.011.
Comments
There are no comments yet.