A GPU-Accelerated Fast Summation Method Based on Barycentric Lagrange Interpolation and Dual Tree Traversal

12/13/2020
by   Leighton Wilson, et al.
0

We present the barycentric Lagrange dual tree traversal (BLDTT) fast summation method for particle interactions. The scheme replaces well-separated particle-particle interactions by adaptively chosen particle-cluster, cluster-particle, and cluster-cluster approximations given by barycentric Lagrange interpolation at proxy particles on a Chebyshev grid in each cluster. The BLDTT is kernel-independent and the approximations can be efficiently mapped onto GPUs, where target particles provide an outer level of parallelism and source particles provide an inner level of parallelism. We present an OpenACC GPU implementation of the BLDTT with MPI remote memory access for distributed memory parallelization. The performance of the GPU-accelerated BLDTT is demonstrated for calculations with different problem sizes, particle distributions, geometric domains, and interaction kernels, as well as for unequal target and source particles. Comparison with our earlier particle-cluster barycentric Lagrange treecode (BLTC) demonstrates the superior performance of the BLDTT. In particular, on a single GPU for problem sizes ranging from N=1E5 to 1E8, the BLTC has O(Nlog N) scaling, while the BLDTT has O(N) scaling. In addition, MPI strong scaling results are presented for the BLTC and BLDTT using N=64E6 particles on up to 32 GPUs.

READ FULL TEXT

page 21

page 23

research
03/03/2020

A GPU-Accelerated Barycentric Lagrange Treecode

We present an MPI + OpenACC implementation of the kernel-independent bar...
research
03/24/2020

Gadget3 on GPUs with OpenACC

We present preliminary results of a GPU porting of all main Gadget3 modu...
research
07/12/2023

Cornerstone: Octree Construction Algorithms for Scalable Particle Simulations

This paper presents an octree construction method, called Cornerstone, t...
research
09/13/2023

GPU Scheduler for De Novo Genome Assembly with Multiple MPI Processes

De Novo Genome assembly is one of the most important tasks in computatio...
research
02/25/2022

HipBone: A performance-portable GPU-accelerated C++ version of the NekBone benchmark

We present hipBone, an open source performance-portable proxy applicatio...
research
03/25/2020

A Hybrid MPI+Threads Approach to Particle Group Finding Using Union-Find

The Friends-of-Friends (FoF) algorithm is a standard technique used in c...
research
03/28/2022

A new Nested Cross Approximation

In this article, we present a new Nested Cross Approximation (NCA), for ...

Please sign up or login with your details

Forgot password? Click here to reset