Extreme Scale FMM-Accelerated Boundary Integral Equation Solver for Wave Scattering

03/27/2018
by   Mustafa AbdulJabbar, et al.
0

Algorithmic and architecture-oriented optimizations are essential for achieving performance worthy of anticipated energy-austere exascale systems. In this paper, we present an extreme scale FMM-accelerated boundary integral equation solver for wave scattering, which uses FMM as a matrix-vector multiplication inside the GMRES iterative method. Our FMM Helmholtz kernels treat nontrivial singular and near-field integration points. We implement highly optimized kernels for both shared and distributed memory, targeting emerging Intel extreme performance HPC architectures. We extract the potential thread- and data-level parallelism of the key Helmholtz kernels of FMM. Our application code is well optimized to exploit the AVX-512 SIMD units of Intel Skylake and Knights Landing architectures. We provide different performance models for tuning the task-based tree traversal implementation of FMM, and develop optimal architecture-specific and algorithm aware partitioning, load balancing, and communication reducing mechanisms to scale up to 6,144 compute nodes of a Cray XC40 with 196,608 hardware cores. With shared memory optimizations, we achieve roughly 77 performance of a 56-core Skylake processor, and on average 60 precision floating point performance of a 72-core KNL. These numbers represent nearly 5.4x and 10x speedup on Skylake and KNL, respectively, compared to the baseline scalar code. With distributed memory optimizations, on the other hand, we report near-optimal efficiency in the weak scalability study with respect to both the logarithmic communication complexity as well as the theoretical scaling complexity of FMM. In addition, we exhibit up to 85 strong scaling. We compute in excess of 2 billion DoF on the full-scale of the Cray XC40 supercomputer.

READ FULL TEXT

page 5

page 12

page 15

page 18

research
12/30/2021

Massively Parallelized Interpolated Factored Green Function Method

This paper presents a parallel implementation of the "Interpolated Facto...
research
01/13/2022

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

Several manufacturers have already started to commercialize near-bank Pr...
research
10/19/2017

DD-αAMG on QPACE 3

We describe our experience porting the Regensburg implementation of the ...
research
02/06/2021

A Newcomer In The PGAS World – UPC++ vs UPC: A Comparative Study

A newcomer in the Partitioned Global Address Space (PGAS) 'world' has ar...
research
08/31/2017

Galactos: Computing the Anisotropic 3-Point Correlation Function for 2 Billion Galaxies

The nature of dark energy and the complete theory of gravity are two cen...
research
10/10/2018

Performance analysis and optimization of the JOREK code for many-core CPUs

This report investigates the performance of the JOREK code on the Intel ...
research
09/12/2019

PittPack: An Open-Source Poisson's Equation Solver for Extreme-Scale Computing with Accelerators

We present a parallel implementation of a direct solver for the Poisson'...

Please sign up or login with your details

Forgot password? Click here to reset