Scalability of 3D-DFT by block tensor-matrix multiplication on the JUWELS Cluster

03/23/2023
by   Nitin Malapally, et al.
0

The 3D Discrete Fourier Transform (DFT) is a technique used to solve problems in disparate fields. Nowadays, the commonly adopted implementation of the 3D-DFT is derived from the Fast Fourier Transform (FFT) algorithm. However, evidence indicates that the distributed memory 3D-FFT algorithm does not scale well due to its use of all-to-all communication. Here, building on the work of Sedukhin et al. [Proceedings of the 30th International Conference on Computers and Their Applications, CATA 2015 pp. 193-200 (01 2015)], we revisit the possibility of improving the scaling of the 3D-DFT by using an alternative approach that uses point-to-point communication, albeit at a higher arithmetic complexity. The new algorithm exploits tensor-matrix multiplications on a volumetrically decomposed domain via three specially adapted variants of Cannon's algorithm. It has here been implemented as a C++ library called S3DFT and tested on the JUWELS Cluster at the Jülich Supercomputing Center. Our implementation of the shared memory tensor-matrix multiplication attained 88% of the theoretical single node peak performance. One variant of the distributed memory tensor-matrix multiplication shows excellent scaling, while the other two show poorer performance, which can be attributed to their intrinsic communication patterns. A comparison of S3DFT with the Intel MKL and FFTW3 libraries indicates that currently iMKL performs best overall, followed in order by FFTW3 and S3DFT. This picture might change with further improvements of the algorithm and/or when running on clusters that use network connections with higher latency, e.g. on cloud platforms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/27/2021

Efficient distributed algorithms for Convolutional Neural Networks

Several efficient distributed algorithms have been developed for matrix-...
research
07/17/2023

Optimizing Distributed Tensor Contractions using Node-Aware Processor Grids

We propose an algorithm that aims at minimizing the inter-node communica...
research
02/22/2023

Matrix Multiplication and Number On the Forehead Communication

Three-player Number On the Forehead communication may be thought of as a...
research
12/27/2017

Tensor network complexity of multilinear maps

We study tensor networks as a model of arithmetic computation for evalua...
research
05/30/2021

2.5-dimensional distributed model training

Data parallelism does a good job in speeding up the training. However, w...
research
04/24/2020

An extra-components method for evaluating fast matrix-vector multiplication with special functions

In calculating integral or discrete transforms, fast algorithms for mult...
research
11/08/2017

RPYFMM: Parallel Adaptive Fast Multipole Method for Rotne-Prager-Yamakawa Tensor in Biomolecular Hydrodynamics Simulations

RPYFMM is a software package for the efficient evaluation of the potenti...

Please sign up or login with your details

Forgot password? Click here to reset