Optimizing Distributed Tensor Contractions using Node-Aware Processor Grids

07/17/2023
by   Andreas Irmler, et al.
0

We propose an algorithm that aims at minimizing the inter-node communication volume for distributed and memory-efficient tensor contraction schemes on modern multi-core compute nodes. The key idea is to define processor grids that optimize intra-/inter-node communication volume in the employed contraction algorithms. We present an implementation of the proposed node-aware communication algorithm into the Cyclops Tensor Framework (CTF). We demonstrate that this implementation achieves a significantly improved performance for matrix-matrix-multiplication and tensor-contractions on up to several hundreds modern compute nodes compared to conventional implementations without using node-aware processor grids. Our implementation shows good performance when compared with existing state-of-the-art parallel matrix multiplication libraries (COSMA and ScaLAPACK). In addition to the discussion of the performance for matrix-matrix-multiplication, we also investigate the performance of our node-aware communication algorithm for tensor contractions as they occur in quantum chemical coupled-cluster methods. To this end we employ a modified version of CTF in combination with a coupled-cluster code (Cc4s). Our findings show that the node-aware communication algorithm is also able to improve the performance of coupled-cluster theory calculations for real-world problems running on tens to hundreds of compute nodes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/27/2021

Efficient distributed algorithms for Convolutional Neural Networks

Several efficient distributed algorithms have been developed for matrix-...
research
03/23/2023

Scalability of 3D-DFT by block tensor-matrix multiplication on the JUWELS Cluster

The 3D Discrete Fourier Transform (DFT) is a technique used to solve pro...
research
04/11/2017

Strassen's Algorithm for Tensor Contraction

Tensor contraction (TC) is an important computational kernel widely used...
research
09/27/2021

Distributed Computing With the Cloud

We investigate the effect of omnipresent cloud storage on distributed co...
research
05/19/2020

Efficient Process-to-Node Mapping Algorithms for Stencil Computations

Good process-to-compute-node mappings can be decisive for well performin...
research
12/15/2018

Layer Based Partition for Matrix Multiplication on Heterogeneous Processor Platforms

While many approaches have been proposed to analyze the problem of matri...
research
08/26/2019

Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication

We propose COSMA: a parallel matrix-matrix multiplication algorithm that...

Please sign up or login with your details

Forgot password? Click here to reset