Efficiently Parallelizable Strassen-Based Multiplication of a Matrix by its Transpose

10/25/2021
by   Viviana Arrigoni, et al.
0

The multiplication of a matrix by its transpose, A^T A, appears as an intermediate operation in the solution of a wide set of problems. In this paper, we propose a new cache-oblivious algorithm (ATA) for computing this product, based upon the classical Strassen algorithm as a sub-routine. In particular, we decrease the computational cost to 2/3 the time required by Strassen's algorithm, amounting to 14/3n^log_2 7 floating point operations. ATA works for generic rectangular matrices, and exploits the peculiar symmetry of the resulting product matrix for saving memory. In addition, we provide an extensive implementation study of ATA in a shared memory system, and extend its applicability to a distributed environment. To support our findings, we compare our algorithm with state-of-the-art solutions specialized in the computation of A^T A. Our experiments highlight good scalability with respect to both the matrix size and the number of involved processes, as well as favorable performance for both the parallel paradigms and the sequential implementation, when compared with other methods in the literature.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/06/2019

Fast Strassen-based A^t A Parallel Multiplication

Matrix multiplication A^t A appears as intermediate operation during the...
research
10/01/2019

An efficient floating point multiplier design for high speed applications using Karatsuba algorithm and Urdhva-Tiryagbhyam algorithm

Floating point multiplication is a crucial operation in high power compu...
research
01/13/2020

On fast multiplication of a matrix by its transpose

We present a non-commutative algorithm for the multiplication of a 2x2-b...
research
06/21/2018

Generic and Universal Parallel Matrix Summation with a Flexible Compression Goal for Xilinx FPGAs

Bit matrix compression is a highly relevant operation in computer arithm...
research
08/28/2020

Distributed-memory ℋ-matrix Algebra I: Data Distribution and Matrix-vector Multiplication

We introduce a data distribution scheme for ℋ-matrices and a distributed...
research
12/22/2016

An efficient hybrid tridiagonal divide-and-conquer algorithm on distributed memory architectures

In this paper, an efficient divide-and-conquer (DC) algorithm is propose...
research
03/09/2020

Software-Level Accuracy Using Stochastic Computing With Charge-Trap-Flash Based Weight Matrix

The in-memory computing paradigm with emerging memory devices has been r...

Please sign up or login with your details

Forgot password? Click here to reset