Accelerating BLAS and LAPACK via Efficient Floating Point Architecture Design

10/27/2016
by   Farhad Merchant, et al.
0

Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) form basic building blocks for several High Performance Computing (HPC) applications and hence dictate performance of the HPC applications. Performance in such tuned packages is attained through tuning of several algorithmic and architectural parameters such as number of parallel operations in the Directed Acyclic Graph of the BLAS/LAPACK routines, sizes of the memories in the memory hierarchy of the underlying platform, bandwidth of the memory, and structure of the compute resources in the underlying platform. In this paper, we closely investigate the impact of the Floating Point Unit (FPU) micro-architecture for performance tuning of BLAS and LAPACK. We present theoretical analysis for pipeline depth of different floating point operations like multiplier, adder, square root, and divider followed by characterization of BLAS and LAPACK to determine several parameters required in the theoretical framework for deciding optimum pipeline depth of the floating operations. A simple design of a Processing Element (PE) is presented and shown that the PE outperforms the most recent custom realizations of BLAS and LAPACK by 1.1X to 1.5X in Gflops/W, and 1.9X to 2.1X in Gflops/mm^2.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/24/2018

No Multiplication? No Floating Point? No Problem! Training Networks for Efficient Inference

For successful deployment of deep neural networks on highly--resource-co...
research
07/13/2022

Reduction of the Random Access Memory Size in Adjoint Algorithmic Differentiation by Overloading

Adjoint algorithmic differentiation by operator and function overloading...
research
11/21/2022

The AMD Rome Memory Barrier

With the rapid growth of AMD as a competitor in the CPU industry, it is ...
research
07/12/2019

Posit NPB: Assessing the Precision Improvement in HPC Scientific Applications

Floating-point operations can significantly impact the accuracy and perf...
research
12/14/2016

Efficient Realization of Householder Transform through Algorithm-Architecture Co-design for Acceleration of QR Factorization

We present efficient realization of Householder Transform (HT) based QR ...
research
10/20/2016

Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design

Basic Linear Algebra Subprograms (BLAS) play key role in high performanc...
research
01/28/2018

BOPS, Not FLOPS! A New Metric, Measuring Tool, and Roofline Performance Model For Datacenter Computing

The past decades witness FLOPS (Floating-point Operations per Second), a...

Please sign up or login with your details

Forgot password? Click here to reset