GEMMbench: a framework for reproducible and collaborative benchmarking of matrix multiplication

11/12/2015
by   Anton Lokhmotov, et al.
0

The generic matrix-matrix multiplication (GEMM) is arguably the most popular computational kernel of the 20th century. Yet, surprisingly, no common methodology for evaluating GEMM performance has been established over the many decades of using GEMM for comparing architectures, compilers and ninja-class programmers. We introduce GEMMbench, a framework and methodology for evaluating performance of GEMM implementations. GEMMbench is implemented on top of Collective Knowledge (CK), a lightweight framework for reproducible and collaborative R&D in computer systems. Using CK allows the R&D community to crowdsource hand-written and compiler-generated GEMM implementations and to study their performance across multiple platforms, data sizes and data types. Our initial implementation supports hand-written OpenCL kernels operating on matrices consisting of single- and double-precision floating-point values, and producing single or multiple output elements per work-item (via thread coarsening and vectorization).

READ FULL TEXT

page 6

page 7

research
11/18/2019

General Matrix-Matrix Multiplication Using SIMD features of the PIII

Generalised matrix-matrix multiplication forms the kernel of many mathem...
research
10/05/2017

Tuning Technique for Multiple Precision Dense Matrix Multiplication using Prediction of Computational Time

Although reliable long precision floating-point arithmetic libraries suc...
research
01/17/2021

Acceleration of multiple precision matrix multiplication based on multi-component floating-point arithmetic using AVX2

In this paper, we report the results obtained from the acceleration of m...
research
10/29/2015

Performance evaluation of multiple precision matrix multiplications using parallelized Strassen and Winograd algorithms

It is well known that Strassen and Winograd algorithms can reduce the co...
research
05/12/2023

AMULET: Adaptive Matrix-Multiplication-Like Tasks

Many useful tasks in data science and machine learning applications can ...
research
01/09/2023

Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU

We introduce Stream-K, a work-centric parallelization of matrix multipli...
research
03/19/2017

CLTune: A Generic Auto-Tuner for OpenCL Kernels

This work presents CLTune, an auto-tuner for OpenCL kernels. It evaluate...

Please sign up or login with your details

Forgot password? Click here to reset