Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments

04/02/2018
by   Mehmet Deveci, et al.
0

Architectures with multiple classes of memory media are becoming a common part of mainstream supercomputer deployments. So called multi-level memories offer differing characteristics for each memory component including variation in bandwidth, latency and capacity. This paper investigates the performance of sparse matrix multiplication kernels on two leading high-performance computing architectures -- Intel's Knights Landing processor and NVIDIA's Pascal GPU. We describe a data placement method and a chunking-based algorithm for our kernels that exploits the existence of the multiple memory spaces in each hardware platform. We evaluate the performance of these methods w.r.t. standard algorithms using the auto-caching mechanisms. Our results show that standard algorithms that exploit cache reuse performed as well as multi-memory-aware algorithms for architectures such as KNLs where the memory subsystems have similar latencies. However, for architectures such as GPUs where memory subsystems differ significantly in both bandwidth and latency, multi-memory-aware methods are crucial for good performance. In addition, our new approaches permit the user to run problems that require larger capacities than the fastest memory of each compute node without depending on the software-managed cache mechanisms.

READ FULL TEXT

page 7

page 11

page 12

page 17

page 18

page 19

page 21

page 22

research
01/09/2018

Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures

Sparse Matrix-Matrix multiplication is a key kernel that has application...
research
02/26/2020

Bandwidth-Optimized Parallel Algorithms for Sparse Matrix-Matrix Multiplication using Propagation Blocking

Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in ...
research
09/29/2020

Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX

The A64FX CPU powers the current number one supercomputer on the Top500 ...
research
09/07/2018

A Microbenchmark Characterization of the Emu Chick

The Emu Chick is a prototype system designed around the concept of migra...
research
03/04/2021

ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX

The A64FX CPU is arguably the most powerful Arm-based processor design t...
research
07/06/2019

Optimizing Xeon Phi for Interactive Data Analysis

The Intel Xeon Phi manycore processor is designed to provide high perfor...
research
05/03/2023

FT-GEMM: A Fault Tolerant High Performance GEMM Implementation on x86 CPUs

General matrix/matrix multiplication (GEMM) is crucial for scientific co...

Please sign up or login with your details

Forgot password? Click here to reset