Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

07/16/2022
by   Matheus Cavalcante, et al.
0

While parallel architectures based on clusters of Processing Elements (PEs) sharing L1 memory are widespread, there is no consensus on how lean their PE should be. Architecting PEs as vector processors holds the promise to greatly reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck (VNB). However, due to their historical association with supercomputers, classical vector machines include micro-architectural tricks to improve the Instruction Level Parallelism (ILP), which increases their instruction fetch and decode energy overhead. In this paper, we explore for the first time vector processing as an option to build small and efficient PEs for large-scale shared-L1 clusters. We propose Spatz, a compact, modular 32-bit vector processing unit based on the integer embedded subset of the RISC-V Vector Extension version 1.0. A Spatz-based cluster with four Multiply-Accumulate Units (MACUs) needs only 7.9 pJ per 32-bit integer multiply-accumulate operation, 40 scalar cores. We analyzed Spatz' performance by integrating it within MemPool, a large-scale many-core shared-L1 cluster. The Spatz-based MemPool system achieves up to 285 GOPS when running a 256x256 32-bit integer matrix multiplication, 70 terms of energy efficiency, the Spatz-based MemPool system achieves up to 266 GOPS/W when running the same kernel, more than twice the energy efficiency of the Snitch-based MemPool system, which reaches 128 GOPS/W. Those results show the viability of lean vector processors as high-performance and energy-efficient PEs for large-scale clusters with tightly-coupled L1 memory.

READ FULL TEXT

page 5

page 7

research
09/18/2023

Spatz: Clustering Compact RISC-V-Based Vector Units to Maximize Computing Efficiency

The ever-increasing computational and storage requirements of modern app...
research
09/02/2022

Soft Tiles: Capturing Physical Implementation Flexibility for Tightly-Coupled Parallel Processing Clusters

Modern high-performance computing architectures (Multicore, GPU, Manycor...
research
03/30/2023

MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory

Shared L1 memory clusters are a common architectural pattern (e.g., in G...
research
07/17/2023

Fast Shared-Memory Barrier Synchronization for a 1024-Cores RISC-V Many-Core Cluster

Synchronization is likely the most critical performance killer in shared...
research
12/02/2021

MemPool-3D: Boosting Performance and Efficiency of Shared-L1 Memory Many-Core Clusters with 3D Integration

Three-dimensional integrated circuits promise power, performance, and fo...
research
12/05/2020

MemPool: A Shared-L1 Memory Many-Core Cluster with a Low-Latency Interconnect

A key challenge in scaling shared-L1 multi-core clusters towards many-co...
research
11/29/2017

Energy-Efficient Time-Domain Vector-by-Matrix Multiplier for Neurocomputing and Beyond

We propose an extremely energy-efficient mixed-signal approach for perfo...

Please sign up or login with your details

Forgot password? Click here to reset