Spatz: Clustering Compact RISC-V-Based Vector Units to Maximize Computing Efficiency

09/18/2023
by   Matheus Cavalcante, et al.
0

The ever-increasing computational and storage requirements of modern applications and the slowdown of technology scaling pose major challenges to designing and implementing efficient computer architectures. In this paper, we leverage the architectural balance principle to alleviate the bandwidth bottleneck at the L1 data memory boundary of a tightly-coupled cluster of processing elements (PEs). We thus explore coupling each PE with an L0 memory, namely a private register file implemented as Standard Cell Memory (SCM). Architecturally, the SCM is the Vector Register File (VRF) of Spatz, a compact 64-bit floating-point-capable vector processor based on RISC-V's Vector Extension Zve64d. Unlike typical vector processors, whose VRF are hundreds of KiB large, we prove that Spatz can achieve peak energy efficiency with a VRF of only 2 KiB. An implementation of the Spatz-based cluster in GlobalFoundries' 12LPP process with eight double-precision Floating Point Units (FPUs) achieves an FPU utilization just 3.4 double-precision, floating-point matrix multiplication. The cluster reaches 7.7 FMA/cycle, corresponding to 15.7 GFLOPS-DP and 95.7 GFLOPS-DP/W at 1 GHz and nominal operating conditions (TT, 0.80V, 25^oC) with more than 55 spent on the FPUs. Furthermore, the optimally-balanced Spatz-based cluster reaches a 95.0 GFLOPS-DP/W (61 kernel, resulting in an outstanding area/energy efficiency of 171 GFLOPS-DP/W/mm^2. At equi-area, our computing cluster built upon compact vector processors reaches a 30 FPU count built upon scalar cores specialized for stream-based floating-point computation.

READ FULL TEXT

page 1

page 4

page 6

page 7

page 9

page 10

page 11

page 13

research
06/02/2019

Ara: A 1 GHz+ Scalable and Energy-Efficient RISC-V Vector Processor with Multi-Precision Floating Point Support in 22 nm FD-SOI

In this paper, we present Ara, a 64-bit vector processor based on the ve...
research
07/16/2022

Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

While parallel architectures based on clusters of Processing Elements (P...
research
02/24/2020

Snitch: A 10 kGE Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads

Data-parallel applications, such as data analytics, machine learning, an...
research
04/24/2022

RedMulE: A Compact FP16 Matrix-Multiplication Accelerator for Adaptive Deep Learning on RISC-V-Based Ultra-Low-Power SoCs

The fast proliferation of extreme-edge applications using Deep Learning ...
research
10/15/2020

FPRaker: A Processing Element For Accelerating Neural Network Training

We present FPRaker, a processing element for composing training accelera...
research
08/27/2020

A transprecision floating-point cluster for efficient near-sensor data analytics

Recent applications in the domain of near-sensor computing require the a...
research
01/09/2023

Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU

We introduce Stream-K, a work-centric parallelization of matrix multipli...

Please sign up or login with your details

Forgot password? Click here to reset