Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms

05/09/2019
by   Jian Weng, et al.
0

Dense linear algebra kernels are critical for wireless applications, and the oncoming proliferation of 5G only amplifies their importance. Many such matrix algorithms are inductive, and exhibit ample amounts of fine-grain ordered parallelism – when multiple computations flow with fine-grain producer/consumer dependences, and where the iteration domain is not easily tileable. Synchronization overheads make multi-core parallelism ineffective and the non-tileable iterations make the vector-VLIW approach less effective, especially for the typically modest-sized matrices. Because CPUs and DSPs lose order-of-magnitude performance/hardware utilization, costly and inflexible ASICs are often employed in signal processing pipelines. A programmable accelerator with similar performance/power/area would be highly desirable. We find that fine-grain ordered parallelism can be exploited by supporting: 1. fine-grain stream-based communication/synchronization; 2. inductive data-reuse and memory access patterns; 3. implicit vector-masking for partial vectors; 4. hardware specialization of dataflow criticality. In this work, we propose, REVEL, as a next-generation DSP architecture. It supports the above features in its ISA and microarchitecture, and further uses a novel vector-stream control paradigm to reduce control overheads. Across a suite of linear algebra kernels, REVEL outperforms equally provisioned DSPs by 4.6x-37x in latency and achieves a performance per mm 2 of 8.3x. It is only 2.2x higher power to achieve the same performance as ideal ASICs, at about 55

READ FULL TEXT

page 1

page 4

page 5

page 6

page 10

page 11

research
11/16/2020

Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra

Sparse-dense linear algebra is crucial in many domains, but challenging ...
research
04/09/2021

Stream Processing With Dependency-Guided Synchronization

Real-time data processing applications with low latency requirements hav...
research
04/27/2023

Co-Design of the Dense Linear AlgebravSoftware Stack for Multicore Processors

This paper advocates for an intertwined design of the dense linear algeb...
research
03/30/2018

Scaling Ordered Stream Processing on Shared-Memory Multicores

Many modern applications require real-time processing of large volumes o...
research
01/23/2023

Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism

The demise of Moore's Law and Dennard Scaling has revived interest in sp...
research
10/14/2017

High Throughput 2D Spatial Image Filters on FPGAs

FPGAs are well established in the signal processing domain, where their ...
research
12/14/2016

Efficient Realization of Householder Transform through Algorithm-Architecture Co-design for Acceleration of QR Factorization

We present efficient realization of Householder Transform (HT) based QR ...

Please sign up or login with your details

Forgot password? Click here to reset