SLAP: A Split Latency Adaptive VLIW pipeline architecture which enables on-the-fly variable SIMD vector-length

02/26/2021
by   Ashish Shrivastava, et al.
0

Over the last decade the relative latency of access to shared memory by multicore increased as wire resistance dominated latency and low wire density layout pushed multiport memories farther away from their ports. Various techniques were deployed to improve average memory access latencies, such as speculative pre-fetching and branch-prediction, often leading to high variance in execution time which is unacceptable in real time systems. Smart DMAs can be used to directly copy data into a layer1 SRAM, but with overhead. The VLIW architecture, the de facto signal processing engine, suffers badly from a breakdown in lockstep execution of scalar and vector instructions. We describe the Split Latency Adaptive Pipeline (SLAP) VLIW architecture, a cache performance improvement technology that requires zero change to object code, while removing smart DMAs and their overhead. SLAP builds on the Decoupled Access and Execute concept by 1) breaking lockstep execution of functional units, 2) enabling variable vector length for variable data level parallelism, and 3) adding a novel triangular load mechanism. We discuss the SLAP architecture and demonstrate the performance benefits on real traces from a wireless baseband system (where even the most compute intensive functions suffer from an Amdahls law limitation due to a mixture of scalar and vector processing).

READ FULL TEXT

page 2

page 4

page 5

research
10/16/2018

On the Off-chip Memory Latency of Real-Time Systems: Is DDR DRAM Really the Best Option?

Predictable execution time upon accessing shared memories in multi-core ...
research
03/27/2021

Reducing Load Latency with Cache Level Prediction

High load latency that results from deep cache hierarchies and relativel...
research
09/11/2019

Cache Where you Want! Reconciling Predictability and Coherent Caching

Real-time and cyber-physical systems need to interact with and respond t...
research
05/28/2023

An evaluation of a microprocessor with two independent hardware execution threads coupled through a shared cache

We investigate the utility of augmenting a microprocessor with a single ...
research
10/16/2020

Combinatorics and Geometry for the Many-ported, Distributed and Shared Memory Architecture

Manycore SoC architectures based on on-chip shared memory are preferred ...
research
07/25/2022

FAT-PIM: Low-Cost Error Detection for Processing-In-Memory

Processing In Memory (PIM) accelerators are promising architecture that ...

Please sign up or login with your details

Forgot password? Click here to reset