Scalable Hierarchical Instruction Cache for Ultra-Low-Power Processors Clusters

09/04/2023
by   Jie Chen, et al.
0

High Performance and Energy Efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly-coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultra-low-power tightly-coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures' performance and energy efficiency for parallel ultra-low-power (ULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications.

READ FULL TEXT
research
04/14/2020

Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor Clusters

The steeply growing performance demands for highly power- and energy-con...
research
08/30/2016

A near-threshold RISC-V core with DSP extensions for scalable IoT Endpoint Devices

Endpoint devices for Internet-of-Things not only need to work under extr...
research
05/18/2019

Low-power Programmable Processor for Fast Fourier Transform Based on Transport Triggered Architecture

This paper describes a low-power processor tailored for fast Fourier tra...
research
11/26/2019

System Performance with varying L1 Instruction and Data Cache Sizes: An Empirical Analysis

In this project, we investigate the fluctuations in performance caused b...
research
08/31/2017

Galactos: Computing the Anisotropic 3-Point Correlation Function for 2 Billion Galaxies

The nature of dark energy and the complete theory of gravity are two cen...
research
04/25/2019

TS Cache: A Fast Cache with Timing-speculation Mechanism Under Low Supply Voltages

To mitigate the ever-worsening Power Wall problem, more and more applica...
research
07/27/2020

Performance-Aware Predictive-Model-Based On-Chip Body-Bias Regulation Strategy for an ULP Multi-Core Cluster in 28nm UTBB FD-SOI

The performance and reliability of Ultra-Low-Power (ULP) computing platf...

Please sign up or login with your details

Forgot password? Click here to reset