MemPool: A Shared-L1 Memory Many-Core Cluster with a Low-Latency Interconnect

12/05/2020
by   Matheus Cavalcante, et al.
0

A key challenge in scaling shared-L1 multi-core clusters towards many-core (more than 16 cores) configurations is to ensure low-latency and efficient access to the L1 memory. In this work we demonstrate that it is possible to scale up the shared-L1 architecture: We present MemPool, a 32 bit many-core system with 256 fast RV32IMA "Snitch" cores featuring application-tunable execution units, running at 700 MHz in typical conditions (TT/0.80 V/25C). MemPool is easy to program, with all the cores sharing a global view of a large L1 scratchpad memory pool, accessible within at most 5 cycles. In MemPool's physical-aware design, we emphasized the exploration, design, and optimization of the low-latency processor-to-L1-memory interconnect. We compare three candidate topologies, analyzing them in terms of latency, throughput, and back-end feasibility. The chosen topology keeps the average latency at fewer than 6 cycles, even for a heavy injected load of 0.33 request/core/cycle. We also propose a lightweight addressing scheme that maps each core private data to a memory bank accessible within one cycle, which leads to performance gains of up to 20 is also highly efficient in terms of energy consumption since requests to local banks consume only half of the energy required to access remote banks. Our design achieves competitive performance with respect to an ideal, non-implementable full-crossbar baseline.

READ FULL TEXT

page 3

page 5

research
03/30/2023

MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory

Shared L1 memory clusters are a common architectural pattern (e.g., in G...
research
12/02/2021

MemPool-3D: Boosting Performance and Efficiency of Shared-L1 Memory Many-Core Clusters with 3D Integration

Three-dimensional integrated circuits promise power, performance, and fo...
research
07/26/2016

Uber: Utilizing Buffers to Simplify NoCs for Hundreds-Cores

Approaching ideal wire latency using a network-on-chip (NoC) is an impor...
research
07/16/2022

Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

While parallel architectures based on clusters of Processing Elements (P...
research
02/20/2019

JArena: Partitioned Shared Memory for NUMA-awareness in Multi-threaded Scientific Applications

The distributed shared memory (DSM) architecture is widely used in today...
research
10/17/2022

Efficient Parallelization of 5G-PUSCH on a Scalable RISC-V Many-core Processor

5G Radio access network disaggregation and softwarization pose challenge...

Please sign up or login with your details

Forgot password? Click here to reset