MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory

by   Samuel Riedel, et al.

Shared L1 memory clusters are a common architectural pattern (e.g., in GPGPUs) for building efficient and flexible multi-processing-element (PE) engines. However, it is a common belief that these tightly-coupled clusters would not scale beyond a few tens of PEs. In this work, we tackle scaling shared L1 clusters to hundreds of PEs while supporting a flexible and productive programming model and maintaining high efficiency. We present MemPool, a manycore system with 256 RV32IMAXpulpimg "Snitch" cores featuring application-tunable functional units. We designed and implemented an efficient low-latency PE to L1-memory interconnect, an optimized instruction path to ensure each PE's independent execution, and a powerful DMA engine and system interconnect to stream data in and out. MemPool is easy to program, with all the cores sharing a global view of a large, multi-banked, L1 scratchpad memory, accessible within at most five cycles in the absence of conflicts. We provide multiple runtimes to program MemPool at different abstraction levels and illustrate its versatility with a wide set of applications. MemPool runs at 600 MHz (60 gate delays) in typical conditions (TT/0.80V/25C) in 22 nm FDX technology and achieves a performance of up to 229 GOPS or 192 GOPS/W with less than 2


page 6

page 9

page 10

page 11

page 12

page 14


MemPool: A Shared-L1 Memory Many-Core Cluster with a Low-Latency Interconnect

A key challenge in scaling shared-L1 multi-core clusters towards many-co...

Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

While parallel architectures based on clusters of Processing Elements (P...

MemPool-3D: Boosting Performance and Efficiency of Shared-L1 Memory Many-Core Clusters with 3D Integration

Three-dimensional integrated circuits promise power, performance, and fo...

Soft Tiles: Capturing Physical Implementation Flexibility for Tightly-Coupled Parallel Processing Clusters

Modern high-performance computing architectures (Multicore, GPU, Manycor...

A near-threshold RISC-V core with DSP extensions for scalable IoT Endpoint Devices

Endpoint devices for Internet-of-Things not only need to work under extr...

JArena: Partitioned Shared Memory for NUMA-awareness in Multi-threaded Scientific Applications

The distributed shared memory (DSM) architecture is widely used in today...

SHIP: A Scalable High-performance IPv6 Lookup Algorithm that Exploits Prefix Characteristics

Due to the emergence of new network applications, current IP lookup engi...

Please sign up or login with your details

Forgot password? Click here to reset