Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor Clusters

04/14/2020
by   Florian Glaser, et al.
0

The steeply growing performance demands for highly power- and energy-constrained processing systems such as end-nodes of the internet-of-things (IoT) have led to parallel near-threshold computing (NTC), joining the energy-efficiency benefits of low-voltage operation with the performance typical of parallel systems. Shared-L1-memory multiprocessor clusters are a promising architecture, delivering performance in the order of GOPS and over 100 GOPS/W of energy-efficiency. However, this level of computational efficiency can only be reached by maximizing the effective utilization of the processing elements (PEs) available in the clusters. Along with this effort, the optimization of PE-to-PE synchronization and communication is a critical factor for performance. In this work, we describe a light-weight hardware-accelerated synchronization and communication unit (SCU) for tightly-coupled clusters of processors. We detail the architecture, which enables fine-grain per-PE power management, and its integration into an eight-core cluster of RISC-V processors. To validate the effectiveness of the proposed solution, we implemented the eight-core cluster in advanced 22nm FDX technology and evaluated performance and energy-efficiency with tunable microbenchmarks and a set of real-life applications and kernels. The proposed solution allows synchronization-free regions as small as 42 cycles, over 41 times smaller than the baseline implementation based on fast test-and-set access to L1 memory when constraining the microbenchmarks to 10 synchronization overhead. When evaluated on the real-life DSP-applications, the proposed SCU improves performance by up to 92 efficiency by up to 98

READ FULL TEXT

page 2

page 3

page 7

page 9

page 10

page 11

page 12

page 15

research
07/17/2023

Fast Shared-Memory Barrier Synchronization for a 1024-Cores RISC-V Many-Core Cluster

Synchronization is likely the most critical performance killer in shared...
research
09/04/2023

Scalable Hierarchical Instruction Cache for Ultra-Low-Power Processors Clusters

High Performance and Energy Efficiency are critical requirements for Int...
research
04/05/2020

Efficient Task Mapping for Manycore Systems

System-on-chip (SoC) has migrated from single core to manycore architect...
research
12/02/2021

MemPool-3D: Boosting Performance and Efficiency of Shared-L1 Memory Many-Core Clusters with 3D Integration

Three-dimensional integrated circuits promise power, performance, and fo...
research
09/02/2022

Soft Tiles: Capturing Physical Implementation Flexibility for Tightly-Coupled Parallel Processing Clusters

Modern high-performance computing architectures (Multicore, GPU, Manycor...
research
01/19/2021

SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures

Near-Data-Processing (NDP) architectures present a promising way to alle...
research
06/10/2019

Transport Triggered Array Processor for Vision Applications

Low-level sensory data processing in many Internet-of-Things (IoT) devic...

Please sign up or login with your details

Forgot password? Click here to reset