Fast Shared-Memory Barrier Synchronization for a 1024-Cores RISC-V Many-Core Cluster

07/17/2023
by   Marco Bertuletti, et al.
0

Synchronization is likely the most critical performance killer in shared-memory parallel programs. With the rise of multi-core and many-core processors, the relative impact on performance and energy overhead of synchronization is bound to grow. This paper focuses on barrier synchronization for TeraPool, a cluster of 1024 RISC-V processors with non-uniform memory access to a tightly coupled 4MB shared L1 data memory. We compare the synchronization strategies available in other multi-core and many-core clusters to identify the optimal native barrier kernel for TeraPool. We benchmark a set of optimized barrier implementations and evaluate their performance in the framework of the widespread fork-join Open-MP style programming model. We test parallel kernels from the signal-processing and telecommunications domain, achieving less than 10 problems that fit TeraPool's L1 memory. By fine-tuning our tree barriers, we achieve 1.6x speed-up with respect to a naive central counter barrier and just 6.2 synchronization kernel. To our knowledge, this is the first work where shared-memory barriers are used for the synchronization of a thousand processing elements tightly coupled to shared data memory.

READ FULL TEXT
research
04/14/2020

Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor Clusters

The steeply growing performance demands for highly power- and energy-con...
research
07/16/2022

Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

While parallel architectures based on clusters of Processing Elements (P...
research
04/12/2020

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications

We present a parallel profiling tool, GAPP, that identifies serializatio...
research
02/18/2022

Migration-Based Synchronization

A fundamental challenge in multi- and many-core systems is the correct e...
research
04/24/2023

Protecting Locks Against Unbalanced Unlock()

The lock is a building-block synchronization primitive that enables mutu...
research
10/12/2018

Compact NUMA-Aware Locks

Modern multi-socket architectures exhibit non-uniform memory access (NUM...
research
06/02/2019

Mutable Locks: Combining the Best of Spin and Sleep Locks

In this article we present Mutable Locks, a synchronization construct wi...

Please sign up or login with your details

Forgot password? Click here to reset