Compact NUMA-Aware Locks

10/12/2018
by   Dave Dice, et al.
0

Modern multi-socket architectures exhibit non-uniform memory access (NUMA) behavior, where access by a core to data cached locally on a socket is much faster than access to data cached on a remote socket. Prior work offers several efficient NUMA-aware locks that exploit this behavior by keeping the lock ownership on the same socket, thus reducing remote cache misses and inter-socket communication. Virtually all those locks, however, are hierarchical in their nature, thus requiring space proportional to the number of sockets. The increased memory cost renders NUMA-aware locks unsuitable for systems that are conscious to space requirements of their synchronization constructs, with the Linux kernel being the chief example. In this work, we present a compact NUMA-aware lock that requires only one word of memory, regardless of the number of sockets in the underlying machine. The new lock is a variant of an efficient (NUMA-oblivious) MCS lock, and inherits its performant features, such as local spinning and a single atomic instruction in the acquisition path. Unlike MCS, the new lock organizes waiting threads in two queues, one composed of threads running on the same socket as the current lock holder, and another composed of threads running on a different socket(s). We integrated the new lock in the Linux kernel's qspinlock, one of the major synchronization constructs in the kernel. Our evaluation using both user-space and kernel benchmarks shows that the new lock has a single-thread performance of MCS, but significantly outperforms the latter under contention, achieving a similar level of performance when compared to other, state-of-the-art NUMA-aware locks that require substantially more space.

READ FULL TEXT
research
11/22/2019

Effectively Prefetching Remote Memory with Leap

Memory disaggregation over RDMA can improve the performance of memory-co...
research
09/25/2021

NUMA-aware FFT-based Convolution on ARMv8 Many-core CPUs

Convolutional Neural Networks (CNNs), one of the most representative alg...
research
07/17/2023

Fast Shared-Memory Barrier Synchronization for a 1024-Cores RISC-V Many-Core Cluster

Synchronization is likely the most critical performance killer in shared...
research
02/06/2019

Storm: a fast transactional dataplane for remote data structures

RDMA is an exciting technology that enables a host to access the memory ...
research
01/06/2023

GCS: Generalized Cache Coherence For Efficient Synchronization

We explore the design of scalable synchronization primitives for disaggr...
research
12/19/2021

New Mechanism for Fast System Calls

System calls have no place on the fast path of microsecond-scale systems...
research
05/26/2019

Avoiding Scalability Collapse by Restricting Concurrency

Saturated locks often degrade the performance of a multithreaded applica...

Please sign up or login with your details

Forgot password? Click here to reset