Asymmetry-aware Scalable Locking

08/07/2021
by   Nian Liu, et al.
0

The pursuit of power-efficiency is popularizing asymmetric multicore processors (AMP) such as ARM big.LITTLE, Apple M1 and recent Intel Alder Lake with big and little cores. However, we find that existing scalable locks fail to scale on AMP and cause collapses in either throughput or latency, or both, because their implicit assumption of symmetric cores no longer holds. To address this issue, we propose the first asymmetry-aware scalable lock named LibASL. LibASL provides a new lock ordering guided by applications' latency requirements, which allows big cores to reorder with little cores for higher throughput under the condition of preserving applications' latency requirements. Using LibASL only requires linking the applications with it and, if latency-critical, inserting few lines of code to annotate the coarse-grained latency requirement. We evaluate LibASL in various benchmarks including five popular databases on Apple M1. Evaluation results show that LibASL can improve the throughput by up to 5 times while precisely preserving the tail latency designated by applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/20/2019

Hurry-up: Scaling Web Search on Big/Little Multi-core Architectures

Heterogeneous multi-core systems such as big/little architectures have b...
11/23/2021

Design of Many-Core Big Little μBrain for Energy-Efficient Embedded Neuromorphic Computing

As spiking-based deep learning inference applications are increasing in ...
11/06/2015

Multi-Threaded Dense Linear Algebra Libraries for Low-Power Asymmetric Multicore Processors

Dense linear algebra libraries, such as BLAS and LAPACK, provide a relev...
11/10/2020

Coherence Traffic in Manycore Processors with Opaque Distributed Directories

Manycore processors feature a high number of general-purpose cores desig...
08/24/2020

Evaluation of hybrid run-time power models for the ARM big.LITTLE architecture

Heterogeneous processors, formed by binary compatible CPU cores with dif...
03/14/2019

High-Throughput CNN Inference on Embedded ARM big.LITTLE Multi-Core Processors

IoT Edge intelligence requires Convolutional Neural Network (CNN) infere...
06/17/2021

QWin: Enforcing Tail Latency SLO at Shared Storage Backend

Consolidating latency-critical (LC) and best-effort (BE) tenants at stor...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Single-ISA asymmetric multicore processor (AMP) combines cores of different computing capacities in one processor (Kumar et al., 2003, 2004) and has been widely used in mobile devices (e.g., ARM big.LITTLE (48)). By combing both faster big cores and slower little cores together, AMP is more flexible in accommodating both performance-oriented and energy-efficiency-oriented scenarios, such as leveraging all cores to achieve peak performance and using little cores only when energy is preferred. There is also a recent trend to embrace such an architecture in more general CPU processors, including the desktop and the edge server (46; 5; 31). As before, applications on AMP need to use locks for acquiring exclusive access to shared data. However, we observe that existing locks, including those scalable in the symmetric multicore processor (SMP) (Boyd-Wickizer et al., 2012; Dice, 2017) or non-uniform memory access system (NUMA) (Dice and Kogan, 2019; Kashyap et al., 2019; Luchangco et al., 2006; Dice et al., 2011, 2012; Kashyap et al., 2017; Radovic and Hagersten, 2003; Chabbi et al., 2015), fail to scale in AMP and cause collapses in either throughput or latency, or both.

(a) Throughput collapse.
(b) Latency collapse.
Figure 1. Performance collapses on Apple M1 when locks are heavily contended. M1 has 4 big and 4 little cores. The first 4 threads are bound to different big cores. Others are bound to different little cores.
Figure 2. LibASL overview.

After an in-depth analysis, we find the main reason is that those locks (implicitly) assume symmetric cores, which does not hold on AMP. On the one side, locks that preserve lock acquisition fairness (i.e., give all cores an equal chance to lock), either short-term (e.g., MCS lock (Mellor-Crummey and Scott, 1991) passes the lock in FIFO order) or long-term (e.g., NUMA-aware locks (Dice and Kogan, 2019; Kashyap et al., 2019; Luchangco et al., 2006; Dice et al., 2011, 2012; Kashyap et al., 2017; Radovic and Hagersten, 2003; Chabbi et al., 2015) ensure the equal chance in a period), assume symmetric computing capacity. Therefore, in AMP, they give the slower little cores the same chance as the big cores to hold the lock, which makes the longer execution time of the critical section in little cores expose on the critical path and causes throughput collapse. On the other side, locks that do not preserve the acquisition fairness rely on atomic operations to decide the lock holder (e.g., test-and-set spinlock). They assume symmetric success rate of the atomic operation when executing simultaneously, which is also asymmetric in AMP. Thus, those locks are likely to be passed only among one type of cores (i.e., either big cores or little cores), which causes latency collapse even starvation to the others. Moreover, when the slower little cores have a higher chance to lock, the throughput also collapses due to the longer execution time of the critical sections on them. Figure 1 shows the performance collapses on Apple M1. Both the fair MCS lock and the unfair TAS (test-and-set) spinlock face throughput collapse when scaling to little cores. Besides, TAS spinlock’s latency also collapses and is 3.7x longer than the MCS lock.

Facing the asymmetry in AMP, it is non-trivial to decide the lock ordering (who lock first) for both high throughput and low latency. Binding threads only to big cores is an intuitive choice. However, using little cores can achieve higher throughput under a lower contention. Besides, it may violate the energy target as the energy-aware scheduler (22) could schedule threads to little cores for saving energy. Another intuitive approach is to give big cores a fixed higher chance to lock. However, the throughput and the latency are mutually competing in AMP. Thus, it is hard to find a suitable static proportion that can meet the application’s latency requirement and improve the throughput at the same time.

In this paper, we propose an asymmetry-aware lock named LibASL as shown in Figure 2. Rather than ensuring the lock acquisition fairness that causes collapses on AMP, LibASL provides a new lock ordering directly guided by applications’ latency requirements to achieve better throughput. Atop of a FIFO waiting queue, LibASL allows big cores to reorder (lock first) with little cores for higher throughput under the condition that the victim (been reordered) will not miss the application’s latency target. To achieve such an ordering, we first design a reorderable lock, which exposes the reorder capability as a configurable time window. Big cores can only reorder with little cores during that time window. Atop of the reorderable lock, LibASL automatically chooses a suitable (fine-grained) reorder window on each lock acquisition according to applications’ coarse-grained latency requirements through a feedback mechanism. LibASL provides intuitive interfaces for developers to specify the coarse-grained latency requirements (e.g., request handling procedure) in the form of latency SLO (service level objective, e.g., 99% request should be complete within 50ms), which is widely adopted by both academia (Zhu et al., 2017; Lo et al., 2014; Hao et al., 2017; Wang et al., 2012) and industry (J. Dean and L. A. Barroso (2013); G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels (2007); 27). To use LibASL, annotating the SLO is the only required effort if latency-critical, which is already clearly defined by application in most cases. Non-latency-critical application can benefit from LibASL without modifications.

We evaluate LibASL in multiple benchmarks including five popular databases on Apple M1, the only off-the-shelf desktop AMP yet. Results show that LibASL improves the throughput of pthread_mutex_lock by up to 5x (3.8x to MCS lock, 2.5x to TAS spinlock) while precisely maintaining the tail latency even in highly variable workloads.

In summary, this paper makes the following contributions:

  • The first in-depth analysis of the performance collapses of existing locks on AMP.

  • An asymmetry-aware scalable lock LibASL, which provides a new latency-SLO-guided lock ordering to achieve the best throughput the SLO allows on AMP.

  • A thorough evaluation on the real desktop AMP (Apple M1) and real-world applications that confirms the effectiveness of LibASL.

2. Scalable Locking is Non-scalable on AMP

2.1. Asymmetric Multicore Processor

In this section, we introduce several major features of AMP.

First, the asymmetry in AMP is inherent. Performance can also be asymmetric in SMP when using the dynamic voltage and frequency scaling (DVFS). However, they can boost the frequency of the lagging core (Wamhoff et al., 2014; Cebrian et al., 2013; Akram et al., 2016) while AMP cannot.

Second, asymmetric cores in recent AMP (1; 6; 46) are placed in a single cluster and share the same Last Level Cache (LLC). Thus, communication among cores in AMP is similar to SMP rather than NUMA.

Third, the scheduler (e.g., the energy-aware scheduler in Linux (22)) can place different threads across asymmetric cores (Fan and Lee, 2016; Jeff, 2013). Multi-threaded applications achieve better performance by leveraging all asymmetric cores (11) and need to use lock for synchronization among them as before.

2.2. A Study of Existing Scalable Locks in AMP

Scalable locking in SMP and NUMA has been extensively studied. However, existing scalable locks are non-scalable on AMP and encounter performance collapses. The main reason is that existing locks (implicitly) assume symmetric cores, which does not hold in AMP. There are two major differences between AMP and SMP that cause the collapses.

First, the computing capacity is asymmetric. Little cores spend a longer time executing the same critical section. Thus, when the lock preserves the acquisition fairness, it gives little cores an equal chance to hold the lock, which makes the longer execution time of critical sections on them expose on the critical path and causes a throughput collapse.

Figure 3. An example timeline (from left to right) when the lock is highly contended. Core 0/1 are big cores. Core 2/3 are little cores. More critical sections are executed means higher throughput; longer waiting time leads to longer latency.

We explain this problem through an example in Figure 3(a). The system includes two big cores (core 0/1) and two little cores (core 2/3), and they are competing for the same lock intensively. We divide the program’s execution into three parts, including executing the critical section, the non-critical section and waiting for the lock. As shown in Figure 3(a), when ensuring the short-term (i.e., FIFO) lock acquisition fairness (e.g., the MCS lock (Mellor-Crummey and Scott, 1991)), threads hand over the lock in a FIFO order. In result, the longer execution time of the critical section in core 2/3 will be exposed on the critical path and hurts the throughput.

Besides the short-term acquisition fairness, previous work provides long-term acquisition fairness to improve the throughput in many-core processors or NUMA while keeping a relatively low latency. Malthusian lock (Dice, 2017) reduces the contention in the many-core processor for better throughput by blocking all competitors in the waiting queue except the head and the tail. It achieves long-term fairness by periodically shifting threads between passive blocking and active acquiring. NUMA-aware locks (Dice et al., 2012; Kashyap et al., 2017; Dice and Kogan, 2019; Kashyap et al., 2019; Dice et al., 2011; Radovic and Hagersten, 2003; Luchangco et al., 2006; Chabbi et al., 2015) batch the competitors from the same NUMA node to reduce the cross node remote memory references. They achieve long-term fairness by periodically allowing different NUMA nodes to hold the lock. However, the long-term fairness also gives the little cores an equal chance to hold the lock in a period and thus hurts the throughput.

Implication 1: Lock ordering that respects acquisition fairness, either short-term or long-term, is no longer suitable in AMP. In SMP or NUMA, preserving acquisition fairness can prevent starvation and achieve relatively low latency without degrading the throughput but causes collapses in AMP. Thus, a new lock ordering should be proposed for AMP to meet the latency goal while bringing higher throughput.

Second, the success rate of atomic operations (e.g., test-and-set, TAS) is asymmetric. On some AMP systems (including ARM Kirin970 and Intel L16G7), we observe that big cores have a stable advantage over little cores in winning the atomic TAS. While on other platform (including Apple M1), the advantage shifts between asymmetric cores111On Apple M1, when executing TAS back-to-back (higher contention), little cores show a stable advantage. With the distance between two TAS increased (lower contention), big cores show a stable advantage.. Thus, locks that do not preserve acquisition fairness and rely on the atomic operation to decide the lock holder (e.g., TAS spinlock) also have the scalability issue in AMP. Such locks are likely to be held only by one type of cores (i.e., either big cores or little cores), which causes a latency collapse even starvation to the others. Moreover, when the little cores have a bigger chance to hold the lock, the throughput also collapses due to the longer execution time of the critical sections on them.

As shown in Figure 3(b), when little cores show advantage in winning the atomic TAS, they have more chance to lock (we name it as little-core-affinity). Thus, big cores can barely lock (starvation). Besides, the longer execution time of the critical sections in little cores will be exposed and hurts the throughput. Similarly, when big cores show advantage (big-core-affinity in Figure 3(c)), little cores will starve. Nevertheless, it allows big cores that arrived later to lock before (reorder with) earlier little cores. Thus more critical sections are executed on faster big cores, which brings higher throughput.

(a) Throughput
(b) Latency
Figure 4. When TAS lock shows big-core-affinity, it can achieve higher throughput but the latency still collapse.

We validate our observations on Apple M1. In Figure 1 (Section 1), we present the case when TAS lock shows little-core-affinity. In Figure 4, we present another scenario when TAS lock shows big-core-affinity222 We identify the affinity by comparing the number of executed critical sections in different types of core.. In both scenarios, the fair MCS lock faces throughput collapses (over 50% degradation from 4 big cores to all cores), while the unfair TAS lock faces latency collapses. When the TAS lock shows little-core-affinity in Figure 1, its throughput also collapses and is 43% worse than the MCS lock when using all the cores. However, when TAS lock shows big-core-affinity in Figure 4, more critical sections will be executed on the faster big cores, which brings 53% higher throughput than the MCS lock. The latency of MCS lock also increases when scaling to little cores due to the longer execution time of critical sections on little cores. However, it is much shorter than the TAS lock. These observations also hold in real-world applications. When TAS lock shows little-core-affinity in SQLite (detailed in Section 4.2), it has 49% worse throughput and 1.8x longer tail latency than the MCS lock. However, when TAS lock shows big-core-affinity in UpscaleDB, it has 90% better throughput yet 2.5x longer tail latency than the MCS lock.

Implication 2: Reordering to prioritize faster cores is indispensable in AMP for higher throughput, but it must be bounded. When TAS lock shows big-core-affinity, it reorders big cores with little cores unlimitedly and achieves higher throughput. However, the unlimited reordering causes a latency collapse. Thus, the reordering must be bounded for preserving applications’ latency requirements.

2.3. Strawman Solutions

Figure 5. Latency and throughput when setting different proportions on the label of each point (e.g., 10 means big cores have 10x higher chance to lock).

The straightforward approach to address the AMP scalability issue is only using big cores. However, little cores can help achieve higher throughput with a lower contention. Finding the optimal number of cores to run applications is a long-existing problem (Guerraoui et al., 2019; Dice, 2017). Moreover, the energy-aware scheduler may schedule threads to little cores to save energy. Binding threads only to big cores may violate the energy target.

Another intuitive solution is adopting proportional execution that gives big cores a fixed higher chance to lock. Figure 5 shows the performance when setting different proportions. As shown in the figure, the throughput and the latency are mutually competing against each other in AMP that a larger proportion leads to higher throughput but a longer tail latency. However, there is no clear clue whether a specific application prefers throughput over latency (and its extent) or the opposite. Moreover, since applications’ loads may change over the time, the latency will be unstable and unpredictable when setting a fixed proportion. Therefore, it is almost impossible to choose a static and suitable proportion to meet applications’ needs.

3. Design of LibASL

3.1. Overview

To address the lock scalability problem on AMP, we propose an asymmetry-aware scalable lock LibASL. Rather than preserving the lock acquisition fairness, LibASL provides a new lock ordering guided directly by the applications’ latency SLOs for better throughput and bounded latency (according to Implication 1). Atop of a FIFO waiting queue, LibASL allows reordering under the condition that the victim (been reordered) will not miss the application’s latency SLO. Thus, big cores can reorder as much as possible with little cores to achieve higher throughput, while little cores can barely meet their latency SLO (according to Implication 2).

To achieve such an ordering, bounded reorder capability is needed. Thus, we first design a reorderable lock, which exposes the bounded reorder capability as a configurable reorder time window atop of a FIFO waiting queue. Only during the time window, big cores can reorder (lock before) with little cores. Once the window expires, no reorder will happen (reordering is bounded). However, it is non-trivial to set a suitable fine-grained reorder window for each lock acquisition based on the application’s coarse-grained latency requirement. To this end, by proposing a feedback mechanism, LibASL automatically chooses a suitable reorder window on each lock acquisition according to applications’ coarse-grained latency requirements through a feedback mechanism.

Figure 6. Example of using LibASL.

Usage model. Using LibASL is simple and straightforward. By leveraging weak-symbol replacement, LibASL redirects the invocations to pthread_mutex_lock transparently. Thus, applications that use pthread_mutex_lock only need to link with LibASL and, if latency-critical, add few lines of code to annotate the latency requirement (non-latency-critical applications can benefit from LibASL without modifications). LibASL provides two intuitive interfaces to annotate the latency SLO of a certain code block (named as an epoch), including epoch_start and epoch_end

. Each epoch has a unique epoch id, which should be passed as an argument (e.g., 5 in Figure 

6). The epoch_id is statically given by programmers, which can be further managed by LibASL simply through a global counter. epoch_end takes another argument, which specifies the latency SLO of the epoch in nanoseconds (e.g., 1000 means the epoch’s latency SLO is 1us in Figure 6). LibASL does not restrict both the number of locks and how the lock is used in an epoch. Thus, programmers can mark a coarse-grained latency SLO (e.g., request handler in Figure 6).

Human efforts. For latency-critical applications, annotating applications’ existing coarse-grained SLOs is the only required effort to use LibASL. Such SLOs are defined according to the actual latency target and are commonly available in both practice (e.g., an interactive app may have an SLO of 16.6ms to satisfy the 60Hz frame rate requirement) and research (e.g.,  (Hao et al., 2017) attaches latency SLO to syscalls,  (Wang et al., 2012) marks the latency SLO of storage system operations). For those without clear SLOs, LibASL provides a profiling tool that generates a latency-throughput graph (see Variant SLOs figures in Section 4) and helps developers choose suitable SLOs. For non-latency-critical applications, LibASL can be transparently used with no SLO.

Usage scenarios. Improving systems’ throughput without violating the latency SLO is always preferred (Zhu et al., 2017; Lo et al., 2014; Wang et al., 2012; Yang et al., 2016). By using LibASL to achieve higher throughput, service providers can consolidate more requests into fewer servers for cost and energy efficiency without breaking the service-level agreement; user-interference applications can run faster without compromising user experience.

3.2. Reorderable Lock

Figure 7. Reorderable lock blocks standby competitors and allows other competitors to reorder with them.

The reorderable lock exposes the bounded reorder capability atop of existing FIFO locks (e.g., the MCS lock). It provides three interfaces to acquire the lock, including lock_immediately, lock_reorder and lock_eventually. Figure 7 presents the behavior of those interfaces. Competitors using lock_immediately will be appended to the tail of the waiting queue immediately. Competitors using lock_reorder and lock_eventually will be regarded as standby competitors. If the waiting queue is empty, standby competitors can enqueue and then become the lock holder. Otherwise, the standby competitors are blocked. Each standby competitor has a unique reorder time window. Other competitors can reorder with them and lock ealier during the reorder window. Thus, the reordering is bounded by that window. A standby competitor will enqueue once its reorder window expires. The reorder window is an argument of lock_reorder, while lock_eventually sets the maximum reorder window.

1int lock_immediately(mutex_t *mutex) {
2  return lock_fifo(mutex);
3}
4
5int lock_reorder(mutex_t *mutex,
6    uint64_t window) {
7  uint64_t window_end;
8  uint64_t cnt = 0, next_check = 1;
9  if (window < THRESHOLD ||
10      is_lock_free(mutex))
11      return lock_fifo(mutex);
12  window_end = current() + window;
13  while (current() < window_end) {
14      if (cnt ++ == next_check) {
15        if (is_lock_free(mutex))
16          break;
17        next_check <<= 1;
18      }
19  }
20  return lock_fifo(mutex);
21}
22
23int lock_eventually(mutex_t *mutex) {
24  return lock_reorder(mutex,
25    MAX_REORDER_WINDOW);
26}
Algorithm 1: Reorderable lock implementation.

Algorithm 1 shows the implementation of the reorderable lock. When calling lock_immediately, the competitor will directly enqueue (line 2) by using the lock interface (lock_fifo) of the underneath FIFO lock (e.g., MCS). lock_reorder takes an argument window that specifies the length of the reorder window in nanoseconds. When calling lock_reorder, the competitor will firstly check whether the lock is free (line 10) or the window is too small to do any meaningful jobs (line 9). If so, the competitor will enqueue immediately (line 11). Otherwise, it will become a standby competitor. During the reorder window, the standby competitors will check the lock status occasionally (line 15). We use a binary exponential back-off strategy to reduce the contention over the lock (line 17). It is an intuitive choice since the more time that the lock is not free when checking, the heavier the contention is likely to be, and the less likely the lock will be free in the same short period. When the reorder time window expires, the competitor can finally enqueue (line 20). We do not use a secondary queue for the standby competitors because that each competitor can have a different reorder window. Thus, they may enqueue at different time once the reorder window expires (not in FIFO order). As for lock_eventually, it sets the reorder window to the maximum and invokes lock_reorder. All three interfaces will invoke the lock_fifo in a limited time. Thus, the reorderable lock is starvation-free.

1typedef struct epoch {
2  uint64_t window; /* Reorder Window*/
3  uint64_t start;  /* Timestamp */
4  uint64_t unit;   /* Adjust Unit */
5} epoch_t;
6__thread epoch_t epoch[MAX_EPOCH];
7__thread int cur_epoch_id = -1;
8#define PCT 99     /* 99th Percentile Latency */
9
10int epoch_start(int epoch_id) {
11  /* Checks are omitted */
12  cur_epoch_id = epoch_id;
13  epoch[epoch_id].start = current();
14  return 0;
15}
16
17int epoch_end(int epoch_id,
18    uint64_t required_latency) {
19  /* Checks are omitted */
20  uint64_t latency, window;
21  if (is_big_core())
22    goto out;
23  latency = current()-epoch[epoch_id].start;
24  window = epoch[epoch_id].window;
25  if (latency > required_latency) {
26    window >>= 1;
27    epoch[epoch_id].unit =
28      (window*(100-PCT))/100 > MIN_UNIT ?
29      (window*(100-PCT))/100 : MIN_UNIT;
30  } else {
31    window += epoch[epoch_id].unit;
32  }
33  epoch[epoch_id].window = window;
34out:
35  cur_epoch_id = -1;
36  return 0;
37}
Algorithm 2: LibASL epoch implementation.

Since the reorderable lock does not modify the underneath lock, for unlocking, the reorderable lock directly invokes the unmodified unlock procedure of the underneath lock.

3.3. LibASL

Atop of the reorderable lock, LibASL collects the application’s latency SLO and maps it to a suitable reorder window to maximize the reordering without violating the SLO. The mapping is achieved by tracing all epochs’ latency and adjusting the reorder window at every epoch ends. Since different epoch may have different latency SLO, LibASL keeps individual reorder window for each epoch. When calling the pthread_mutex_lock in an epoch, LibASL redirects it to lock_immediately if the thread is running on big cores. Otherwise, it uses lock_reorder and sets the reorder window of that epoch.

Algorithm 2 shows the implementation of the LibASL’s epoch interfaces. Each epoch has the per-thread metadata, which keeps the length of the reorder window (window), the start timestamp (start) and the adjusting unit of the epoch’s length (unit). When calling epoch_start, epoch_id specifies the unique id of the upcoming epoch, which will be stored in the per-thread global variable cur_epoch_id (line 12). Then it records the start timestamp (line 13) by using the light-weight clock_gettime.

When the epoch ends, epoch_end takes another argument required_latency, which specifies the latency SLO of the current epoch in nanoseconds. It calculates the current latency (line 23), compares with the requirement (line 25) and updates the reorder window accordingly. We take a conservative strategy to adjust the reorder window inspired by the TCP congestion control algorithm (Allman et al., 1999), which combines linear growth and exponential reduction when latency exceeds. We set the granularity of growth (unit) to be of the reduced window333 So that after another

executions, the latency will be the same as the one which barely exceeds the SLO and triggers the exponential reduction. The probability of not exceeding the SLO is

. , where PCT represents the percentile the SLO specifies (defined in line 8).

By leveraging weak-symbol replacement, LibASL redirects pthread_mutex_lock in applications to asl_mutex_lock in Algorithm 3 transparently with negligible overhead (20+ cycles, similar to litl (Guerraoui et al., 2019)). When calling libASL_lock, competitors from big cores directly acquire the underneath lock using lock_immediately (line 3), while those from little cores use lock_reorder and set the window length of the current epoch. If not in any epoch, it calls lock_eventually (line 5). Identifying the core type is achieved by getting the core id and looking up a pre-defined mapping. Since the reorderable lock is implemented atop of existing locks, both the trylock and the nested locking are supported. Besides, the conditional variable is also supported by using the same technique in litl (Guerraoui et al., 2019).

1int asl_mutex_lock(mutex_t *mutex) {
2  if (is_big_core())
3    return lock_immediately(mutex);
4  else if (cur_epoch_id < 0)
5    return lock_eventually(mutex);
6  else
7    return lock_reorder(mutex,
8      epoch[cur_epoch_id].window);
9}
Algorithm 3: LibASL internal interface.

3.4. Analyses

Throughput. LibASL provides good scalability in AMP. We analyze different situations applications may encounter and the corresponding behavior of LibASL as follows.

Big cores and little cores are not competing for the same lock. In big cores, LibASL behaves the same as the underneath FIFO lock (e.g., the MCS lock). In little cores, LibASL behaves similarly to the backoff spinlock. Both locks are scalable (Boyd-Wickizer et al., 2012) when competitors are from one type of core.

Big cores and little cores are competing for the same lock. When the lock is not heavily contented, competitors from both big and little cores can immediately hold the lock if the lock is free (no additional overhead). Little cores can help achieve higher throughput in such cases (e.g., Figure (c)c shows the corresponding experiment). With the contention level increased, big cores will reorder with little cores under the condition that the latency SLO is still met. When the big cores do not saturate the lock (i.e., the lock becomes free sometimes in a while), little cores will lock once the queue is empty. Thus, LibASL can find the sweet spot where some additional little cores help saturate the lock for better throughput (and block the rest little cores). Otherwise, allowing any extra little core to join the competition will degrade the throughput. In those cases, little cores can get the lock only when the reorder window expires. Thus, LibASL can improve the throughput as much as the latency SLO allows. Theoretically, the throughput of LibASL and the latency SLO have a negative reciprocal relationship444 Supposing the big core is times faster than the little core and each critical section takes 1 second in the big core ( seconds in the little core). When big cores execute before 1 little core, the throughput is (i.e., the critical sections divides the execution time ). Meanwhile, the SLO has a linear relationship with (when SLO increases by 1s, 1 more big core can reorder). So the latency SLO and the throughput has a negative reciprocal relationship. . So, the growth speed of throughput will slow down with a larger SLO (see Figure (b)b).

Latency. LibASL can precisely maintain the latency under SLO through a feedback mechanism. The size of the reorder window has a monotonic relationship with the epoch’s latency (e.g., a smaller reorder window means a shorter waiting time and a shorter latency). It still holds when an epoch contains multiple locks since they share the same window size. Thus, LibASL can find the suitable window size that the latency barely meets the SLO by adjusting the size according to the latency (if the latency higher than SLO, shrink the window, and vice versa).

Even if the epoch length (i.e., execution time) becomes heterogeneous (e.g., executing different code paths), LibASL can still maintain the tail latency because the reorder window shrinks exponentially once the violation happens and grows gradually (linearly) in the following executions. Thus, it only gives some short epoch with a small reorder window (larger window can still meet SLO) but will not violate the SLO.

Notice that the latency SLO is not a strict deadline. LibASL uses it as a hint to maximize throughput without violating it. There are some cases where LibASL does not take effect. We conclude as follows:

  1. Inappropriate latency requirement. When the given SLO is impossible to achieve even without reordering, LibASL falls back to a FIFO lock (best effort).

  2. Non-lock-sensitive workloads. In such workloads, locks barely affect both the latency and the throughput since they are barely used on the critical path. Inherently, LibASL will not influence the performance.

Energy. Energy-efficiency is one of the major targets of the AMP system. To maintain the lowest energy consumption, Linux provides EAS (energy-aware scheduler (22)), which chooses the most suitable core for each thread. LibASL does not require binding thread to cores. Thus threads can migrate between cores freely. The reorder window can quickly adjust itself once the migration happens. Therefore, LibASL will not violate the scheduling decision, as well as the energy target. Threads will only run on big cores when the scheduler decides so (not caused by LibASL). Moreover, when running on big cores, LibASL makes threads do meaningful jobs rather than busy waiting, which helps saving the energy (Falsafi et al., 2016).

Space overhead. The space overhead of LibASL comes from the metadata of epochs, which is negligible because the per-thread metadata of an epoch only takes 24 bytes (see Algorithm 2), and is irrelevant to the number of locks.

(a) Latency and throughput comparison. LibASL-X means the SLO is set to Xus. LibASL-MAX means enabling maximum reordering (upper-bound).
(b) Performance of LibASL when setting variant SLOs (x-axis).
(c) Throughput speedup of LibASL under variant contention levels.
(d) Self-adaptive Reorder Window: Latency of each epoch during the first 350ms.
(e) Performance of blocking locks and LibASL when core-oversubscription.
(f) LibASL with variant SLO when core-oversubscription.
Figure 8. Micro-benchmarks. Big P99 and Little P99 presents the 99th percentile latency in big and little cores individually.

4. Evaluation

We evaluate LibASL to answer the following questions:

  1. How much throughput can LibASL improve when setting variant SLO?

  2. Can LibASL precisely maintain epochs’ latency in various situations?

  3. How does LibASL perform under variant contention levels?

  4. Can LibASL take effect in real-world applications?

Evaluation Setup. We evaluate LibASL on Apple M1, the only available desktop AMP yet. LibASL also works well on mobile AMP processors (e.g., ARM big.LITTLE), since the improvement of LibASL comes from considering the asymmetry in computing capacity and is not restricted to certain AMP processors. However, due to the space limit, we only present the results on M1 here. Apple M1 has 4 big cores and 4 little cores, where big cores are about 4.7x the performance of little cores555 According to the result of Sysbench (53). The ratio may vary in different workloads. . In such a system, the theoretical throughput speedup upper bound of LibASL to FIFO lock (e.g., MCS) by allowing big cores to reorder is 1.8x666: Comparing the case where the big cores always run and the case where the big cores and little cores run one by one.. We run an unmodified Linux 5.11 on M1 (15).

In the following experiments, without explicit statement, the reorderable lock is built atop of the MCS lock, and the PCT is set to 99 to guarantee the P99 latency. We create 8 threads and bind them to different cores (only for evaluation, core-binding is not required by LibASL). We compare LibASL with pthread_mutex_lock (in glibc-2.32), TAS, ticket, MCS and ShflLock (Kashyap et al., 2019). ShflLock provides a lock reordering framework but can only take a static reorder policy. The SLO-guided ordering (not static) in LibASL is hard to be integrated into it. Instead, we implement a proportional-based static policy, which gives the big core a fixed higher (10x) chance to lock777Batching at most 10 big cores before passing to 1 little core.. As shown in Figure 5, any proportion is a point lies on the latency-throughput relationship curve. Thus, we choose the proportion (i.e., 10) that has obvious throughput improvement without introducing extremely long latency to compare with. We also present the speedup upper-bound of LibASL by using the maximum reorder window888Maximum reorder window is set to 100ms in LibASL. (LibASL-MAX).

4.1. Micro-Benchmarks

Bench-1: A heavily contended benchmark. In this benchmark, all threads repeatedly execute the same epoch that contains critical sections in different lengths and acquiring multiple locks in a nested manner. We measure the latency of epoch and present the tail latency of little cores (Little P99), big cores (Big P99) and overall (Overall P99) separately. As shown in Figure (a)a, the TAS lock shows big-core-affinity here (big cores have a much shorter tail latency). Thus, among existing locks, TAS lock has the best throughput yet the worst P99 latency. When setting the latency SLO to 0us (LibASL-0), LibASL has the same throughput and latency as the MCS lock since the latency target is impossible to achieve (best effort FIFO). When achieving similar throughput with TAS lock (LibASL-35), LibASL reduces the overall tail latency by 37% (55% in little cores). When having a similar overall tail latency as the TAS lock (LibASL-55), LibASL achieves 22% better throughput. Although TAS lock also (implicitly) prioritizes the big cores under such a circumstance, the reordering depends on the hardware atomic operation and is unstable and uncontrollable. LibASL manages the reordering elaborately, which allows more critical sections execute on the big cores and thus has higher throughput. LibASL can bring up to 86% throughput speedup to the TAS lock and 130% to the MCS lock when setting a larger SLO (LibASL-MAX). When giving big cores a 10x higher chance to lock (SHFL-PB10), it improves the throughput of MCS by 20% while having a 4x longer overall tail latency. LibASL outperforms the SHFL-PB10 by 35% when having similar overall latency (LibASL-65). It is because that LibASL has better cache locality by batching more big cores before passing the lock to little core for no SLO violation, while the proportional-based approach has to periodically pass the lock to little cores. The pthread_mutex_lock has the worst throughput and the longest tail latency. LibASL can outperform it by 3.5x at most. (Question 1)

Figure (b)b shows the throughput and the latency of LibASL when setting variant latency SLOs (epoch’s length does not change). As shown in the figure, with setting a larger SLO (x-axis from left to right), the throughput increases, and the tail latency of little cores sticks straightly to the Y=X line (i.e., meet the SLO). It validates that LibASL improves the lock’s throughput under the condition that little cores can barely meet their latency SLO. Moreover, the growth speed of throughput slows down with SLO becomes larger as discussed in Section 3.4. Meanwhile, big cores get more chances to lock. Thus, both the big cores’ and the overall tail latencies are shorter than the little cores’ and are within the SLO. The only exception is when setting an SLO shorter than 15us (the tail latency of MCS), the required latency is impossible to achieve even if passing the lock in FIFO (best-effort). Therefore, LibASL falls back to the MCS lock.

Bench-2: A highly variable workload. To present the effectiveness of LibASL in managing the epoch’s latency in highly variable workloads, we record each epoch’s latency in the first 350ms when executing the same benchmark in the Bench-1. Figure (d)d shows the latency of each epoch executed on big cores and little cores individually. The latency SLO is set to 100us throughout the experiment. During the 100 and 200ms period, we enlarge the epoch’s length (execution time) by 128 times and shrink it back to the original length during the 200 and 250ms period. After that, we change the length of the epoch rapidly and randomly during the 250 to 300ms period. Finally, the epoch’s length is set 1024 times larger than the original till the end. As shown in the figure, LibASL is fully capable of maintaining the latency in a highly variable workload (Question 2). Every time the latency exceeds the SLO, the reorder window shrinks to its half and increases slowly in the next 100 executions (PCT is set to 99). When the epoch’s length enlarges at 100 ms and 200 ms, LibASL quickly adjusts the reorder window to a suitable size. Even when the epoch’s length becomes highly heterogeneous during 250 and 300 ms, LibASL can still keep the latency within the SLO. When the epoch’s length enlarges 1024 times at 300 ms, the target latency is impossible to achieve. Thus, LibASL falls back to the underneath MCS lock, and both big and little cores have similar latencies.

Application Type Version Benchmark Lock Usage in each Epoch
Kyoto Cabinet (39) In-memory KV Store (CacheDB) 1.2.78 50% Put 50% Get Slot-level Lock, Method Lock
upscaledb (55) On-disk Persistent KV Store 2.2.1 50% Put 50% Get Global DB Lock, Async Worker Pool Lock
LMDB (H. Chu (2011); 41) On-disk Persistent KV Store 0.9.70 50% Put 50% Get Global DB Lock, Metadata Lock
LevelDB (40) On-disk Persistent KV Store 1.22 db_bench Random Read Metadata Lock
SQLite (51) On-disk Persistent Database 3.33.0 1/3 Insert 1/3 Simple 1/3 Complex Query State Machine Lock, Metadata Locks
Table 1. Databases Considered

Bench-3: A benchmark with variant contention levels. In this benchmark, we evaluate LibASL under variant contention levels by executing a different number of nop instructions between two lock acquisitions. Figure (c)c shows the throughput speedup of LibASL over the locks in the legends (e.g., when x=0, LibASL outperforms MCS by 2x and TAS lock by 45%). To allow reordering as much as possible, we do not set the latency SLO in LibASL. We also include the result of only using big cores (MCS-4). When competitors from big cores already saturate the lock (x 3), LibASL makes the competitors from little cores to be standby competitors and achieves similar throughput with MCS-4 (significantly better than others). With the contention level decreased, LibASL allows little cores to join the competition and achieves better throughput than only using big cores (up to 68%). Among all contention levels, LibASL achieves good throughput (Question 3).

Bench-4: A benchmark with CPU core over-subscription. In this benchmark, we examine the effectiveness of LibASL in a core over-subscription situation by creating 2 threads on each core and executing Bench-1. We replace the non-blocking MCS lock in LibASL with the pthread_mutex_lock and use nanosleep to block the standby competitor. Results are presented in Figure (e)e and (f)f. Since the MCS lock is passed in a FIFO order, the waking-up latency will be exposed on the critical path and leads to a significant throughput degradation (spin-then-park MCS is 96% worse than pthread_mutex_lock). So, LibASL uses pthread_mutex_lock rather than the MCS lock. Although pthread_mutex_lock does not guarantee the FIFO order and has unstable lock acquisition latency, LibASL still can preserve the SLO owing to its self-adaptive reorder window and improve the throughput of pthread_mutex_lock by up to 90%.

Figure 9. Performance when mixing epochs of significantly different lengths with variant ratios. Throughput is normalized to MCS. SLO is set to 100us (also the p99 tail latency of the MCS lock at 100%).

Bench-5: A benchmark mixing epochs of significantly different lengths. In this benchmark, we randomly generate short and long (100x longer) epochs of different ratios. We compare LibASL with the ideal case that directly chooses (no window adjustment) a suitable window for different epochs (impossible in real world). As shown in the Figure 9, LibASL can bring significant and close-to-ideal throughput improvement to MCS (maximum 20% gap with ideal at 50% ratio) while keeping the latency under SLO.

(a) Stack
(b) Variant SLOs.
(c) Linked List. 1e6 means .
(d) Variant SLOs.
Figure 10. Data structures. Legends are explained in Figure 8.

Bench-6: Data structures benchmarks. In the stack benchmark, threads randomly push or pop (fifty-fifty) 1 element from the stack in each epoch. Similarly, in the linked list benchmark, threads randomly append or remove (fifty-fifty) 1 element from the list in each epoch. As shown in Figure (a)a, in the stack benchmark, LibASL only improves the throughput of TAS lock by 8% (18% to ShflLock) when having similar overall P99 latency (LibASL-6). It is because that the lock is mostly passing only among big cores when using TAS lock, which results in an extremely long tail latency in little cores (Little P99) but good throughput. When setting a larger SLO, LibASL can provide up to 27% speedup over TAS lock (LibASL-MAX). In the linked list benchmark (Figure (c)c), TAS lock shows a little-core-affinity (big cores have extremely long latency). Thus, TAS lock faces a significant throughput degradation since critical sections are mostly executed on slower little cores. LibASL outperforms TAS lock by 1.7x (in throughput) when having similar overall tail latency (LibASL-30, up to 3.2x with a larger SLO). Compared with MCS, ShflLock and pthread_mutex_lock, LibASL brings up to 2.5x, 1.5x and 2.3x throughput improvement. Besides enabling reorder, LibASL also reduces the contention by putting competitors to the standby mode, which helps it go beyond the theoretical speedup upper bound comparing to the MCS lock (1.8x in Evaluation Setup). In both benchmarks, LibASL prevents the latencies from violating the SLO (Y=X) as shown in Figure (b)b and (d)d.

4.2. Application Benchmarks

Databases. To examine the effectiveness of using LibASL in real-world applications (Question 4), we evaluate 5 popular databases (detailed in Table 1). Databases benefit from using little cores to handle more requests in fewer machines, which can improve the cost and energy efficiency in edge computing (Zhu et al., 2017; Lo et al., 2014; Wang et al., 2012). However, when threads in asymmetric cores are intensively competing for the same lock, using existing locks causes performance collapses. Integrating LibASL only requires inserting 3 lines of code, including wrapping the operations with epoch_start and epoch_end, and adding the header file. As prior work (Dice and Kogan, 2019; Kashyap et al., 2019) does, we run each benchmark for a fixed period and calculate the average throughput. Moreover, to present the effectiveness of LibASL in cases where epochs’ lengths are highly heterogenous, we randomly choose to insert or find (fifty-fifty chance referring to YCSB-A (59)) 1 item in an epoch. In most benchmarks, each epoch acquires multiple locks as listed in the rightmost column of Table 1.

(a) Kyoto Cabinet. 1e6 means . The chosen SLOs are only for easier comparing. Other settings are detailed in figure (b)b.
(b) Variant SLOs
(c) CDF (SLO: 70us)
(d) upscaledb
(e) Variant SLOs
(f) CDF (SLO: 140us)
(g) LMDB. 5e5 means .
(h) Variant SLOs
(i) CDF (SLO: 1900us)
Figure 11. Databases. Legends are explained in Figure 8.

We first evaluate LibASL in several KV-stores. KV-store plays an important role in CDN or IoT edge servers as the storage service (2; 54). In Kyoto Cabinet, we Put or Get (fifty-fifty) 1 item to an in-memory cacheDB in an epoch and measure its latency. As shown in Figure (a)a, TAS lock shows big-core-affinity and thus has the best throughput yet the worst tail latency among existing locks. LibASL reduces the tail latency by 90% when achieving similar throughput with TAS lock (LibASL-70) and improves the throughput by up to 23% (96% to MCS and 89% to pthread_mutex_lock). When having a similar tail latency with ShflLock (LibASL-40), LibASL improves the throughput by 38%. Figure (b)b shows the performance of LibASL when setting variant latency SLOs. Although the execution time of Put or Get is heterogeneous, LibASL can still precisely maintain the tail latency while improving the throughput.

Figure (c)c

presents the latency Cumulative Distribution Function (CDF) of

LibASL when setting the SLO to 70us. In the figure, Overall and Little represent the overall and little core’s latency. A clear boundary can be seen in the overall latency since most operations are finished on big cores. Due to the intensive contention, only less than 20% of operations are executed on little cores and have longer latency. There is also a clear boundary in little core’s latency. About half of the operations have shorter latency (¡35us), which is because of the shorter execution time of Get operation. As for the longer Put operation (another half), since the reorder window shrinks by half and grows linearly once the latency exceeds the SLO, the probability grows linearly after Half SLO (35us).

(a) LevelDB. 6e5 means .
(b) Variant SLOs
(c) CDF (SLO: 100us)
(d) SQLite. LibASL-X means the SLO is set to X ms.
(e) Variant SLOs
(f) CDF (SLO: 4ms)
Figure 12. Database benchmarks. Legends are explained in Figure 8.

Similar results can be found in upscaledb and LMDB (Figure (d)d(i)i). In upscaledb, among existing locks, TAS lock has the highest throughput (90% higher than MCS) yet the longest tail latency (2.5x longer than MCS). When having a similar tail latency (LibASL-140) with TAS lock, LibASL improves the throughput by 46%, which further goes to 1.6x (3.8x to MCS, 5x to pthread_mutex_lock and 50% to ShflLock) when setting a larger SLO. In LMDB, LibASL has 40% higher throughput than TAS lock when having a similar latency (LibASL-600), which goes to 60% (86% to MCS, 126% to pthread mutex and 27% to ShflLock) with larger SLO. In both benchmarks, the latency CDF (Figure (f)f, (i)i) shows a similar trend with Figure (c)c. The clear boundary in little core’s latency distinguishes the shorter Get with the longer Put.

LevelDB is another widely-used KV-store. However, LevelDB implements its own blocking strategy rather than directly using the pthread_mutex_lock on Put operation. Thus we use the randomread test in the build-in db_bench to only test its Get operation. In LevelDB, each Get operation will acquire a global lock to take a snapshot of internal database structures. As shown in Figure (a)a, LibASL improves the throughput of TAS lock by 50% when having a similar latency (LibASL-15), which goes to 2.5x (1.6x to MCS, 1.8x to pthread_mutex_lock and 1.3x to ShflLock) when setting a larger SLO. Since we only test the Get operation, most requests have a longer latency than the half-SLO as shown in Figure (c)c.

Finally, we evaluate LibASL in SQLite, which is a relational database and has been used in Azure IoT edge server (7). We place 1/3 Insert, 1/3 simple query (point query on an indexed column) and 1/3 complex query (range query on an indexed column with a filter on a non-indexed column) in a DEFERRED transaction enclosed in an epoch. Moreover, we add an extremely long full-table-scan on a 100k table every 1000 executions in the same epoch to present that LibASL can survive on some occasionally appeared extremely long requests. SQLite is configured to use the Multi-thread threading modes. Different from previous benchmarks, SQLite uses locks to protect the internal state machine. The transaction can commit successfully only in a certain state. Thus epoch’s latency greatly fluctuates, which results in a non-linear growth in little core’s latency as shown in Figure (f)f. It also amplifies the differences of transaction success rate in big and little cores when using ShflLock and causes the latency collapses in little cores. Moreover, both the simple and the complex Select operations have much shorter execution time than the Insert operation. Thus, 2/3 of the requests have extremely short tail latency (latency grows significantly after y2/3 in Figure (f)f). However, LibASL is still able to precisely keep the tail latency under the SLO and improve the throughput as shown in Figure (e)e. The occasionally appeared extremely long epoch only causes a window shrink, LibASL can quickly adjust it back to the suitable size in the following execution. Thus, it does not influence the performance. TAS lock shows little-core-affinity here (big cores have longer latency). Thus, it has both lower throughput and a longer tail latency than the MCS lock. LibASL can bring up to 2.1x throughput speedup to the TAS lock (55% to MCS, 2.1x to pthread_mutex_lock and 35% to ShflLock) without violating SLO.

PARSEC. We also evaluate LibASL in non-latency-critical applications. LibASL will use the maximum reorder window. We present the results of three benchmarks from PARSEC (Bienia et al., 2008), which can represent all three different types observed in PARSEC. We also include the results when using only 4 big cores (Spin-4Big) and 2 big cores (Spin-2Big) to show the scalability of each application.

Dedup uses pipeline parallelism to compress data and is known to be lock-sensitive (Guerraoui et al., 2019). We configure the benchmark to use a single task queue to enable cross-asymmetric cores’ work balance. As shown in Figure (a)a, LibASL improves the throughput of MCS lock by 2x. However, LibASL and TAS lock have similar throughput, which is close to only using big cores (Spin-4Big). It is because that little cores can barely acquire the lock when using TAS lock (big-core-affinity). Nevertheless, TAS lock may encounter throughput collapse when showing little-core-affinity under different scenarios (e.g., in linked list benchmark in Figure (c)c) while LibASL does not. Dedup represents lock-sensitive applications, which may encounter the scalability issue in AMP systems. LibASL provides stable and best performance in such applications without code modifications and forbids the throughput collapse as in MCS lock and TAS lock.

Raytrace uses a per-thread task queue and enables work-stealing when the local task queue is empty. As shown in the figure, all the locks behave the same since locks are not heavily contended. Raytrace scales well on AMP. Little cores improve the overall throughput by 36% (from Spin-4Big to TAS). Raytrace represents a bunch of applications, which scale well on AMP, including map-reduce applications in Phoenix (Yoo et al., 2009). Those applications’ performance is not bounded by locks (limited number of cores in M1) and either use work-stealing or assign tasks at runtime to achieve a good work balance among asymmetric cores.

In contrast, Ocean_cp fails to scale on AMP. It contains multiple phases, and each core is assigned with even works. Ocean_cp scales well when adding 2 big cores to the system (from Spin-2Big to Spin-4Big) but fails to go further when including little cores. It is because big cores have to wait for slower little cores at the end of each phase. Besides Ocean_cp, streamcluster, lu_cb and barnes in PARSEC use a similar approach, therefore fail to scale on AMP. Those applications make a false assumption (symmetric computing capacity) towards the underneath cores. LibASL cannot take effect since it is an orthogonal problem. Those applications have to be modified either by adding work balance or assigning tasks considering the asymmetry to scale on AMP.

4.3. Evaluation Highlights

Results confirm the effectiveness of LibASL in AMP. First, LibASL can precisely maintain the latency SLO even in highly variant workloads. Second, LibASL shows promising performance advantages over existing locks. Comparing to fair locks (lowest latency but low throughput), LibASL can significantly improve the throughput (e.g., 3.8x to MCS lock in upscaledb). Comparing to unfair locks (highest latency but sometimes high throughput), LibASL has much lower tail latency when achieving similar throughput (e.g., 90% lower than TAS lock in KyotoCabinet), and substantially higher throughput when ensuring similar tail latency (e.g., 46% higher than TAS lock in LMDB). Moreover, LibASL can further outperform TAS lock by 2.5x when setting a larger SLO in LevelDB. Comparing to the proportional-based approach, LibASL can better meet applications’ needs by improving throughput as much as possible considering the certain latency SLO of an application. Moreover, LibASL achieves better throughput when having a similar latency.

(a) Dedup
(b) Raytrace
(c) Ocean_cp
Figure 13. PARSEC benchmark.

5. Related Work

Scalable locking has been extensively studied over decades. However, previous work (Dice and Kogan, 2019; Kashyap et al., 2019; Dice et al., 2012; Dice, 2017; Hendler et al., 2010; Lozi et al., 2012; Roghanchi et al., 2017; Zhang et al., 2017; Dice et al., 2011; Luchangco et al., 2006; Fatourou and Kallimanis, 2012, 2011; Oyama et al., 1999; Chabbi and Mellor-Crummey, 2016; Chabbi et al., 2015; Radovic and Hagersten, 2003) mainly targets the scalability problem in either SMP or NUMA rather than AMP. There are already some locks that reorder competitors to achieve better throughput (Dice and Kogan, 2019; Kashyap et al., 2019; Dice et al., 2012; Dice, 2017; Dice et al., 2011; Luchangco et al., 2006; Chabbi and Mellor-Crummey, 2016; Radovic and Hagersten, 2003) in NUMA or many-core processor. NUMA-aware locks (Dice and Kogan, 2019; Dice et al., 2012, 2011; Luchangco et al., 2006; Chabbi and Mellor-Crummey, 2016; Radovic and Hagersten, 2003) reorder competitors from local NUMA node with other nodes to reduce cross-node remote memory references, which does not exist in AMP. Malthusian lock (Dice, 2017) reorders active competitors with passive blocking ones to reduce contention in many-core processors. However, as discussed in Section 2.2, they rely on the long-term fairness to keep a relatively low latency, which brings throughput collapse on AMP. Besides, long-term fairness only forbids starvation, leaving the latency unpredictable. LibASL provides a new SLO-guided lock ordering for AMP and achieves a higher throughput.

ShflLock (Kashyap et al., 2019) provides a lock reordering framework. It relies on a provided static policy to shuffle the queue internally. However, the scalability issue in AMP cannot be easily solved by proposing a static policy. Preserving fairness brings throughput collapse; reordering without limit causes latency collapse. It is also hard to find a suitable static proportion to prioritize big cores as discussed in Section 2.3. Thus in LibASL, the reorderable lock exposes the reorder capability so that LibASL can adjust the extent of reordering on the fly and achieves the best throughput the SLO allows.

Delegation locks (Hendler et al., 2010; Lozi et al., 2012; Roghanchi et al., 2017; Zhang et al., 2017; Fatourou and Kallimanis, 2012, 2011; Oyama et al., 1999) reduce the data movement by executing all critical sections in one core (the lock server), which significantly improves the throughput in NUMA. Although placing the lock server on big cores can hide the weak computing capacity of little cores (similar to (Suleman et al., 2009; Joao et al., 2013, 2012), but big cores are dedicated for acceleration), it requires the big cores busy polling, which wastes a big core and violates the energy target when the contention is low. LibASL achieves good performance in variant contention levels. A more severe obstacle of using delegation lock is that it requires non-trivial code modifications to convert all critical sections into closures, which brings enormous engineering work due to the complexity of real-world applications. Instead, LibASL only requires linking and, if latency-critical, inserting few lines of code to specify the latency SLO.

Computing capacity can also be asymmetric in SMP when using DVFS. Previous work (Wamhoff et al., 2014; Cebrian et al., 2013; Akram et al., 2016) boosts the frequency of the lock holder to gain better throughput. However, unlike DVFS, the asymmetry in AMP is inherent. Thus, those techniques also cannot be applied to AMP.

Improving the system’s throughput for better cost and energy efficiency without violating the latency SLO is a widely adopted technique (Zhu et al., 2017; Wang et al., 2012; Yang et al., 2016). WorkloadCompactor (Zhu et al., 2017) and Cake (Wang et al., 2012) reduce the datacenter’s cost by consolidating more loads into fewer servers without violating the latency SLO. Elfen Scheduling (Yang et al., 2016) leverages the SMT to run latency-critical and other requests simultaneously to improve the utilization without compromising the SLO. LibASL takes a similar approach to solve the lock’s scalability problem on AMP.

6. Conclusion

In this paper, we propose an asymmetry-aware scalable lock named LibASL. It provides a new SLO-guided lock ordering to maximize big cores’ chance to lock for better throughput while carefully maintaining little cores’ latencies. Evaluations on real-world applications show that LibASL brings better performance over counterparts on AMP.

References

  • [1] A look at intel lakefield: a 3d-stacked single-isa heterogeneous penta-core soc. Note: https://fuse.wikichip.org/news/3417/a-look-at-intel-lakefield-a-3d-stacked-single-isa-heterogeneous-penta-core-soc/ Cited by: §2.1.
  • [2] Akamai: iot edge connect. Note: https://www.akamai.com/cn/zh/products/performance/iot-edge-connect.jsp Cited by: §4.2.
  • S. Akram, J. B. Sartor, and L. Eeckhout (2016) DVFS performance prediction for managed multithreaded applications. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Vol. , pp. 12–23. External Links: Document Cited by: §2.1, §5.
  • M. Allman, V. Paxson, W. Stevens, et al. (1999) TCP congestion control. Cited by: §3.3.
  • [5] Apple m1 chip. Note: https://www.apple.com/mac/m1/ Cited by: §1.
  • [6] ARM dynamiq shared unit technical reference manual. Note: https://developer.arm.com/documentation/100453/0002/functional-description/introduction/about-the-dsu Cited by: §2.1.
  • [7] Azure iot edge sqlite module. Note: https://github.com/Azure/iot-edge-sqlite Cited by: §4.2.
  • C. Bienia, S. Kumar, J. P. Singh, and K. Li (2008) The parsec benchmark suite: characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, New York, NY, USA, pp. 72–81. External Links: ISBN 9781605582825, Link, Document Cited by: §4.2.
  • S. Boyd-Wickizer, M. F. Kaashoek, R. Morris, and N. Zeldovich (2012) Non-scalable locks are dangerous. In Proceedings of the Linux Symposium, pp. 119–130. Cited by: §1, §3.4.
  • J. M. Cebrian, D. Sánchez, J. L. Aragón, and S. Kaxiras (2013) Efficient inter-core power and thermal balancing for multicore processors. Computing 95 (7), pp. 537–566. External Links: Link, Document Cited by: §2.1, §5.
  • [11] CFS wakeup path and arm big.little/dynamiq. Note: https://lwn.net/Articles/793379/ Cited by: §2.1.
  • M. Chabbi, M. Fagan, and J. Mellor-Crummey (2015) High performance locks for multi-level numa systems. 50 (8). External Links: ISSN 0362-1340, Link, Document Cited by: §1, §1, §2.2, §5.
  • M. Chabbi and J. Mellor-Crummey (2016) Contention-conscious, locality-preserving locks. SIGPLAN Not. 51 (8). External Links: ISSN 0362-1340, Link, Document Cited by: §5.
  • H. Chu (2011) MDB: a memory-mapped database and backend for openldap. In Proceedings of the 3rd International Conference on LDAP, Heidelberg, Germany, pp. 35. Cited by: Table 1.
  • [15] CORELLIUM: how we port linux to m1. Note: https://corellium.com/blog/linux-m1 Cited by: §4.
  • J. Dean and L. A. Barroso (2013) The tail at scale. Communications of the ACM 56, pp. 74–80. External Links: Link Cited by: §1.
  • G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels (2007) Dynamo: amazon’s highly available key-value store. In Proceedings of Twenty-First ACM SIGOPS Symposium on Operating Systems Principles, SOSP ’07, New York, NY, USA, pp. 205–220. External Links: ISBN 9781595935915, Link, Document Cited by: §1.
  • D. Dice and A. Kogan (2019) Compact numa-aware locks. In Proceedings of the Fourteenth EuroSys Conference 2019, EuroSys ’19, New York, NY, USA. External Links: ISBN 9781450362818, Link, Document Cited by: §1, §1, §2.2, §4.2, §5.
  • D. Dice, V. J. Marathe, and N. Shavit (2011) Flat-combining numa locks. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’11, New York, NY, USA, pp. 65–74. External Links: ISBN 9781450307437, Link, Document Cited by: §1, §1, §2.2, §5.
  • D. Dice (2017) Malthusian locks. In Proceedings of the Twelfth European Conference on Computer Systems, EuroSys ’17, New York, NY, USA, pp. 314–327. External Links: ISBN 9781450349383, Link, Document Cited by: §1, §2.2, §2.3, §5.
  • D. Dice, V. J. Marathe, and N. Shavit (2012) Lock cohorting: a general technique for designing numa locks. SIGPLAN Not. 47 (8), pp. 247–256. External Links: ISSN 0362-1340, Link, Document Cited by: §1, §1, §2.2, §5.
  • [22] Energy aware scheduling. Note: https://www.kernel.org/doc/html/latest/scheduler/sched-energy.html Cited by: §1, §2.1, §3.4.
  • B. Falsafi, R. Guerraoui, J. Picorel, and V. Trigonakis (2016) Unlocking energy. In 2016 USENIX Annual Technical Conference, USENIX ATC 2016, Denver, CO, USA, June 22-24, 2016, A. Gulati and H. Weatherspoon (Eds.), pp. 393–406. External Links: Link Cited by: §3.4.
  • S. Fan and B. C. Lee (2016) Evaluating asymmetric multiprocessing for mobile applications. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2016, Uppsala, Sweden, April 17-19, 2016, pp. 235–244. External Links: Link, Document Cited by: §2.1.
  • P. Fatourou and N. D. Kallimanis (2011) A highly-efficient wait-free universal construction. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’11, New York, NY, USA, pp. 325–334. External Links: ISBN 9781450307437, Link, Document Cited by: §5, §5.
  • P. Fatourou and N. D. Kallimanis (2012) Revisiting the combining synchronization technique. SIGPLAN Not. 47 (8), pp. 257–266. External Links: ISSN 0362-1340, Link, Document Cited by: §5, §5.
  • [27] Google cloud: defining slos. Note: https://cloud.google.com/solutions/defining-SLOs Cited by: §1.
  • R. Guerraoui, H. Guiroux, R. Lachaize, V. Quéma, and V. Trigonakis (2019) Lock–unlock: is that all? a pragmatic analysis of locking in software systems. ACM Trans. Comput. Syst. 36 (1). External Links: ISSN 0734-2071, Link, Document Cited by: §2.3, §3.3, §4.2.
  • M. Hao, H. Li, M. H. Tong, C. Pakha, R. O. Suminto, C. A. Stuardo, A. A. Chien, and H. S. Gunawi (2017) MittOS: supporting millisecond tail tolerance with fast rejecting slo-aware os interface. New York, NY, USA. External Links: ISBN 9781450350853, Link, Document Cited by: §1, §3.1.
  • D. Hendler, I. Incze, N. Shavit, and M. Tzafrir (2010) Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the Twenty-Second Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’10, New York, NY, USA, pp. 355–364. External Links: ISBN 9781450300797, Link, Document Cited by: §5, §5.
  • [31] Intel alder lake: performance hybrid with golden cove and gracemont for 2021, intel architecture day 2020. Note: https://newsroom.intel.com/press-kits/architecture-day-2020/ Cited by: §1.
  • B. Jeff (2013) Big. little technology moves towards fully heterogeneous global task scheduling. ARM white paper. Cited by: §2.1.
  • J. A. Joao, M. A. Suleman, O. Mutlu, and Y. N. Patt (2012) Bottleneck identification and scheduling in multithreaded applications. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2012, London, UK, March 3-7, 2012, T. Harris and M. L. Scott (Eds.), pp. 223–234. External Links: Link, Document Cited by: §5.
  • J. A. Joao, M. A. Suleman, O. Mutlu, and Y. N. Patt (2013) Utility-based acceleration of multithreaded applications on asymmetric cmps. In The 40th Annual International Symposium on Computer Architecture, ISCA’13, Tel-Aviv, Israel, June 23-27, 2013, A. Mendelson (Ed.), pp. 154–165. External Links: Link, Document Cited by: §5.
  • S. Kashyap, I. Calciu, X. Cheng, C. Min, and T. Kim (2019) Scalable and practical locking with shuffling. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, pp. 586–599. External Links: ISBN 9781450368735, Link, Document Cited by: §1, §1, §2.2, §4.2, §4, §5, §5.
  • S. Kashyap, C. Min, and T. Kim (2017) Scalable numa-aware blocking synchronization primitives. In 2017 USENIX Annual Technical Conference, USENIX ATC 2017, Santa Clara, CA, USA, July 12-14, 2017, D. D. Silva and B. Ford (Eds.), pp. 603–615. External Links: Link Cited by: §1, §1, §2.2.
  • R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen (2003) Single-isa heterogeneous multi-core architectures: the potential for processor power reduction. In Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36., pp. 81–92. Cited by: §1.
  • R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas (2004) Single-isa heterogeneous multi-core architectures for multithreaded workload performance. In Proceedings. 31st Annual International Symposium on Computer Architecture, 2004., pp. 64–75. Cited by: §1.
  • [39] Kyoto cabinet: a straightforward implementation of dbm. Note: https://dbmx.net/kyotocabinet/ Cited by: Table 1.
  • [40] LevelDB. Note: https://github.com/google/leveldb Cited by: Table 1.
  • [41] LMDB technical information. Note: https://symas.com/lmdb/technical/ Cited by: Table 1.
  • D. Lo, L. Cheng, R. Govindaraju, L. A. Barroso, and C. Kozyrakis (2014) Towards energy proportionality for large-scale latency-critical workloads. In ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA, June 14-18, 2014, pp. 301–312. External Links: Link, Document Cited by: §1, §3.1, §4.2.
  • J. Lozi, F. David, G. Thomas, J. L. Lawall, and G. Muller (2012) Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications. In 2012 USENIX Annual Technical Conference, Boston, MA, USA, June 13-15, 2012, G. Heiser and W. C. Hsieh (Eds.), pp. 65–76. External Links: Link Cited by: §5, §5.
  • V. Luchangco, D. Nussbaum, and N. Shavit (2006) A hierarchical CLH queue lock. In Euro-Par 2006, Parallel Processing, 12th International Euro-Par Conference, Dresden, Germany, August 28 - September 1, 2006, Proceedings, W. E. Nagel, W. V. Walter, and W. Lehner (Eds.), Lecture Notes in Computer Science, Vol. 4128, pp. 801–810. External Links: Link, Document Cited by: §1, §1, §2.2, §5.
  • J. M. Mellor-Crummey and M. L. Scott (1991) Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9 (1), pp. 21–65. External Links: ISSN 0734-2071, Link, Document Cited by: §1, §2.2.
  • [46] New intel core processors with intel hybrid technology. Note: https://www.intel.com/content/www/us/en/products/docs/processors/core/core-processors-with-hybrid-technology-brief.html Cited by: §1, §2.1.
  • Y. Oyama, K. Taura, and A. Yonezawa (1999) Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of the International Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications, Vol. 16, pp. 95. Cited by: §5, §5.
  • [48] Processing architecture for power efficiency and performance. Note: https://www.arm.com/why-arm/technologies/big-little Cited by: §1.
  • Z. Radovic and E. Hagersten (2003) Hierarchical backoff locks for nonuniform communication architectures. In Proceedings of the Ninth International Symposium on High-Performance Computer Architecture (HPCA’03), Anaheim, California, USA, February 8-12, 2003, pp. 241–252. External Links: Link, Document Cited by: §1, §1, §2.2, §5.
  • S. Roghanchi, J. Eriksson, and N. Basu (2017) Ffwd: delegation is (much) faster than you think. New York, NY, USA. External Links: ISBN 9781450350853, Link, Document Cited by: §5, §5.
  • [51] SQLite. Note: https://www.sqlite.org/index.html Cited by: Table 1.
  • M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt (2009) Accelerating critical section execution with asymmetric multi-core architectures. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2009, Washington, DC, USA, March 7-11, 2009, M. L. Soffa and M. J. Irwin (Eds.), pp. 253–264. External Links: Link, Document Cited by: §5.
  • [53] Sysbench: scriptable database and system performance benchmark. Note: https://github.com/akopytov/sysbench Cited by: footnote 5.
  • [54] The best iot databases for the edge – an overview and compact guide. Note: https://objectbox.io/the-best-iot-databases-for-the-edge-an-overview-and-compact-guide/ Cited by: §4.2.
  • [55] Upscaledb: embedded database technology. Note: https://upscaledb.com/ Cited by: Table 1.
  • J. Wamhoff, S. Diestelhorst, C. Fetzer, P. Marlier, P. Felber, and D. Dice (2014) The TURBO diaries: application-controlled frequency scaling explained. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), Philadelphia, PA, pp. 193–204. External Links: ISBN 978-1-931971-10-2, Link Cited by: §2.1, §5.
  • A. Wang, S. Venkataraman, S. Alspaugh, R. H. Katz, and I. Stoica (2012) Cake: enabling high-level slos on shared storage systems. In ACM Symposium on Cloud Computing, SOCC ’12, San Jose, CA, USA, October 14-17, 2012, M. J. Carey and S. Hand (Eds.), pp. 14. External Links: Link, Document Cited by: §1, §3.1, §3.1, §4.2, §5.
  • X. Yang, S. M. Blackburn, and K. S. McKinley (2016) Elfen scheduling: fine-grain principled borrowing from latency-critical workloads using simultaneous multithreading. In 2016 USENIX Annual Technical Conference, USENIX ATC 2016, Denver, CO, USA, June 22-24, 2016, A. Gulati and H. Weatherspoon (Eds.), pp. 309–322. External Links: Link Cited by: §3.1, §5.
  • [59] YCSB core workloads. Note: https://github.com/brianfrankcooper/YCSB/wiki/Core-Workloads Cited by: §4.2.
  • R. M. Yoo, A. Romano, and C. Kozyrakis (2009) Phoenix rebirth: scalable mapreduce on a large-scale shared-memory system. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC 2009, October 4-6, 2009, Austin, TX, USA, pp. 198–207. External Links: Link, Document Cited by: §4.2.
  • M. Zhang, H. Chen, L. Cheng, F. C. M. Lau, and C. Wang (2017) Scalable adaptive numa-aware lock. IEEE Trans. Parallel Distributed Syst. 28 (6), pp. 1754–1769. External Links: Link, Document Cited by: §5, §5.
  • T. Zhu, M. A. Kozuch, and M. Harchol-Balter (2017) WorkloadCompactor: reducing datacenter cost while providing tail latency SLO guarantees. In Proceedings of the 2017 Symposium on Cloud Computing, SoCC 2017, Santa Clara, CA, USA, September 24-27, 2017, pp. 598–610. External Links: Link, Document Cited by: §1, §3.1, §4.2, §5.