Stretching the capacity of Hardware Transactional Memory in IBM POWER architectures

03/06/2020
by   Ricardo Filipe, et al.
University of Lisbon
0

The hardware transactional memory (HTM) implementations in commercially available processors are significantly hindered by their tight capacity constraints. In practice, this renders current HTMs unsuitable to many real-world workloads of in-memory databases. This paper proposes SI-HTM, which stretches the capacity bounds of the underlying HTM, thus opening HTM to a much broader class of applications. SI-HTM leverages the HTM implementation of the IBM POWER architecture with a software layer to offer a single-version implementation of Snapshot Isolation. When compared to HTM- and software-based concurrency control alternatives, SI-HTM exhibits improved scalability, achieving speedups of up to 300 benchmarks.

READ FULL TEXT VIEW PDF

page 9

page 10

10/04/2021

Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware

Many modern workloads such as neural network inference and graph process...
08/26/2016

When to use 3D Die-Stacked Memory for Bandwidth-Constrained Big Data Workloads

Response time requirements for big data processing systems are shrinking...
09/14/2020

The Cost of Software-Based Memory Management Without Virtual Memory

Virtual memory has been a standard hardware feature for more than three ...
08/04/2022

Static Hardware Partitioning on RISC-V – Shortcomings, Limitations, and Prospects

On embedded processors that are increasingly equipped with multiple CPU ...
06/27/2021

Revamping Storage Class Memory With Hardware Automated Memory-Over-Storage Solution

Large persistent memories such as NVDIMM have been perceived as a disrup...
07/23/2022

Improving the Reliability of Next Generation SSDs using WOM-v Codes

High density Solid State Drives, such as QLC drives, offer increased sto...
04/20/2016

CLAASIC: a Cortex-Inspired Hardware Accelerator

This work explores the feasibility of specialized hardware implementing ...

1. Introduction

In the quest for scalability, in-memory databases (IMDBs) offering weak consistency guarantees like Snapshot Isolation (SI) (Berenson et al., 1995) are increasingly prominent within the database landscape. On the one hand, the in-memory nature of IMDBs minimizes (or even eliminates) disk access latency to achieve faster and more predictable performance (Zhang et al., 2015; Tu et al., 2013). On the other hand, weak consistency models alleviate many concurrency bottlenecks that characterize serializable systems (Berenson et al., 1995). Today, popular databases like HyPer (Neumann et al., 2015), SAP HANA (Lee et al., 2013), solidDB (Lindström et al., 2013) and Hekaton (Diaconu et al., 2013) combine the virtues of both trends, relying on weakly consistent IMDB designs.

At a first glance, the recent emergence of hardware transactional memory (HTM) support in commercially available processors such as Intel Core and IBM POWER might seem like a perfect match to push the current generation of IMDBs to new performance levels. However, the limited capacity of HTM implementations (Diegues et al., 2014; Goel et al., 2014; Nakaike et al., 2015) is incompatible with many real-world OLTP/OLAP workloads, whose access footprints are often much larger than the reduced capacity of existing HTM implementations (Leis et al., 2014).

To work around this crucial obstacle, recent works either propose modifications to HTM for SI support over a multi-versioned memory layout that eliminates the capacity limits (Litz et al., 2014; Chen et al., 2017); or exploit HTM as an auxiliary mechanism to accelerate software-based concurrency control schemes (Leis et al., 2014). However, while the former depends on hardware that is not yet available, the latter exploits HTM only to a limited extent. To the best of our knowledge, the expectation of IMDBs that rely on HTM transactions as a first-class mechanism to run and synchronize each transaction is yet to be met in practice.

This paper proposes SI-HTM, the first solution that achieves such a goal by relying on a commercially available HTM implementation – the HTM support originally introduced in the IBM POWER8 architecture and continued in the most recent IBM POWER9 (IBM, 2018). Hereafter, we will denote such HTM implementation as P8-HTM. The key novelty of SI-HTM is that, with no hardware modification to P8-HTM, SI-HTM is able to support SI-equivalent guarantees while relying on the hardware to detect conflicts and abort transactions.

Intuitively, SI-HTM constructs a restricted SI implementation by combining two building blocks: i) rollback-only transactions (ROT), a complementary mode available in P8-HTM that is originally aimed at speculative execution of code blocks that do not manipulate shared data (IBM, 2018); and ii) a software regulated quiescence phase that is added before the hardware commit to ensure that the transaction only commits once it is certain that its execution is compliant with SI semantics.

As we describe later, this hybrid software-hardware mechanism is able to substantially stretch the capacity bounds of the hardware transactions that can run on P8-HTM, with no software instrumentation of memory reads and writes. SI-HTM eliminates capacity bounds on a transaction’s read set, restricting only their write sets by the HTM capacity. Since many IMDB workloads are dominated by read-only and read-dominated transactions with few writes, SI-HTM is typically able to run the vast majority of transactions in the HTM fast path.

Breaking the tight capacity bounds of the original P8-HTM contributes to important improvements in the scalability that P8-HTM can attain, for two distinct reasons. First and foremost, with SI-HTM, less transactions abort due to exceeding the HTM capacity. This means less frequent situations that require falling back to a sequential fall-back path. The second improvement is related to the notable power of simultaneous multi-threading (SMT) on the POWER8 and 9 architectures, which are able to run up to 8 hardware threads on each core. As acknowledged by previous studies on P8-HTM (Nakaike et al., 2015), this SMT feature is practically incompatible with HTM programs since the already scarce HTM capacity becomes shared among the co-located SMT threads. By stretching the capacity of P8-HTM, SI-HTM enables SMT-friendly transactional workloads to achieve speed-ups at multi-SMT levels, thus enabling parallelism in scenarios that are typically strongly adverse to HTM based applications.

This paper has three main contributions:

  • We propose SI-HTM, a restricted implementation of SI for P8-HTM.

  • We experimentally evaluate SI-HTM on a real IBM POWER8 server, both with a synthetic benchmark and the TPC-C benchmark (Council, 2010), which is serializable under SI. When compared to HTM-based concurrency control alternatives, SI-HTM exhibits speedups of up to 300% on TPC-C.

  • We prove that any execution history that SI-HTM allows is correct under SI. An important corollary is that any application that is serializable under SI is also serializable on SI-HTM.

The remainder of this paper is organized as follows. Section 2 provides the background on SI and on the features of P8-HTM. Section 3 describes SI-HTM. Section 4 evaluates SI-HTM, comparing its performance to relevant alternatives. Section 5 surveys related work. Finally, Section 6 concludes and describes future work.

2. Background

In this section we start by introducing the basic notions of SI. We then describe the HTM support in the IBM POWER architecture (which we call P8-HTM), emphasizing its limitations and showing why it is not trivial to obtain weak semantics like SI when using this HTM.

2.1. Snapshot Isolation

SI is a widely used correctness criterion in databases. Intuitively, SI allows each transaction to read from and write to its own private isolated snapshot of data (Berenson et al., 1995; Fekete et al., 2005; Cerone and Gotsman, 2016).

Each transaction’s snapshot is created when the transaction starts (or, alternatively, when it performs its first read). Hence the snapshot holds the committed values that were valid at that moment.

Each transaction’s snapshot is isolated from the writes of concurrent transactions. More precisely, each write that an active transaction performs on its snapshot is only visible to that transaction; other concurrent transactions will not observe such write when reading from their own snapshots. This means that SI is typically implemented in a multi-versioned approach, since different transactions reading from the same location at the same time may observe different versions. It is only when a transaction commits that its writes become (atomically) visible to any new transaction whose snapshot is created afterwards.

SI aborts transactions in the presence of write-write conflicts. More precisely, a transaction is only allowed to commit if its write set does not overlap with the write-set of any other (concurrent) transaction that has committed after started. In contrast, SI tolerates read-write conflicts. Therefore, two transactions may commit even if one transaction’s read-set overlaps with the other transaction’s write-set.

Figure 1. Example of SI semantics.

To illustrate SI semantics, consider the example in Figure 1. Since and read from their own snapshots, they are isolated from the writes that performs. Since and do not incur any write-write conflict, they can safely commit under SI. Only has to abort, because of the write-write conflict with (on ). Note, however, that SI allows both and to commit, although they are not serializable.

The weaker (than serializable) guarantees of SI enable efficient concurrency control implementations and improved concurrency, especially for read-only or read-dominated transactions, while avoiding most isolation anomalies (Berenson et al., 1995). These advantages have quickly rendered SI a mainstream consistency guarantee in the database domain (Ports and Grittner, 2012) and, more recently, in IMDBs (Neumann et al., 2015; Lee et al., 2013; Lindström et al., 2013; Diaconu et al., 2013) and distributed transactional systems (Salomie et al., 2011; Zamanian et al., 2017).

SI can still yield a few anomalies, most notably the write skew anomaly

(Fekete et al., 2005). An example of a write skew is when two transactions start from a common snapshot, each one writes to a different location, and later each transaction reads from the location written by the other transaction. While some application semantics naturally tolerate the write skew anomaly, other applications may suffer from unexpected behavior in the presence of write skews. Fortunately, recent tools and methodologies have been proposed to detect and remove write skews (Fekete et al., 2005), with the goal of ensuring serializable executions even when the program runs under SI. One common fix is read promotion (Fekete et al., 2005; Litz et al., 2014): the problematic reads are also inserted into the transaction’s write set, which ensures that a write skew triggers an abort.

2.2. P8-Htm

P8-HTM detects conflicts by adopting a 2-Phase Locking (2PL) scheme at the granularity of a cache line. In P8-HTM’s 2PL scheme, the last transaction to read onto some shared variable will kill the execution of any other previous writer transaction on that same variable. In the case of write-write conflicts the last writer is killed.

P8-HTM can only handle transactions whose read and write sets fit into each core’s transactional buffer. For the IBM POWER8 and POWER9 processors, this buffer is called TMCAM and consists of a content-addressable memory linked with the L2 cache, shared by eight hardware threads (IBM, 2018). Since the TMCAM is 8 KB in size, the available capacity for transactions running on the core(s) sharing a TMCAM is up to 64 cache lines.

One of the main features of the POWER architecture is its extensive use of Simultaneous Multi-Threaded (SMT) which supports the execution of up to 8 threads per core (SMT-1,2,4,8). When multiple threads run in SMT on the same core, they share the hardware resources available to that core. It is worth noting that SMT and P8-HTM are, in practice, conflicting features since the TMCAM is shared among co-located SMT threads. Therefore, a transactional program that takes advantage of SMT will inherently reduce the available capacity to hardware transactions, which degrades the effectiveness of P8-HTM. Most recently, POWER9 introduced an additional 512 KB read tracking structure, called L2 LVDIR and also shared among two cores, intended to support transactions with larger read sets (IBM, 2018); however, the L2 LVDIR can only be used by up to two threads at any given time, which essentially makes it incompatible with workloads with large transactions that wish to use SMT. Not surprisingly, we are not aware of any paper that has proposed a transactional system that exploits P8-HTM and achieves consistent speed-ups when running in SMT scenarios.

When running transactions in P8-HTM, the program can resort to an advanced suspend-resume mechanism. When a transaction is suspended, all its subsequent operations are executed non-transactionally, thus not tracked in the TMCAM. When the program eventually resumes the transaction any transactional conflicts that were detected during the suspend-resume interval take effect (and the transaction aborts). This is a useful mechanism to support programs that need to access control variables within a transaction’s lifetime without aborting due to conflicts when accessing those control variables.

Another advanced feature of P8-HTM is the support of a special kind of transactions called rollback-only transactions (ROT). In this mode, the TMCAM only tracks writes111Actually, due to implementation-specific reasons, the TMCAM can also track a small fraction of reads in a ROT (IBM, 2016). while reads are performed as if they were not inside a transaction. This difference has two key consequences when we compare the behaviors of ROTs and regular HTM transactions. First, reads no longer contribute to spending the HTM capacity. Since reads are usually the most prevalent operation inside a transaction, the capacity of ROTs is improved massively relatively to regular HTM transactions.

Figure 2. Example A: write-after-read conflict is tolerated by ROTs. Example B: read-after-write conflict causes the writer ROT to abort.

The second consequence is that, while concurrent ROTs can still abort due to write-write conflicts, some read-write conflicts are not guaranteed to be (and in general are not) detected, hence will not lead to aborts. Example A in Figure 2 shows an example of a write-after-read conflict between two ROTs that is tolerated. However, it should be noted that ROTs can still abort due to read-after-write conflicts. As illustrated in example B in Figure 2, if a ROT writes to a location and a concurrent ROT later reads from the same location, ’s entry in the TMCAM will be invalidated and thus will abort. Evidently, the weak semantics of ROTs do not guarantee serializability. For that reason, the official documentation of P8-HTM clearly states that ROTs should only be used with code blocks that exclusively access thread-local data and may benefit from the ability to roll back their updates (IBM, 2018) – hence the term “rollback-only transactions”.

As we show next, ROTs can actually be used with a different purpose than intended and constitute a fundamental building block to implementing a restricted form of hardware-supported SI in SI-HTM.

3. Si-Htm

The goal of SI-HTM is to build an SI implementation directly supported by HTM. The gains of SI-HTM design are two-fold: first, to benefit from the fast transactional execution that an HTM delivers when directly handling memory accesses and conflicts; second, to take advantage of SI to avoid the tight capacity restrictions of HTM. Recent proposals have shown how to achieve this goal through modified hardware (Litz et al., 2014; Chen et al., 2017). The key novelty of SI-HTM is that it relies on the commercially available IBM POWER8 and 9 architectures, hence SI-HTM is ready to run on off-the-shelf hardware.

Accomplishing this design requires overcoming important challenges. First, the available plain HTM transactions impose rigid strong semantics and capacity limits. SI-HTM relies on ROTs as the main building block to execute transactions. This inherently enables SI-HTM transactions to be capacity-bounded only by their write-sets, which represents a decisive advantage with read-dominated and read-only transactions. However, executing each transaction as a ROT that accesses shared memory is unsafe, as it may yield serious anomalies that SI disallows.

To prevent the ROT-induced anomalies, the hardware ROT support needs to be complemented with software-level instrumentation that enforces the hardware-supported execution to circumvent those anomalies. The second challenge is, then, to ensure that such software instrumentation has a reduced impact on the runtime performance of HTM. Ideally, memory accesses should be handled directly by the HTM; any code instrumentation should only be allowed (and minimized) on the begin and commit stages – especially for read-only transactions, which dominate many workloads.

Non-instrumented transactional accesses imply that each transaction will directly access the cache-coherent memory, which is single-version (i.e., two transactions that read from the same location simultaneously observe the same value). This is incompatible with the original definition of SI, which relies on a multi-versioned scheme to allow concurrent transactions to access distinct isolated snapshots. An example is given in Figure 1, where, after has written to on its local snapshot, both and observe different values when they read from . The third and last challenge is, then, how to implement SI on a single-version memory system.

Since building a multi-versioned memory would require significant software instrumentation on memory accesses, SI-HTM follows a different approach: SI-HTM relies on the single-version memory system, which keeps transactional accesses non-instrumented, and restricts the allowed executions to those that, under SI rules, would not require keeping older data versions. For instance, recalling the above example from Figure 1, a correct execution (under SI) implies that and observe different values when accessing the same memory location. To deal with this, SI-HTM enforces that one of the contending transactions (either or ) aborts.

The next sections describe SI-HTM in detail. Section 3.1 starts by discussing the semantics of encapsulating transactions in ROTs, pointing out possible anomalies that are not accepted under SI. Section 3.2 then complements ROTs with the necessary software instrumentation to ensure that allowed executions are correct under SI. Section 3.3 then describes how read-only transactions may be optimized.

We describe SI-HTM as a support for general-purpose transactional memory programs that, within each transaction may read and write to pre-allocated memory locations, indexed by their virtual address. Among other uses, SI-HTM can be integrated as a concurrency control mechanism in IMDBs, including IMDBs that stores named records that are accessed by a set-oriented language (like SQL), making use of efficient indexes (Tu et al., 2013).

3.1. ROTs as the building block of SI-HTM

SI-HTM encapsulates each transaction in a ROT by preceding each transaction’s code by a HTMBeginROT instruction and committing the transaction with HTMEnd. Yet, as Section 2 discussed, the semantics of ROTs are unsuitable to transactional programs, as they may yield serious correctness anomalies. Still, it is also true that ROTs implicitly ensure some key consistency properties that are shared with SI. Namely:

  • Since P8-HTM keeps track of each ROT’s write set, the underlying hardware 2PL implementation detects write-write conflicts and resolves them by aborting (at least) one of the contending ROTs. This implicitly satisfies SI’s restriction that, when two concurrent transactions have overlapping write sets, one of them should not be allowed to commit.

  • Executions where a ROT writes to a location that has previously been read (and not written) by an ongoing concurrent ROT are not treated as conflicts. While this was not allowed by the serializable 2PL implementation of plain HTM, it is allowed under SI. Of course, as Section 2 discusses, ROTs still treat some read-write situations as conflicts. This reflects the fact that SI-HTM is a single-version implementation of SI.

Figure 3. Example of two concurrent ROTs contending on shared data with an anomaly that is not allowed under SI

Therefore, by running transactions as ROTs, SI-HTM gets the above SI properties for free from the hardware. However, using ROTs to encapsulate transactions that concurrently access shared data may yield dirty read anomalies (American National Standards Institute, 1992), which are not allowed under SI (Berenson et al., 1995).

To illustrate these anomalies consider the example in Figure 3. In this example, two concurrent ROT-encapsulated transactions, and access shared variable . Since writes to after reads (a write-after-read case), no conflict is detected, hence both ROTs are allowed to continue running. However, since ’s write is performed in the actual shared location, this write is not isolated in ’s conceptual snapshot, as SI mandates. Instead, the write is visible to when later reads from . Recall that, in an execution that is correct under SI, the second read by should return the value of that was committed when the (isolated) snapshot was initially created – clearly, the above execution with ROTs results in a dirty read, which violates the requirement of isolated snapshots in SI.

The next section shows how SI-HTM prevents dirty reads on ROT-encapsulated transactions.

3.2. Base algorithm

Figure 4. Examples of the safety wait as a means to prevent dirty read anomalies. Example A: a dirty read is effectively prevented by having wait until performs the problematic read. Example B: after a safety wait, commits without causing dirty reads.

Recall the example in Figure 3 illustrating the dirty read anomaly induced by encapsulating transactions in ROTs. The key insight behind SI-HTM is that, if ROT had waited for a sufficiently long time before issuing the HTMEnd instruction, the anomaly would not have occurred.

Suppose that had waited until had issued its last read, , as Figure 4 illustrates in example A. Since this read targets a location that is currently in ’s write-set, ’s read invalidates ’s write entry in the TMCAM and aborts. Consequently, ’s write is rolled back, thus reads the original (and correct) value (). Therefore, this waiting prevents the dirty read anomaly by aborting the writer transaction.

It should be noted that can only be sure that it can safely commit without incurring dirty read anomalies on after has performed its last access (and has survived each read access by ). Example B in Figure 4 illustrates an alternative execution where does not read from locations written by and, thus, can safely commit after waiting ’s last read.

We should remark that, in both examples in Figure 4, there is a cost to pay for correctness: in example A, the writer transaction aborts; in example B, the writer transaction spends significant time spinning. As we describe next, SI-HTM trades these costs for important improvements in capacity. As Section 4 evaluates, the gains clearly outweigh these costs.

More generically, when a ROT-encapsulated transaction, that wrote to a given location, , completes (i.e. performs its last memory access before entering the commit phase), should wait until any active transaction, , that has read before ’s update is guaranteed to have performed its last read of . If only issues HTMEnd after that condition is guaranteed, then none of ’s writes will induce dirty reads on any .

However, precisely determining the earliest instant when such a guarantee is met is impractical. Firstly, it is usually not possible to predict the future reads of a transaction. Second, determining which ROTs have previously read would imply keeping track (at the software-level) of each read, requiring prohibitive read instrumentation. Hence, SI-HTM adopts a conservative approach: assumes that every active ROT may have read from at least one location of ’s write set; moreover, waits until each such concurrent transaction has completed, thus it will not issue any more memory requests. When this condition is satisfied, we say that is safe to commit.

1:int
2:
3:function TxBegin()
4:       
5:       
6:       if  then
7:             return
8:       end if
9:end function
10:
11:function TxEnd()
12:       
13:       
14:       
15:       
16:       
17:       for  do
18:             if  then
19:                    wait while
20:             end if
21:       end for
22:       
23:       
24:end function
Algorithm 1 Transaction begin and end

Algorithm 1 shows the software-level instrumentation that SI-HTM adds to implement the safety wait. SI-HTM maintains a shared array where each thread publishes its current state, which can either be not running any transaction (inactive=0), running a transaction (any value greater than 1), or completed and waiting for a safe commit (completed=1).

Before starting the ROT on P8-HTM, a transaction announces that it will become active (line 4) by setting its state to the current system timestamp (in clock cycles). Conversely, when a transaction has completed and wishes to commit, it sets its state to completed (line 13) and then waits until every other active transaction, , leaves that state. After leaves that state, the corresponding thread’s state will change to inactive, eventually starting another transaction and becoming active again with a higher timestamp, and later switching to completed, and so forth.

The waiting condition in lines 17-19 spins until one of such options is observed for each other thread. Once that happens, the waiting transaction can finally commit its ROT in P8-HTM and announce itself as inactive (line 23).

Whenever a transaction’s state is set, after the beginning of a new transaction (line 4) and after suspending (line 13), we need to ensure that the change propagates to all concurrent transactions. To do so, we enforce a full memory barrier (sync) after the state change.

One relevant implementation detail is that all updates to the thread’s entry on the shared state array are performed in non-transactional mode. If these updates happened inside an active ROT, the transactional buffer would be occupied with one unnecessary write and, most importantly, the ROT would abort whenever other transactions read this transaction’s state in the shared array.

3.3. Read-only fast path and the fall-back path

SI-HTM has alternative paths besides the algorithm described so far, which Algorithm 2 presents.

1:function SyncWithGL()
2:       
3:       
4:       if  then
5:             
6:             wait while
7:             go to 2
8:       end if
9:end function
10:
11:function TxBeginExt(boolean isRO)
12:       if isRO then
13:             
14:             return
15:       else
16:             while  - -  do
17:                    
18:                    if  then
19:                          return
20:                    end if
21:             end while
22:             
23:             
24:             for  do
25:                    wait while
26:             end for
27:       end if
28:end function
29:
30:function TxEndExt()
31:       if  then
32:             
33:       else
34:             if isRO then
35:                    
36:                    
37:             else
38:                    
39:             end if
40:       end if
41:end function
Algorithm 2 Extension with SGL and RO paths

One important path is the fast path for read-only transactions. In the context of SI-HTM, we define a read-only transaction as one that performs no writes on shared data locations. Note that a read-only transaction is allowed to update thread-private memory locations (such as its local stack). When a transaction is launched in SI-HTM, an argument specifies whether the transaction is read-only or not. We assume this parameter is set by the programmer or by some automatic tool (e.g. a compiler).

Since read-only transactions perform no shared updates, they are not prone to cause dirty reads. Therefore, they may safely skip the safety waiting and immediately commit upon completion (line 34). Of course, read-only transactions still need to announce their state changes, so that other read-write transactions can know how to coordinate their safety waitings. Finally, when a read-only transaction ends we must ensure that all shared memory reads were performed before the state is set. This is accomplished using a light-weight barrier lwsync issued at Line 35.

Another alternative path is the fall-back path when a read-write transaction is not able to commit after a number of retries. This can happen for a number of reasons, ranging from transactions whose write-sets exceeds the available capacity, frequent aborts under high contention, to transactions issuing instructions that are illegal on P8-HTM, among others. To ensure progress under such situations, transactions in SI-HTM resort to a traditional single global lock (SGL) fall-back path after having aborted too many consecutive times (line 16). We should note that, upon acquiring the SGL, the transaction cannot proceed immediately since other concurrent transactions may still be actively running. Hence, the SGL holder first waits until no other active transaction exists (lines 23). While the SGL is locked, no other transaction is allowed to proceed222It should be noted that the early subscription scheme that is usually used with regular HTM (which precludes the initial wait after the SGL lock) is not possible in SI-HTM since read-only transactions run non-transactionally and ROTs do not detect write-after-read conflicts..

3.4. Correctness

In this section, we show that any execution history that SI-HTM allows is correct under the original definition of SI (Berenson et al., 1995). This implies that any existing application that is serializable under SI may be directly executed under SI-HTM and will retain its correctness. Moreover, it means that the techniques and tools that have been proposed to analyze and fix programs to run under SI without anomalies (e.g., (Fekete et al., 2005; Litz et al., 2014)) are also applicable to SI-HTM.

We note, however, that SI-HTM assumes that any access to shared memory locations is included within a SI-HTM transaction (i.e., between TxBeginExt and TxEndExt). This is similar to the weak atomicity model (Martin et al., 2006), albeit employed in the context of SI. We also remark that SI-HTM prevents inconsistent reads, in the sense that any transaction in SI-HTM (even one that eventually aborts) must see (all and only) the effects of transactions that committed before they started.

However, we highlight that SI-HTM also prevents inconsistent reads from transactions that eventually abort, which is a stronger guarantee that SI does not require. It is straightforward to prove this property by recalling the semantics of P8-HTM. Let us suppose, by contradiction, that some transaction, , performs an inconsistent read, i.e. reads a value written by some transaction that has or will abort. This would mean that was able to read a value written by another transaction that had not yet been committed, which contradicts P8-HTM’s semantics. A consequence of this property is that, for those applications that are serializable under SI (i.e., are guaranteed to run under SI without incurring SI-related anomalies) SI-HTM will not yield the undesirable side-effects that may arise in TM implementations that allow inconsistent reads (Guerraoui and Kapalka, 2008). Therefore, SI-HTM can safely run such applications in non-sandboxed environments.

To show that any execution history that SI-HTM allows is correct under SI, the following sketch of proof addresses each restriction from the operational definition of SI (Berenson et al., 1995), explaining why SI-HTM satisfies all of them.

R1: Each transaction reads data from a snapshot of the (committed) data as of the time the transaction started, called its Start-Timestamp.

Proof.

Rephrasing the above restriction, any transaction reads from a snapshot that reflects the writes by the most recently committed transactions whose Commit-Timestamp precedes ’s Start-Timestamp. The above restriction is guaranteed if, for any transaction that successfully commits (in SI-HTM), we define its Commit-Timestamp as the instant where completes taking a snapshot of each thread’s state (line 16 in Alg. 1); i.e., just before performing its safety wait.

Consider a pair of transactions, and , where: transaction writes to a given location, , and eventually commits; starts after commit and reads from ; is the last transaction to write to before the Start-Timestamp of . Assume, by contradiction, that observes a version, , different from the one produced by . This would be possible only in the following three cases:

Case a: is produced by a not yet committed transaction. This is impossible, since it implies that read uncommitted values, which is prevented by the P8-HTM. Note that this would be a violation of restriction R4, which we address later.

Case b: is produced by a transaction, , that committed after started; i.e., the Start-Timestamp of precedes the Commit-Timestamp of . Clearly, can only read after has issued HTMEnd. This implies that completed its safety wait before read . However, by hypothesis, ’s Start-Timestamp is earlier than the Commit-Timestamp of , consequently observed that ’s state was active before initiated its safety wait. Therefore, could only conclude its safety wait after committed, which contradicts the hypothesis that read after executed HTMEnd.

Case c: is produced by a committed transaction, , whose Commit-Timestamp is earlier than ’s Commit-
Timestamp. By restriction R5 (described later), the only case where both and are able to commit is if the Commit-Timestamp of precedes the Start-Timestamp of . Since, by hypothesis, both transactions commit, then it is easy to prove that ’s write to occurred after executed HTMEnd (otherwise, a write-write conflict would arise and P8-HTM would abort one writer) and, consequently, after the write by . Since, by hypothesis, starts after the Commit-Timestamp of , then will necessarily observe the most recent write (’s write), which contradicts case c’s hypothesis. ∎

Figure 5. Example illustrating why selecting the Commit-Timestamp of is defined as the time where completes reading each thread’s state, rather than the time at which HTMEnd occurs.

As a side note, we explain the rationale behind defining the Commit-Timestamp as defined above, instead of the instant in which HTMEnd is performed (line 22 of Alg. 1). To illustrate why the alternative definition is not appropriate, consider the example in Figure 5. Transaction reads the value that wrote (and committed) to . However, began before executed HTMEnd. Hence, considering the moment at which a transaction performs HTMEnd as its Commit-Timestamp would contradict the previous SI restriction.

R2: A transaction running in SI is never blocked attempting a read as long as the snapshot data from its Start-Timestamp can be maintained.

Proof.

Trivially ensured by P8-HTM’s semantics, which never block upon memory accesses.∎

R3: A transaction’s writes are also reflected in its local snapshot.

Proof.

Since update transactions are executed in ROTs, the semantics of P8-HTM trivially ensure this restriction. ∎

R4: Updates by other transactions active after the transaction Start-Timestamp are invisible to the transaction.

Proof.

Let us assume that a transaction writes on a given location and another transaction, , reads the same location. Further, let us assume that is active (i.e., has not reached its Commit-Timestamp, as defined in R1) after started. If still has not executed HTMEnd when the read occurs, then P8-HTM invalidates ’s write (thus aborting ) and, thus, reads the previous value, which satisfies R3.

Alternatively, let us suppose, by contradiction, that had already committed when reads, then would see ’s update. However, since had already performed HTMEnd, then we know that had previously completed its safety wait. This implies that every transaction that had started before ’s Commit-Timestamp has already completed; since subsequently performs a read, then has started after has committed, which contradicts the initial hypothesis. ∎

R5: A transaction can only commit if no other transaction, , with a Commit-Timestamp in ’s execution interval [Start-Timestamp, Commit-Timestamp] wrote data that also wrote.

Proof.

Suppose, by contradiction, that transactions and write to a common location, . Further, suppose, without loss of generality, that commits before does. Since, by hypothesis, ’s Start-Timestamp precedes ’s Commit-Timestamp, must have observed that the state of was not inactive before entered its safety wait. Therefore, had to wait until issued all its memory accesses and completed (since, by hypothesis, did not abort). This implies that wrote to before the write to the same location by was committed in hardware, which is a write-write conflict that P8-HTM solved by aborting either or , which contradicts the hypothesis that both and commit. ∎

To conclude, we complement the above sketch of proof with correctness arguments for the SGL fall-back path scenario. As Section 3.3 briefly explains, after a transaction acquires the SGL, it waits until ongoing transactions finish (lines 14-15, Alg.2) before it starts executing. Conversely, any other transaction that may try to start will observe that the SGL is taken and wait (lines 4 and 7, Alg.2) until that condition changes. Therefore, when the thread holding SGL runs its transaction, no other threads have active transactions and any previous transaction that had issued writes must have already committed or aborted. Consequently, it is easy to show that the previous restrictions apply in the scenario where one transaction runs in the SGL fall-back path.

4. Evaluation

Figure 7. Hash-map 50% large read-only txs, low (left) and high (right) contention
Figure 6. Hash-map 90% large read-only txs, low (left) and high (right) contention
Figure 7. Hash-map 50% large read-only txs, low (left) and high (right) contention
Figure 8. Hash-map 90% small txs, low (left) and high (right) contention
Figure 6. Hash-map 90% large read-only txs, low (left) and high (right) contention

SI-HTM has distinct features that have the potential to contribute to effective performance improvements. More precisely, when compared with plain HTM, SI-HTM potentially offers the following benefits: i) update transactions with much larger memory footprints may run and commit without exhausting the HTM capacity; ii) read-only transactions run non-transactionally, hence exhibit lower begin/commit overheads and have unlimited capacity; iii) as a corollary of the previous outcomes, it becomes feasible to co-locate more parallel transactions on a common SMT core; iv) since SI-HTM provides weaker correctness guarantees than plain HTM, SI-HTM allows higher concurrency.

The main goal of this evaluation is to understand, for a wide range of scenarios and workloads, the performance and scalability gains of SI-HTM, when compared to relevant HTM- and software-based concurrency control mechanisms. in this study we aim to evaluate the effective benefits that each factor above (i to iv) contributes to the global outcome of SI-HTM, as well as the real performance costs of the quiescence phase component of SI-HTM.

In order to answer these questions, we deploy SI-HTM on a IBM Power8 system with one 8284-22A processor of 10 cores with SMT-8 (i.e., up to 8 hardware threads per core). We use a hash-map micro-benchmark to compare the behavior of SI-HTM with the pure HTM baseline; and TPC-C (Council, 2010) as a real-world application benchmark, which we use to compare SI-HTM with a relevant set of state-of-the-art concurrency control systems.

4.1. Hash-map benchmark

The hash-map benchmark consists of a simple transactional hash-map implementation, where clients can perform lookup, insert and remove operations. A read-only transaction performs a lookup operation and a read-write transaction performs an insert, or a remove operation if the last transaction on that thread was an insert. This synthetic benchmark allows us to study different workload scenarios that cover distinct combinations between the orthogonal dimensions of transaction footprint and contention.

Regarding the transaction footprint dimension, the number of elements that initially populate the hash-map can be in one of two modes: a large transaction footprint mode, where the hash-map size is such that each bucket has, on average, a list of 200 elements (hence, operations on a key in that bucket may need to read from 200 cache lines at most to find the target element, which easily leads transactions to exceed P8-HTM’s capacity); and a short transaction footprint mode, where each bucket has, on average, 50 elements (thus most operations find the target element without exceeding P8-HTM’s capacity). Since the available capacity, both in HTM and in SI-HTM, depends on the read/write ratio, we further distinguish the (large vs. short) transaction footprint dimension with the read/write ratio. Thus, in total we consider 3 scenarios: a large-footprint scenario dominated by read-only transactions (90% read-only transactions vs. 10% update transactions); a large-footprint scenario 50% read-only vs. 50% update transactions; and a short-footprint scenario that mixes 90% read-only transactions and 10% update transactions. (We omit the short/50%:50% case for space limitations, as it adds no relevant findings.)

Concerning the orthogonal dimension of contention, it can be tuned by choosing different numbers of buckets for the hash-map. We consider two scenarios along this dimension: low contention, where the hash-map has 1000 buckets (hence, concurrent operations on the same bucket are rare); and high contention, where the hash-map has only 10 buckets (frequent operations contend for a common bucket).

We experiment each possible combination of transaction footprint with the two contention scenarios ( scenarios), running each combination up to 80 threads (10 cores running in no-SMT up to SMT-8 mode).

Figures 8 to 8 present the throughput and discriminated abort rate of each experiment, averaged over five runs. Regarding the types of aborts, we distinguish transactional aborts, essentially caused by conflicting accesses to shared memory locations; non-transactional aborts, mostly caused by a locked SGL that kills ongoing transactions (only possible in HTM); and, of course, capacity aborts.

As expected, the largest gains we observe are on large read-only transaction scenarios, with an impressive 576% improvement of peak throughput on the low contention workload. This is the best-case scenario for SI-HTM, where most transactions are read-only, hence run with no capacity bounds. This is in clear contrast with the prohibitive capacity constraints of HTM; such capacity issues quickly escalate onto non-transactional aborts due to falling back to the SGL.

On the scenario with 50% update transactions, where the majority of SI-HTM’s transactions run as ROTs with a limited write set capacity, SI-HTM still proves to be the best approach on a low contention workload, with gains of up to 10% peak throughput. Again, the fact that update transactions do not abort frequently for capacity reasons is the main reason. However, on the high contention workload, SI-HTM is not able to surpass regular HTM. Because of the quiescence phase, SI-HTM transactions take longer to abort (than in HTM), leading to a delay on the fall-back to the SGL.

On scenarios of small transactions, which mostly fall within P8-HTM’s capacity bounds, SI-HTM is not able to surpass HTM. The added safety wait delays the execution of update transactions in SI-HTM, on both low and high contention workloads, a cost that is not compensated by relevant reductions on capacity aborts.

Figure 9. TPC-C standard mix with low (left) and high (right) contention
Figure 10. TPC-C read dominated mix with low (left) and high (right) contention
Figure 9. TPC-C standard mix with low (left) and high (right) contention

HTM has been historically bad on SMT execution (Diegues et al., 2014; Goel et al., 2014; Nakaike et al., 2015), mostly due to sharing scarce hardware resources between SMT threads. Since a transaction’s footprint in SI-HTM is limited only by its write-set, we expect that, in some workloads, multiple SMT transactions on the same core will finally fit in a shared TMCAM. In fact, we can observe on all low contention scenarios (even on short transactions) that SI-HTM behaves very well on SMT threads, scaling up to 32 threads, only showing signs of exhaustion at 40 threads, when the resources of each core start to be shared by more than four SMT threads. To the best of our knowledge, SI-HTM is the first HTM-based algorithm to consistently exhibit the power of SMT in low contention workloads.

4.2. TPC-C benchmark

To evaluate SI-HTM on a real-world application, we use the TPC-C benchmark, with a standard mix of transactions (i.e. -s 4 -d 4 -o 4 -p 43 -r 45), and a read-dominated mix (i.e. -s 4 -d 4 -o 80 -p 4 -r 8). The standard mix is primarily composed of update transactions, where roughly half of them have large transactional footprints. We also tested high and low-contention workloads of both mixes. Figures 10 and 10 present the results for TPC-C under such configurations. We compare SI-HTM not only to HTM but also to P8TM (Issa et al., 2017), a HTM-based design for larger capacity transactions on P8-HTM (discussed in Section 5); and Silo (Tu et al., 2013), a software-level optimistic concurrency control for in-memory databases. For a fair comparison, we disable record indexing in Silo, and the on-line adaptation of P8TM; this way, our analysis focuses exclusively on the core concurrency control performance of all solutions.

Overall, SI-HTM is able to improve peak performance of TPC-C’s standard mix by 48% on 8 threads, relatively to the best alternative (HTM). With 16 threads (SMT-2), the low contention workload still delivers very good results, albeit SI-HTM starts to show capacity issues, explained by TMCAM sharing, and exacerbated on higher thread counts. The improved resource usage of SI-HTM is especially evident on the read-dominated mix, where SI-HTM gracefully scales up to SMT-2 levels, improving 27% in peak throughput over the best alternative (P8TM) and 300% over base HTM. Still, the performance degrades on SMT-4 and SMT-8 modes, since much of a core’s hardware is shared between the multiple SMT threads of that core.

5. Related Work

Hybrid transactional memory (TM) designs (Damron et al., 2006; Kumar et al., 2006; Dalessandro et al., 2011; Matveev and Shavit, 2015) are the most visible effort that addresses the capacity issue in HTM-based programs. Hybrid TMs fall back to software-based TM (STM) implementations when transactions cannot successfully execute in hardware. In contrast to SI-HTM, hybrid TMs do not change the capacity limits of the HTM; rather, they aim at providing scalable STM fall-back paths.

Postponing a transaction’s commit to a moment where the system state ensures that committing the transaction will not result in correctness anomalies is not a new concept of SI-HTM. In another context, a notable example of a solution that relies on a similar technique is the read-copy-update (RCU) synchronization mechanism (Mckenney and Slingwine, 1998). Like SI-HTM, RCU allows multiple read-only threads to read directly from shared memory by having writer threads update a snapshot that is only committed at a later time when safety is ensured. The RCU mechanism is entirely done in software and has been implemented at user and kernel-level (Guniguntala et al., 2008). It requires the programmer to explicitly provide dedicated code in order to create snapshots of the objects to update and ensure consistent pointers to the right snapshot. Read-Log-Update (RLU) extended RCU to allow a simpler programming model and higher concurrency between readers and writers (Matveev et al., 2015), by relying on techniques borrowed from the world of STM.

The works that are technically closer to SI-HTM are HERWL (Felber et al., 2016), SpRWL (Issa et al., 2018) and P8TM (Issa et al., 2017). These works exploit the principle of making writers commit only when it is safe to do so. However, their aim is clearly distinct than SI-HTM’s, as they all offer strong consistency guarantees to programs based on read-write locks (HERWL and SpRWL) and in-memory transactions (P8TM). One direct consequence of this key distinction is that none of them relieve transactions from the cost of having the read sets of update transactions tracked by the HTM (in HERWL and SpRWL) or by costly software instrumentation of each read (in P8TM). In contrast, since SI-HTM aims at weaker consistency guarantees, SI-HTM is able to completely free update transactions from read tracking. This fundamental advantage allows SI-HTM to clearly outperform P8TM for applications that are correct under SI, as Section 4 shows.

In order to mitigate the worst-case scenarios where the quiescence phase yields prohibitive latencies, P8TM proposes self-tuning techniques that revert to the baseline HTM support in unfavourable scenarios. With minor adaptations, such self-tuning techniques may be incorporated in SI-HTM to improve its performance in some of the unfavourable scenarios that Section 4 identified.

An alternative direction to mitigate the constraints of HTM capacity is by providing the weaker correctness guarantee of SI. Two recent efforts have proposed new (or modified) hardware support to implement SI on HTM. Litz et al. (Litz et al., 2014) propose a multi-versioned memory architecture that implements SI for transactional programs running on parallel CPUs while Chen et al. (Chen et al., 2017) pursue the same goal on GPUs. They propose a multi-versioned memory subsystem for transactional programs that run on a GPU, together with an online method for eliminating the write skew anomaly on the fly. When compared to SI-HTM, these systems allow higher concurrency degrees as they constitute full SI implementations. Further, the fact that they rely on a multi-versioned memory system obviates the need for (and the cost of) SI-HTM’s quiescence. Nevertheless, SI-HTM is ready to use on commercially available systems, which distinguishes it from these recent attempts to combine SI and HTM.

On software, Litz et al. (Litz et al., 2016) implement SI by manipulating virtual memory mappings and using a copy-on-read mechanism with a customized page cache. Riegel et al. (Riegel et al., 2006) created an STM approach to SI by using a lazy multi-version mechanism. These approaches require instrumenting read and write operations of transactions, thus incumbering them over our HTM based implementation. Litz et al. (Litz et al., 2015) present a technique to automatically correct SI anomalies. Our work can also benefit from such a technique being used before deploying a new workload untested on SI systems.

IMDBs are among the domains where HTM capacity limits constitute a major obstacle to adopt HTM to replace current software-based concurrency control schemes (Leis et al., 2014). Several previous proposals have leveraged concurrency control in IMDBs with HTM. Leis et al. (Leis et al., 2014) use HTM transactions to run individual portions of a large transaction, with substantial code instrumentation. Wang et al. (Wang et al., 2014) leverage a software-based optimistic concurrency control mechanism with an optimized HTM-based commit stage. Wu et al. (Wu and Tan, 2016) adopt a similar HTM-assisted strategy, in this case using HTM transactions to perform optimized HTM-based pre-commit validation and writes to individual database records. All these proposals use HTM as an auxiliary hardware mechanism to assist a software-based concurrency control. In contrast, SI-HTM relies on HTM transactions – more precisely, on ROTs – as a first-class construct that runs full individual transactions, with non-instrumented memory accesses.

6. Conclusions

SI-HTM leverages the HTM features of the IBM POWER architectures with a software-based safety wait before commit to offer a restricted implementation of Snapshot Isolation. As a main outcome, SI-HTM dramatically stretches the capacity bounds of the underlying HTM with no hardware modifications, thus boosting the scalability of P8-HTM and opening it to a much broader class of applications, like large-footprint transactions from the IMDB domain.

Our work emphasizes how important it is for commercially available HTM implementations to expose advanced HTM related mechanisms like ROTs and suspend-resume to the software layers. SI-HTM shows that such mechanisms can serve as building blocks to sophisticated software-hardware designs that enrich the baseline features.

As future work, we plan to study advanced mechanisms to mitigate the idle waiting time that RW transactions spend in SI-HTM. Among possible approaches, we envision a killing

alternative, where the group of already-completed transactions decides, based on system-efficient heuristics to kill the transactions that are taking too long to complete; and a

batching alternative, where a completed transaction that predicts a long safety waiting uses such idle time to execute one (or more) subsequent transactions. We could also study how feasible a software based SI fallback path would be.

7. Acknowledgements

Our thanks go to Pascal Felber for shepherding our paper, the anonymous reviewers who gave us valuable feedback, Brno University of Technology and the University of Neuchatel for providing us access to their IBM POWER8 machines. This work is partially funded by FCT via projects UID/CEC/50021/2019 and PTDC/EEISCR/1743/2014.

8. Artifact Appendix

8.1. Abstract

Our artifact includes the algorithm described in the SI-HTM paper, the benchmarks used and scripts to reproduce the paper’s results. There are no software dependencies to run our algorithm. The test machine should include an IBM POWER8 processor with at least 10 cores. Plotting scripts were included which produce the graphs presented in the paper. These graphs can be used to validate our results.

8.2. Artifact check-list (meta-information)

  • Algorithm: yes, the main file of our algorithm is:
    POWER8TM/backends/p8tm-si/tm.h

  • Program: hashmap and TPCC, included

  • Compilation: GCC 5+

  • Transformations: no

  • Binary: no

  • Data set: none

  • Run-time environment: Linux

  • Hardware: IBM POWER8 processor

  • Run-time state: no other processes sharing the same cores as SI-HTM

  • Execution: specific thread pinning, two hours

  • Metrics: Execution time, number of operations, detailed specific abort counters

  • Output: graphs with throughput and abort rate

  • Experiments: compile the TinySTM back-end and run the given scripts

  • How much disk space required (approximately)?: 10 MBytes

  • How much time is needed to prepare workflow (approximately)?: 20 minutes

  • How much time is needed to complete experiments (approximately)?: 2 hours

  • Publicly available?: yes

  • Code/data licenses (if publicly available)?: MIT

  • Workflow frameworks used?: None

8.3. Description

8.3.1. How delivered

10 MB

8.3.2. Hardware dependencies

IBM POWER8 processor with 10+ cores

8.3.3. Software dependencies

Gnuplot

8.3.4. Data sets

None

8.4. Installation

tar -xvf artifact.tgz

cd POWER8TM/stms/tinystm; make

8.5. Experiment workflow

cd si-htm

- In each sub-folder (hashmap; tpcc) you will find scripts to run and plot the results of the respective benchmark

- Create a results folder for each of the benchmarks:

mkdir ¡benchmark¿/results

- Run each benchmark with the absolute path to the POWER8TM directory included in the artifact and the absolute path to the corresponding results directory:

bash run_¡benchmark¿.sh ¡POWER8TM-dir¿ ¡results-dir¿

This creates a ¡results-dir¿/date sub-directory for that run

- Edit each benchmark’s plot.sh script with the absolute path to gnuplot and eps driver

- Edit each benchmark’s default-plots.sh script with the absolute path to your Gnuplot PostScript directory and the corresponding ¡results-dir/date¿ directory

- Plot each benchmark: bash default-plots.sh

8.6. Evaluation and expected result

The results of our artifact should be reproducible with small variations in single digit percentages. It will output graphs into the results/plots folder, which can be compared to those on the paper. The raw data for the plots can be found in the results/summary folder, which include run time, number of transactions executed and number of aborts by type.

References

  • (1)
  • American National Standards Institute (1992) American National Standards Institute. 1992. American national standard for information systems: database language — SQL: ANSI X3.135-1992. pub-ANSI. xiv + 580 pages. Revision and consolidation of ANSI X3.135-1989 and ANSI X3.168-1989, Approved October 3, 1989.
  • Berenson et al. (1995) Hal Berenson, Phil Bernstein, Jim Gray, Jim Melton, Elizabeth O’Neil, and Patrick O’Neil. 1995. A Critique of ANSI SQL Isolation Levels. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD ’95). ACM, New York, NY, USA, 1–10.
  • Cerone and Gotsman (2016) Andrea Cerone and Alexey Gotsman. 2016. Analysing Snapshot Isolation. In Proceedings of the 2016 ACM Symposium on Principles of Distributed Computing (PODC ’16). ACM, New York, NY, USA, 55–64.
  • Chen et al. (2017) Sui Chen, Lu Peng, and Samuel Irving. 2017. Accelerating GPU Hardware Transactional Memory with Snapshot Isolation. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA ’17). ACM, New York, NY, USA, 282–294.
  • Council (2010) TPC Council. 2010. Transaction Processing Performance Council, TPC BENCHMARK™ C. Revision 5.11. February 2010.
  • Dalessandro et al. (2011) Luke Dalessandro, François Carouge, Sean White, Yossi Lev, Mark Moir, Michael L. Scott, and Michael F. Spear. 2011. Hybrid NOrec: A Case Study in the Effectiveness of Best Effort Hardware Transactional Memory. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVI). ACM, New York, NY, USA, 39–52.
  • Damron et al. (2006) Peter Damron, Alexandra Fedorova, Yossi Lev, Victor Luchangco, Mark Moir, and Daniel Nussbaum. 2006. Hybrid Transactional Memory. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XII). ACM, New York, NY, USA, 336–346.
  • Diaconu et al. (2013) Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Ake Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, and Mike Zwilling. 2013. Hekaton: SQL Server’s Memory-optimized OLTP Engine. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD ’13). ACM, New York, NY, USA, 1243–1254.
  • Diegues et al. (2014) Nuno Diegues, Paolo Romano, and Luís Rodrigues. 2014. Virtues and Limitations of Commodity Hardware Transactional Memory. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT ’14). ACM, New York, NY, USA, 3–14.
  • Fekete et al. (2005) Alan Fekete, Dimitrios Liarokapis, Elizabeth O’Neil, Patrick O’Neil, and Dennis Shasha. 2005. Making Snapshot Isolation Serializable. ACM Trans. Database Syst. 30, 2 (June 2005), 492–528.
  • Felber et al. (2016) Pascal Felber, Shady Issa, Alexander Matveev, and Paolo Romano. 2016. Hardware Read-write Lock Elision. In Proceedings of the 11th European Conference on Computer Systems (EuroSys ’16). ACM, New York, NY, USA, Article 34, 15 pages.
  • Goel et al. (2014) Bhavishya Goel, Rubén Titos-Gil, Anurag Negi, Sally McKee, and Per Stenstrom. 2014. Performance and Energy Analysis of the Restricted Transactional Memory Implementation on Haswell. Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS, 615–624.
  • Guerraoui and Kapalka (2008) Rachid Guerraoui and Michal Kapalka. 2008. On the Correctness of Transactional Memory. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’08). ACM, New York, NY, USA, 175–184.
  • Guniguntala et al. (2008) D. Guniguntala, P. E. McKenney, J. Triplett, and J. Walpole. 2008. The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with Linux. IBM Systems Journal 47, 2 (2008), 221–236.
  • IBM (2016) IBM. 2016. POWER8 Processor User’s Manual for the Single-Chip Module (version 1.3).
  • IBM (2018) IBM. 2018. POWER9 Processor User’s Manual (version 2.0).
  • Issa et al. (2017) Shady Issa, Pascal Felber, Alexander Matveev, and Paolo Romano. 2017. Extending Hardware Transactional Memory Capacity via Rollback-Only Transactions and Suspend/Resume. In LIPIcs 31st International Symposium on Distributed Computing (DISC ’17), Vol. 91. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Germany, 28:1–28:16.
  • Issa et al. (2018) Shady Issa, Paolo Romano, and Tiago Lopes. 2018. Speculative Read Write Locks. In Proceedings of the 19th International Middleware Conference (Middleware ’18). ACM, New York, NY, USA, 214–226.
  • Kumar et al. (2006) Sanjeev Kumar, Michael Chu, Christopher J. Hughes, Partha Kundu, and Anthony Nguyen. 2006. Hybrid Transactional Memory. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’06). ACM, New York, NY, USA, 209–220.
  • Lee et al. (2013) J. Lee, Y. S. Kwon, F. Färber, M. Muehle, C. Lee, C. Bensberg, J. Y. Lee, A. H. Lee, and W. Lehner. 2013. SAP HANA distributed in-memory database system: Transaction, session, and metadata management. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). 1165–1173.
  • Leis et al. (2014) V. Leis, A. Kemper, and T. Neumann. 2014. Exploiting hardware transactional memory in main-memory databases. (March 2014), 580–591.
  • Lindström et al. (2013) Jan Lindström, Vilho Raatikka, Jarmo Ruuth, Petri Soini, and Katriina Vakkila. 2013. IBM solidDB: In-Memory Database Optimized for Extreme Speed and Availability. IEEE Data Eng. Bull. 36, 2 (2013), 14–20.
  • Litz et al. (2016) Heiner Litz, Benjamin Braun, and David Cheriton. 2016. EXCITE-VM: Extending the virtual memory system to support snapshot isolation transactions. In Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 401–412.
  • Litz et al. (2014) Heiner Litz, David Cheriton, Amin Firoozshahian, Omid Azizi, and John P. Stevenson. 2014. SI-TM: Reducing Transactional Memory Abort Rates Through Snapshot Isolation. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’14). ACM, New York, NY, USA, 383–398.
  • Litz et al. (2015) Heiner Litz, Ricardo J Dias, and David R Cheriton. 2015. Efficient correction of anomalies in snapshot isolation transactions. ACM Transactions on Architecture and Code Optimization (TACO) 11, 4 (2015), 65.
  • Martin et al. (2006) Milo Martin, Colin Blundell, and E. Lewis. 2006. Subtleties of Transactional Memory Atomicity Semantics. IEEE Comput. Archit. Lett. 5, 2 (July 2006), 17–17.
  • Matveev and Shavit (2015) Alexander Matveev and Nir Shavit. 2015. Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15). ACM, New York, NY, USA, 59–71. https://doi.org/10.1145/2694344.2694393
  • Matveev et al. (2015) Alexander Matveev, Nir Shavit, Pascal Felber, and Patrick Marlier. 2015. Read-log-update: A Lightweight Synchronization Mechanism for Concurrent Programming. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP ’15). ACM, New York, NY, USA, 168–183.
  • Mckenney and Slingwine (1998) Paul E. Mckenney and John D. Slingwine. 1998. Read-Copy Update: Using Execution History to Solve Concurrency Problems. In Parallel and Distributed Computing and Systems. Las Vegas, NV, 509–518.
  • Nakaike et al. (2015) Takuya Nakaike, Rei Odaira, Matthew Gaudet, Maged M. Michael, and Hisanobu Tomari. 2015. Quantitative Comparison of Hardware Transactional Memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA ’15). ACM, New York, NY, USA, 144–157.
  • Neumann et al. (2015) Thomas Neumann, Tobias Mühlbauer, and Alfons Kemper. 2015. Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD ’15). ACM, New York, NY, USA, 677–689.
  • Ports and Grittner (2012) Dan R. K. Ports and Kevin Grittner. 2012. Serializable Snapshot Isolation in PostgreSQL. Proc. VLDB Endow. 5, 12 (Aug. 2012), 1850–1861.
  • Riegel et al. (2006) Torvald Riegel, Christof Fetzer, and Pascal Felber. 2006. Snapshot isolation for software transactional memory. In First ACM SIGPLAN Workshop on Languages, Compilers, and Hardware Support for Transactional Computing (TRANSACT’06). 1–10.
  • Salomie et al. (2011) Tudor-Ioan Salomie, Ionut Emanuel Subasu, Jana Giceva, and Gustavo Alonso. 2011. Database Engines on Multicores, Why Parallelize when You Can Distribute?. In Proceedings of the Sixth Conference on Computer Systems (EuroSys ’11). ACM, New York, NY, USA, 17–30.
  • Tu et al. (2013) Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. 2013. Speedy Transactions in Multicore In-memory Databases. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP ’13). ACM, New York, NY, USA, 18–32.
  • Wang et al. (2014) Zhaoguo Wang, Hao Qian, Jinyang Li, and Haibo Chen. 2014. Using restricted transactional memory to build a scalable in-memory database. In Proceedings of the Ninth European Conference on Computer Systems. ACM, 26.
  • Wu and Tan (2016) Yingjun Wu and Kian-Lee Tan. 2016. Scalable In-Memory Transaction Processing with HTM. In 2016 USENIX Annual Technical Conference (USENIX ATC 16). USENIX Association, Denver, CO, 365–377.
  • Zamanian et al. (2017) Erfan Zamanian, Carsten Binnig, Tim Harris, and Tim Kraska. 2017. The End of a Myth: Distributed Transactions Can Scale. Proc. VLDB Endow. 10, 6 (Feb. 2017), 685–696.
  • Zhang et al. (2015) H. Zhang, G. Chen, B. C. Ooi, K. Tan, and M. Zhang. 2015. In-Memory Big Data Management and Processing: A Survey. IEEE Transactions on Knowledge and Data Engineering 27, 7 (July 2015), 1920–1948.