The Transactional Memory (TM) abstraction is a synchronization mechanism that allows the programmer to optimistically execute sequences of shared-memory operations as atomic transactions. Several software TM designs [9, 26, 14, 12] have been introduced subsequent to the original TM proposal based in hardware . The original dynamic STM implementation DSTM  ensures that a transaction aborts only if there is a read-write data conflict with a concurrent transaction (à la progressiveness ). However, to satisfy opacity , read operations in DSTM must incrementally validate the responses of all previous read operations to avoid inconsistent executions. This results in quadratic (in the size of the transaction’s read set) step-complexity for transactions. Subsequent STM implementations like NOrec  and TL2  minimize the impact on performance due to incremental validation. NOrec uses a global sequence lock that is read at the start of a transaction and performs value-based validation during read operations only if the value of the global lock has been changed (by an updating transaction) since reading it. TL2, on the other hand, eliminates incremental validation completely. Like NOrec, it uses a global sequence lock, but each data item also has an associated sequence lock value that is updated alongside the data item. When a data item is read, if its associated sequence lock value is different from the value that was read from the sequence lock at the start of the transaction, then the transaction aborts.
In fact, STMs like TL2 and NOrec ensure progress in the absence of data conflicts with O(1) step complexity read operations and invisible reads (read operations which do not modify shared memory). Nonetheless, TM designs that are implemented entirely in software still incur significant performance overhead. Thus, current CPUs have included instructions to mark a block of memory accesses as transactional [24, 1, 18], allowing them to be executed atomically in hardware. Hardware transactions promise better performance than STMs, but they offer no progress guarantees since they may experience spurious aborts. This motivates the need for hybrid TMs in which the fast hardware transactions are complemented with slower software transactions that do not have spurious aborts.
To allow hardware transactions in a HyTM to detect conflicts with software transactions, hardware transactions must be instrumented to perform additional metadata accesses, which introduces overhead. Hardware transactions typically provide automatic conflict detection at cacheline granularity, thus ensuring that a transaction will be aborted if it experiences memory contention on a cacheline. This is at least the case with Intel’s Transactional Synchronization Extensions . The IBM POWER8 architecture additionally allows hardware transactions to access metadata non-speculatively, thus bypassing automatic conflict detection. While this has the advantage of potentially reducing contention aborts in hardware, this makes the design of HyTM implementations potentially harder to prove correct.
In , it was shown that hardware transactions in opaque progressive HyTMs must perform at least one metadata access per transactional read and write. In this paper, we show that in opaque progressive HyTMs with invisible reads, software transactions cannot avoid incremental validation. Specifically, we prove that each read operation of a software transaction in a progressive HyTM must necessarily incur a validation cost that is linear in the size of the transaction’s read set. This is in contrast to TL2 which is progressive and has constant complexity read operations. Thus, in addition to the linear instrumentation cost in hardware transactions, there is a quadratic step complexity cost in software transactions.
We then present opaque HyTM algorithms providing progressiveness for a subset of transactions that are optimal in terms of hardware instrumentation. Algorithm 1 is progressive for all transactions, but it incurs high instrumentation overhead in practice. Algorithm 2 avoids all instrumentation in fast-path read operations, but is progressive only for slow-path reading transactions. We also sketch how some hardware instrumentation can be performed non-speculatively without violating opacity.
Extensive experiments were performed to characterize the cost of concurrency in practice. We studied the instrumentation-optimal algorithms, as well as TL2, Transactional Lock Elision (TLE)  and Hybrid NOrec  on both Intel and IBM POWER architectures. Each of the algorithms we studied contributes to an improved understanding of the concurrency vs. hardware instrumentation vs. software validation trade-offs for HyTMs. Comparing results between the very different Intel and IBM POWER architectures also led to new insights. Collectively, our results suggest the following. (i) The cost of concurrency is significant in practice; high hardware instrumentation impacts performance negatively on Intel and much more so on POWER8 due to its limited transactional cache capacity. (ii) It is important to implement HyTMs that provide progressiveness for a maximal set of transactions without incurring high hardware instrumentation overhead or using global contending bottlenecks. (iii) There is no easy way to derive more efficient HyTMs by taking advantage of non-speculative accesses supported within the fast-path in POWER8 processors.
Roadmap. The rest of the paper is organized as follows. 2 presents details of the HyTM model that extends the model introduced in . 3 presents our main lower bound result on the step-complexity of slow-path transactions in progressive HyTMs while 4 presents opaque HyTMs that are progressive for a subset of transactions. 5 presents results from experiments on Intel Haswell and IBM POWER8 architectures which provide a clear characterization of the cost of concurrency in HyTMs, and study the impact of non-speculative (or direct) accesses within hardware transactions on performance. 6 presents the related work along with concluding remarks. Formal proofs appear in the Appendix.
2 Hybrid transactional memory (HyTM)
Transactional memory (TM). A transaction is a sequence of transactional operations (or t-operations), reads and writes, performed on a set of transactional objects (t-objects). A TM implementation provides a set of concurrent processes with deterministic algorithms that implement reads and writes on t-objects using a set of base objects.
Configurations and executions. A configuration of a TM implementation specifies the state of each base object and each process. In the initial configuration, each base object has its initial value and each process is in its initial state. An event (or step) of a transaction invoked by some process is an invocation of a t-operation, a response of a t-operation, or an atomic primitive operation applied to base object along with its response. An execution fragment is a (finite or infinite) sequence of events . An execution of a TM implementation is an execution fragment where, informally, each event respects the specification of base objects and the algorithms specified by .
For any finite execution and execution fragment , denotes the concatenation of and , and we say that is an extension of . For every transaction identifier , denotes the subsequence of restricted to events of transaction . If is non-empty, we say that participates in , Let denote the set of transactions that participate in . Two executions and are indistinguishable to a set of transactions, if for each transaction , . A transaction is complete in if ends with a response event. The execution is complete if all transactions in are complete in . A transaction is t-complete if ends with or ; otherwise, is t-incomplete. We consider the dynamic programming model: the read set (resp., the write set) of a transaction in an execution , denoted (resp., ), is the set of t-objects that attempts to read (and resp. write) by issuing a t-read (resp., t-write) invocation in (for brevity, we sometimes omit the subscript from the notation).
We assume that base objects are accessed with read-modify-write (rmw) primitives. A rmw primitive event on a base object is trivial if, in any configuration, its application does not change the state of the object. Otherwise, it is called nontrivial. Events and of an execution contend on a base object if they are both primitives on in and at least one of them is nontrivial.
Hybrid transactional memory executions. We now describe the execution model of a Hybrid transactional memory (HyTM) implementation. In our HyTM model, shared memory configurations may be modified by accessing base objects via two kinds of primitives: direct and cached. (i) In a direct (also called non-speculative) access, the rmw primitive operates on the memory state: the direct-access event atomically reads the value of the object in the shared memory and, if necessary, modifies it. (ii) In a cached access performed by a process , the rmw primitive operates on the cached state recorded in process ’s tracking set .
More precisely, is a set of triples where is a base object identifier, is a value, and is an access mode. The triple is added to the tracking set when performs a cached rmw access of , where is set to if the access is nontrivial, and to otherwise. We assume that there exists some constant such that the condition must always hold; this condition will be enforced by our model. A base object is present in with mode if .
Hardware aborts. A tracking set can be invalidated by a concurrent process: if, in a configuration where (resp., , a process applies any primitive (resp., any nontrivial primitive) to , then becomes invalid and any subsequent event invoked by sets to and returns . We refer to this event as a tracking set abort.
Any transaction executed by a correct process that performs at least one cached access must necessarily perform a cache-commit primitive that determines the terminal response of the transaction. A cache-commit primitive issued by process with a valid does the following: for each base object such that , the value of in is updated to . Finally, is set to and the operation returns commit. We assume that a fast-path transaction returns as soon a cached primitive or cache-commit returns .
Slow-path and fast-path transactions. We partition HyTM transactions into fast-path transactions and slow-path transactions. A slow-path transaction models a regular software transaction. An event of a slow-path transaction is either an invocation or response of a t-operation, or a direct rmw primitive on a base object. A fast-path transaction essentially encapsulates a hardware transaction. Specifically, in any execution , we say that a transaction is a fast-path transaction if contains at least one cached event. An event of a hardware transaction is either an invocation or response of a t-operation, or a direct trivial access or a cached access, or a cache-commit primitive.
Remark 1 (Tracking set aborts).
Let be any t-incomplete fast-path transaction executed by process , where (resp., ) after execution , and be any event (resp., nontrivial event) that some process is poised to apply after . The next event of in any extension of is .
Remark 2 (Capacity aborts).
Any cached access performed by a process executing a fast-path transaction ; first checks the condition , where is a pre-defined constant, and if so, it sets and immediately returns .
Direct reads within fast-path. Note that we specifically allow hardware transactions to perform reads without adding the corresponding base object to the process’ tracking set, thus modeling the suspend/resume instructions supported by IBM POWER8 architectures. Note that Intel’s HTM does not support this feature: an event of a fast-path transaction does not include any direct access to base objects.
HyTM properties. We consider the TM-correctness property of opacity : an execution is opaque if there exists a legal (every t-read of a t-object returns the value of its latest committed t-write) sequential execution equivalent to some t-completion of that respects the real-time ordering of transactions in . We also assume a weak TM-liveness property for t-operations: every t-operation returns a matching response within a finite number of its own steps if running step-contention free from a configuration in which every other transaction is t-complete. Moreover, we focus on HyTMs that provide invisible reads: t-read operations do not perform nontrivial primitives in any execution.
3 Progressive HyTM must perform incremental validation
In this section, we show that it is impossible to implement opaque progressive HyTMs with invisible reads with (1) step-complexity read operations for slow-path transactions. This result holds even if fast-path transactions may perform direct trivial accesses.
Formally, we say that a HyTM implementation is progressive for a set of transactions if in any execution of ; , if any transaction returns in , there exists another concurrent transaction that conflicts (both access the same t-object and at least one writes) with in .
We construct an execution of a progressive opaque HyTM in which every t-read performed by a read-only slow-path transaction must access linear (in the size of the read set) number of distinct base objects.
Let be any progressive opaque HyTM implementation providing invisible reads. There exists an execution of and some slow-path read-only transaction that incurs a time complexity of ; .
Proof sketch. We construct an execution of a read-only slow-path transaction that performs distinct t-reads of t-objects . We show inductively that for each ; , the t-read must access distinct base objects during its execution. The (partial) steps in our execution are depicted in Figure 1.
For each , has an execution of the form depicted in Figure 0(b). Start with the complete step contention-free execution of slow-path read-only transaction that performs t-reads: , followed by the t-complete step contention-free execution of a fast-path transaction that writes to and commits and then the complete step contention-free execution fragment of that performs its t-read: . Indeed, by progressiveness, cannot incur tracking set aborts and since it accesses only a single t-object, it cannot incur capacity aborts. Moreover, in this execution, the t-read of by slow-path transaction must return the value written by fast-path transaction since this execution is indistinguishable to from the execution in Figure 0(a).
We now construct different executions of the form depicted in Figure 0(c): for each , a fast-path transaction (preceding in real-time ordering, but invoked following the t-reads by ) writes to and commits, followed by the t-read of by . Observe that, and which access mutually disjoint data sets cannot contend on each other since if they did, they would concurrently contend on some base object and incur a tracking set abort, thus violating progressiveness. Indeed, by the TM-liveness property we assumed (cf. Section 2) and invisible reads for , each of these executions exist.
In each of these executions, the final t-read of cannot return the new value :
the only possible serialization for transactions is , , ; but the
performed by that returns the initial value is not legal in this serialization—contradiction to the assumption of opacity.
In other words, slow-path transaction is forced to verify the validity of t-objects in .
Finally, we note that, for all ;,
fast-path transactions and access mutually disjoint sets of base objects thus forcing the t-read of to access least different base objects
in the worst case.
Consequently, for all , slow-path transaction must perform at least steps
while executing the t-read in such an execution.
How STM implementations mitigate the quadratic lower bound step complexity. NOrec  is a progressive opaque STM that minimizes the average step-complexity resulting from incremental validation of t-reads. Transactions read a global versioned lock at the start, and perform value-based validation during t-read operations iff the global version has changed. TL2  improves over NOrec by circumventing the lower bound of Theorem 3. Concretely, TL2 associates a global version with each t-object updated during a transaction and performs validation with O(1) complexity during t-reads by simply verifying if the version of the t-object is greater than the global version read at the start of the transaction. Technically, NOrec and algorithms in this paper provide a stronger definition of progressiveness: a transaction may abort only if there is a prefix in which it conflicts with another transaction and both are t-incomplete. TL2 on the other hand allows a transaction to abort due to a concurrent conflicting transaction.
Implications for disjoint-access parallelism in HyTM. The property of disjoint-access parallelism (DAP), in its weakest form, ensures that two transactions concurrently contend on the same base object only if their data sets are connected in the conflict graph, capturing data-set overlaps among all concurrent transactions . It is well known that weak DAP STMs with invisible reads must perform incremental validation even if the required TM-progress condition requires transactions to commit only in the absence of any concurrent transaction [13, 17]. For example, DSTM  is a weak DAP STM that is progressive and consequently incurs the validation complexity. On the other hand, TL2 and NOrec are not weak DAP since they employ a global versioned lock that mitigates the cost of incremental validation, but this allows two transactions accessing disjoint data sets to concurrently contend on the same memory location. Indeed, this inspires the proof of Theorem 3.
4 Hybrid transactional memory algorithms
|Algorithm 4||Algorithm 4||TLE||HybridNOrec|
|Instrumentation in fast-path reads||per-read||constant||constant||constant|
|Instrumentation in fast-path writes||per-write||per-write||constant||constant|
|Validation in slow-path reads||none||, but validation only if concurrency|
|h/w-s/f concurrency||prog.||prog. for slow-path readers||zero||not prog., but small contention window|
|Direct accesses inside fast-path||yes||no||no||yes|
Instrumentation-optimal progressive HyTM. We describe a HyTM algorithm that is a tight bound for Theorem 3 and the instrumentation cost on the fast-path transactions established in . Pseudocode appears in Algorithm 4. For every t-object , our implementation maintains a base object that stores the value of and a sequence lock .
Fast-path transactions: For a fast-path transaction executed by process , the implementation first reads (direct) and returns if some other process holds a lock on . Otherwise, it returns the value of . As with , the implementation returns if some other process holds a lock on ; otherwise process increments the sequence lock . If the cache has not been invalidated, updates the shared memory during by invoking the primitive.
Slow-path read-only transactions: Any invoked by a slow-path transaction first reads the value of the t-object from , adds to if its not held by a concurrent transaction and then performs validation on its entire read set to check if any of them have been modified. If either of these conditions is true, the transaction returns . Otherwise, it returns the value of . Validation of the read set is performed by re-reading the values of the sequence lock entries stored in .
Slow-path updating transactions: An updating slow-path transaction attempts to obtain exclusive write access to its entire write set. If all the locks on the write set were acquired successfully, performs validation of the read set and if successful, updates the values of the t-objects in shared memory, releases the locks and returns ; else aborts the transaction.
Direct accesses inside fast-path: Note that opacity is not violated even if the accesses of the sequence lock during t-read may be performed directly without incurring tracking set aborts.
Instrumentation-optimal HyTM that is progressive only for slow-path reading transactions. Algorithm 4 does not incur the linear instrumentation cost on the fast-path reading transactions (inherent to Algorithm 4), but provides progressiveness only for slow-path reading transactions. The instrumentation cost on fast-path t-reads is avoided by using a global lock that serializes all updating slow-path transactions during the procedure. Fast-path transactions simply check if this lock is held without acquiring it (similar to TLE ). While the per-read instrumentation overhead is avoided, Algorithm 4 still incurs the per-write instrumentation cost.
Sacrificing progressiveness and minimizing contention window. Observe that the lower bound in Theorem 3 assumes progressiveness for both slow-path and fast-path transactions along with opacity and invisible reads. Note that Algorithm 4 retains the validation step complexity cost since it provides progressiveness for slow-path readers.
Hybrid NOrec  is a HyTM implementation that does not satisfy progressiveness (unlike its STM counterpart NOrec), but mitigates the step-complexity cost on slow-path transactions by performing incremental validation during a transactional read iff the shared memory has changed since the start of the transaction. Conceptually, Hybrid NOrec uses a global sequence lock gsl that is incremented at the start and end of each transaction’s commit procedure. Readers can use the value of gsl to determine whether shared memory has changed between two configurations. Unfortunately, with this approach, two fast path transactions will always conflict on the gsl if their commit procedures are concurrent. To reduce the contention window for fast path transactions, the gsl is actually implemented as two separate locks (the second one called esl). A slow-path transaction locks both esl and gsl while it is committing. Instead of incrementing gsl, a fast path transaction checks if esl is locked and aborts if it is. Then, at the end of the fast path transaction’s commit procedure, it increments gsl twice (quickly locking and releasing it and immediately commits in hardware). Although the window for fast path transactions to contend on gsl is small, our experiments have shown that contention on gsl has a significant impact on performance.