Processing Transactions in a Predefined Order

12/13/2018 ∙ by Mohamed M. Saad, et al. ∙ Alexandria University Lehigh University ibm 0

In this paper we provide a high performance solution to the problem of committing transactions while enforcing a predefined order. We provide the design and implementation of three algorithms, which deploy a specialized cooperative transaction execution model. This model permits the propagation of written values along the chain of ordered transactions. We show that, even in the presence of data conflicts, the proposed algorithms are able to outperform single-threaded execution, and other baseline and specialized state-of-the-art competitors (e.g., STMLite). The maximum speedup achieved in micro benchmarks, STAMP, PARSEC and SPEC200 applications is in the range of 4.3x -- 16.5x.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Transaction ordering intuitively means considering not just the set of transactions as input of the problem, but also the specific commit order that must be enforced for them. Such a formulation inherently includes a fundamental trade off between the level of parallelism achievable, given the need of committing in-order, and the performance of the single threaded execution without any software instrumentation.

Ordering tasks before their execution is a problem, mostly relevant to contexts where producing executions equivalent to a predefined order is needed in order to satisfy certain properties (e.g., a program semantics equivalent to serial execution). Examples of these deployments include: speculative loop parallelization (Streit et al., 2013; Gonzalez-Mesa et al., 2014; von Praun et al., 2008; Saad et al., 2012, 2018), and distributed computation using the state machine approach (Schneider, 1990; Kapritsos et al., 2012; Hirve et al., 2014).

In the former, loops designed to run sequentially are parallelized by executing their iterations concurrently and guarding memory transactions (e.g., by using Transactional Memory (Herlihy and Moss, 1993) as done in (Gonzalez-Mesa et al., 2014; Saad et al., 2012, 2018)). In that case, providing an order matching the sequential one is fundamental to enforce equivalent semantics for both the parallel and sequential code. Regarding the latter, many distributed systems order tasks before executing them to guarantee that a single state machine abstraction always evolves consistently on distinct nodes. A common example of this methodology is when consensus (e.g., Paxos (Lamport, 1998)) is employed to establish a common order among commands (or transactions) manipulating a single replicated state.

In this paper we focus on Transactional Memory (TM) as a technology to support speculative execution of tasks, and we present three algorithms to process transactions in parallel while enforcing a predefined order: Ordered Write Back (OWB), Ordered Undo Logging (OUL), and a lock-steal variant of OUL (OUL-Steal). These algorithms are based on two widely used techniques to merge transaction modifications into the shared state: write-back (in OWB) and write-through (in OUL and OUL-Steal).

All our implementations deploy a common design that uses a cooperative model, where transactions exchange both data and locks to increase concurrency while preserving the predefined commit order. OWB uses data forwarding for transactions that finish their execution successfully, but are not committed yet, and OUL leverages encounter time locking with the ability to pass the lock ownership to other transactions. Our cooperative model is similar to the dependency aware transactions model (DATM) (Ramadan et al., 2008, 2009). However, DATM is not designed (and thus cannot be optimized) for committing transactions in a predefined order since it tracks all dependencies among transactions and analyzes them at run-time seeking for some correct serialization order instead of the predefined one.

We implement OWB, OUL, OUL-Steal, the ordered version of four existing well-known TM designs (i.e., TL2 (Dice et al., 2006), NOrec (Dalessandro et al., 2010), and UndoLog (Felber et al., 2008) with and without visible readers), and STMLite (Mehrara et al., 2009), a specialized solution that commits transactions in a predefined order. We conduct a performance evaluation using a set of micro-benchmarks, STAMP (Minh et al., 2008), and some applications from the PARSEC and SPEC2000 benchmarks. For determining the transaction order, we use either the index of the application main for-loop that generates the parallel code or an artificial atomic integer that we inserted as transaction order. Results have been compared agains the sequential execution of the benchmarks, as well as against their parallel execution, as provided by the original version of the applications.

Results show interesting trends: our OUL outperforms other ordered competitors consistently. In particular, the maximum speedup achieved is 4 over Ordered TL2, 4.3 over Ordered NORec, 8 over Ordered UndoLog visible, 10 over Ordered UndoLog invisible, and 5.7 over STMLite. Interestingly, the peak gain over the sequential non-instrumented execution in micro benchmarks is 10, 16.5 in STAMP, more than 10 in PARSEC, and 30% in SPEC2000. Also, there are configurations where the sequential execution of the applications outperforms all ordered competitors, except ours.

OWB ensures TMS1 (Doherty et al., 2013), a weaker consistency condition than opacity (Guerraoui and Kapalka, 2008, 2011), the most popular correctness level for TM. TMS1 has been proved to be sufficient to guarantee safety in our model (Attiya et al., 2014), as is the case with opacity. OUL achieves higher concurrency and therefore higher performance, at the cost of weakening the correctness level by ensuring Strict Serializability (Bernstein et al., 1987).

Finally, it is worth mentioning that OWB, OUL, and OUL-Steal are TM implementations meant to be integrated into runtime systems to support their speculative execution. An example of such a system is Lerna (Saad et al., 2018). In those systems, a sandboxing (Dalessandro and Scott, 2012) mechanism prevents computation exceptions to be propagated outside the concurrency control engine.

2. Related Work

Transactional Memory (TM) has emerged as a technique for protecting speculative code (Herlihy and Moss, 1993; Mehrara et al., 2009), and extracting parallelization from sequential code (Streit et al., 2013; Gonzalez-Mesa et al., 2014; von Praun et al., 2008; Saad et al., 2012, 2018)

. In these cases when a predefined order is necessary, conflicts are handled by aborting (and re-executing) whichever transaction ran code with the latest chronological ordering. The key idea is that code blocks run as transactions and commit in the program’s original chronological order. The techniques for supporting the aforementioned ordering are classified as: blocking 

(Gonzalez-Mesa et al., 2014; Mehrara et al., 2009; Von Praun et al., 2007) or freezing (Zhang et al., 2010); a comparison between them is in (Saad et al., 2016).

In the blocking approach, Mehrara et al. (Mehrara et al., 2009) proposed STMLite, a TM with a separate thread, Transaction Commit Manager (TCM), that detects conflicts among transactions waiting to commit. TCM orchestrates the in-order commit process with the ability to have concurrent commits. Worker threads poll and stall to wait for the TCM’s permission. Gonzalez et al. (Gonzalez-Mesa et al., 2014) use a distributed approach for handling the commitment order. Each thread employs a bounded circular buffer to store its completed transactions. If all buffer slots are exhausted, the thread stalls until one of the pending executed transactions reaches the correct commit order.

Another existing solution lets threads freeze

completed transactions and proceed to execute the transactions with later chronological order, with the disadvantage of increasing the transaction lifetime (hence, a higher conflict probability). Zhang

et al. (Zhang et al., 2010) introduced this technique to support a predetermined total order of transactions. A next-to-commit shared variable is used to preserve this order. Overall, in both the blocking and freezing approach ordering transactions’ commits negatively affects the overall resource utilization and may nullify any potential gain due to threads’ parallelism. To overcome this restriction, OWB and OUL limit the stalling periods to only the latency of the commit.

The level of atomicity could be an orthogonal classification for the aforementioned techniques. The classical TM model mandates transactions to see only committed values. However, concurrent transactions can construct a dependency graph of uncommitted values. Based on this graph, the transactions commit in the constructed order. Ramadan et al. (Ramadan et al., 2009) proposed a dependency aware transactions model (DASTM), in which every object keeps track of all transactional reads and writes, and transactions forward their uncommitted changes to other conflicting transactions. Based on these relations, the commit order is defined at run-time by verifying that the constructed conflict graph is acyclic. This check is very expensive, especially when executed during the transaction execution and leads to performance gain only in the presence of very high conflicts. As opposed to DASTM, OWB and OUL, and OUL-Steal do not maintain the conflicting graph because data forwarding is optimized to enforce only the predefined commit order.

Deterministic execution of TM may be seen as a distant related topic. Recently, in (Ravichandran et al., 2014) it has been proposed an STM implementation that improves performance in case of deterministic execution. Deterministic execution is meant for reducing the possible parallelism in the system, whereas our approaches aim at introducing parallelism when a specific commit order is enforced.

Since code parallelization is the main application of our STM implementations, Thread-level Speculation (TLS) (Prabhu and Olukotun, 2003; Oancea et al., 2009) is an immediate related topic. Loop parallelization using TLS has been proposed in both hardware (Prabhu and Olukotun, 2003) and software (Rauchwerger and Padua, 1995; Bhowmik and Franklin, 2002). TLS and TM have been merged through a unified model in (Barreto et al., 2012; Ramaseshan and Mueller, 2008; Raman et al., 2010; Oancea et al., 2009) to get the best of the two techniques. Generally, TLS is a less flexible way to parallelize code than using STM. For example, with STM only some instructions can be instrumented while the others still execute without instrumentation; on the contrary, leveraging TLS means speculating over the entire loop body. To overcome some of the well-known TLS limitations, the work in (Oancea et al., 2009) proposes a software TLS implementation where write operations directly update the non-speculative memory and read-races are tracked using metadata. Some of these intuitions have been ported to STM by OWB, OUL, and OUL-Steal.

3. Execution and Memory Model

Our model assumes a set of transactions . Transactions access shared objects using read and write operations, with their usual meaning (Herlihy and Moss, 1993). We denote the sets of shared objects accessed by transaction for read and write as read-set() and write-set(), respectively.

A transaction execution is defined as a sequence of operations, where each operation is represented by a pair of invoke and return events. Besides the read and write operations, whose semantics is the usual one, it also includes a commit operation that starts by invoking the try-commit event, whose return value is either commit or abort. Note that a transaction can also be aborted before invoking the try-commit event. A transaction that begins its execution and did not invoke the try-commit event yet is called live. A transaction that invoked the try-commit event but did not commit or abort yet is called commit-pending. When a transaction is categorized as committed, it means that all its write operations have been executed permanently on the shared state; and when it is categorized as aborted, its operations have no permanent effect. In both the cases, all metadata is cleaned before proceeding or re-executing. Figure 1 summarizes the transaction states.

Figure 1. Transaction execution states in OWB and OUL.

A shared object has a value and a (versioned) lock associated with it. We say that a shared object is exposed if it is locked by some live or commit-pending transaction. Intuitively, a shared object is exposed if some transaction can already read it, although the transaction that wrote to that object is still executing. A transaction is exposed when it is commit-pending and has all its written objects exposed.

Two transactions are said to be conflicting if both are concurrent and access an object , and at least one of them writes on . Note that two transactions are conflicting even if both write the same object without reading it. Including such a dependency is fundamental, as motivated in the next paragraph. A conflict is handled by aborting one of the transactions, or postponing the access responsible for the conflict (if possible), until the other transaction commits.

Transaction Ordering. We focus on TM implementations providing a specific order of transaction commits, which is assumed to be known prior to the transaction execution. We denote such an order as the transaction age. The age is assumed to be defined before activating any transaction (e.g., an ordering layer deployed on top of the TM implementation), and must match the transactions commit order. The age is unique, meaning no two transactions can have the same age and once assigned to a transaction, the age does not change even if the transaction is aborted multiple times.

The age of a transaction should not be confused with the transaction timestamp taken from a global timestamp, as used by many existing TM implementations (e.g., TL2 (Dice et al., 2006), LSA (Riegel et al., 2006)). The global timestamp is advanced by the concurrency control every time a transaction commits. The age of a transaction is externally determined (e.g., by the application) and does not depend on the execution of concurrent transactions.

Let be the total order relation on transaction ages, and let those ages be denoted as subscripts (e.g., ). If we say that has a lower age than ; otherwise higher. A concurrency control that enforces an order of commits ensures that when two operations and , issued by transactions and , respectively, are conflicting, then must happen before if and only if . We deploy this idea into our execution model by introducing an Age-based Commit Order (ACO). ACO mandates a customization of the classical TM model. As an example, the transaction conflict detection should guarantee that when , must not read a value written by (intuitively, should commit before ).

We define a transaction as reachable if all ’s lower age transactions are committed, which means that has been reached by a serial execution where all transactions {,…, } committed in the order . In practice, ACO constrains the serialization orders.

In our model, when a transaction aborts, it is restarted by the TM library with the same age.

4. General Design

In this section, we present our co-operative model for supporting ACO. The core idea is to relax the common practice of letting transactions access values written by only committed or commit-pending transactions that will surely commit. In our proposed solutions, we weaken this assumption while still preserving the consistency according to ACO. Depending on the desired correctness and performance level, we permit a transaction to expose its changes either:

  • after it invokes the try-commit event and performs a validation to verify execution’s consistency, but still allowing it to abort later due to ACO violation (in OWB); or

  • right after the write operation takes place during the execution and aborting any dependent transaction as soon as a further modification on the same early exposed value happens (in OUL).

The above idea allows transactions with higher age to use such visible changes. Although it speeds up the flow of values from lower to higher age transactions, it also creates a possible dependency chain with other live and commit-pending transactions that accessed those values. Therefore when an abort occurs, the abort event should be immediately triggered to all the dependent transactions (cascading abort).

We can classify existing transactional models according to the way they handle concurrent read and write operations as: conservative model, conflict serialization model (e.g., DASTM (Ramadan et al., 2009)), and cooperative ordered model (our model). The former prohibits the co-existence of read and write operations on the same object issued by concurrent transactions (this model is deployed by most concurrency controls). The conflict serialization model permits all combinations and selects the commit order based on the transaction dependency. Our cooperative ordered model restricts the memory snapshot seen by transactions to only the values exposed by transactions with lower age.

Interestingly, under the conflict serialization model transactions are aborted only when a mutual dependency (i.e., a cycle in the transaction dependency graph) exists; in our model, the graph is always acyclic. Avoiding to identify cycles in the transaction dependency chain increases the chance to achieve high performance.

4.1. Cooperative Ordered Transactional Execution

To construct our cooperative model, we start by highlighting the following two events of a transaction execution:

  • a transaction is exposed when all its written objects are exposed, and it is in the commit-pending state by having all its read operations consistent according to the ACO, therefore no conflict with lower age transactions occurred;

  • a transaction becomes reachable, when all the lower age transactions have been committed.

Supporting this new model requires that: i) aborted transactions should be able to abort other transactions that accessed their exposed updates; and ii) lower age transactions should enforce the abort of exposed higher age transactions. Accomplishing the above goals requires maintaining some transactional metadata (e.g., read and write sets, including acquired locks) even after a transaction is exposed. Those metadata help in identifying conflicts (or aborting) exposed transactions, and they should be kept accessible until a transaction becomes reachable. Additionally, we need to support the cascading abort of multiple live or exposed transactions that share elements in their read-sets and write-sets.

Exposing written objects before being sure that a transaction eventually commits may violate the ACO if all lower age transactions are not committed yet. Similarly, ACO might be violated when the transaction conflicts with a lower age transaction that accesses a common object that is exposed by the first. For this reason, we postpone releasing the transaction metadata until the transaction becomes reachable, thus providing a safe point to decide whether a commit or abort should be triggered. The main difference between an exposed transaction and a reachable transaction is that: the former, although it has already published its modifications, it can still be aborted (and trigger the cascading abort of other transactions), whereas the latter cannot be aborted anymore. Therefore, it is safe to release all its metadata without violating the ACO.

5. The Ordered Write Back (OWB)

1:procedure Read(SharedObject so, Transaction tx)
2:       readVersion = so.lock.version
3:       currentWriter = so.lock.writer
4:       if tx.writeSet.contains(so) then
5:             return tx.writeSet.get(so).value Read written value
6:       else if currentWriter NULL then Check speculative write
7:             if currentWriter.age tx.age then
8:                    ABORT(currentWriter) ; Read after Speculative Write
9:                    go to 2
10:             else ; Add Tx to its dependencies
11:                    currentWriter.dependencies.add(tx)
12:                    if currentWriter.status ACTIVE then Double check writer
13:                          ABORT(tx) Writer got aborted while registeration
14:                    end if
15:             end if
16:       end if
17:       Validate_Reads(tx)
18:       tx.readSet.add(so, readVersion)
19:       return so.value
20:end procedure 
21:procedure Write(SharedObject so, Object value, Transaction tx)
22:       tx.writeSet.add(so, newValue) Save new value
23:end procedure 
24:procedure Abort(Transaction tx)
25:       if tx.status = ABORTED then return false; end if Already got aborted
26:       if tx.status = INACTIVE then return false; end if Already compeleted
27:       while  ! CAS(tx.status, ACTIVE, TRANSIENT)  do Try Abort
28:             repeat until  tx.status TRANSIENT Spin Wait
29:             go to 25
30:       end while
31:       for each Transaction in tx.dependencies do
32:             ABORT(dependency) Abort dependent transactions
33:       end for
34:       for each Entry in tx.writeSet do
35:             SharedObject so = entry.so
36:             if so.lock.writer tx then Aquired lock
37:                    so.value = entry.newValue Revert value
38:                    so.lock.writer = NULL Release the lock
39:             end if
40:       end for
41:       tx.status = ABORTED
42:       return true
43:end procedure 
44:procedure Validate_Reads(Transaction tx)
45:       for each Entry in tx.readSet do Validate Read Set
46:             SharedObject so = entry.so
47:             if so.lock.version entry.readVersion so.lock.writer NULL then
48:                    return ABORT(tx) Read a wrong version
49:             end if
50:       end for
51:       return VALID
52:end procedure 
53:procedure Validate_Locked_Reads(Transaction tx)
54:       for each Entry in tx.readSet do Validate Write Set
55:             SharedObject so = entry.so
56:             if so.lock.writer tx so.lock.version 1 + entry.readVersion then
57:                    return ABORT(tx) Concurrent Expose/Commit Occurs
58:             end if
59:       end for
60:       return VALID
61:end procedure 
62:procedure TryCommit(Transaction tx)
63:       if tx.status = ABORTED then return false; end if Already got aborted
64:       while  ! CAS(tx.status, ACTIVE, TRANSIENT)  do Try Commit
65:             repeat until  tx.status TRANSIENT Spin Wait
66:             return false
67:       end while
68:       if VALIDATE_READS(tx) VALID then return false; end if
69:       for each Entry in tx.writeSet do Lock Write Set
70:             SharedObject so = entry.so
71:             currentWriter = so.lock.writer
72:             if currentWriter NULL then
73:                    if tx.age currentWriter.age then
74:                          ABORT(currentWriter) ; Write after Specu. Write
75:                    else
76:                          ABORT(tx) ; Write after Write
77:                          return false
78:                    end if
79:                    if  ! CAS(so.lock.writer, NULL, tx)  then Acquire Lock
80:                          go to 71
81:                    end if
82:             end if
83:       end for
84:       for each Entry in tx.writeSet do
85:             SharedObject so = entry.so
86:             so.lock.version + +
87:             temp = so.value Save old value
88:             so.value = entry.newValue Expose written value
89:             entry.newValue = temp
90:       end for
91:       if Validate_Locked_Reads(tx) VALID then return false; end if
92:       tx.status = ACTIVE Transaction Exposed
93:       return true
94:end procedure 
95:procedure Commit(Transaction tx)
96:       if tx.status = ABORTED then return false; end if Already got aborted
97:       while  ! CAS(tx.status, ACTIVE, TRANSIENT)  do Try Complete
98:             repeat until  tx.status TRANSIENT Spin Wait
99:             return false
100:       end while
101:       if VALIDATE_READS(tx) VALID then return false; end if
102:       for each Entry in tx.writeSet do
103:             SharedObject so = entry.so
104:             so.lock.writer = NULL Unlock
105:       end for
106:       tx.status = INACTIVE Transaction Committed
107:       return true
108:end procedure
Algorithm 1 OWB - Pseudocode

The Ordered Write Back Algorithm (OWB) employs a write-buffer approach; a transaction writes its modifications into a local buffer. While entering the try-commit phase, the transaction acquires a versioned-lock for each object in its write-set and writes its changes to the shared memory, and becomes exposed. To avoid concurrent writers, the locks are not released until the transaction becomes committed or is aborted. However, to allow an early propagation of the modifications, higher age transactions can read these locked objects. In case an abort is triggered, the exposed transaction is responsible to abort any dependent transaction that has read the exposed values. We use versioning to detect conflicts between concurrent transactions. The transaction performs a validation before exposing its values, and before releasing its locks to approach the final commit.

In practice, for OWB a transaction is exposed if: it is executed until the end without any conflict with other concurrent transactions; it acquired the locks on its modified objects successfully; it exposed their new values to the shared memory; and it is waiting to be reachable. A transaction can commit only if it is reachable and passes the validation of its read operations. The transaction also releases its acquired locks at this stage. As stated earlier, an exposed transaction can still be aborted.

A transaction keeps these metadata: 1) read-set, which stores read objects and their read version; 2) write-set, which stores the modified objects and their new values; and 3) dependencies list: a list of transactions who read the changes done by this transaction after it becomes exposed. Shared objects are associated with a versioned lock. The lock stores the version number and a reference to the writer transaction (if it exists) that currently owns it. The version is incremented when a new value for the object is exposed. The pseudocode of OWB is in Algorithm 5.

The Write operation simply adds the object and its new value to the write-set. The Read operation first checks if the object has been earlier modified by the transaction itself. If so, the new value from the write-set is returned; otherwise the object, along with its version, is fetched from the shared memory. If the object is currently exposed, then the writer is aborted only if its age is higher (), and the read operation is retried. If the transaction that holds the lock has a lower age than the reading transaction, we let the latter read the written value () and add itself to the writer’s dependencies list. That way, if the writer will be aborted in the future, it can cascade its abort to the affected transactions who read its modified objects. It is worth noting that, to avoid inconsistencies while reading from an exposed writer, we let the reader double check the writer state (if it is aborted) after it registers itself in the writer dependencies list; also the dependencies list must provides a thread-safe insertion. Before a read operation returns, the read-set is validated by invoking Validate_Reads (see below). Not doing that would make the read-set of OWB transactions not consistent during execution.

To enter the exposed state, a validation of the read-set is executed to make sure the transaction reads a consistent view of the memory before exposing the locally buffered written objects. Validate_Reads compares the current versions of the read objects with the value of the corresponding versions stored in the read-set. If the current version is different, then this means that the object was modified after the read, i.e., a Write after Read (WAR) conflict.

Upon passing a read-set validation, the exposed procedure acquires the locks and then writes the write-set to the memory. If the locks are already acquired by another concurrent exposed writer ( or ), we handle that by favoring the lower age transaction, and aborting the other. Since exposed transactions can still be aborted by other transactions, we need to store the old value of modified objects. This is done by swapping the write-set stored values with the old objects’ values at commit.

Finally, at commit time we call Validate_Reads again to prevent the WAR anomaly. However, since write-set elements are already locked, we can leverage that to reduce the validation overhead. Consider is executing the commit operation. Let read-set() write-set(). As is still acquiring the locks over its write-set (including ), is sure that the value of is unchanged since its lock acquisition, thereby it could be excluded from the commit-time read-set validation. To do so, it requires checking that read-set objects have not been changed while acquiring locks.

Keeping the commit execution time short is fundamental; therefore, the optimization just described shrinks the commit execution at the price of adding an extra check in the Try-Commit procedure. However, having an object read and written in a transaction is a common pattern, which makes this optimization fruitful.

Finally, when a transaction becomes reachable and the re-validation succeeds, the commit operation releases its acquired locks and reclaims metadata.

As correctness guarantee, OWB guarantees TMS1 (Doherty et al., 2013). The intuition is that: if for a history generated by OWB, every exposed transaction is committed, then the history is opaque (Guerraoui and Kapalka, 2008). First of all, transactions can commit only in the ACO order, serializing all the committed transactions, making OWB strict serializable. Moreover, OWB allows transactions to read only from commit-pending (exposed) and committed transactions, and any time a transaction enters the exposed state, it aborts all concurrent transactions that has read a value that violates the ACO. However, exposed transactions can abort after some live transaction already read those values. This is not allowed by Opacity, but TMS1 allows that as long as the live transactions do not perform any operation after the exposed transaction is aborted. OWB implements that through an atomic cascading abort. We give more details about correctness in Section 7.

6. The Ordered Undo Log (OUL)

The Ordered Undo Log (OUL) Algorithm is an undo-log algorithm that preserves the ACO. Here, transactional updates affect the shared memory at encounter time, while the old value is kept in a local undo-log. Such a scheme implies that the transactions’ order is guaranteed while operations are invoked, and not at commit time as in OWB. In order to deploy the above idea, each object is associated with a read-write lock. The transaction acquires a read or write lock according to its need, as explained later. Also, each lock stores the reference to the (single) writer transaction, which can be either the current transaction holding the lock or the one that committed that version, and a list of concurrent readers, namely those transactions that accessed the version for reading it, and they are still live or commit-pending. Note that the size of the readers list impacts the efficiency of the protocol, thus it should be bounded.

As in OWB, every transaction in OUL maintains a write-set, but here the write-set stores the old values of the written objects (undo-log). Regarding the transaction read-set, it is implicitly represented by the object lock’s readers list.

The pseudocode of OUL’s core operations is included in Algorithm 2 and Algorithm 3. In the Read procedure, we allow Read after Write (RAW) conflicts only if the writer transaction has a lower age (); otherwise the speculative writer is aborted (). The Write procedure enforces that only a single transaction can hold the write lock on the object at a time. A Write-Write conflict is solved by aborting the transaction with the highest age. As readers are visible, the writer transaction can check if there is any (wrong) speculative reader, and abort it accordingly ().

One of the major benefit of a write through protocol is that the Try-Commit procedure is simple because the values are already in the shared memory. However, in OUL exposing a transaction only means that it did not conflict with other transactions so far – but it could be still aborted to preserve the ACO. In the Commit procedure, the transaction is marked as Inactive and locks are released. As we said before, since a lock is maintained with a back-reference to the transaction that holds it, setting the transaction status is sufficient to release all the locks held by that transaction with a single step. On the other hand, in the Abort procedure the transaction restores old values from the undo-log (Rollback), and release all the locks (switches to Inactive).

1:procedure Read(SharedObject so, Transaction tx)
2:       if tx.status = TRANSIENT then return ABORT(tx) end if
3:       Transaction currentWriter = so.lock.writer;
4:       if currentWriter = BUSY  then go to 2 end if
5:       if currentWriter NULL currentWriter.status INACTIVE currentWriter.age this.age then
6:             ABORT(currentWriter) ; Read after specu. Write
7:             go to 2
8:       end if
9:       registered = false
10:       repeat
11:             for i=0 to  do
12:                    Transaction readerSlot = so.lock.reader[i]
13:                    if readerSlot ACTIVE readerSlot PENDING CAS(so.lock.reader[i], readerSlot, tx) then
14:                          registered = true Found empty reader slot
15:                    end if
16:             end for
17:       until registered
18:       if currentWriter so.lock.writer then Writer was changed
19:             go to 2
20:       end if
21:       return so.value
22:end procedure
23:procedure Write(SharedObject so, Object value, Transaction tx)
24:       if tx.status = TRANSIENT then ABORT(tx); end if
25:       Transaction currentWriter = so.lock.writer;
26:       if currentWriter = BUSY  then go to 24 end if
27:       if currentWriter tx  then Already in write-set
28:             if currentWriter NULL currentWriter.status INACTIVE then
29:                    if currentWriter.age this.age  then
30:                          ABORT(currentWriter) ; Write after specu. Write
31:                          go to 24
32:                    else
33:                          ABORT(tx) ; Write after Write
34:                    end if
35:             end if
36:             if  ! CAS(so.lock.writer, currentWriter, BUSY)  then
37:                    go to 24 Failed to aquire the lock
38:             end if
39:             tx.writeSet.add(so, so.value) Save old value
40:       end if
41:       for i=0 to  do
42:             Transaction readerSlot = so.lock.reader[i]
43:             if readerSlot INACTIVE readerSlot.age tx.age) then
44:                    ABORT(readerSlot) ; Abort specu. reader
45:             end if
46:       end for
47:       so.value = newValue Write new value
48:       so.lock.writer = tx Save me as the new writer
49:end procedure
Algorithm 2 OUL - Pseudocode 1
50:procedure TryCommit(Transaction tx)
51:       if  !CAS(tx.status, ACTIVE, PENDING) then ABORT(tx) end if
52:end procedure
53:procedure Commit(Transaction tx)
54:       if  CAS(tx.status, PENDING, INACTIVE) then return true  end if
55:       repeat until tx.status TRANSIENT Wait until be aborted
56:end procedure
57:procedure Abort(Transaction tx)
58:       if tx.status = INACTIVE then return true; end if Check if already aborted
59:       if CAS(tx.status, PENDING, TRANSIENT) then Rollback
60:             for each Entry in tx.writeSet do
61:                    SharedObject so = entry.so
62:                    Object value = entry.value
63:                    so.value = value Restore old value
64:                    for i=0 to  do
65:                          Transaction readerSlot = so.lock.reader[i]
66:                          if readerSlot INACTIVE readerSlot.age tx.age) then
67:                                 ABORT(readerSlot) Abort specu. reader
68:                          end if
69:                    end for
70:             end for
71:             tx.status = INACTIVE
72:       else Set aborted
73:             return CAS(tx.status, ACTIVE, TRANSIENT)
74:       end if
75:end procedure
Algorithm 3 OUL - Pseudocode 2
1:procedure Read(SharedObject so, Transaction tx)
2:       if tx.status = TRANSIENT then return ABORT(tx) end if
3:       Transaction currentWriter = so.lock.writer;
4:       if currentWriter = BUSY  then go to 2 end if
5:       if currentWriter NULL currentWriter.status INACTIVE currentWriter.age this.age then
6:             ABORT(currentWriter) ; Read after Speculative Write
7:             go to 2
8:       end if
9:       registered = false
10:       repeat
11:             for i=0 to  do
12:                    Transaction readerSlot = so.lock.reader[i]
13:                    if readerSlot ACTIVE readerSlot PENDING CAS(so.lock.reader[i], readerSlot, tx) then
14:                          registered = true Found empty reader slot
15:                    end if
16:             end for
17:       until registered
18:       if currentWriter so.lock.writer then Writer got changed meanwhile
19:             go to 2
20:       end if
21:       return so.value
22:end procedure 
23:procedure Write(SharedObject so, Object value, Transaction tx)
24:       if tx.status = TRANSIENT then ABORT(tx); end if
25:       Transaction currentWriter = so.lock.writer
26:       if currentWriter = BUSY  then go to 24 end if
27:       if currentWriter tx  then Already in write-set
28:             steal = false
29:             if currentWriter NULL currentWriter.status INACTIVE then
30:                    if currentWriter.age this.age  then
31:                          ABORT(currentWriter) ; Write after Specu. Write
32:                          go to 24
33:                    else
34:                          steal = true ; Lock Steal, Write after Write
35:                    end if
36:             end if
37:             if  ! CAS(so.lock.writer, currentWriter, BUSY)  then Aquire the lock
38:                    go to 24
39:             end if
40:             tx.writeSet.add(so, so.value, steal ? currentWriter : NULL) Save old value, and old writer when stealing the lock
41:       end if
42:       for i=0 to  do
43:             Transaction readerSlot = so.lock.reader[i]
44:             if readerSlot INACTIVE readerSlot.age tx.age) then
45:                    ABORT(readerSlot) ; Abort speculative readers
46:             end if
47:       end for
48:       so.value = newValue Write new value
49:       so.lock.writer = tx Save me as the new writer
50:end procedure 
51:procedure TryCommit(Transaction tx)
52:       if  !CAS(tx.status, ACTIVE, PENDING) then ABORT(tx) end if
53:end procedure 
54:procedure Commit(Transaction tx)
55:       if  CAS(tx.status, PENDING, INACTIVE) then return true end if
56:       repeat until tx.status TRANSIENT Wait until be aborted
57:end procedure 
58:procedure Rollback(Transaction tx)
59:       tx.aborted = true
60:       for each Entry in tx.writeSet do
61:             SharedObject so = entry.so
62:             if CAS(so.lock.writer, tx, BUSY) then
63:                    Object value = entry.value
64:                    so.value = value Restore old value
65:                    so.lock.writer = entry.originalOwner Release lock, or return the original owner
66:                    if  entry.originalOwner != NULL entry.originalOwner.aborted then
67:                          ROLLBACK(entry.originalOwner)
68:                    end if
69:             end if
70:             for i=0 to  do
71:                    Transaction readerSlot = so.lock.reader[i]
72:                    if readerSlot INACTIVE readerSlot.age tx.age) then
73:                          ABORT(readerSlot) Abort speculative readers
74:                    end if
75:             end for
76:       end for
77:end procedure 
78:procedure Abort(Transaction tx)
79:       if tx.status = INACTIVE then return true; end if Already Aborted
80:       if CAS(tx.status, PENDING, TRANSIENT) then
81:             Rollback(tx) Rollback
82:             tx.status = INACTIVE
83:       else
84:             if CAS(tx.status, ACTIVE, TRANSIENT) then Set aborted
85:                    return true
86:             end if
87:       end if
88:       return false Failed to abort
89:end procedure
Algorithm 4 OUL-Steal - Pseudocode

6.1. The OUL-Steal Algorithm

In this section, we introduce OUL-Steal, a variant of the OUL algorithm where we relax the aforementioned multiple-writers restriction and allow write-writer conflicts while guaranteeing ACO. In both OWB and OUL, conflicting transactions co-operate to commit as they are allowed to proceed without aborts even in the presence of some read-write conflict, as long as ACO is still preserved. However, a writer transaction holds the locks until reaching the commit state, which sometimes limits the overall concurrency.

Let and be two conflicting writers on object , and . In OUL, if finds locked by (), should abort . However, ACO could still be preserved if overwrites the value written by , as long as there is no other transaction , with , that will read in the future.

OUL-Steal allows a transaction with higher age to overwrite the value written by a concurrent transaction with lower age (), and steal its lock. The higher age transaction stores the stolen lock in a local list so that the lock can be returned back to the original writer (the lower age transaction) in case of abort. That way, if a mid-age reader needs the value of a lower age transaction, then it can abort the higher age transaction(s) which stole the lock(s); otherwise (i.e., without ), the value written by the higher age transaction will be used by higher age readers. This operation could be repeated until the reader reaches the correct writer transaction.

In Write, the lock is passed to the higher age writer and is saved in its write-set. As a consequence, the written address exists in the undo-log of both the writers (the original and the one which stole the lock). During the Abort, the transaction uses Rollback to revert its changes using its undo-log. An undo-log entry can be:

  • stolen by another writer: which means the transaction does not have the ownership record at the abort time. In this case, the transaction does not do any action, although, it keeps the undo-log entry, which contains the address value before the current transaction modifications.

  • exclusively modified by the current transaction, reverting the old value from the undo-log, and aborting the speculative readers.

  • stolen from another writer: in addition to the steps done in the exclusively modified case, the lock ownership is passed back to the old writer, and the current transaction checks the state of the old writer. If it was not aborted, then no further action is needed. Otherwise, the transaction calls the Rollback of the old owner. At this stage, the old writer will treat the entry as the cases of exclusively modified or stolen from, accordingly.

The complete pseudocode of the OUL-Steal algorithm is in Algorithm 4.

As correctness guarantee, OUL guarantees Strict Serializability (Papadimitriou, 1979). Unlike OWB, OUL allows reading from live transactions, which is not allowed by TMS1 (and hence opacity). However, similar to OWB, OUL restricts transactions to commit only in the ACO order, making OUL strict serializable. More details about correctness are in Section 7.

7. Correctness

Here we discuss the correctness of the given algorithms. We do not include the case where a transaction triggers an error (e.g., division by zero) because it speculatively processes a computation that in a non-parallel execution would not happen. Such an execution might be a for-loop iteration executed speculatively and preceded by an iteration that breaks the for-loop. We assume a sandboxing mechanism to handle such exceptions.

First, we show how our protocols preserve the ACO. Suppose by contradiction that the ACO is violated. Let and be two transactions such that . The interesting case is if successfully reads a value of an object written by . This implies that happened after exposes X’s value in OWB or in OUL. In both OWL and OUL, acquires a shared lock on at the time of the read operation, either by visible reads (OUL, OUL-Steal) or checking if there is no writer (OWB). For a successful read, the shared lock must be acquired, thus the write lock should not be already granted. This implies that has released all its locks. As a transaction does not release its acquired locks until it commits, must be necessarily committed. Therefore must occur after . Since a transaction cannot perform any step after it commits, . This means , which cannot be the case since they must commit in order, according to their ages.

Now we prove that both OWB and OUL are serializable. In order to prove that, we define as a predicate defining a dependency between and , when reads a value written by , or overwrites a value written by . Using this definition, we can construct a dependency directed graph , where is the set of all committed transactions, and is the set of dependency relations. It is easy to see that , where SG is the conflict serialization graph (Bernstein et al., 1987). A history is serializable if and only if its SG is acyclic. Note that serializability is not guaranteed if is acyclic.

Assume by contradiction that an execution of our algorithms produce a cyclic , which implies having an edge where . By definition of dependency, this means that either reads a value written by (i.e., ), or overwrites a ’s written value (i.e., ). In all the proposed algorithms, exclusive locks must be acquired when we expose the written values (at commit in OWB or encounter time in OUL and OUL-Steal) and released only at commit, or passed to a higher age transaction (which is not the case here). We can rewrite the previous situations as or . Since a transaction cannot perform any step after it commits, , which cannot be the case as mentioned earlier; therefore, is a acyclic.

Assume , this edge represents the case where , which means . In OWB, the procedure that validates read operations captures this by comparing the read version with the current version of the accessed object; while in OUL and OUL-Steal, the readers’ visibility enables to detect the and aborts it. So is acyclic, making the algorithms serializable.

The serialization point of both OUL and OWB is inside the commit procedure: for OUL is when the transaction’s status is atomically set to Inactive; for OWB is when locks on written objects are released. As the serialization point is inside the transaction execution, all the algorithms preserve the real-time order, and are strict serializable.

In addition to being strict serializable, OWB is TMS1 (Doherty et al., 2013), a stronger condition than strict serializability. Being TMS1, OWB ensures that response of every object operation, even by aborted and live transactions, is consistent with a serial execution. Informally, for a history to be TMS1, it must be strict serializable, and for every successful response of an object operation by a transaction , there must exist a serialization of a subset of the transactions, justifying the response. This subset must contain (until the response) and all the committed transactions that completed before started. In addition, the serialization can also contain some commit-pending transactions, and some committed and even aborted transactions, that are concurrent to . We have already shown earlier that OWB is strict serializable. Since OWB allows a read operation to return a value written by an exposed transaction, which may get aborted later, it justifies including concurrent aborted transaction for the response. Recall that OWB allows reading values written by committed and exposed transactions only, but not from aborted transactions. The intuition is that if a transaction reads from an exposed transaction, which gets aborted later, the reading transaction is also aborted without executing any further operations. This is done using cascading mechanism in OWB.

8. Implementation and Evaluation

In our implementation locks are implemented using 32 bits. The mapping between addresses and locks is made by leveraging the least significant bits, so a single lock might be responsible for multiple addresses. The lock is divided into two parts: the most significant bits represent the reference to the writer, and the remaining bits represent either the header address of the readers list (for OUL), or the version number (for OWB). In OUL, we use a bounded list of readers to limit the number of concurrent readers, which is set to 40.

A thread plays multiple roles in our implementations: worker, validator, or cleaner. A worker executes transactions and performs the try-commit. A cleaner takes care of reclaiming metadata. Once the transactional operations are all executed, any thread can take the lead of finalizing any transaction; however, there is a single thread at a time in the validator role. This role is responsible for moving commit-pending transactions to the committed state and also re-executes invalid transactions. We adopt the flat combining (Hendler et al., 2010) technique to let threads take ownership of the validator role. The pseudocode in Algorithm 5 shows the steps done by a thread to carry the execution of a transaction.

1:for each Transaction in WorkQueue do
2:       if validator = IDLE CAS(validator, IDLE, BUSY) then Try to be the validator
3:             tx = ExposedList[last_committed]
4:             if  tx = NULL  then Tx is not exposed yet
5:                    go to 16 Stop validation
6:             end if
7:             if  tx.commit() = FAIL  then Perform Tx commit
8:                     tx.start()
9:                     tx.execute() Reexecute failed transaction
10:                    tx.tryCommit() Commit without validation
11:                    tx.commit()
12:             end if
13:             last_committed++
14:             CommittedQueue.enqueue(tx)
15:             go to 3 Validate next exposed Tx
16:             validator = IDLE Release the validator role
17:       end if
18:       if abortsLIMIT tx.age - last_committed MAX  then
19:             while  tx.age - last_completed MIN  do
20:                    for each Transaction in CommittedQueue do
21:                          tx.clean() Do housekeeping
22:                    end for
23:             end while
24:       end if
25:       tx.start()
26:       tx.execute() Execute transaction
27:       if tx.tryCommit() = FAIL then Try to expose transaction
28:             go to 25 Retry
29:       end if
30:       ExposedList[tx.age] = tx Add to pending transactions
31:end for
Algorithm 5 Thread Execution
(a) Long Transaction
(b) Short Transaction
(c) Heavy Transaction
Figure 2. Peak performance of all competitors (including unordered) using all micro benchmarks (Y-axis is log scale).

We compare our algorithms with STMLite (Mehrara et al., 2009): a lightweight STM with ACO used to support code parallelization; the unordered and ordered version of three state-of-art TM algorithms: TL2 (Dice et al., 2006), NOrec (Dalessandro et al., 2010) and UndoLog (Felber et al., 2008) (with and without visible readers).

Both TL2 and NOrec follow the write-back design strategy and validate transactions at commit time. To enforce ACO in these implementations, transactions are allowed to enter the commit phase only when all transactions with lower age have been committed. In order to aid the ordering for UndoLog, we exploit an age-based contention policy (i.e., always favor transactions with the lower age) to handle write-write conflicts. In the visible readers variant, the writer transaction aborts all active readers, while when readers are invisible the writer retries multiple times if the object is locked, then it backs off. STMLite uses a write-back implementation and replaces the need for constructing a read-set by leveraging signatures (Bloom Filters). There is a tradeoff in determining the effective size of signatures, but the authors recommended a range of 32 to 1024. We used a signature of size 64 with the STL hashing function because it provided the best performance. The number of threads in STMLite also includes its commit manager.

All competitors, including STMLite whose source code, to the best of our knowledge, is not publicly available, have been re-implemented atop the same baseline software framework so that all take advantage of the same low-level optimizations. It is worth noting that competitors may provide different correctness guarantees (e.g., OWB provides TMS1 while NOrec/TL2 give opacity).

In our experiments, the ACO is defined in two ways. Unless otherwise specified, the index of the dominant for-loop that each benchmark uses to generate parallel code (e.g., transactions in STAMP) is used as transaction age. In some application with more complex patterns, such as nested loops, we inserted an atomic integer to define and assign ages.

Threads are pinned to cores. The policy is to use up all cores of one socket before moving to the other one.

We report the throughput for micro benchmarks and the application execution time for STAMP and some applications of PARSEC and SPEC200 benchmarks by varying the number of serving threads in the thread-pool (the datapoint at 1 thread shows the performance of the single-threaded transactional execution). We also compare our performance against the unordered algorithms, which do not use ACO. In this case, applications directly activate transactions on parallel because no ACO needs to be defined. Performance of the non-transactional single-threaded execution (green line) is also included.

We used two different machines for our experiments: micro benchmarks and STAMP have been evaluated on an AMD machine equipped with 2 Opteron 6168 CPUs, each with 12-core running at 1.9 GHz. The total memory available is 12 GB. Evaluation of PARSEC applications and SPEC2000 has been done using a Intel server hosting 4 Intel Xeon Platinum 8160. Results are the average of five runs.

Micro Benchmark. In our first set of experiments we consider the RSTM micro-benchmarks (SOF, [n. d.]) to evaluate the effect of different workload characteristics, such as the amount of operations per transaction, the transaction length, and the read/write ratio, on the performance. Each experiment included running half a million transactions. For each micro benchmark, we configured three types of transactions: short, long, and heavy. Both short and heavy have the same number of accesses (i.e., a random between 10 and 20), but the latter adds more local computation in between them (i.e., 100 CPU-ops). Long transactions simply produce more transactional accesses (i.e., a random between 30 and 60).

Figure 2 summarizes the peak performance of all competitors. From that we can see the gap in performance between the ordered and unordered versions of the same algorithm: 26-56% for TL2, 13-41% for NOrec, 12-88% for UL-vis, and 28-74% for UL-invis.

As a general comment on these results, OUL and OUL-Steal outperform all other ordered versions of the algorithms. OUL-Steal excels for write loads and performs equally to OUL in read loads; OWB outperforms all write-back based implementations in most benchmarks. At high thread count, STMLite suffers from false conflicts due to the use of signatures. However, at low number of threads (less than 8) and with Long transactions it achieves a higher peak throughput than Ordered TL2 and Ordered NOrec, because it benefits from the quick validation using signatures. For the UL-inv algorithm, we found that the readers’ visibility was crucial; without this information, the algorithm may abort a lower age transaction (using timeout) while some higher age transaction holds the read shared lock. On the other hand, these higher age transactions cannot commit before their order comes, hence they timeout.

(a) Disjoint-Long
(b) Disjoint-Short
(c) Disjoint-Heavy
(d) RNW1-Long
(e) RNW1-Long Aborts
(f) RNW1-Short
(g) RNW1-Short Aborts
(h) RNW1-Heavy
(i) RNW1-Heavy Aborts
Figure 3. Disjoint 2(a)2(c) and ReadNWrite1 2(d)2(i).

In configurations where the performance of the sequential (non-transactional) execution is faster than many ordered algorithms, our solutions outperform it, letting parallelism pay off. However, there are two benchmarks with long transactions where the sequential execution is faster. These workloads represent unfavorable scenarios for processing ordered transactions because of the high cost of aborting transactions (possible repeatedly) due to ACO violation.

The DisjointBench (Figures 2(a)-2(c)) produces a workload with no conflict between concurrent transactions. Every transaction accesses a different set of addresses with read and write operations. In all configurations, OUL achieves the best throughput, while OUL-Steal suffers from the overhead of its lock management scheme without actually gaining from that, as the disjoint transactions do not have any shared accesses. UL-vis achieves a throughput near to OUL-Steal, thanks to the simplicity of its immediate write strategy. In all the three plots is visible a peak performance around 6 threads. This shape is the consequence of NUMA latency (Brown et al., 2016; Daly et al., 2018), which can be appreciated in this configuration more than in others due to the absence of data contention.

(a) RWN-Long
(b) RWN-Long Aborts
(c) RWN-Short
(d) RWN-Short Aborts
(e) RWN-Heavy
(f) RWN-Heavy Aborts
(g) MCAS-Long
(h) MCAS-Long Aborts
(i) MCAS-Short
(j) MCAS-Short Aborts
(k) MCAS-Heavy
(l) MCAS-Heavy Aborts
Figure 4. RWN 3(a)3(f); MCAS 3(g)3(l)

In addition to that, without aborts we can show the transactional access overhead for each of all competitors. It is intuitive that UndoLog algorithms (UL-vis, UL-inv, OUL, OUL-Steal) benefit from having the values already in memory, thus they outperform others. In fact, the UndoLog’s main drawback is the costly abort, which never happens in this benchmark. With Long transactions (Figure 2(a)), STMLite benefits from eliminating lock usage at the write-back phase and it has minimal overhead at low numbers of threads. On the other hand, for Short transactions (Figures 2(b) and 2(c)) the Ordered TL2 algorithm performs better. OWB has a moderate overhead relative to the other write-back algorithms.

In ReadNWrite1Bench (Figures 2(d)-2(i)), the transaction reads N locations and writes one. Since transaction write-set is very small, the number of aborts is low. Similarly, UndoLog algorithms excel here as well. With long and heavy transactions (Figure 2(d)2(h)), the processing done by workers is overweights the overhead due to single thread transaction validator, so both OUL and OUL-Steal scales well with increasing the number of workers. On the other hand, the validator represents a performance bottleneck for short transactions (Figure 2(f)), resulting in a slightly lower scalability.

In ReadWriteN (Figures 3(a)-3(f)), each transaction reads N locations, and then writes to other N locations. The large transaction write-set introduces a challenge for both undo-log (increases the number of aborts) and write-buffer algorithms (delay at commit time). The cooperative execution enables OUL, OUL-Steal and OWB to outperforms all other algorithms at all workloads. OUL-Steal outperforms OUL by 10% because it significantly reduces the number of aborts (Figures 3(b)3(d), and 3(f).

(a) OWB Algorithm
(b) OUL Algorithm
(c) OUL-Steal Algorithm
(d) Aborts Percentage
Figure 5. Aborts Breakdown

MCASBench performs a multi-word compare-and-swap, by reading and writing N consecutive locations. Similar to ReadWriteN, the write-set is large but the abort probability is lower than before because each pair of read/write acts on the same location. Figures 3(g)-3(l) illustrate the impact of increasing workers with the different workloads. We noticed a similar trend to ReadWriteN.

The breakdown of the abort reasons for OWB, OUL, and OUL-Steal is shown in Figure 5. Aborts are measured for the number of workers that achieved the maximum throughput.

In OWB (Figure 4(a)), with RNW1bench aborts due to validation failure represent the main reason; while in write-intensive benchmarks, such as RWNbench and MCASbench, aborts are mainly (-) due to concurrent commits (Locked Write). However, only of these cases falls in WAW, which means that OWB can benefit from the lock-steal optimization and save a considerable amount of aborts. However, applying lock-steal on OWB would complicate the design and the validation procedure. The reason is that transactions use commit-time locking and rely on the version number to validate their read-set. With lock-steal, multiple writers would increment the version number, thereby readers would not be able to do the validation simply.

For OUL and write-intensive benchmarks, concurrent writes generate between 70% to 85% of total aborts; a WAW represents at most 10% of them. In OUL-Steal, stealing the lock eliminates the problem of concurrent writes, and narrows write-write conflicts to only the WAW anomaly. However, it introduces several changes to the abort characteristics: a writer transaction that steals the lock becomes able to abort any invalid speculative readers earlier than before. This was reflected on increasing the number of Read After Write aborts; the probability of triggering cascading aborts is increased if compared to OUL (Figures 4(b)4(c)); and the total number of aborts of OUL is reduced by one order of magnitude (Figures 2(e)2(g)3(b)3(d)3(h)3(j)4(d)).

Although OUL-Steal substantially reduces the number of aborts, the speed-up is on average 20%. The reasons for that are: the abort procedure for OUL-Steal is longer than OUL (2-4 in our experiments) because it involves recursive rollback for stolen locks. This outweighs the reduction of the number of aborts; and OUL uses encounter time locking, thus aborts are detected at an early stage. This reduces the impact of aborting. In contrast, lazy algorithms (e.g., OWB) are greatly affected by aborts because the whole transaction needs to be re-executed since the invalidation is detected at commit time. It is worth noting that, in OUL algorithms the abort cost differs according to the transaction type. In fact, aborting a write transaction requires restoring its original value, thus forcing the other transaction involved in the conflict to wait for the restoration of old written values; whereas aborting the readers is cheaper.

Figure 4(d) shows the number of aborts in the maximum throughput scenario. OUL experiences more aborts than OWB because of the eager accesses; OUL-Steal avoids this drawback and experiences lesser, yet longer, aborts.

(a) Kmeans Low
(b) Kmeans High
(c) Genome
(d) SSCA2
(e) Vacation Low
(f) Vacation High
(g) Labyrinth
(h) Intruder
Figure 6. Execution time of STAMP (Y-axis log scale).
(a) PARSEC/Blacksholes
(b) PARSEC/Swaptions
(c) PARSEC/Fluidanimate
(d) SPEC2000/Equake
Figure 7. Execution time using PARSEC and SPEC200 benchmarks.

STAMP Benchmark. STAMP (Minh et al., 2008) is a benchmark suite with applications covering a variety of domains. Figure 6 shows the execution time of the aforementioned algorithms (lower is better). Two applications (Yada and Bayes) have been excluded because they expose non-deterministic behaviors, thus their evolution is unpredictable. The datapoints for competitors that do not scale in some configuration are omitted to preserve the scale and readability of the plot. We also included performance of the unordered STM algorithm (among those in Figure 2) that behaves best in each plot.

Kmeans, a clustering algorithm, iterates over a set of points and associates them to clusters. The main computation is in finding the nearest point, while shared data updates occur when updating the cluster centers. Both OUL and OUL-Steal scale when increasing the number of workers, while under high contention OUL-Steal performs better (Figure 5(b)). OWB and Ordered NOrec have similar performance, but OWB does not degrade at high thread count.

Genome reconstructs the gene sequence from segments of a larger gene. It uses a shared hash-table to organize segments, which requires synchronization over its accesses. Genome exhibits a little contention, which makes OUL and OUL-Steal perform similarly (Figure 5(c)).

SCAA2 is a multi-graph kernel that is commonly used in domains such as biology and security. The core of the kernel uses a shared graph structure that is updated at each iteration. The amount of contention is low as the large number of graph nodes leads to infrequent concurrent updates. Figure 5(d) shows that all algorithms perform almost equally and benefit from optimistic concurrency.

Vacation is a travel reservation system using an in-memory database. Each client uses a coarse-grain transaction to execute its session, consequently, aborts are costly. Again, our cooperative model boosts the performance of the proposed algorithms, and they scale well when increasing the number of workers (clients) (Figures 5(e) and 5(f)).

Labyrinth is a multi-path maze solver. The maze is represented as a three-dimensional uniform grid, and each thread tries to connect input pairs by a path of adjacent maze points. Upon finding a path, it is is highlighted at a shared output grid. Transactions conflict when their paths overlap. In Figure 5(g), NOrec outdoes other algorithms because of two reasons: 1) as Labyrinth updates adjacent addresses for the path, it is prone to produce false sharing for all other algorithms that use locks; and 2) NOrec employs a value-based validation, thus when two conflicting transactions updating a maze point with the same value, they commit successfully.

Intruder, a network intrusion detection system using signatures. It compares the captured packets against a dictionary of intrusion signatures. Packets are processed in parallel, grouped in sessions, and stored in a self-balanced (red-black) tree. Transactions guard the tree operations, and the contention is high and depends on the frequency of the rebalance operation. Figure 5(h) shows that not all algorithms scale well; besides, the sequential execution outperforms all of them (except the unordered).

PARSEC Benchmark. PARSEC is a benchmark suite for shared memory chip-multiprocessors architectures.

The Black-Scholes application calculates Black-Scholes equation for input values. Since calculations per iteration are few, each transaction involves multiple calculations to reduce the overhead or parallelization. Swaptions employs Monte Carlo simulation to compute prices. Fluidanimate is an application performing physics simulations. The main computation is spent on computing particle densities and forces, which involves six levels of loops nesting updating a shared array structure. Since it is not straightforward to assign ages based on number of iterations, a global atomic integer variable is used to assign ages to transactions.

OUL, OUL-Steal and OWB scale in these three applications; significant speedup over sequential is achieved in Swaptions. In both Black-Scholes and Fluidanimate, all other algorithms outperform sequential when contention is low, and then performance drop quickly when contention increases, which is due to a high abort rate.

SPEC CPU2000 Benchmark. Equake is an application included in the SPEC CPU2000 benchmark and it simulates the propagation of elastic waves. The computation iterates over a number of steps and, in each time step, it iterates over a number of nodes where each performed calculation relies on the previous one. The loop-carried dependencies forces the transaction to be committed in a specific order. Each thread is assigned a consecutive region of nodes so only those in joints may abort.

When testing this benchmark, we set the input size to be 500 nodes. The results show that OUL, OUL-Steal and OTL2 scales when increasing the number of threads, up to 32 threads; the achieved peak speedup is 30%. After that, because of high contention and increasing number of aborts, all algorithms’ performance drops.

9. Conclusion

In this paper, we presented three algorithms that address the problem of committing transactions with an order defined prior to execution. Our results show that even if a system requires a specific commit order, it is possible to achieve high performance exploiting parallelism with data conflicts.

Acknowledgments

Authors would like to thank the anonymous reviewers for their valuable comments, Binoy Ravindran for his feedback at the very early stage of this paper, and Jacob Nelson for the insightful discussion. This material is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-17-1-0367 and by the National Science Foundation under Grant No. CNS-1814974.

References

  • (1)
  • SOF ([n. d.]) [n. d.]. RSTM: The University of Rochester STM. http://www.cs.rochester.edu/research/synchronization/rstm/.
  • Attiya et al. (2014) Hagit Attiya, Alexey Gotsman, Sandeep Hans, and Noam Rinetzky. 2014. Safety of Live Transactions in Transactional Memory: TMS is Necessary and Sufficient. In DISC. 376–390.
  • Barreto et al. (2012) Joao Barreto, Aleksandar Dragojevic, Paulo Ferreira, Ricardo Filipe, and Rachid Guerraoui. 2012. Unifying thread-level speculation and transactional memory. In Proceedings of the 13th International Middleware Conference. Springer-Verlag New York, Inc., 187–207.
  • Bernstein et al. (1987) Philip A Bernstein, Vassos Hadzilacos, and Nathan Goodman. 1987. Concurrency control and recovery in database systems. Addison-Wesley.
  • Bhowmik and Franklin (2002) Anasua Bhowmik and Manoj Franklin. 2002. A general compiler framework for speculative multithreading. In Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures (SPAA ’02). ACM, New York, NY, USA, 99–108. https://doi.org/10.1145/564870.564885
  • Brown et al. (2016) Trevor Brown, Alex Kogan, Yossi Lev, and Victor Luchangco. 2016. Investigating the Performance of Hardware Transactions on a Multi-Socket Machine. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2016, Asilomar State Beach/Pacific Grove, CA, USA, July 11-13, 2016, Christian Scheideler and Seth Gilbert (Eds.). ACM, 121–132.
  • Dalessandro and Scott (2012) Luke Dalessandro and Michael L. Scott. 2012. Sandboxing Transactional Memory. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT ’12). ACM, New York, NY, USA, 171–180. https://doi.org/10.1145/2370816.2370843
  • Dalessandro et al. (2010) Luke Dalessandro, Michael F. Spear, and Michael L. Scott. 2010. NOrec: Streamlining STM by Abolishing Ownership Records. In PPoPP. 67–78.
  • Daly et al. (2018) Henry Daly, Ahmed Hassan, Michael F. Spear, and Roberto Palmieri. 2018. NUMASK: High Performance Scalable Skip List for NUMA. In 32nd International Symposium on Distributed Computing, DISC 2018, New Orleans, LA, USA, October 15-19, 2018 (LIPIcs), Ulrich Schmid and Josef Widder (Eds.), Vol. 121. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 18:1–18:19.
  • Dice et al. (2006) Dave Dice, Ori Shalev, and Nir Shavit. 2006. Transactional Locking II. In DISC. 194–208.
  • Doherty et al. (2013) Simon Doherty, Lindsay Groves, Victor Luchangco, and Mark Moir. 2013. Towards formally specifying and verifying Transactional Memory. Formal Aspects of Computing 25, 5 (2013), 769–799. https://doi.org/10.1007/s00165-012-0225-8
  • Felber et al. (2008) Pascal Felber, Christof Fetzer, and Torvald Riegel. 2008. Dynamic performance tuning of word-based software transactional memory. In PPoPP. 237–246.
  • Gonzalez-Mesa et al. (2014) MA Gonzalez-Mesa, Eladio Gutierrez, Emilio L Zapata, and Oscar Plata. 2014. Effective Transactional Memory Execution Management for Improved Concurrency. ACM TACO 11, 3 (2014), 24.
  • Guerraoui and Kapalka (2008) Rachid Guerraoui and Michal Kapalka. 2008. On the correctness of transactional memory. In PPoPP. 175–184.
  • Guerraoui and Kapalka (2011) R. Guerraoui and M. Kapalka. 2011. Principles of Transactional Memory. Morgan & Claypool.
  • Hendler et al. (2010) Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010. Flat combining and the synchronization-parallelism tradeoff. In SPAA. 355–364.
  • Herlihy and Moss (1993) Maurice Herlihy and J. Eliot B. Moss. 1993. Transactional Memory: Architectural Support for Lock-Free Data Structures. In ISCA. 289–300.
  • Hirve et al. (2014) Sachin Hirve, Roberto Palmieri, and Binoy Ravindran. 2014. Archie: a speculative replicated transactional system. In Proceedings of the 15th International Middleware Conference, Bordeaux, France, December 8-12, 2014. 265–276.
  • Kapritsos et al. (2012) Manos Kapritsos, Yang Wang, Vivien Quéma, Allen Clement, Lorenzo Alvisi, and Mike Dahlin. 2012. All about Eve: Execute-Verify Replication for Multi-Core Servers. In 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2012, Hollywood, CA, USA, October 8-10, 2012. 237–250.
  • Lamport (1998) Leslie Lamport. 1998. The Part-Time Parliament. ACM Trans. Comput. Syst. 16, 2 (1998), 133–169.
  • Mehrara et al. (2009) Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, and Scott Mahlke. 2009. Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory. In PLDI. 166–176.
  • Minh et al. (2008) Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, and Kunle Olukotun. 2008. STAMP: Stanford Transactional Applications for Multi-Processing. In IISWC. 35–46.
  • Oancea et al. (2009) Cosmin E. Oancea, Alan Mycroft, and Tim Harris. 2009. A Lightweight In-place Implementation for Software Thread-level Speculation. In Proceedings of the Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures (SPAA ’09). ACM, New York, NY, USA, 223–232. https://doi.org/10.1145/1583991.1584050
  • Papadimitriou (1979) Christos H. Papadimitriou. 1979. The serializability of concurrent database updates. J. ACM 26 (October 1979), 631–653. Issue 4.
  • Prabhu and Olukotun (2003) Manohar K. Prabhu and Kunle Olukotun. 2003. Using thread-level speculation to simplify manual parallelization. In Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP ’03). ACM, New York, NY, USA, 1–12. https://doi.org/10.1145/781498.781500
  • Ramadan et al. (2008) Hany E. Ramadan, Christopher J. Rossbach, and Emmett Witchel. 2008. Dependence-aware transactional memory for increased concurrency. In 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41 2008), November 8-12, 2008, Lake Como, Italy. 246–257.
  • Ramadan et al. (2009) Hany E Ramadan, Indrajit Roy, Maurice Herlihy, and Emmett Witchel. 2009. Committing conflicting transactions in an STM. In ACM Sigplan Notices, Vol. 44. ACM, 163–172.
  • Raman et al. (2010) Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and David I. August. 2010. Speculative parallelization using software multi-threaded transactions. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems (ASPLOS ’10). ACM, New York, NY, USA, 65–76. https://doi.org/10.1145/1736020.1736030
  • Ramaseshan and Mueller (2008) Ravi Ramaseshan and Frank Mueller. 2008. Toward thread-level speculation for coarse-grained parallelism of regular access patterns. In Workshop on Programmability Issues for Multi-Core Computers. 12.
  • Rauchwerger and Padua (1995) Lawrence Rauchwerger and David Padua. 1995. The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization. SIGPLAN Not. 30 (June 1995), 218–232. Issue 6. https://doi.org/10.1145/223428.207148
  • Ravichandran et al. (2014) Kaushik Ravichandran, Ada Gavrilovska, and Santosh Pande. 2014. DeSTM: Harnessing Determinism in STMs for Application Development. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT ’14). 213–224.
  • Riegel et al. (2006) Torvald Riegel, Pascal Felber, and Christof Fetzer. 2006. A Lazy Snapshot Algorithm with Eager Validation. In DISC 2006. Lecture Notes in Computer Science, Vol. 4167. Springer, 284–298.
  • Saad et al. (2012) Mohamed M. Saad, Mohamed Mohamedin, and Binoy Ravindran. 2012. HydraVM: Extracting Parallelism from Legacy Sequential Code Using STM. In HotPar.
  • Saad et al. (2016) Mohamed M. Saad, Roberto Palmieri, and Binoy Ravindran. 2016. On ordering transaction commit. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2016, Barcelona, Spain, March 12-16, 2016, Rafael Asenjo and Tim Harris (Eds.). ACM, 46:1–46:2. https://doi.org/10.1145/2851141.2851191
  • Saad et al. (2018) Mohamed M. Saad, Roberto Palmieri, and Binoy Ravindran. 2018. Lerna: Parallelizing Dependent Loops Using Speculation. In Proceedings of the 11th ACM International Systems and Storage Conference, SYSTOR 2018, HAIFA, Israel, June 04-07, 2018. ACM, 37–48.
  • Schneider (1990) Fred B. Schneider. 1990. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput. Surv. 22, 4 (1990), 299–319.
  • Streit et al. (2013) Kevin Streit, Clemens Hammacher, Andreas Zeller, and Sebastian Hack. 2013. Sambamba: runtime adaptive parallel execution. In ADAPT. 7.
  • von Praun et al. (2008) Christoph von Praun, Rajesh Bordawekar, and Calin Cascaval. 2008. Modeling optimistic concurrency using quantitative dependence analysis. In PPoPP. 185–196.
  • Von Praun et al. (2007) Christoph Von Praun, Luis Ceze, and Calin Caşcaval. 2007. Implicit parallelism with ordered transactions. In PPoPP. 79–89.
  • Zhang et al. (2010) Lingli Zhang, Vinod K Grover, Michael M Magruder, David Detlefs, John Joseph Duffy, and Goetz Graefe. 2010. Software transaction commit order and conflict management. US Patent 7,711,678.