Concurrent programming is difficult because it requires developers to consider interactions between multiple threads of execution and mediate access to shared resources and data. Programming languages can offer higher-level abstractions to reduce this complexity by making concurrent programming more declarative. One such abstraction is the monitor (hansen:osp; hoare:monitors), which is an object that encapsulates shared state and allows threads access to it only through a set of operations, between which the monitor enforces mutual exclusion.
Ideally, developers would implement monitors using implicit synchronization, wherein the only synchronization primitive is a waituntil(P) operation that blocks threads until condition P is satisfied. The compiler or runtime can then automatically generate the necessary explicit synchronization operations (locks, condition variables, etc.) to implement the monitor in a way that respects the semantics of the implicit monitor. However, automatically deriving an efficient explicit monitor from its implicit specification is a challenging problem, and there have been several recent research efforts, including both run-time techniques like AutoSynch (hung:autosynch) and compile-time tools like Expresso (expresso), to support implicit-synchronization monitors.
While these state-of-the-art approaches make it possible to program using implicit monitors, they still achieve sub-optimal performance because they adhere closely to the monitor’s mutual exclusion requirement. They generally use a single lock for the entire monitor and allow access by at most one thread at a time across all monitor operations. In practice, however, many monitors can admit additional concurrency while still preserving the appearance of mutual exclusion. For example, consider a FIFO queue monitor that provides take and put operations. These two operations can safely run concurrently unless the queue is empty or full, as they will not access the same slot in the queue. Today, realizing this fine-grained concurrency requires expert developers to fall back to hand-written explicit synchronization. These implementations are subtle and error-prone, and there is no easy way for developers to determine when they have extracted the maximum possible concurrency from such an implementation.
This paper presents a new technique which automatically synthesizes fine-grained explicit-synchronization monitors. Our technique takes as input an implicit monitor that specifies the desired operations and automatically generates an implementation that allows as much concurrency as possible between those operations while still preserving the appearance of mutual exclusion. The key idea is to decompose each monitor operation into a set of fragments and allocate a set of locks to each fragment to enforce the mutual exclusion requirement while allowing as many fragments as possible to run concurrently. The resulting implementation selectively acquires and releases locks at fragment boundaries within each operation and signals condition variables as needed.
At a high level, our approach operates in three phases to generate a high-performance explicit synchronization monitor from its implicit version:
Signal placement: First, we use an off-the-shelf technique (expresso) to infer a signaling regime which determines where to insert signaling operations on condition variables. While the output of this tool is sufficient to synthesize a single-lock implementation, it does not admit any additional concurrency wherein different threads can perform monitor operations simultaneously.
Static analysis: Second, we perform static analysis to infer sufficient conditions for correctness. That is, the output of the static analysis is a set of conditions such that if the synthesized monitor obeys them, it is guaranteed to be correct-by-construction. A key challenge for this static analysis is to determine which fragments can safely execute concurrently without creating a potential violation of the monitor semantics. The analysis simulates interleaving each fragment between the fragments of other operations and determines which possible interleavings are safe.
Synchronization protocol synthesis via MaxSAT: Finally, we reduce the synthesis problem to a maximum satisfiability (MaxSAT) instance from whose solution an explicit sychronization protocol can be extracted. The hard constraints in the MaxSAT problem enforce the correctness requirements extracted by the static analysis, while the soft constraints encode two competing objective functions: minimizing the total number of locks used, while maximizing the number of pairs of fragments that can run concurrently.
We have implemented our proposed approach in a tool called Cortado
that operates on Java monitors and evaluated it on a collection of monitor implementations that are (1) drawn from popular open-source projects and (2) contain parallelization opportunities that can be achieved via fine-grained locking. Given only the implicit monitor as input,Cortado synthesizes explicit-synchronization monitors that perform as well as, or better than, hand-written explicit implementations by expert developers. Compared to state-of-the-art automated tools for synthesizing explicit monitors (expresso), Cortado-synthesized monitors extract more concurrency and therefore perform much better (up to ) on heavily contended workloads.
In summary, this paper makes four main contributions:
A new technique for automatically synthesizing fine-grained monitor implementations that admit the maximum possible concurrency.
A novel static static analysis for inferring safe interleaving opportunities between threads.
A MaxSAT encoding to automate reasoning aboutboth the correctness and performance of the synthesized explicit-synchronization monitor.
An implementation of our technique, Cortado, that outperforms both state-of-the-art automated tools and expert-written code on benchmarks that can be parallelized via fine-grained locking.
In this section, we give an overview of our approach through a motivating example. Given the implicit-synchronization monitor shown in Figure 0(a), our goal is to automatically synthesize an efficient and semantically equivalent explicit-synchronization monitor like the one presented in Figure 0(b). In what follows, we walk through this example and describe how our technique is able to automatically generate the code in Figure 0(b).
2.1. Implicit-synchronization monitor
Our technique takes as input an implicit-synchronization monitor that specifies which operations should execute atomically and when certain operations are allowed to proceed but does not fix a specific synchronization protocol for realizing that behavior. For example, Figure 0(a) shows an implicit monitor that implements a limited capacity blocking queue via a bounded circular array buffer. This monitor defines two operations, put and take, that execute atomically (i.e., the body of each method must appear to execute as one indivisible unit). The put operation adds an object if the queue is not full, and take removes an object if the queue is not empty. If one of these method calls cannot proceed (i.e., queue is full or empty), the monitor blocks the calling thread’s execution using a waituntil statement until the operation can be executed. For example, the waituntil statement at line 0(a) in take blocks execution until there is at least one object in the queue.
As Figure 0(a) illustrates, implicit-synchronization monitors make concurrent programming simpler because they are declarative: they merely state which operations are atomic and when operations can proceed, but they do not specify a particular synchronization protocol for realizing that desired behavior. However, most programming languages do not offer implicit synchronization facilities; so, concurrent programs must instead be implemented in terms of explicit synchronization constructs such as locks and condition variables, as we discuss next.
2.2. Explicit-synchronization monitor
Figure 0(b) shows an explicit-synchronization implementation of the bounded queue from Figure 0(a) that is written by an expert. This implementation uses two distinct locks, putLock and takeLock, to protect the put and take methods respectively. The explicit-synchronization monitor also uses an atomic integer for the count field, transforming reads into get() calls (e.g., line 0(b)) and writes into the appropriate atomic method (e.g., count.getAndIncrement() on line 0(b)). The expert-written monitor performs explicit signaling via condition variables notFull and notEmpty that are associated with putLock and takeLock respectively. When a thread cannot execute one of these operations, it calls await on the appropriate condition variable to block its execution (lines 0(b) and 0(b)). A thread blocked in put can only be unblocked by a corresponding take that frees up space in the queue. To do so, the take must acquire putLock and perform a signal operation on condition variable notFull (lines 0(b)–0(b)). The logic for take is symmetric (lines 0(b)–0(b)).
Although the expert-written version has more locks than a single global-lock implementation, its performance will often be better: Introducing two locks allows put and take to execute concurrently, although multiple concurrent puts are still serialized, as are multiple takes. A single global lock would admit no concurrency in this case and would still incur the same synchronization overhead of acquiring and releasing a lock on every method call. The expert implementation mitigates the overhead of having two locks by acquiring locks selectively: take only acquires the putLock if it is possible for there to be a put operation currently blocked waiting for space in the queue, which happens only if the queue was full when take ran (the put/takeLock case is symmetric). This example demonstrates the intricacy of synthesizing fine-grained locking protocols: instead of only minimizing the total number of locks, we must also try to maximize the available concurrency.
2.3. Our Approach
Our tool Cortado automatically synthesizes the efficient explicit-synchronization monitor in Figure 0(b) given the implicit version from Figure 0(a). It does so in three phases: First, it infers when and how signaling operations should take place. Second, it performs static analysis to infer sufficient conditions for the synthesized monitor to be correct. Third, it encodes the synchronization protocol synthesis problem as a MaxSAT instance and uses a model of the MaxSAT problem to generate an explicit-sychronization monitor. Since prior work can already handle the first phase, we only focus on the the latter two phases in the following discussion.
The granularity of our synthesized locking protocol is at the level of code fragments, where each fragment is a single-entry region of code within a single method. For example, the fragments chosen for the blocking queue example are indicated by comments in Figure 0(a). Fragments are the indivisible unit of concurrency in our approach: we aim to maximize the number of fragments that can run concurrently, but we do not modify the code within a fragment to introduce extra concurrency (e.g., by removing data races). Hence, the explicit monitor synthesized by our approach acquires and releases locks only at fragment boundaries.
To ensure correctness of the synthesized monitor, our technique needs to enforce the following three key requirements:
Data-race freedom: Fragments that involve a data race must not be able to run concurrently.
Deadlock freedom: Locks must be acquired and released in an order that prevents deadlocks.
Atomicity: Each monitor operation should appear to take place as one indivisible unit. That is, even though the implementation can allow thread interleavings inside monitor operations, the resulting state should be equivalent to one where each method executes truly atomically.
Here, the second requirement (i.e., deadlock freedom) does not necessitate any static analysis, as we can prevent deadlocks by imposing a static total order on locks (birrell1989introduction) and ensuring that locks are acquired and released in a manner that is consistent with . However, in order to ensure data-race freedom and atomicity, we need to perform static analysis of the source code to identify (1) code fragments that have a data race, and (2) interleaving opportunities between code fragments. Since detection of data races is a well-studied problem, the novelty of our static analysis lies in identifying safe interleaving opportunities. Hence, the key question addressed by our analysis is the following: Given a code fragment executed by thread , and two consecutive code fragments executed by a different thread , is it safe to interleave the execution of in between and while ensuring that monitor operations appear to take place atomically?
To answer this question, our method performs a novel static analysis to identify a set of such safe interleavings. For instance, going back to the running example, our analysis determines that it is safe to interleave the execution of fragment 4 in Figure 0(a) in between fragments 5 and 6 by checking a number of commutativity relations between code fragments. In this instance, since our analysis proves that fragment 4 left-commutes (reduction-lip) with fragment 5 and right-commutes (reduction-lip) with 6 and all of its successors, we identify this as a safe interleaving opportunity. On the other hand, our analysis concludes that interleaving fragment 4 in between 1 and 2 is not safe because fragment 4 does not left-commute with fragment 1 — intuitively, this is because fragment 4 can falsify predicate count < queue.length that appears in the waituntil statement of fragment 1.
Once we identify possible data races and safe interleavings via static analysis, we use this information to generate a MaxSAT instance whose solution corresponds to a fine-grained synchronization protocol. Specifically, our MaxSAT encoding uses a variable to indicate that code fragment must hold lock and generates both hard constraints (for correctness) and soft constraints (for efficiency) over these variables. Thus, if the MaxSAT solver returns a model in which variable is assigned to true, this means that the synthesized code must acquire lock prior to executing fragment . Similarly, our MaxSAT encoding introduces a variable indicating that field fld should be implemented using an atomic type.
The hard constraints in our MaxSAT encoding correspond to the three correctness requirement mentioned earlier, namely (1) data race prevention, (2) deadlock freedom, and (3) atomicity. On the other hand, soft constraints encode our optimization objective. In what follows, we give a brief overview of the different types of constraints in our encoding, focusing only on constraints that involve lock acquisition variables . However, it is worth noting that our technique also generates constraints on atomic variables and can automatically convert fields to atomic types whenever doing so is safe and more efficient than introducing a lock.
Given a pair of code fragments that have a potential data race according to the static analysis, our MaxSAT encoding introduces hard constraints of the form stating that and must share at least one common lock. For example, in Figure 0(a), our analysis determines that fragments 4 and 8 cannot run in parallel since they both write to the same memory location count. Thus, the MaxSAT instance contains boolean constraints to make sure that two different threads cannot execute count-- and count++ at the same time.
Our approach precludes deadlocks by imposing a total order on locks. In particular, it enforces that a thread can only acquire lock if does not already hold any lock where . For example, in Figure 0(a), suppose the locking protocol determines that fragments 1 and 2 must hold all locks in sets and respectively. Between executing the two fragments, the code will need to acquire all locks in . Hence, we add constraints for every pair of locks and so that those locks can be acquired while respecting the order .
Our MaxSAT encoding also includes constraints to ensure that monitor operations appear to execute atomically. Suppose that our static analysis determines that a thread cannot safely execute code fragment in between some other thread’s execution of code fragments and . To prevent such an unsafe interleaving, we add hard constraints to ensure that fragments , and all share at least one common lock. For example, since our analysis determines that fragment 4 (count++) cannot be interleaved with any other pair of fragments in the same method put (running concurrently on a different thread), our MaxSAT encoding includes a hard constraint asserting that fragment 4 must share a lock with all other fragments in the put method.
Because the efficiency of the synthesized code depends on both the allowed parallelization opportunities as well as the number of locks, our optimization objective tries to minimize the number of locks and maximize the number of fragments that can run in parallel. To encode the latter objective, our MaxSAT encoding includes soft contraints asserting that any two parallelizable fragments must not share a lock. On the other hand, to encode the former objective, we add a soft constraint stating that no fragment in is holding lock .
A solution of the generated MaxSAT instance determines (a) which fragments should hold which locks, (b) which fields should be implemented using atomic types, and (c) which locks should be associated with which condition variables. Thus, together with the output of the signal placement technique (expresso), a model of the MaxSAT problem can be automatically translated into the target monitor implementation. For our running example, Cortado synthesizes precisely the implementation in Figure 0(b) given the implicit monitor of Figure 0(a).
In this section, we describe our source and target languages and define what it means for an explicit synchronization monitor to correctly implement an implicit one.
3.1. Background on Monitors
In this work, we assume that all shared resources between threads are handled by a monitor class which consists of fields and set of operations (methods) . The fields constitute the only shared state between threads, which can only access shared state by performing one of the monitor operations . These operations can be performed by an arbitrary, yet fixed, number of threads, and locations reachable through arguments are assumed to be thread-local. We represent each thread by a unique identifier from set , and we model memory locations using access paths () (aps) of the form , consisting of a base variable optionally followed by a finite sequence of field accesses. We also assume that a special this variable stores the memory location of the monitor object.
Definition 3.1 ().
(Monitor State). A monitor state is a mapping from pairs (where is a thread identifier and an access path) to a value.
3.2. Source Language
Our source language, presented in Figure 1(a), corresponds to implicit synchronization monitors without explicit locking or signaling. The body of each monitor operation consists of a sequence of so-called Conditional Critical Regions (CCRs) (hoare:ccr), which in turn consist of a waituntil statement followed by one or more regular non-blocking statements. We refer to the predicate of the waituntil statement of a CCR as its guard and to the rest of the statements as its body. A thread executes the body of the CCR atomically if its guard evaluates to true; otherwise it suspends its execution and exits the monitor until the predicate becomes true. More formally, the semantics of our source language are defined via the notion of an implicit monitor history:
Definition 3.2 ().
(Implicit monitor history). Given a set of threads interacting with each other through monitor , an implicit monitor history is a sequence where each is a CCR of and is a thread identifier.
Given history , we define an argument mapping to be a list whose ’th element maps formal parameters of to their actual value for each event in .
Definition 3.3 ().
(Implicit monitor semantics). Given a monitor , initial state , and monitor history with argument mapping , the operational semantics of is defined using a judgment indicating that the new monitor state is after executing on state .
Because our source language is very similar to the one used in expresso, we omit a formal definition of the operational semantics. Following that work, we also consider an implicit history to be valid only if it respects the program order of the input monitor.
3.3. Target Language
Figure 1(b) presents the language of explicit-synchronization monitors. The overall structure of this target language is similar to the source language but with a few important differences. First, an explicit monitor contains locks, conditional variables, and atomic fields, collectively referred to as synchronization variables. Second, CCRs in the target language do not contain waituntil statements; instead, the logic of a waituntil statement is implemented by calling methods on the appropriate condition variable. We assume that synchronization variables support all the standard synchronization operations present in modern concurrent languages (e.g., await, signal, signalAll, etc.). Finally, our target language contains a special update statement for performing updates on atomic fields: it takes as argument an atomic field and a unary function and updates the value of atomically as . For instance, the statement c := c.update() atomically increments c by one and stores the value of c before the update in c.
Definition 3.4 ().
(Explicit monitor history). Given a set of threads executing in monitor , an explicit monitor history is a sequence where each is a (non-composite) statement of a monitor operation and is a thread identifier.
Leveraging the same notion of argument mappings defined in Section 3.2, we define explicit monitor semantics as follows:
Definition 3.5 ().
(Explicit monitor semantics). Given a monitor , initial state , and monitor history with argument mapping , the operational semantics of is defined using a judgment indicating that the new state is after executing on initial state .
The full operational semantics of our target language is given in Appendix LABEL:sec:trg-lang-sem.
3.4. Relating Implicit and Explicit Histories
In order to formalize the correctness of our approach, we need to relate an implicit history of a source monitor with an explicit history of its corresponding target version . Because every history of an implicit monitor induces a corresponding history of its explicit version , we define an operation called that that “translates” an implicit history to an explicit one. That is, given an implicit history with argument mapping and state , returns a pair , where is a history of containing all statements executed by under initial state and is the argument mapping for .
Example 3.6 ().
Using this operation, we can classify explicit histories as being sequential or interleaved:
Definition 3.7 ().
(Sequential history) Let be an explicit monitor implementation of . We say that an explicit history of monitor with argument mapping is sequential iff there exist a history of , argument mapping , and initial state such that .
In other words, a sequential history corresponds to an execution in which statements of the explicit monitor are not interleaved between threads.
Example 3.8 ().
Going back to Figure 2(c), history is sequential but is not.
Next, we introduce the notion of well-formed histories, which, intuitively, respect the program order of the original implicit monitor:
Definition 3.9 ().
(Well-formed history) Let be the projection of onto thread (i.e., it filters out all elements of not involving thread ). We say that a history of is well-formed iff, for every thread , there exists sequential histories such that .
Intuitively, well-formed histories respect program dependence in the original monitor for every thread. By definition, every sequential history is also well-formed. In the remainder of this paper, we implicitly mean well-formed explicit history whenever we refer to an explicit history.
Example 3.10 ().
Histories , from Figure 2(c) are both well-formed. However, the following history is not well-formed because it does not respect program order:
Definition 3.11 ().
(Interleaved history) We say that a history of is interleaved iff it is (1) well-formed and (2) not sequential.
Example 3.12 ().
History from Figure 2(c) is interleaved.
Next, we define what it means for an explicit history to simulate an implicit history.
Definition 3.13 ().
(Simulation relation). Let be an explicit version of implicit monitor . We say that an explicit history of with argument mapping simulates of on input , denoted , if there exist sequential history and such that:
In other words, simulates a history of the original monitor if it is a (well-formed) permutation of some sequential history of the explicit monitor .
Example 3.14 ().
Going back to Figure 2(c), we have for some , .
3.5. Correctness of Explicit-Synchronization Monitors
Using the concepts introduced in the previous section, we now formalize what it means for an explicit monitor to correctly implement an implicit one.
Definition 3.15 ().
(State equivalence) Let be a program state of an implicit monitor and that of an explicit monitor . We say that and are equivalent modulo , denoted , iff for all in the domain of , we have
Intuitively, this notion of equivalence between two monitor states ignores any additional synchronization fields and local variables introduced by translating to an explicit-synchronization monitor. Finally, we can define the correctness of an explicit monitor as follows:
Definition 3.16 ().
(Correctness) We say that an explicit monitor correctly implements an implicit monitor , denoted as , iff for all input states s.t. , we have:
The first correctness condition simply states that does not eliminate any feasible behaviors of . The second condition, on the other hand, states that every feasible history of simulates some implicit history that results in the same state. Intuitively, this means that all statement interleavings allowed by provide the illusion that all operations of are executed atomically.
4. Main Algorithm
In this section, we present our main synthesis algorithm. Specifically, Section 4.1 introduces some preliminary definitions and proves an NP-completeness result to justify the reduction to MaxSAT. Then, Section LABEL:sec:synth-alg presents the high-level algorithm, Section LABEL:sec:analysis presents the static analysis for inferring safe interleavings, and Sections LABEL:sec:max-sat presents the details of the MaxSAT encoding.
4.1. Fragment Dependency Graphs and NP-Completeness
Our main synthesis algorithm is parametrized over a partitioning of the input monitor into code fragments, where each code fragment defines a unit of computation that we need to assign locks to. In this section, we clarify our assumptions about these code fragments and prove the NP-completeness of the problem for a given choice of partition.
First, to define what we mean by a valid partition, we represent each method of the monitor as a standard control-flow graph (CFG), where each atomic statement belongs to its own basic block. Given a control-flow graph and node , we write to indicate the predecessor nodes of in and to indicate its successors. Then, a valid partition of a method into code fragments is defined as follows: