Concurrent software written for modern computer architectures, though ubiquitous, remains challenging for static program analysis. Although abstract interpretation (Cousot77, ) is a powerful static analysis technique and prior thread-modular methods (Ferrara08, ; Mine11, ; Mine12, ; Mine14, ; KusanoW16, ) mitigated interleaving explosion, none was specifically designed for software running on weakly consistent memory. This is a serious deficiency since weakly consistent memory may exhibit behaviors not permitted by uniprocessors. For example, slow memory accesses may be delayed, increasing performance, but also introducing additional inter-thread non-determinism. Thus, multithreaded software running on such processors may exhibit erroneous behaviors not manifesting on sequentially consistent (SC) memory.
Consider x86-TSO (total store order) as an example. Under TSO, each processor has a store buffer caching memory write operations so they do not block the execution of subsequent instructions (AdveG96, ). Conceptually, each processor has a queue of pending writes to be flushed to memory at a later time. The flush occurs non-deterministically at any time during the program’s execution. This delay between the time a write instruction executes and the time it takes effect may cause the write to appear reordered with subsequent instructions within the same thread. Figure 1 shows an example where the assertion holds under SC but not TSO. Since and are initialized to 0 and they are not defined as atomic variables, the write operations (x=1 and y=1) may be stored in buffers, one for each thread, and thus delayed after the read operations.
SPARC-PSO (partial store order) permits even more non-SC behaviors: it uses a separate store buffer for each memory address. That is, x=1 and y=1 within the same thread may be cached in different store buffers and flushed to memory independently. This permits the reordering of a write to with a subsequent read from , but also with a subsequent write (e.g., to variable ) in the same thread. The situation is similar under SPARC-RMO (relaxed-memory order). We detail how such relaxation leads to errors in Section 2.
Broadly speaking, existing thread-modular abstract interpreters fall into two categories, neither modeling weak-memory related behaviors. The first are SC-specific (KusanoW16, ; FarzanK12, ; FarzanK13, ): they are designed to be flow-sensitive in terms of modeling thread interactions but consider only behaviors compatible with the SC memory. The second (Mine11, ; Mine12, ; Mine14, ) are oblivious to memory models (MM-oblivious): they permit all orderings of memory-writes across threads. Therefore, MM-oblivious methods may report spurious errors (bogus alarms) whereas SC-specific methods, although more accurate for SC memory, may miss real errors on weaker memory (bogus proofs). This flaw is not easy to fix using conventional approaches (Mine14, ). For example, maintaining relational invariants at all program points makes the analysis prohibitively expensive. In Section 2, we use examples to illustrate issues related to these techniques.
We propose the first thread-modular abstract interpreter for analyzing concurrent programs under weakly consistent memory. Our method models thread interactions with flow-sensitivity, and is memory-model specific: it models memory operations assuming a processor-level memory model, as shown in Figure 2. In this figure, the boxes with bold text highlight our main contributions.
Our method builds on a unified framework for modeling the memory consistency semantics. Specifically, the feasibility of thread interactions is formulated as a constraint problem via Datalog: it is efficiently solvable in polynomial time, and adaptable to various hardware-level memory models. Additionally, our method handles thread interactions in a flow-sensitive fashion while being thread-modular. Analyzing one thread at a time, as opposed to the entire program, increases efficiency, especially for large programs. However, unlike prior MM-oblivious methods we do not join all the effects of remote stores before propagating them to a thread, thus preserving accuracy. Overall, our method differs from the state-of-the-art, which either are non-thread-modular (KupersteinVY11, ; FarzanK12, ; Meshman2014, ; Dan15, ) or not specifically targeting weak memory (Mine11, ; Mine12, ; Mine14, ; KusanoW16, ).
Our method also differs significantly from techniques designed for bug hunting as opposed to obtaining correctness proofs. For example, in concurrency testing, stateless dynamic model checking (Godefr97, ; Flanag05, ) was extended from SC to weaker memory models (NorrisD13, ; ZhangKW15, ; AbdullaAAJLS15, ; DemskyL15, ; Huang16a, ; AbdullaAJL16, ; OuD17, ). In bounded model checking, Alglave et al. (Alglave13T, ) modeled weak memory through code transformation or direct symbolic encoding (Alglave13P, ; Alglave14, ). However, these methods cannot be used to verify properties: if they do not find bugs, it does not mean the program is correct. In contrast, our method, like other abstract interpreters, is geared toward obtaining correctness proofs.
We implemented our new method in a tool named FruitTree, using Clang/LLVM (AdveLBSG03, ) as the C front-end, Apron (Jeannet09, ) for abstract domains, and the Z (Hoder11, ) Datalog engine in Z3 (DeMoura08, ). We evaluated FruitTree on 209 litmus tests, and 52 larger multithreaded programs totaling of 61,981 lines of C code. Reachability properties were expressed as embedded assertions. Our results show that FruitTree is significantly more accurate than state-of-the-art techniques with moderate runtime overhead.
Specifically, we compared FruitTree against the MM-oblivious analyzer of Miné (Mine14, ), the SC-specific thread-modular analyzer Watts (KusanoW16, ), and a non thread-modular analyzer named Duet (FarzanK12, ; FarzanK13, ). On the litmus tests, FruitTree is more accurate than the other three methods. On the larger benchmarks, including Linux device drivers, FruitTree proved 4,577 properties, compared to 1,752 proved by Miné’s method.
To summarize, we make the following contributions:
We propose a memory-model aware static analysis method based on thread-modular abstract interpretation.
We introduce a declarative analysis framework for deducing the feasibility of thread-interferences on weak memory.
We implement and evaluate our method on a set of benchmarks to demonstrate its high accuracy and moderate runtime overhead.
The remainder of this paper is organized as follows. First, we motivate our technique via examples in Section 2. Then, we provide background on memory models and abstract interpretation in Section 3. We present our new declarative analysis for checking the feasibility of thread inferences in Section 4, followed by the main algorithm for thread-modular abstract interpretation in Section 5. We present our experimental results in Section 6, review related work in Section 7, and conclude in Section 8.
Consider the program in Figure 3. The assertion holds under SC, TSO, PSO, and RMO. But, removing the fence causes it to fail under PSO and RMO. In this section, we show why MM-oblivious methods may generate bogus errors, why SC-specific ones may generate bogus proofs, and how our new method fixes both issues.
2.1. Behaviors under SC, TSO, PSO, and RMO
First, note that the assertion in Figure 3 holds under SC since each thread executes its instructions in program order, i.e., x = 5 takes effect before y = 10. So, thread two observing y to be 10 implies x must have been set to 5.
Next, we explain why the assertion holds under TSO (AdveG96, ). TSO permits the delay of a store after a subsequent load to a disjoint memory address (as in Figure 1). This program-order relaxation is a performance optimization, e.g., buffering slow stores to speed up subsequent loads. However, since all stores in a thread go into the same buffer, TSO does not allow the reordering of two stores (thread 1, Figure 3). Thus, even without the fence, x = 5 always takes effect before y = 10, meaning the assertion holds.
Next, we show why removing the fence causes the assertion to fail under PSO and RMO. Both permit store–store reordering by allowing each processor to have a separate store buffer for each memory address. Thus x and y are in separate buffers. Since buffers are flushed to memory independently, with the fence removed, y = 10 may take effect before x = 5, as if the two instructions were reordered in this thread. Thus, the second thread may read 10 from y before 5 is written to x in global memory, thus causing the assertion to fail.
The fence is important because it forces all stores issued before the fence to be visible to all loads issued after the fence, i.e., x = 5 takes effect before y = 10, even under PSO and RMO. Thus, the assertion holds again.
2.2. Ineffectiveness of Existing Methods
MM-oblivious methods (Mine11, ; Mine12, ; Mine14, ) report bogus alarms because they were not designed for weak memory, and they ignore the causality of inter-thread data flows. Thus, they tend to drastically over-approximate the interferences between threads.
|Program in Figure 1||Program in Figure 3|
|Method||with fences||without fences||with fences||without fences|
|MM-oblivious (e.g. (Mine11, ; Mine12, ; Mine14, ))||alarm||alarm||alarm||alarm||alarm||alarm||alarm||alarm||alarm||alarm||alarm||alarm|
|SC-specific (e.g. (KusanoW16, ; FarzanK12, ; FarzanK13, ))||proof||proof||proof||proof||proof||proof||proof||proof||proof||proof||proof||proof|
For example, an MM-oblivious static analysis may work as follows. First, it analyzes each thread as if it were a sequential program. Then, it joins the effects of all stores on global memory—known as the thread interferences. Next, it individually analyzes each thread again, this time in the presence of the thread interferences computed from the previous iteration: when a thread performs a memory read, the value may come from any one of these thread interferences. This iterative process repeats until a fixed point is reached.
Next, we demonstrate how the MM-oblivious analyzer works on Figure 3. Consider the thread interferences to be a map from variables to abstract values in the interval domain (Cousot77, ). Thread 1 generates interferences and . Within thread 2, the load of y may read from local memory, , or the interference . Thus, , where is the join operator in the interval domain. Similarly, the load of x may read from local memory, , or the interference , i.e., . Thus, the assertion is incorrectly reported as violated.
While our previous example used the non-relational interval domain the bogus alarm remains when using a relational abstract domain: the propagation of interferences in MM-oblivious methods is inherently non-relational. Inferences map variables to a single values, causing all relations to be forgotten. Conventional approaches cannot easily fix this since maintaining relational invariants at all global program points is prohibitively expensive.
In contrast, prior SC-specific methods (KusanoW16, ; FarzanK12, ; FarzanK13, ) do not report bogus alarms: they assume x = 5 takes effect before y = 10. This leads to more accurate analysis results for SC, but is unsound under weak memory, e.g., they miss the assertion failure in Figure 3 under PSO or RMO when the fence is removed.
Figure 4 summarizes the ineffectiveness of prior techniques on the programs in Figures 1 and 3 with and without fences. Note that in Figure 1 the fence instruction may be added between the write and read instructions of both threads. The table in Figure 4 shows how prior MM-oblivious methods report bogus alarms, prior SC-specific methods report bogus proofs, while our new method eliminates both.
2.3. How Our Method Handles Memory Models
Some prior techniques lead to bogus alarms because they over-approximate thread interferences, i.e., they allow a load to read from any remote store regardless of whether such a data flow, or combination of flows, is feasible, while others lead to missed bugs because they under-approximate thread interferences, i.e., they do not allow any non-SC data flow. Consider Figure 3: the load of x may read 0 or 5, and the load of y may read 0 or 10, but the combination of x reading 0 and y reading 10 is infeasible. Realizing this, our method checks the feasibility of interference combinations under weak-memory semantics before propagating them.
Toward this end, we propose two new techniques. The first is the flow-sensitive propagation of thread interferences. Instead of eagerly joining all interfering stores, we handle each combination separately. The second is a declarative modeling of the memory consistency semantics general enough to capture SC, TSO, PSO, and RMO (AdveG96, ; Weaver94, ; Sites92, ). Together, these techniques prune infeasible combinations of thread interferences such as x and y reading 0 and 10, respectively, in Figure 3.
Our new method analyzes thread 2 in Figure 3 by considering four different interference combinations, –, separately.
and . We gain accuracy in two ways. First, we remove spurious values caused by an eager join (e.g., we no longer have ). Second, we query a lightweight constraint system to quickly deduce infeasibility of an interference combination on demand. , , and are all feasible but they do not cause assertion failures.
Our check for infeasibility of an interference combination is implemented using Datalog (Horn clauses within finite domains), solvable in polynomial time. We will provides details of this constraint system in Section 4. For now, consider in Figure 3: it is infeasible (unless we assume the program runs under PSO or RMO with the fence removed). We deduce infeasibility as follows:
y = 10 has executed (it is being read from),
thus x = 5 has executed (due to the program-order requirement on SC and TSO, and the fence on PSO and RMO),
so the load of x must not read from its initial value .
This deduction leads to a formal proof that can not exist in any concrete execution. Since the combinations do not violate the assertion, and is proved to be infeasible, the property is verified.
In this section, we review weak memory models at the processor level (as opposed to the programming language level) and static program analysis based on abstract interpretation.
3.1. Concurrent Programs
We are concerned with a program consisting of a finite set of threads. Each thread assesses a set of local variables. All threads access a set of global variables via load and store instructions. A thread creates a child thread with ThreadCreate, and waits it to terminate with ThreadJoin.
We represent a program using a set of flow graphs. Each flow graph , where , is a thread: is the set of program locations of the thread, is the entry point, and is the transition relation. That is, iff there exists an edge from to .
|Which Program-Order Relaxation Is Allowed?||Write-Atomicity|
|Memory Model||R()R()||R()W()||W()R()||W()W()||R()R()||R()W()||W()R()||W()W()||read own|
|SC (Lamport79, )||no||no||no||no||no||no||no||no||no|
|TSO (Weaver94, ; Sewell10, )||no||no||no*||no||no||no||yes||no||yes|
|PSO (Weaver94, )||no||no||no*||no||no||no||yes||yes||yes|
|RMO (Weaver94, ; Sites92, )||no||no||no*||no||yes||yes||yes||yes||yes|
Each program location is associated with an atomic instruction that may be a load, store, or fence. Non-atomic statements such as y = x+1, where both x and y are global variables, can be transformed to a sequence of atomic instructions, e.g., the load a = x followed by the store y = a+1, where a is a local variable in both cases. When accessing variables on the global memory, threads may use a special fence instruction to impose a strict program order between memory operations issued before and after the fence.
3.2. Memory Consistency Models
The simplest memory model is sequential consistency (SC) (Lamport79, ). SC corresponds to a system running on a single coherent memory time-shared by operations executed from different threads. There are two important characteristics of SC: the program-order requirement and the write-atomicity requirement. The program-order requirement says that the processor must ensure that instructions within a thread take effect in the order they appear in the program. The write-atomicity requirement says that the processor must maintain the illusion of a single sequential order among operations from all threads. That is, the effect of any store operation must take effect and become visible either to all threads or to none of the threads.
SC is an ideal memory model: In real CPUs, the hardware-level memory models are often weaker than SC, and can be characterized by their corresponding relaxations of the program-order and write-atomicity requirements as shown in Figure 5. Here, is a read of followed by a write of in the same thread.
Specifically, TSO allows x=1;a=y to be reordered as a=y;x=1, according to W()R() in Column 8, where is x and is y. PSO further allows x=1;y=2 reordered to y=2;x=1, according to W()W() in Column 9. As shown in Section 2, these program-order relaxations, conceptually, are the effect of store buffering, which delay the stores past subsequent stores/loads within a thread. Neither TSO nor PSO permits the delay of a load. Weaker still is RMO, which permits the relaxations of R()R() and R()W(), as shown in Columns 6 and 7 of the table in Figure 5.
By relaxing the write-atomicity requirement, all three weaker memory models allow a thread to read its own write early. That is, the thread can read a value it has written before the value reaches the global memory and hence becomes visible to other threads.
3.3. Abstract Interpretation
Abstract interpretation is a popular technique for conducting static program analysis (Cousot77, ). In this context, a numerical abstract domain defines, for every of the program, a memory environment . It is a map from each program variable to its abstract value111For ease of presentation we assume a variable maps to a single value. Our analysis can trivially use relational domains.. Consider intervals, which map each variable to a region defined by the lower and upper bounds. For a program with two integer variables and where both may have any value initially, the memory environment associated with the entry point is , where . After executing x = 1, the memory environment becomes .
The process of computing based on is represented by the transfer function of x = 1. Additionally, the join is defined as . The partial-order relation is defined as if and only if and . For example and .
We use to denote the set of all memory environments. is a lattice with properly defined top () and bottom () elements, join (), partial-order (), and a widening operator (Cousot77, ). Each node has a transfer function , taking an environment as input (before executing the atomic operation in ) and returns a new environment as output.
Let be a map from each node to its transfer function. For example, given a node whose operation is x = a+1, if , the new environment is .
The goal of an abstract interpreter is to compute an environment map over-approximating the memory state at every program location. , typically, initially maps all variables in the entry node to , and all variables in other nodes to . Then, it iteratively applies the transfer function and joins the resulting environments for all , until they reach a fixed point.
Without getting into more details (refer to the literature (Cousot77, )), we define the sequential analyzer as a fixed-point computation with respect to the function :
Here, is the environment produced by a predecessor node of , and is the join of these environments. is the new memory environment produced by executing the operation in . Applying this function to all nodes of a sequential program until a fixed point leads to an over-approximated memory state for each program location.
However, directly applying the sequential analyzer to each execution of a multithreaded program is not practical because it leads to an exponential complexity. Instead, thread-modular techniques (Mine11, ; Mine12, ; Mine14, ; KusanoW16, ) iteratively apply AnalyzeSeq to each thread, as if it were a sequential program, and then merge/propagate the global memory effects across threads. The iterative process continues until memory environments in all threads stabilize.
Since each thread is analyzed in isolation, this approach is more scalable than non-thread modular techniques. However, it may result in accuracy loss because the analyzer for each thread relies on a coarse-grained abstraction of interferences from other threads. When analyzing a thread in the presence of a set of threads , for example, the interferences are the effects of global memory stores from all . The interferences are a map from each variable read by thread to the set of memory environments produced by interfering stores, where is the set of all program variables, and is the power set of .
Prior thread-modular techniques (Mine11, ; Mine12, ; Mine14, ) eagerly join all interfering memory states from the other threads in before propagating them to the current thread . As such, they often introduce bogus store-to-load data flows into the static analysis or miss valid store-to-load data flows. In the remainder of this paper, we present our method for mitigating this problem.
4. Deciding Interference Feasibility
In this section, we describe our new method for quickly deciding the feasibility of a combination of store-to-load data-flows under a given memory model. An interference combination is a set where each is a load and an interfering store .
Checking the feasibility of is formulated as a deductive analysis with inputs: (1) the flow graph of the current thread, (2) the flow graphs of all interfering threads, and (3) the existing set of store-to-load data flows represented by . The output of this deductive analysis is the relation MustNotReadFrom. means the load must not read from the store since our analysis proved the data flow from to is infeasible given the input ReadsFrom relation in .
Consider the program in Figure 3 as an example. One thread interference combination we want to check is the load of y from y=10 and the load of x from the initial value 0. Let these load and store instructions be denoted , , , and , respectively. Then, the feasibility problem is stated as follows: given , check if .
Before presenting the details of our feasibility checking procedure, we define a set of unary and binary relations over instructions and program variables. Specifically, denotes is a load of variable , and denotes is a store to variable . We use if we do not care about the variable. Similarly, we use to denote that is a fence. We also use IsLLMembar, IsLSMembar, IsSLMembar, IsSSMembar to denote load–load, load–store, store–load, and store–store memory barriers as defined in the SPARC architecture (Weaver94, ); for example, a load–store membar prevents loads before the barrier from being reordered with subsequent stores.
We define binary relations over instructions and : the first four relations (Dominates, NotReachableFrom, ThreadCreates, ThreadJoins) are determined by the program’s flow graphs. Based on them, we deduce the MHB relation, which must be satisfied by all program executions. The ReadsFrom relation comes from the given , from which we deduce the MustNotReadFrom relation.
|Dominates means that dominates in the control flow graph of a thread.|
|NotReachableFrom means that cannot be reached from in the control flow graph of a thread.|
|ThreadCreates means is the thread creation and is the first operation of the child thread.|
|ThreadJoins means is the thread join and is the last operation of the child thread.|
|MHB means that must happen before in all executions of the program.|
|ReadsFrom means that is a load that reads the value written by the store .|
|MustNotReadFrom means that must not read from the value written by .|
Consider Figure 3 again, where we want to check if the load of y in the second thread reads from y=10, then is it possible for the load of x to read from the initial value 0? In this case, we encode the assumption as . Next, we deduce the MustNotReadFrom relation. Finally, we check if .
4.2. Relaxing the Program-Order Requirement
To model the program order imposed by different memory models, we define a new relation NoReorder such that if the reordering of and within the same thread is not allowed.
We define the rules for NoReorder based on the allowed program-order relaxations for different memory models (Figure 5).
For SC, NoReorder is defined as:
That is, no reordering is ever allowed under SC (row SC Figure 5).
For TSO, NoReorder is defined as:
Under TSO, two operations can not reorder in six of the eight cases. The first rule above disallows Columns 2, 3, 6, and 7 (Figure 5), while the second disallows Columns 3, 5, 7, and 9. Thus, reordering is permitted in two cases: Columns 4 and 8.
Although this is counter-intuitive, note that (Column 4) may be reordered in our analysis under TSO (and PSO and RMO) for soundness: it permits read-own-write-early behaviors. We detail this shortly in Section 4.6.
For PSO, NoReorder is defined as:
Under PSO, two operations can not reorder in five of the eight cases. The first rule above disallows Columns 2, 3, 6, and 7, while the second disallows Column 5. Thus, reordering is permitted only in the remaining three cases (Columns 4, 8, and 9).
For RMO, the inference rules are defined as:
Similarly, the above inference rules can be directly translated from Columns 2, 3, and 6 of the table in Figure 5.
4.3. Handling Fences and Memory Barriers
Next, we present the ordering constraints imposed by fences and memory barriers. We consider four variants of the membar instruction, which prevents loads and/or stores before the membar from being reordered with subsequent loads and/or stores (Weaver94, ).
We also model fences in terms of membars since they prevent loads and stores from being reordered with subsequent loads and stores as well.
In addition to fences explicitly added to the program, there are fences implicitly added to thread routines such as lock/unlock and signal/wait. For example, in the code snippet x = 1; lock(lk); a = y; unlock(lk), there is a fence inside lock(lk), to ensure x = 1 always takes effect before a = y. This is how most modern programming systems guarantee data-race-freedom (AdveB10, ) to application-level code (i.e., programs without data races have only SC behaviors). Thus, we model every call to a POSIX thread routine using .
4.4. Rules for Deducing MustNotReadFrom
We divide our inference rules into two groups. The first (Figure 7) use the relations ThreadCreates, ThreadJoins, Dominates, and NoReorder to generate the must-happen-before (MHB) relation.
Rule (1) states that if the instruction creates a thread with entry instruction , then must happen before . Similarly, if instruction joins a thread with exit instruction , then must happen before .
Rule (2) states that if dominates within a thread’s CFG, and is not reachable from , (i.e., no loop encompasses both and ), then, if permitted by the memory model, must happen before . Figure 8 exemplifies this rule: the loop in the left CFG is outside the Dominates edge, thus . The loop in the right CFG encompasses the Dominates edge, thus .
Rule (3) states that the MHB relation is transitive: if must happen before , and must happen before , then must happen before . Correctness follows from the definition of MHB.
Rule (4) states that if a load reads from the value written by the store , then must happen before some second store to the same variable takes effect. This is intuitive because, if takes effect before (but after the first store ), then can no longer read from . Figure 9 exemplifies this rule. Its correctness is obvious.
The second group of inference rules (Figure 7) takes the relations MHB and ReadsFrom and generates the MustNotReadFrom relation. Recall that if a load-store pair , the value stored by can never flow to . Thus, MustNotReadFrom may be used to eliminate infeasible data flows.
Rule (5) states that if a load must happen before a store , then cannot read from . This follows from the definition of MHB. Note that a store “happens” when it propagates to main memory.
4.5. Soundness and Incompleteness
When deciding the feasibility of an interference combination our analysis is designed to be sound but incomplete. By sound we mean it permits all possible program behaviors allowed by a memory model. Therefore, if it says a certain interference combination is infeasible it must be infeasible. However, there is no guarantee every infeasible interference combination will be found.
Incompleteness is expected: the intent is a quick pruning of infeasible combinations before the computationally expensive thread-level analysis. The overhead of insisting on completeness would outweigh its benefit: the feasibility checking problem, in the worst case, is as hard as program verification itself, which is undecidable.
Now, we formally state the soundness of our deductive procedure. First, our deduction of the NoReorder relation relaxing the program order requirement, from Figure 5, is sound.
Theorem 4.1 ().
Let and be two instructions in the same thread. If our rules deduce , then the reordering of and is not allowed by the corresponding memory model.
Next, we note that, given the ReadsFrom relation, the deduction of the MustNotReadFrom relation is also sound.
Theorem 4.2 ().
Let and be two instructions. If our rules deduce to , then cannot read from .
The proof of this theorem is straightforward: it amounts to proofs of Rules (1)–(6). During the previous presentation, we have argued why each rule is correct. More formal proofs can be obtained via proof-by-contradiction, which is straightforward. We omit the details for brevity.
4.6. Relaxing the Write-Atomicity Requirement
Our method soundly models buffer forwarding, which corresponds to the write-atomicity requirement (Column 10 Figure 5). This allows a thread to read its own write before the written value is flushed to the memory, thus becoming visible to other threads. This is modeled in both the thread-level analyzer (AnalyzeSeq) and the deduction rules.
AnalyzeSeq captures the relaxation for free. During this analysis each thread is treated as a sequential program: all loads read their values from the preceding writes within the same thread.
The deduction rules for NoReorder (Section 4.2) always permit the reordering of a store with a subsequent load of the same variable (Column 4, Figure 5). That is, if and , we do not deduce due to buffer forwarding (even though it is counter-intuitive). Within a thread , it may appear to be the case that the store and load are reordered from the perspective of all threads . Forbidding this reordering would be equivalent to forcing a full flush of the store-buffers before every load, thus prohibiting any thread from reading its own store earlier than other threads.
Figure 11 exemplifies the requirement of this relaxation. First, the assertion may be violated under TSO. An error trace is: x = 1; a = 1; b = 0; y = 1; flush y; c = 0; flush x. To permit this trace, we must allow the following interference combination: reads from , reads from the initial value 0, and reads from the initial value 0. This combination is feasible only when we avoid enforcing the program order between and . Specifically, the statements in thread 2 follow program order () from the fence. In thread 1, and are ordered since they are added to NoReorder under TSO. But, statements and are not added to NoReorder, thus preventing the assertion from being (incorrectly) verified.
5. The Thread-Modular Analysis
Next, we present the integration of our interference analysis (Section 4) with a thread-modular analyzer. The thread-modular analyzer itself is standard, whose full details may be found in several prior works including (Mine12, ; Mine14, ) and (KusanoW16, ). Thus, our presentation of the analyzer itself will be terse. Instead, we shall focus on our main contribution, which is adding the capability of deducing infeasible interference combinations for weak-memory models: our method is sound for not only SC but also TSO, PSO, and RMO. Prior techniques were either MM-oblivious, or sound only for SC.
Given a load of , the interferences on , within the thread-modular analysis, are the environments after all stores to from other threads. The function takes a graph as input and returns the nodes of the graph. The interferences on the loads in a thread is the least fixed point of the function .
We use as shorthand for , where is a partially-applied function, and use as the initial map from loads to interfering-environments, i.e., one mapping all nodes to . lfp computes the least fixed point. depends on the existence of , a map from each program location, in all threads, to an environment. We show shortly that the computation of and the interferences is done in a nested fixed point.
We refer to an interference combination, ), as a map from a load to the memory environment after a store instruction from which reads. This differs slightly from the definition of Section 4 where it is defined as a set of load-stores pairs. The two can be easily converted as the analysis keeps track of all the environments associated with each store. Given the set of interferences from Interfs, the set of all interference-combinations are all permutations of selecting a single environment from for each load. The iterative thread-modular analysis separately considers each interference-combination thus increasing accuracy.
The thread analyzer adapts the sequential analyzer (AnalyzeSeq, Section 3) to use interference-combinations. takes a thread and an interference combination and computes the input environment for some node in by joining the environment after the predecessors of with ’s environment in , denoted . Then, is passed to ’s transfer function to update .
AnalyzeTM is the least fixed point of . returns the transition-relation of a graph. is the initial memory map mapping the entry nodes of each thread to and all others to . Given a set of threads and a set of interference-combinations , applying AnalyzeTM to each and each computes the analysis over all threads.
What remains is to show how the thread analyzer and the calculation of interferences can be done simultaneously since they are dependent: the interference computation depends on the analysis result, , and the analysis result depends on the set of interferences, . The solution is a nested fixed point: the outer computation produces , and the inner computation produces . The process iterates until (and thus ) reach a fixed point.
Analyze operates as follows: first, it takes , the current analysis results over all threads, and computes the interferences, , wrt the thread under test, . The function FilterFeasible integrates the thread-level analyzer with the feasibility analysis of Section 4. It expands the interferences into a set of interference combinations , and filters any infeasible combination.
Specifically, given the interferences on a thread, , FilterFeasible creates all combinations of pairing each load to a single interfering environment, e.g., }. Then, it maps each environment in to the associated store generating the environment, e.g., . Each set of pairs of load and store statements in is then passed to the deduction analysis of Section 4. If it is infeasible, it is discarded, otherwise it is added to the set returned by FilterFeasible.
is the set of the results of applying for each . Specifically, takes a function and a set , and returns a set containing the application of on each element of .
takes the join of memory environments on matching nodes across a set of maps to join them into a single map. Similarly, joins the results of applying Analyze to the set of threads. AnalyzeAll computes the fixed point of starting with the initial map .
The following is a high-level example. Initially, each thread is analyzed in the presence of resulting in the set of interferences, , being empty (all stores map to ). The results of analyzing each thread are merged into a new map . Each thread is then analyzed using , resulting in the sets and to be (potentially) non-empty, causing AnalyzeTM to be called once per-combination. Within a thread, the results of AnalyzeTM are joined, then, across threads, the results of Analyze are joined, creating . The procedure repeats, thus growing the size of , , and until .
We handle loops the same way as in prior techniques (e.g., (KusanoW16, )). Given a load within a loop the previously described analysis can generate an infinite number of interference combinations for , e.g., when is within an infinite loop. Loops are unrolled when possible, and, when not, we join all the feasible interfering memory environments into a single value. An interfering environment is infeasible to interfere on if the store generating must-happen after ; otherwise, it is feasible. This is sound for verifying assertions embedded in a concurrent program (KusanoW16, ).
We implemented our weak-memory-aware abstract interpreter in a tool named FruitTree, building upon open-source platforms such as LLVM (AdveLBSG03, ), Apron (Jeannet09, ), and (Hoder11, ). Specifically, we use LLVM to translate C programs into LLVM bit-code, based on which we perform static analysis. We use the Apron library to manipulate abstract domains in the thread analyzer. We use the Z fixed-point engine in Z3 (DeMoura08, ) to solve Datalog constraints that encode the feasibility of interference combinations.
We implemented the state-of-the-art MM-oblivious abstract interpretation method of Miné (Mine14, ), and the SC-specific method, Watts (KusanoW16, ), on the same platform to facilitate experimental evaluation. We also compared against a previously implemented version of Duet (FarzanK12, ; FarzanK13, ). While Duet may be unsound, and Watts is unsound, we include their results because they are closely related to our new technique.
All methods implemented in FruitTree use the clustering and property-directed optimizations (KusanoW16, ), where clustering considers interferences only within sets of loads, similar to the packing of relational domains, and property-direction filters interference combinations unrelated to properties under test. These optimizations reduce the number of interference combinations, which is crucial since it grows exponentially with respect to program’s size.
We evaluated FruitTree on a large set of programs written using the POSIX threads. These benchmarks fall into two categories. The first are 209 litmus tests exposing non-SC behaviors under various processor-level memory models (Alglave13T, ). The second are 52 larger applications (svcomp15, ; LinuxISR, ; FarzanK12, ), including several Linux device drivers. The benchmarks total 61,981 lines of code. The properties under verification are assertions embedded in the program’s source code: a property is valid if and only if the assertion holds over all executions under a given memory model.
Our experiments were designed to answer the following research questions: (1) Is our new method more effective than prior techniques in obtaining correctness proofs on relaxed memory? (2) Is our new method more accurate than prior techniques in detecting potential violations on relaxed memory? (3) Is our new method reasonably efficient when used as a static program analysis technique? We conducted all experiments on a Linux computer with 8 GB RAM, and a 2.60 GHz CPU.
6.1. Litmus Test Results
First, we present the litmus test results. Since these programs are small in terms of code size, all methods under evaluation (Miné, Watts, Duet, and FruitTree) finished quickly. Thus, our focus is not on comparing the runtime performance but comparing the accuracy of their results. Specifically, we compare our method to these state-of-the-art techniques in terms of the number of true proofs, bogus proofs, true alarms, and bogus alarms.
Here, a bogus alarm is a valid property which cannot be proved. A bogus proof is a property which may be violated yet is unsoundly and incorrectly proved. The litmus tests are particularly useful not only because they cover corner cases, but also because we know a priori if a property holds or not.
|Method||True Alarm||Bogus Alarm||True Proof||Bogus Proof||Time (s)|
|Miné (Mine14, )||77||207||8||0||12.9|
|Duet (FarzanK12, )||77||181||34*||0||473.1|
|Watts (KusanoW16, )||63||13||0||216||71.0|
Table 1 summarizes the litmus test results under TSO. The first column shows the name of each method, and the next four show the number of true alarms, bogus alarms, true proofs, and bogus proofs generated by each method, respectively. Since Watts (KusanoW16, ) was designed to be SC-specific, it ignores non-SC behaviors, meaning its proofs are unsound under weaker memory (marked by ). The last column is the total analysis time over all tests.
Overall, the results show the prior thread-modular technique of Miné admits many infeasible executions thus leading to 207 bogus alarms. Duet reported 181 bogus alarms. In contrast, our method (FruitTree) reported only 72 bogus alarms, together with 143 true proofs. Therefore, it is more accurate than these prior techniques.
Although Watts reported only 13 bogus alarms, it is unsound for TSO: it only considers SC behaviors and cannot be trusted. Furthermore, the soundness of Duet under TSO or any other non-SC memory model was not clear (since Duet was only designed for SC). Thus, in the result table, its 34 proofs are marked with *.
|Method||True Alarm||Bogus Alarm||True Proof||Bogus Proof||Time (s)|
|Miné (Mine14, )||81||203||8||0||12.9|
|Duet (FarzanK12, )||81||177||34*||0||473.1|
|Watts (KusanoW16, )||64||12||0||216||71.0|
Table 2 summarizes the results under PSO. Again, Watts may be unsound for weak memory. The same litmus programs were used under PSO as in TSO but the properties changed, i.e., whether an alarm is true or bogus. Note that Miné only verified 8 properties, Duet verified 34, whereas our method verified 143.
|Method||True Alarm||Bogus Alarm||True Proof||Bogus Proof||Time (s)|
|Miné (Mine14, )||28||67||8||0||4.9|
|Duet (FarzanK12, )||11||58||34||0||187.8|
|Watts (KusanoW16, )||0||0||75||28||33.9|
Table 3 summarizes the results under RMO. Under RMO, a different set of litmus programs were used since the instruction set for processors using RMO differs from TSO and PSO. Nevertheless, we observed similar results: FruitTree obtained significantly more true proofs and fewer bogus alarms than the other methods.
In general, our method was more accurate than prior techniques. However, since the analysis is over-approximated, it does not eliminate all bogus alarms. Currently, most bogus alarms reported by FruitTree require reasoning across more than two threads, e.g., the correctness of a property may require reasoning that thread reading from thread implies in thread . Since our method is thread-modular—threads are analyzed individually by abstracting all other threads into a set of interferences—it cannot capture ordering constraints involving more than two threads. In principle, this limitation can be lifted by extending our interference feasibility analysis: we leave this as future work.
6.2. Results on Larger Applications
Next, we present our results on the larger benchmark programs. Since execution time is no longer negligible, we compare, across methods, both the run time and accuracy. However, since the programs are larger (60K lines of code) and there are far too many properties to manually inspecting each case, we do not report the number of bogus alarms and bogus proofs due to lack of the ground truth. Instead, we compare the total number of proofs reported by each method, to show our method is more accurate even though all methods are approximate.
|Minè (Mine14, )||Duet (FarzanK12, )||Watts (KusanoW16, )||FruitTree|
|Total||415 s||1752||106 s||2432*||9830 s||4583||5387 s||4577|
Table 4 shows our results under TSO, where and * mark the unsoundly verified properties. Since the results for PSO and RMO are similar to Table 4, we omit them for brevity. Column 1 of this table shows the name of the benchmark program. Columns 2–3, 4–5, 6–7, and 8–9 show the run time and number of properties verified by Miné, Duet, Watts, and FruitTree, respectively.
Again, while the proofs reported by FruitTree and Miné’s method are sound, the proofs reported by Watts are not, and the soundness of Duet on weak memory is unclear.
Overall, FruitTree proved 4,577 properties compared to only 1,712 proved by Miné, an increase of 2.7x more properties relative to prior state-of-the-art. Additionally, though Duet may be unsound, it proved only 2,432 properties. The definitely-unsound Watts “proved” 4,583 properties, possibly including bogus proofs.
In terms of the run time, FruitTree took 5,387 seconds, which is similar to Watts, and slower than Duet and Miné. However, the additional time is well justified due to the significant increase in the number of proofs. Furthermore, the runtime performance – proving 1 property per second – remains competitive as a static analysis technique.
To summarize, our new method has modest runtime overhead compared to prior techniques, but vastly improved accuracy in terms of the analysis results, and is provably sound in handling not only SC but also three other processor-level memory models.
7. Related Work
We reviewed prior work on thread-modular abstract interpretation, which are either MM-oblivious (Mine11, ; Mine12, ; Mine14, ) or SC-specific (KusanoW16, ) in processor memory-models. There are also techniques (FarzanK12, ; FarzanK13, ; RoychoudhuryM02, ; HuynhR06, ) that are not thread-modular.
There are code-transformation techniques (KupersteinVY11, ; Meshman2014, ; Dan15, ) that transform a non-SC program into an SC program and then apply abstract interpretation. They generally follow the sequentialization approach pioneered by Lal and Reps (LalR09, ), with a focus on code transformation as opposed to abstract interpretation. To ensure termination, they make various assumptions to bound the program’s behavior. Furthermore, they are not thread-modular, and often do not directly handle C code. Instead, they admit only models of concurrent programs written in artificial languages; because of this, we were not able to perform a direct experimental comparison.
In the context of bounded model checking, Alglave et al. proposed several methods for concurrent software on relaxed memory. They are based on either sequentializing concurrent programs (Alglave13T, ) or encoding weak memory semantics using SAT/SMT solvers (Alglave13P, ; Alglave14, ). Alglave et al. also developed techniques for modeling and testing weak-memory semantics of real processors (AlglaveMT14, ), and characterized the memory models of some GPUs (AlglaveBDGKPSW15, ). However, these techniques are primarily for detecting buggy behaviors as opposed to proving that such behaviors do not exist.
In the context of systematic testing, often based on stateless model checking (Godefr97, ; Flanag05, ; WangSG11, ) or predictive analysis (WangKGG09, ; SaidWYS11, ; SinhaMWG11, ; SinhaMWG11hvc, ; WangG11, ; HuangMR14, ), a number of methods have been proposed to handle weak memory such as TSO/PSO (ZhangKW15, ; AbdullaAAJLS15, ; DemskyL15, ; Huang016, ), PowerPC (AbdullaAJL16, ), and C++11 (NorrisD13, ; OuD17, ). However, since they rely on concretely executing the program, and require the user to provide test inputs, they can only be used to detect bugs. That is, since testing does not cover all program behaviors, if no bug is detected, these methods cannot obtain a correctness proof. In contrast, our method is based on abstract interpretation, which covers all possible program behaviors and therefore is geared toward obtaining correctness proofs.
Thread-modular analysis was also used in model checking (FlanaganFQ02, ; FlanaganQ03, ), where it was combined with predicate abstraction (HenzingerJMQ03, ) to help mitigate state explosion and thus increase the scalability. However, model checking is significantly different from abstract interpretation in that each thread must be first abstracted into a finite-state model. Thread-modular analysis was also used to conduct shape analysis (GotsmanBCS07, ) and prove thread termination (CookPR07, ). Hoenicke et al. (HoenickeMP17, ) introduced a hierarchy of proof systems that leverage thread modularity in compositional verification on SC memory.
Similar to the interference analysis in Watts (KusanoW16, ), we check the feasibility of thread interactions using Datalog. Datalog-based declarative program analysis was a framework introduced by Whaley and Lam (WhaleyL04, ). Previously, it has been used to implement points-to (LamWLMACU05, ; BravenboerS09, ), dependency (GuoKWYG15, ; SungKSW16, ) and change-impact analyses (GuoKW16, ), uncover security bugs (LivshitsL05, ) and detect data races (NaikAW06, ).
In abstract interpretation of sequential programs, Miné (Mine06, ) proposed a technique for abstracting the global memory into a set of byte-level cells to support a variety of casts and union types. Ferrara et al. (Ferrara14, ; Ferrara15, ) integrated heap abstraction and numerical abstraction during static analysis, where the heap is represented as disjunctions of points-to constraints based on values. Jeannet and Serwe (Jeannet04, ) also proposed a method for abstracting the data and control portions of a call-stack for analyzing sequential programs with potentially infinite recursion. Subsequently, Jeannet (Jeannet12, ) extended the work to handle concurrent programs as well. However, none of these methods was designed specifically for handling weak memory models.
We have presented a thread-modular static analysis method for concurrent programs under weak memory models, building upon a lightweight constraint system for quickly identifying the infeasibility of thread interference combinations, so they are skipped during the expensive abstract-interpretation based analysis. The constraint system is also general enough to handle a range of processor-level memory models. We have implemented the method and conducted experiments on a large number of benchmark programs. We showed the new method significantly outperformed three state-of-the-art techniques in terms of accuracy while maintaining only a moderate runtime overhead.
This material is based upon research supported in part by the U.S. National Science Foundation under grants CNS-1405697 and CCF-1722710 and the U.S. Office of Naval Research under award number N00014-13-1-0527.
-  Parosh Aziz Abdulla, Stavros Aronis, Mohamed Faouzi Atig, Bengt Jonsson, Carl Leonardsson, and Konstantinos F. Sagonas. Stateless model checking for TSO and PSO. In International Conference on Tools and Algorithms for Construction and Analysis of Systems, pages 353–367, 2015.
-  Parosh Aziz Abdulla, Mohamed Faouzi Atig, Bengt Jonsson, and Carl Leonardsson. Stateless model checking for POWER. In International Conference on Computer Aided Verification, pages 134–156, 2016.
-  Sarita V. Adve and Hans-Juergen Boehm. Memory models: a case for rethinking parallel languages and hardware. Commun. ACM, 53(8):90–101, 2010.
-  Sarita V. Adve and Kourosh Gharachorloo. Shared memory consistency models: A tutorial. Computer, 29(12):66–76, 1996.
-  Vikram Adve, Chris Lattner, Michael Brukman, Anand Shukla, and Brian Gaeke. LLVM: A low-level virtual instruction set architecture. In ACM/IEEE international symposium on Microarchitecture, Dec 2003.
-  Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen, and John Wickerson. GPU concurrency: Weak behaviours and programming assumptions. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 577–591, 2015.
-  Jade Alglave, Daniel Kroening, Vincent Nimal, and Daniel Poetzl. Don’t sit on the fence. In International Conference on Computer Aided Verification, pages 508–524, 2014.
-  Jade Alglave, Daniel Kroening, Vincent Nimal, and Michael Tautschnig. Software verification for weak memory via program transformation. In European Symposium on Programming, pages 512–532, 2013.
-  Jade Alglave, Daniel Kroening, and Michael Tautschnig. Partial orders for efficient bounded model checking of concurrent software. In International Conference on Computer Aided Verification, pages 141–157, 2013.
-  Jade Alglave, Luc Maranget, and Michael Tautschnig. Herding cats: modelling, simulation, testing, and data-mining for weak memory. In ACM SIGPLAN Conference on Programming Language Design and Implementation, page 7, 2014.
-  Martin Bravenboer and Yannis Smaragdakis. Strictly declarative specification of sophisticated points-to analyses. In ACM SIGPLAN Conference on Object Oriented Programming, Systems, Languages, and Applications, pages 243–262, 2009.
-  Byron Cook, Andreas Podelski, and Andrey Rybalchenko. Proving thread termination. ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 320–330, 2007.
-  P. Cousot and R. Cousot. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 238–252, 1977.
-  Andrei Dan, Yuri Meshman, Martin Vechev, and Eran Yahav. Effective abstractions for verification under relaxed memory models. In International Conference on Verification, Model Checking, and Abstract Interpretation, pages 449–466, 2015.
-  Leonardo De Moura and Nikolaj Bjørner. Z3: An efficient SMT solver. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340, 2008.
-  Brian Demsky and Patrick Lam. SATCheck: SAT-directed stateless model checking for SC and TSO. In ACM SIGPLAN Conference on Object Oriented Programming, Systems, Languages, and Applications, pages 20–36, 2015.
-  Azadeh Farzan and Zachary Kincaid. Verification of parameterized concurrent programs by modular reasoning about data and control. In ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, pages 297–308, 2012.
-  Azadeh Farzan and Zachary Kincaid. Duet: Static analysis for unbounded parallelism. In International Conference on Computer Aided Verification, pages 191–196, 2013.
-  Pietro Ferrara. Static analysis via abstract interpretation of the happens-before memory model. In International Conference on Tests and Proofs, pages 116–133. 2008.
-  Pietro Ferrara. Generic combination of heap and value analyses in abstract interpretation. In International Conference on Verification, Model Checking, and Abstract Interpretation, pages 302–321, 2014.
-  Pietro Ferrara, Peter Müller, and Milos Novacek. Automatic inference of heap properties exploiting value domains. In International Conference on Verification, Model Checking, and Abstract Interpretation, pages 393–411, 2015.
-  C. Flanagan and P. Godefroid. Dynamic partial-order reduction for model checking software. In ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, pages 110–121, 2005.
-  Cormac Flanagan, Stephen N. Freund, and Shaz Qadeer. Thread-modular verification for shared-memory programs. In European Symposium on Programming, pages 262–277, 2002.
-  Cormac Flanagan and Shaz Qadeer. Thread-modular model checking. In International SPIN Workshop on Model Checking Software, pages 213–224, 2003.
-  Patrice Godefroid. VeriSoft: A tool for the automatic analysis of concurrent reactive software. In International Conference on Computer Aided Verification, pages 476–479, 1997.
-  Alexey Gotsman, Josh Berdine, Byron Cook, and Mooly Sagiv. Thread-modular shape analysis. ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 266–277, 2007.
-  Shengjian Guo, Markus Kusano, and Chao Wang. Conc-iSE: Incremental symbolic execution of concurrent software. In IEEE/ACM International Conference On Automated Software Engineering, 2016.
-  Shengjian Guo, Markus Kusano, Chao Wang, Zijiang Yang, and Aarti Gupta. Assertion guided symbolic execution of multithreaded programs. In ACM SIGSOFT Symposium on Foundations of Software Engineering, pages 854–865, 2015.
-  Thomas A. Henzinger, Ranjit Jhala, Rupak Majumdar, and Shaz Qadeer. Thread-modular abstraction refinement. In International Conference on Computer Aided Verification, pages 262–274, 2003.
-  Krystof Hoder, Nikolaj Bjørner, and Leonardo de Moura. muZ - an efficient engine for fixed points with constraints. In International Conference on Computer Aided Verification, pages 457–462, 2011.
-  Jochen Hoenicke, Rupak Majumdar, and Andreas Podelski. Thread modularity at many levels: A pearl in compositional verification. In ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, pages 473–485, 2017.
-  Alan Huang. Maximally stateless model checking for concurrent bugs under relaxed memory models. In International Conference on Software Engineering, pages 686–688, 2016.
-  Jeff Huang, Patrick O’Neil Meredith, and Grigore Rosu. Maximal sound predictive race detection with control flow abstraction. In ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 337–348, 2014.
-  Shiyou Huang and Jeff Huang. Maximal causality reduction for TSO and PSO. In ACM SIGPLAN Conference on Object Oriented Programming, Systems, Languages, and Applications, pages 447–461, 2016.
-  Thuan Quang Huynh and Abhik Roychoudhury. A memory model sensitive checker for c#. In International Symposium on Formal Methods, pages 476–491, 2006.
-  Bertrand Jeannet. Relational interprocedural verification of concurrent programs. Software & Systems Modeling, 12(2):285–306, 2012.
-  Bertrand Jeannet and Antoine Miné. Apron: A library of numerical abstract domains for static analysis. In Ahmed Bouajjani and Oded Maler, editors, International Conference on Computer Aided Verification, pages 661–667. 2009.
-  Bertrand Jeannet and Wendelin Serwe. Abstracting call-stacks for interprocedural verification of imperative programs. In International Conference on Algebraic Methodology and Software Technology, pages 258–273, 2004.
-  Michael Kuperstein, Martin T. Vechev, and Eran Yahav. Partial-coherence abstractions for relaxed memory models. In ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 187–198, 2011.
-  Markus Kusano and Chao Wang. Flow-sensitive composition of thread-modular abstract interpretation. In ACM SIGSOFT Symposium on Foundations of Software Engineering, 2016.
-  Akash Lal and Thomas W. Reps. Reducing concurrent analysis under a context bound to sequential analysis. Formal Methods in System Design, 35(1):73–97, 2009.
-  Monica S. Lam, John Whaley, V. Benjamin Livshits, Michael C. Martin, Dzintars Avots, Michael Carbin, and Christopher Unkel. Context-sensitive program analysis as database queries. In ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 1–12, 2005.
-  Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, 28(9), 1979.
-  V. Benjamin Livshits and Monica S. Lam. Finding security vulnerabilities in Java applications with static analysis. In USENIX Security Symposium, 2005.
-  Yuri Meshman, Andrei Dan, Martin Vechev, and Eran Yahav. Synthesis of memory fences via refinement propagation. In International Symposium on Static Analysis, pages 237–252, 2014.
-  Antoine Miné. Field-sensitive value analysis of embedded C programs with union types and pointer arithmetics. In ACM SIGPLAN/SIGBED Conference on Language, Compilers, and Tool Support for Embedded Systems, pages 54–63, 2006.
-  Antoine Miné. Static analysis of run-time errors in embedded critical parallel C programs. In Programming Languages and Systems, pages 398–418. 2011.
-  Antoine Miné. Static analysis by abstract interpretation of sequential and multi-thread programs. In Proc. of the 10th School of Modelling and Verifying Parallel Processes, pages 35–48, 2012.
-  Antoine Miné. Relational thread-modular static value analysis by abstract interpretation. In International Conference on Verification, Model Checking, and Abstract Interpretation, pages 39–58, 2014.
-  Mayur Naik, Alex Aiken, and John Whaley. Effective static race detection for Java. In ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 308–319, 2006.
-  Brian Norris and Brian Demsky. CDSchecker: checking concurrent data structures written with C/C++ atomics. In ACM SIGPLAN Conference on Object Oriented Programming, Systems, Languages, and Applications, pages 131–150, 2013.
-  Peizhao Ou and Brian Demsky. Checking concurrent data structures under the C/C++11 memory model. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 45–59, 2017.
-  Abhik Roychoudhury and Tulika Mitra. Specifying multithreaded java semantics for program verification. In International Conference on Software Engineering, pages 489–499, 2002.
-  Mahmoud Said, Chao Wang, Zijiang Yang, and Karem Sakallah. Generating data race witnesses by an SMT-based analysis. In NASA Formal Methods, pages 313–327, 2011.
-  Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. X86-TSO: A rigorous and usable programmer’s model for x86 multiprocessors. Commun. ACM, 53(7):89–97, July 2010.
-  Arnab Sinha, Sharad Malik, Chao Wang, and Aarti Gupta. Predicting serializability violations: SMT-based search vs. DPOR-based search. In Haifa Verification Conference, pages 95–114, 2011.
-  Arnab Sinha, Sharad Malik, Chao Wang, and Aarti Gupta. Predictive analysis for detecting serializability violations through trace segmentation. In International Conference on Formal Methods and Models for Co-Design, pages 99–108, 2011.
-  Richard Sites. Alpha Architecture Reference Manual. Digital Press, 1992.
-  Chungha Sung, Markus Kusano, Nishant Sinha, and Chao Wang. Static DOM event dependency analysis for testing web applications. In ACM SIGSOFT Symposium on Foundations of Software Engineering, 2016.
-  SVCOMP. International competition on software verification. http://sv-comp.sosy-lab.org/2015/benchmarks.php, Accessed: 2015-05-06.
-  TLDP. Interrupt handlers: Linux kernel module programming guide. http://www.tldp.org/LDP/lkmpg/2.6/html/x1256.html, Accessed: 2015-05-06.
-  Chao Wang and Malay Ganai. Predicting concurrency failures in generalized traces of x86 executables. In International Conference on Runtime Verification, pages 4–18, September 2011.
-  Chao Wang, Sudipta Kundu, Malay Ganai, and Aarti Gupta. Symbolic predictive analysis for concurrent programs. In International Symposium on Formal Methods, pages 256–272, 2009.
-  Chao Wang, Mahmoud Said, and Aarti Gupta. Coverage guided systematic concurrency testing. In International Conference on Software Engineering, pages 221–230, 2011.
-  David L Weaver and Tom Gremond. The SPARC architecture manual. PTR Prentice Hall Englewood Cliffs, NJ 07632, 1994.
-  John Whaley and Monica S. Lam. Cloning-based context-sensitive pointer alias analysis using binary decision diagrams. In ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 131–144, 2004.
-  Naling Zhang, Markus Kusano, and Chao Wang. Dynamic partial order reduction for relaxed memory models. In ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 250–259, 2015.