A True Positives Theorem for a Static Race Detector - Extended Version

11/08/2018 ∙ by Nikos Gorogiannis, et al. ∙ Facebook Yale-NUS College 0

RacerD is a static race detector that has been proven to be effective in engineering practice: it has seen thousands of data races fixed by developers before reaching production, and has supported the migration of Facebook's Android app rendering infrastructure from a single-threaded to a multi-threaded architecture. We prove a True Positives Theorem stating that, under certain assumptions, an idealized theoretical version of the analysis never reports a false positive. We also provide an empirical evaluation of an implementation of this analysis, versus the original RacerD. The theorem was motivated in the first case by the desire to understand the observation from production that RacerD was providing remarkably accurate signal to developers, and then the theorem guided further analyzer design decisions. Technically, our result can be seen as saying that the analysis computes an under-approximation of an over-approximation, which is the reverse of the more usual (over of under) situation in static analysis. Until now, static analyzers that are effective in practice but unsound have often been regarded as ad hoc; in contrast, we suggest that, in the future, theorems of this variety might be generally useful in understanding, justifying and designing effective static analyses for bug catching.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Context for the True Positives Theorem

The purpose of this paper is to state and prove a theorem that has come about by reacting to surprising properties we observed of a static program analysis that has been in production at Facebook for over a year.

The RacerD program analyzer searches for data races in Java programs, and it has had significantly more reported industrial impact than any other concurrency analysis that we are aware of. It was released as open source in October of 2017, and the OOPSLA’18 paper by Blackshear et al. (2018) describes its design, and gives more details about its deployment. They report, for example, that over 2,500 concurrent data races found by RacerD have been fixed by Facebook developers, and that it has been used to support the conversion of Facebook’s Android app rendering infrastructure from a single-threaded to a multi-threaded architecture.

RacerD’s designers did not establish the formal properties of the analyzer, but it has been shown to be effective in practice. We wanted to understand this point from a theoretical point of view. RacerD is not sound in the sense of computing an over-approximation of some abstraction of executions. Over-approximations support a theorem: if the analyzer says there are no bugs, then there are none. I.e., there are no false negatives (when the program has no bugs), which is often considered as a “soundness” theorem. RacerD favours reducing false positives over false negatives. A design goal was to “detect actionable races that developers find useful and respond to” but “no need to (provably) find them all”. In fact, it is very easy to generate artificial false negatives, but Blackshear et al. (2018) say that few have been reported in over a year of RacerD in production.

One can react to this by saying that RacerD is simply an ad hoc

, if effective, tool. But, the tool does not heuristically filter out bug reports in order to reduce false positives. It first does computations with an abstract domain, and then issues data race reports if any potential races are found according to the abstract domain. Its architecture is like that of a principled analyzer based on abstract interpretation,

even though it does not satisfy the usual soundness theorem. This suggests that saying RacerD is ad hoc because it does not satisfy a standard soundness theorem is somehow missing something: It would be better if a demonstrably-effective analyzer with a principled design came with a theoretical explanation, even if partial, for its effectiveness. That is the research problem we set ourselves to solve in this work.

A natural question is if it is possible to actually modify RacerD so as to make it sound, without losing its effectiveness in generating signal—actionable and useful data race reports that developers are keen to fix. RacerD’s initial design elided standard analysis techniques such as alias and escape analysis, on the grounds that this was consistent with the goal of reducing false positives. To try to get closer to soundness our colleague Sam Blackshear, one RacerD’s authors, implemented an escape analysis to find race bugs due to locally declared references escaping their defining scope, i.e., to reduce the false negatives. The escape analysis led to too many false positives; it contradicted the goal of high signal, and was not put into production. One of the current authors, Gorogiannis, tried another tack to reduce false negatives: a simple alias analysis, to find races between distinct syntactic expressions which may denote the same lvalue. Again the attempt caused too many false positives to make it to production.

Next we wondered: might there be a different theorem getting to the heart of why RacerD works? Because it is attempting to reduce false positives, a natural thing to try would be an under-approximation theorem. This would imply that every report is a true positive. This theorem is false for the analyzer, because it is possible to artificially generate false positives. One of the main reasons for this is that conditionals and loops are treated in a path-insensitive manner (i.e., join corresponds to taking the union of potential racy accesses across the different branches).

It seems plausible to modify RacerD to be under-approximate in two ways, but each of these has practical problems. In the first way, one considers sets (disjunctions) of abstract states, and uses conjunction for interpreting if statements and disjunction at join points: this is like in symbolic execution for testing (Cadar and Sen, 2013). This would cause scaling challenges because of the path explosion problem. RacerD runs quickly, in minutes, on (modifications to) millions of lines of code, and its speed is important for delivering timely signal. To make a much slower analysis would not be in the spirit of RacerD, and would not explain why the existing analyzer is effective. The other way would be to use a meet (like intersection) while sticking with one abstract state per program point. The problem here is that this prunes very many reports, so many that it would miss a great many bugs and (we reasoned) would not be worth deploying in Continuous Integration.

These considerations led us to the following hypothesis, which we would like to validate:

Conjecture 0 (True Positives (TP) Theorem).

Under certain assumptions, the analyzer reports no false positives.

An initial version of the assumptions was described as follows. Consider an idealized language, IL, in which programs have only non-deterministic choice in conditionals and loops, and where there no is recursion. Then the analyzer should only report true positives for that language.

The absence of booleans in conditionals in IL reflects an analyzer assumption that the code it will apply (well) to will use coarse-grained locking and not (much) fine-grained synchronization, where one would (say) race or not dependent on boolean conditions. In Facebook’s Android code we do find fine-grained concurrency. The Litho concurrent UI library111Available at https://fblitho.com. has implementations of ownership transfer and double-checked locking, both of which rely on boolean conditions for concurrency control. This kind of code can lead to false positives for RacerD, but we did not regard that as a mistake in RacerD. For, the Android engineers advised us to concentrate on coarse grained locking, which is used in the vast majority of product code at Facebook. For instance, we rarely observed code calling into Litho which selects a lock conditionally based on the value of a mutable field. Thus, the TP theorem based on assumptions reflected in IL could, if true, be a way of explaining why the analyzer reports few false positives, even though it uses join for if-statements.

The no-recursion condition is there because we want to say that the analyzer gets the races right except for when divergence makes a piece of code impossible to reach. If a data race detector reports a bug on a memory access that comes after a divergent statement (diverge; acc) then this would be a false positive, but we would not blame data race reporting, but rather failure to recognize the divergence which makes the memory access unreachable. The no-recursion requirement is just a form of separation of concerns to help theory focus on explaining the data race reporting aspect. Note that if non-deterministic choices are the only booleans in while-loops, then such loops do not necessarily loop forever; that is why we forbid recursion and not loops here. We give a full description of the assumptions for the TP theorem in Section 3.

One can see our True Positives Theorem as establishing an under-approximation of an over-approximation in a suitably positioned context. Start with a program without recursion (context). Replace boolean values in conditionals and loop statements by non-determinism (over-approximation). Finally, report only true positives in that over-approximation (i.e., under-approximate the potential bugs in that over-approximation). The more usual position in program analysis is to go for the reverse decomposition, an over-approximation of an under-approximation, to account for soundness relative to assumptions. The “under” comes about from ignored behaviors – e.g., if an analyzer does not deal well with a particular feature, such as reflection, prune those behaviours involving reflection for stating the soundness property– and the “over” comes from a usual sound-for-bug-prevention analysis, but only wrt. this under-approximate model. In contrast, our notion of under-of-over seems like a good way to think about a static analysis for bug catching.

We explained above how we arrived at the statement of the TP Theorem by considering properties of a particular analyzer, paired with considerations of its effectiveness. We don’t claim that to get an effective race detector you must end up in the position of the TP theorem. It might be possible to obtain a demonstrably useful-in-practice data race detector that satisfies a standard soundness (over-approximation) theorem. It might also be possible to find one satisfying a standard (unconditional) under-approximation theorem. In fact, there exist a number of dynamic data race detectors that are very promising, and at least one (Serebryany and Iskhodzhanov, 2009) that is widely deployed in practice. Both of these directions are deserving of further research. The work here is not in conflict with these valuable research directions, but would simply complement them by providing new insights on designing static race analyzers.

Now, the True Positives Theorem was not actually true of RacerD when we formulated it, and apparently it is still not. But, subsequent to our discovery of the theorem, the authors of RacerD took it as a guiding principle. For example, they implemented a “deep ownership” assumption (Clarke and Drossopoulou, 2002): if an access path, say x.f, is owned (i.e., accessible through just one object instance), then so are all extensions, e.g. x.f.g. The analyzer would never report a race for an owned access. Somewhat surprisingly, even though the deep ownership assumption goes against soundness for bug prevention (over-approximation), it is compatible with the goal of reducing false positives and commonly holds in practice.

We have proven a version of the TP Theorem for RacerDX, a modified version of RacerD analyzer, which is based on IL and is not too far removed from the original. In this paper we formulate and prove the TP theorem for an idealized theoretical language, and we carry out an empirical evaluation of an implementation of RacerDX in relation to the in-production analyzer (RacerD), wrt. change in the the number of reports produced. The distance between RacerD and RacerDX is such that the latter, which has a precise theorem attached to it, makes on the order of 10-57% fewer data race reports on our evaluation suite. To switch RacerD for RacerDX in production would require confidence concerning amount of true bugs in their difference, or other factors (e.g., simplicity of maintenance) which require engineering judgement; in general, replacing well-performing in-production software requires strong reasons. But, our experiments show how the basic design of RacerD is not far from an analyzer satisfying a precise theorem and, as we now explain, the extent of their difference is not so important for the broader significance of our results.

Broader Significance

Our starting point was a desire to understand theoretically why the specific analyzer RacerD is effective in practice, and we believe that our results go some way towards achieving this, but it appears that they have broader significance. They exemplify a way of studying program analysis tools designed for catching bugs rather than ensuring their absence.

The first point we emphasize is the following:

  • Unsound (and incomplete) static analyses can be principled, satisfying meaningful theorems that help to understand their behaviour and guide their design.

The concept ‘sound’ seems to have been taken as almost a synonym for ‘principled’ in some branches of the research community, and colleagues we have presented our results to have often reacted with surprise that such a theorem is possible. Representative are the remarks of one of the POPL referees: ‘It is remarkable that the authors manage to prove any theorem at all about an unsound analysis’ and the referees collectively who said ‘proving a theorem for an unsound static analysis is a first’. Note that our our analysis is neither sound for showing the absence of bugs nor for bug finding (every report is a true positive: the static analysis often refers to this as ‘complete’, where in symbolic as well as concrete testing it is sometimes what is meant by ‘sound’). The potential for theoretical insights on unsound and incomplete analyses is perhaps less widely appreciated than it could be. Such analyses are not necessarily ad hoc (although they can be).

Just to avoid misunderstanding, note that we are not talking about the common situation of where an analyzer is unsound generally but designed to be sound under assumptions. As described above, this situation can be understood as soundness (over-approximation) wrt. a different model than the usual concrete semantics, typically an under-approximation of the concrete semantics. By striving for minimizing false positives instead of false negatives, our analyzer is purposely designed to be unsound, and this difference is reflected in the shape of our theorem being under-of-over rather than the converse.

Much more significant than saying that there exists an unsound (and incomplete) analysis with a theorem, would be if we could make the above claim for an analyzer that was useful in the sense of helping people.

Here, the fact that RacerD (the analyzer deployed in production, without a theorem) and RacerDX (its close cousin, with a theorem) are not the same at first glance seems problematic. However, our experiments suggest strongly that, if RacerDX rather than RacerD had been put into production originally, it would have found thousands of bugs, far outstripping the (publicly reported) impact of all previous static race detectors. We infer:

  • One can have an unsound (and incomplete) but effective static analysis, which has significant industrial impact, and which is supported by a meaningful theorem; in our case the TP theorem.

The discussion of this second point is admittedly based on counter-factual reasoning (what if we ‘had’ deployed RacerDX instead of RacerD?), but the possibility of it being a false argument seems to be vanishingly small. The point is so powerful, and accepted by the Infer static analysis team members even outside the authors,222RacerD and RacerDX are implemented using the Infer.AI abstract interpretation framework; see https://fbinfer.com. that the TP theorem is guiding the construction of new analysers at Facebook.

Thus, it seems that our results could be more broadly significant than the initial but perhaps worthwhile goal of ‘understand why RacerD is effective’. The fact that RacerD and RacerDX don’t coincide is not so important for the larger point, though it would have made for a neat story if they had. Such neat stories have so far been rather rare when developing analysers under industrial rather than scientific constraints. Future analysers, including ones being developed at Facebook, will possibly adhere more to the ‘neat’ story.

Paper Outline

In the remainder of the paper we give an overview of the intuition and reasoning principles for identifying and reporting concurrent data races in RacerD (Section 2). We then describe an idealized language (IL), whose set of features matches the common conventions followed in production Java code, as well as IL’s over-approximating semantics (Section 3). We then provide a formal definition of RacerDX, a modified RacerD analysis, tailored for reporting no false positives, in the framework of abstract interpretation (Section 4). Our formal development culminates with a proof of the True Positives Theorem, coming in two parts: in Section 5 we prove completeness of RacerDX wrt. to its abstraction and in Section 6 we show how to reconstruct provably racy executions from the analysis results. In Section 7, we discuss the implementation of the theoretical analyzer described in Section 4. We then present an evaluation of RacerDX which compares it with RacerD on a set of real-world Java projects. We discuss related work in Section 8 and elaborate on the common formal guarantees considered for static concurrency analyses, positioning RacerDX amongst existing tools and approaches.

2. Overview

A textbook definition of a data race is somewhat low-level: a race is caused by potentially concurrent operations on a shared memory location, of which at least one is a write (Herlihy and Shavit, 2008).

Data races in object-based languages with a language-provided synchronization mechanism (e.g., Objective C or Java) can be described more conveniently, for the sake of being understood and fixed by the programmer, in terms of the program’s syntax (rather than memory), via access paths, which would serve as runtime race “witnesses”, by referring to the dynamic semantics of concurrent executions.333We will provide a more rigorous definition in Section 3. Given two programs (e.g., calls to methods of the same instance of a Java class), and , there is a data race between and if one can construct their concurrent execution trace , and identify two access paths, and (represented as field-dereferencing chains , where is a variable or this), in and , respectively, such that:

  • [leftmargin=1.5em]

  • at some point of both and are involved into two concurrent operations, at least one of which is a write, while both and point to the same shared memory location;

  • the sets of locks held by and at that execution point respectively, are disjoint.

This “definition” of a data race provides a lot of freedom for substantiating its components. Of paramount importance are () the considered set of pairs of programs that can race, () the assumptions about the initial state of and , and () the notion of the dynamic semantics, employed for constructing concurrent execution traces. A choice of those determines what is considered to be a race and reflects some assumptions about program executions. For instance, accounting for the “worst possible configuration” (e.g., arbitrary initial state and uncontrolled aliasing) would make the problem of sound (i.e., over-approximating) reasoning about data races non-tractable in practice, or render its results too imprecise to be useful.

2.1. Race Detection in RacerD

2public class Dodo {
3  private Dodo dee;
5  public void zap(Dodo d) {
6    synchronized (this) {
7      System.out.println(d.dee);
8    }
9  }
10  public void zup(Dodo d) {
11    d.dee = new Dodo();
12  }
Figure 1. A racy Java class.

As a specific set of design choices wrt. identifying data races in terms of access paths, let us consider RacerD (Blackshear et al., 2018)—a tool for static compositional race detection by Facebook—and its take on a toy example in Figure 1. The Java class Dodo has one private field dee and two public methods, both manipulating dee’s state. To instantiate the definition of a data race above, RacerD considers all pairs of public method calls of the same class instance as concurrently running programs and from the definition above (). To instantiate the initial state, it assumes that method parameters of the same type (e.g., Dodo) may alias, thus, maximizing a possibility of a race (). Finally, it assumes there is only one reentrant lock in the entire program (e.g., Dodo’s this), and over-approximates the dynamic semantics by taking all branches of the conditionals and entering every loop ().

With this setup, RacerD reports the class Dodo (annotated with javax.annotation.concurrent’s @ThreadSafe) as racy, due to concurrent unsynchronized accesses to the field dee in two paths: a read from d.dee in the method zap() and the write to d.dee in zup()—both detected based on the assumption that the two ds can be run-time aliases, since they share the same type. This assumption makes it easy to reconstruct a concurrent execution of the two culprit methods, invoked with the same object, which exhibit a race, thus making this report a true positive of the analysis.

2.2. Fighting False Positives

By its nature, RacerD is a bug detector, hence it sacrifices the traditional concept of soundness (the property customary for static analyses stating that all behaviors of interest are detected (Cousot and Cousot, 1979)) and, thus, may suffer false negatives (i.e., miss races).444Although there were very few reports of races missed in Facebook’s production code (Blackshear et al., 2018). Instead, for the purposes of correctly detecting bugs, and minimizing the time programmers spend investigating the reports, RacerD emphasizes completeness (Ranzato, 2013) over soundness, targeting what we are going to refer to as the True Positives Conjecture—that is, if it reports a data race, one should be able to find a witness execution and access paths exhibiting the concurrency bug, given the assumptions made.

1class Bloop { 2  public int f = 1; 3} 4 5class Burble { 6 7  public void meps(Bloop b) { 8    synchronized (this) { 9      System.out.println(b.f); 10    } 11  } 12 13  public void reps(Bloop b) { 14    b.f = 42; 15  } 16 17  public void beps(Bloop b) { 18    b = new Bloop(); 19    b.f = 239; 20  } 21} 22class Wurble { 23  Wurble x = new Wurble(); 24  Bloop g = new Bloop(); 25 26  public void qwop(Wurble w) { 27    zwup(w.x); 28  } 29 30  public void gwap(Wurble w) { 31    synchronized (this) { 32      System.out.println(w.x.g); 33    } 34  } 35 36  private void zwup(Wurble w) { 37    synchronized (this) { 38      System.out.println(w.x.g); 39    } 40    w = new Wurble(); 41    w.g.f = 21; 42  } 43}
Figure 2. A Java class with a false race.
Figure 3. A class with a false interprocedural race.

As a next example, consider the Java class Burble in Figure 3. For the same reason as with Dodo, the methods meps and reps race with each other when run with aliased arguments. It is less obvious, however, whether meps races with beps—and in fact they do not! The reason for that is that beps reassigns a freshly allocated instance of Bloop to the formal b before assigning to the field of the latter, thus, effectively avoiding a race with a concurrent access to b.f in meps.

This phenomenon of “destabilizing” an access path in a potentially racy program can be manifested both intra-procedurally (as in Burble) and inter-procedurally. To wit, in another example in Figure 3, the class Wurble demonstrates a similar instance of a false race, with the private method zwup() “destabilizing” the path w.g by assigning a newly allocated Wurble instance to w, thus, ensuring that qwop() and gwap() avoid a race with each other.

A sound static analyzer would typically be expected to report races in both of these examples, corresponding to a loss of precision. However, having a non-negligible number of false positives is not something a practical bug detector can afford.

To avoid this loss of signal effectiveness, RacerD employes an ownership tracking domain (Naik et al., 2006; Flanagan and Freund, 2009), used to record the variables and paths that have been assigned a newly allocated object, thus remedying the situation shown above.

Upon closer examination, we found that RacerD’s abstract domain, including that of ownership, was not enough to allow us to prove that an access path resolves to the same address, before and after execution. To wit, knowing that an access path x.f.g is not owned, does not guarantee that the lvalue it corresponds to stays the same during execution. The reason we wanted this latter property is that it is one of the simplest ways to exhibit a race: once we have set up an initial state where a path resolves to a certain address, and have shown that execution of does not modify that address (i.e., the path to address is stable), we are in the position to uncoditionally say that if both programs access that address, they will race. The answer we came up with is that of stability; its negation, instability (or wobbliness), over-approximates ownership.

Thus we pose the question: can we state the reasonable (i.e., non-trivial) conditions under which we can in confidence state (i.e., formally prove) that all of RacerD’s reported races are true positives?

We refer to this desirable result as the True Positives Theorem (TPT) for a static race detector, and in this paper we deliver such a theorem for a version of RacerD (called RacerDX, using stability), formulating a set of assumptions under which it holds, and assessing their practical implications and impact on signal.

2.3. A True Positives Theorem for RacerDX

Our main result enabling the TPT proof is defining an over-approximating concrete semantics and stating the sufficient conditions, both reflecting the behavior of production code, under which RacerDX, a modified version of RacerD, reports no false data races. The RacerDX True Positives Theorem, which substantiates this statement, builds on the following three pieces of the formal development that together form the central theoretical contribution of this work.

2.3.1. An Over-Approximating Concrete Semantics and Practical Assumptions

RacerDX is a flow- and path-insensitive static analysis, which goes in all branches of conditional expressions and loops. To account for this design choice while stating a formal completeness result, we adopt a novel non-deterministic (single-threaded) trace-collecting semantics that treats branch and loop guards as non-deterministically valued variables and explores all execution sequences (Section 3). We use this semantics to give meaning to single-threaded program executions, thus, building a concrete domain for the main RacerDX analysis procedure. Amongst other things, we assume just one global lock and a language with a single class (although we do not restrict the number of its fields and methods, as well as their signatures). We also restrict the reasoning to programs with well-balanced locking (e.g., in Java terminology it would mean that only synchronized-enabled locking is allowed), and forbid recursion.

2.3.2. RacerDX Abstraction and Sequential Completeness of its Analysis

We formulate the abstract domain for RacerDX analysis (Section 4) along with the abstraction function to it from the concrete domain of the previously defined multi-threaded trace-collecting semantics, à la Brookes (2007), which we show to form a Galois connection (Cousot and Cousot, 1979). We then prove, in Section 5, a tower of lemmas, establishing the “sequential” completeness of RacerDX’s static analysis, which analyzes each sequential sub-program compositionally in isolation, without considering concurrent interleavings, with respect to this abstraction (Theorem 5.15). The novelty of our formal proof is in employing standard Abstract Interpretation, while taking a different perspective by establishing the completeness rather than soundness of an analysis in the spirit of the work by Ranzato (2013).

2.3.3. Syntactic Criteria for Ensuring True Positives

We introduce the notions of path stability and its counterpart, wobbliness (i.e., instability) —simple syntactic properties, which are at the heart of stating the sufficient conditions for RacerDX’s TPT— and connect it to the previously developed abstract domain and single-threaded RacerDX abstraction. A similar tower of lemmas is built, showing that a stable path resolves to the same address before and after execution, thus providing us with the tools for validating the reported race. Our formal development culminates in Section 6, with leveraging the sequential completeness result of RacerDX for reconstructing the concrete concurrent execution traces exhibiting provably true data races, thus delivering the final statement and the proof of TPT (Theorem 6.10).

2.4. Measuring the Impact of True Positives Theorem on Signal Effectiveness

The practical contribution of our work, described in Section 7, is an implementation of RacerDX, a revised version of RacerD, incorporating the analysis machinery enabling the result of TPT, and its evaluation. We aimed to measure the impact of employing stability as a sufficient condition for detecting true races with respect to the reduction in overall number of reported bugs.555It is easy to have a vacuous TPT by reporting no bugs whatsoever! We ran experiments contrasting RacerD and RacerDX on a number of open-source Java projects, ranging from 25k to 273k LOC. We evaluated the runtime of each analyser, and looked at the reports the two analyzers produced in several ways, in order to assess the loss of signal induced by RacerDX.

3. Concrete Execution Model

To formally define our execution model and the notion of a race, we first describe an idealized programming language (IL) that accurately captures the essence of RacerDX’s intermediate representation and faithfully represents the race-relevant aspects of Java and similar languages.

3.1. Language and Assumptions about Programs

We start by defining a programming language with a semantics suitable for RacerDX’s goals. Our language is simplified compared to Java, with assumptions made to help our study of the question of whether the data race reports are effective.

  1. [label=A0]

  2. The language has only one class, that of record-like objects with an arbitrary, but finite set of fields (with names from a fixed set ) which are themselves pointers.

  3. The concurrency model is restricted to exactly two threads, and use a single, reentrant lock. The commands lock() and unlock() are only allowed to appear in balanced pairs within block statements and method bodies.

  4. Local variables need no declaration, but must be assigned to before first use. Formal parameters are always taken from a fixed set. There are no global variables.

  5. There is no destruction of objects, only allocation.

  6. All methods are non-recursive, have call-by-value parameters and no return value. The assumption on non-recursion is standard, as we don’t want to reason here about termination.

  7. Control for conditional statements and while loops is non-deterministic; there are no booleans.

Assumptions 1 and 5 are about ignoring potential sources of false positives which have nothing in particular to do with races. For example, if you enter an infinite loop before a potential data race then you would have a false positive if you reported the race, but we wouldn’t expect a (static) race detector to detect infinite loops. Similarly, if you have a class that can’t be inhabited you can’t get races on it, but we view this as separate from the question of the effectiveness of the race reports themselves.

Assumption 5 about recursion is perhaps not as practically restrictive as might first appear. Its impact is less than, for example, bounded symbolic model checking, which can be seen as performing a finite unwinding of a program to produce a non-recursive underapproximation of it (and without loops as well) before doing analysis. As we explained in Section 1, our main reason for making the assumption is a conceptual rather than a practical one, to do with separation of concerns, but we state the point about bounded model checking for additional context.

The reasons for the simplifications 1 and 2 are both related to RacerDX’s focus on detection of races in one class at a time, with races manifested by parallel execution of methods on a single instance. Assumption 2 is a potential source of real false negatives, as methods of the same class that use distinct locks can race. Furthermore, the well-balanced assumption in 2 corresponds directly to scoped synchronisation mechanisms like Java’s synchronized(m)\{ \} construction (Goetz et al., 2006), where m is a static global mutex object. 3 corresponds to RacerDX’s intermediate language representation of parameters and variables. The lack of global variables is a genuine restriction, but which is easy to address; we discuss this from a practical point of view in Section 7. The assumption 4 is due to the fact that Java is a garbage-collected language.

Finally, the assumption 6 is the most significant one, both from the perspective of formalising a language semantics, and from the point of specifying completeness. By assuming that every execution branch may be taken, we do not have to reason statically about branching and looping conditions. This is effectively an over-approximation of the actual semantics of the concurrent programs. This assumption is motivated by two considerations: (i) RacerD purposely avoids tackling fine-grained concurrency, and is applied at Facebook to code where developers do not often avoid races by choosing which branch to take; (ii) RacerD uses join for if statements in order not to have too many false negatives.

Figure 4. Syntax grammar of the concurrent programming language.

The grammar defining the language of interest is given in Figure 4. We represent expressions as either program variables and method formals , and access paths . The language contains no constants such as, e.g., null. We partition statements into simple and composite. Simple statements include assignments to variables, paths, reading from paths, allocating a new object, (blocking) lock acquisition via , releasing a lock via , and popping an execution stack. The command is not to be used directly in programs; it only occurs in our semantics of method calls and is needed for defining the trace-collecting semantics described below. Composite statements provide support for sequential composition (), as well as conditionals (), loops () and method calls, with the latter resulting in pushing a new stack frame on the call stack, as well as emitting a command to remove it at runtime after the method body is fully executed.

Following the tradition of Featherweight Calculi for object-oriented languages (Igarashi et al., 2001), we introduce the function , which maps method names to bodies and , which gives the number of formal parameters. We impose the convention that the formal parameters are always for all methods, where and that no other variables of the form , where , appear in . We also allow local variables (i.e. variables which are not of the form ) but forbid their use without definite prior initialisation.

3.2. Concrete Semantics

In this work, we do not tackle fine-grained concurrency (Turon et al., 2013), and we only consider lock-based synchronisation, as per assumption 2. Therefore, our concurrent semantics adopts the model of sequential consistency (Lamport, 1979). To define it, we start by giving semantics to sequential program runs, which we will later combine to construct concurrent executions.

Our definition of runtime executions works over the following semantic categories. denotes a countably infinite set of object locations. A stack666We overload the term stack to mean store here, borrowing from Separation Logic, the style of which informs most of our development. We disambiguate the term by explicitly using call-stack for a list of stacks. is a mapping . Addresses are defined in a field-splitting style, as . A heap is a partial, finite map . The constant is such that no heap is ever defined on an address of the form . We write for the stack such that for all .777Here, is simply an unallocated address (corresponding to Java’s null). We introduce it in order to avoid deviating too much from standard developments of shape analyses. The projection of an address to its first component is , i.e., . We lift this definition to sets of addresses, , and to heaps, . For example, . A lock context is a natural number.

A (single-thread) program state is a tuple , where is a either simple statement (Figure 4) or a runtime call-statement ; models the call-stack, is a heap, and . The stack at head position in the list is the current stack frame. The component is the lock context of the thread, signifying how many times a instruction has been executed without a corresponding . We remark that the stack, heap and lock state components are those produced by executing starting at some previous program state.

A two-threaded program state is a tuple , where is either or , denoting one of the two threads executing command . The pairs and are the thread-local call-stacks and lock contexts for each thread, and is the shared heap.

For the next series of definitions, we overload the term state to describe stack-heap pairs for both single-threaded and multi-threaded executions when the context is unambiguous.

Definition 3.1 ().

The address, , of a path in a state is recursively defined as follows:

The value, , of an expression in a state is defined as follows:

The address of a path is the address read or written when a load or a store accesses that path.

Definition 3.2 (Execution trace).

A (single-threaded) execution trace is a possibly empty list of program states. The set of all traces is .

We are now equipped with all the necessary formal components to give meaning to both single- and multi-threaded executions.


Figure 5. Single-threaded trace-collecting semantics of simple (top) and compound statements (bottom).
Definition 3.3 (Sequential trace-collecting semantics).

A trace-collecting semantics of a single-threaded (possibly compound) program , denoted is a map to a set of traces starting from an initial configuration . The trace-collecting semantics are defined in Figure 5.888For the sake of uniformity, we assume that every program has the form . The auxiliary function (defined only on non-empty traces) returns the triple if the last element of the non-empty trace is for some command .

We give the semantics for concurrent programs in the style of Brookes (2007).

Definition 3.4 (Concurrent execution trace).

A (two-threaded) concurrent execution trace is a possibly empty list of two-threaded program states. The set of all concurrent traces is .

Definition 3.5 (Concurrent trace-collecting semantics).

The trace-collecting semantics of a parallel composition of programs and is defined in Figure 6. It is a map from two-threaded program states to sets of concurrent traces: . Intuitively, it interleaves all single-threaded executions of with those of taking care to guarantee that only one thread can hold the lock at a given step of the trace’.

Figure 6. Concurrent trace-collecting semantics.

We conclude this section of definitions by formally specifying concurrent data races.

Definition 3.6 (Data Race).

The program races if there exists a state and a non-empty concurrent trace such that 999 Throughout the paper we use the notation for we don’t care about, effectively existentially quantifying them. and,

  • there exist paths such that ;

  • there exist states and such that ;

  • , or, , or, .

How can this definition capture races without mentioning locks? For a thread to be blocked on a lock acquisition, the successor instruction to the current state must be a statement, which is excluded by the syntactic condition on the successor instructions (cf. restrictions wrt. locking contexts in Figure 6). Other sources of getting stuck are excluded by our assumptions 26: there no deadlocks due to a single reentrant lock, no deterministic infinite loops and no recursion.

4. RacerDX Analysis and its Abstract Domain

The core RacerDX analysis statically collects information about path accesses occurring during an abstract program execution, as well as their locking contexts. In addition to those fairly standard bits, it tracks an additional program property, which makes it possible to filter out potential false negatives at the reporting phases—wobbly paths.

Definition 4.1 ().

We call a path non-stable (or unstable, or wobbly) in a program if it appears as either LHS or RHS in a read or an assignment command during some execution of . We elaborate this concept in the presence of method calls below.

Formally, the analysis operates on an abstract domain, which is a product of the three components: domain of wobbly paths, an access path domain and a lock domain.

Definition 4.2 (RacerDX abstract states).

The abstract domain is , where

  • is the set of wobbly accesses;

  • is the current lock context;

  • is a domain of sets of recorded read/write accesses from/to paths which occur under lock state .

We will use as identifiers of elements of the corresponding domains above.

The lattice structure on (which we will ofter refer to as summaries) is ordered (via ) by pointwise lifting of the order relations on the three components of the domain (Cousot and Cousot, 1979). For an element , we refer to the corresponding three projections as , and . For computing the join of branching control-flow and loops, the analysis employs a standard monotone version of a lub-like operator.

Definition 4.3 (Least-upper bound on ).

For and ,

Figure 7. Definition of the abstract analysis semantics . We define , , and . We define a substitution (applied recursively to all syntactic elements) as applying only to the path component of (eg, ). The function acts as a filter that only selects expressions rooted at formals, i.e., variables of the form , and is extended straightforwardly to sets of elements of , depending on the path component of the command.

The intuition of what might go wrong with defined this way, when aiming for a “no-information-loss” analysis, is easier to see on an example. Consider the program , where in the lock (which is reentrant, as agreed above) is taken strictly more times than in ; then by taking an over-approximation of the total number of times the lock is taken for both branches (i.e., ), we have a chance to miss a race in the remainder program , as some access path can be recorded by the analyzer as having a larger lock context than it would have in a concrete execution. What comes to the rescue is the assumption 2, which we will exploit in our proofs of TP Theorem. Turns out, in practice, implementations of all classes we run the analysis on (cf. Section 7) have well-scoped locking, mostly relying on Java’s synchronized primitive.

The definition of RacerDX analysis for arbitrary compound programs is given in Figure 7. The abstract transition function relies on the three primitives, , , and , that account for the three corresponding components of the abstract domain. In the case of method calls, (), the analysis also takes advantage of its own compositionality, adapting a summary of a method to its caller context () and the actual arguments .

What does the analysis achieve?

First, the results of the analysis from Figure 7 are used to construct a set of race candidates in RacerDX. Following the original RacerD algorithm described by Blackshear et al. (2018), if for a pair of method calls (possibly of the same method) of the same class instance, the same syntactic access path appears in the both methods summaries, depending on the nature of the access (read or write) and the locking context , at which the access has been captured, it might be deemed a race.

Second, the gathered wobbliness information comes into play. If a path , which is a proper prefix of , is identified as wobbly by the analysis, a race on , might be a false positive. That is, wobbly paths “destabilise” the future results of the analysis, allowing the real path-underlying values “escape” the race, thus rendering it a false positive—precisely what we are aiming to avoid. Therefore, in order to report only true races, RacerDX removes from the final reports all paths that were affected by wobbly prefixes.

In Sections 56 we establish this completeness guarantee formally, showing that races reported on non-destabilised accesses are indeed true positives, in the sence that there exists a pair of execution traces for a pair of method calls of the same class instances that exhibit the behaviour described by Definition 3.6. Furthermore, the evaluation of RacerDX in Section 7 provides practical evidence that the notion of wobbliness does not remove too many reports, that is that the analysis remains efficient while being precise, in the assumptions 16.

5. Towards TP Theorem, Part I: Analysis Completeness

In this section, we deliver on the first part of the agenda towards True Positive Theorem: the definition of RacerDX abstraction (wrt. the concrete semantics), and the proof of completeness of the analysis with respect to it. That is, informally, if for a program , the analysis reports a certain path  accessed under a locking context , then there exists an initial configuration and an execution trace , such that contains a command accessing with a locking context . To establish this, we follow the general approach of the Abstract Interpretation framework (Cousot, 1978; Cousot and Cousot, 1977), aiming for completeness, i.e., no information loss wrt. chosen abstraction (Ranzato, 2013), stated in terms of a traditional abstract transition function and an abstraction from a concrete domain of sets of traces.

We first formulate the abstraction and prove its desirable properties. The remainder of this formal development in this arc of our story takes a “spiral” pattern, coming in two turns: establish the completeness of the analysis for straight-line programs (turn one), and then lift this proof for the inter-procedural case (turn two). In most of the cases, we provide only statements of the theorems, referring the reader to Appendix A for auxiliary definitions and proofs.

5.1. RacerDX Abstraction for Trace-Collecting Semantics

The abstraction connects the results of the analyzer to the elements of the concrete semantics, i.e., traces. It is natural to restrict the considered traces and states to such that could indeed be produced by executions. We will refer to them as well-formed (WF) and well-behaved (WB).

Definition 5.1 ().

An execution trace is well-formed if for each two subsequent states and , the stack and heap of can be obtained by executing the simple command in ’s first component with respect to the stack/heap of .

It is easy to show that for any program , , , trace , is well-formed.

Definition 5.2 ().

A program state is well-behaved iff (a) for any appearing in and any variable , , and, (b) for any address , .

We remark that dangling pointers do not occur in Java, and we reproduce a similar result for our language below.

Lemma 5.3 (Preservation of well-behavedness).

Let be a non-empty, well-formed trace, whose starting state is well-behaved. Then, every state in is well-behaved.

Abstract domain and abstraction function

The analysis’s domain (Definition 4.2) is rather coarse-grained: it does not feature any information about runtime heaps or stacks. To bring the concrete traces closer to abstract summaries, for an execution trace , we define syntactic trace as a list, obtained by taking only the first components of each of ’s elements. That is, . We now define the abstraction function , from the lattice of sets of execution traces to , which corresponds to “folding” a syntactic trace, encoding a run of straight-line program (equivalent to the original program for this particular execution), left-to-right, to an abstract state recursively.

where and

Figure 8. An auxiliary function for executing syntactic traces to compute RacerDX abstraction.
Definition 5.4 (Abstraction Function).

For a set of well-formed traces ,

and is defined in Figure 8.101010We overload the list notation to denote appending a list to , i.e., .

The component in the state carried forward by is a stack of substitutions (from formals, , to paths) which mirrors the call stack in a concrete execution. Whenever an access occurs, the substitutions  are immediately applied, and the result is a path that is rooted at a top-level variable . Accesses rooted at local variables (any variable which is not in the syntactic form or not in the domain of the top-most substitution) are discarded (the substitution function returns an empty set). Similarly, the set is populated with expressions that have the substitutions already applied, and similarly discarded if rooted at a local variable.

The commands and accordingly manipulate the substitution stack, while ) also adds certain extra paths to ; this is to avoid the effects of aliasing of paths rooted at different formals inside the method body. That is, paths can become wobbly because parts of the same path have been provided as parameters in a method call, such as in . The full reasoning behind this definition will become clearer in Section 6, where it is used in the construction of a memory state where there is exactly one parameter pointing to the heap-image of the racy path.

We next establish a number of facts about necessary for the proof of our analysis completeness.

Lemma 5.5 (Additivity of ).

is additive (i.e., preserves lubs) with respect to and .

Thanks to Lemma 5.5, we can define the (monotone) Galois connection between the two complete lattices, where  (Cousot and Cousot, 1992). Having a Galois connection between and in conjunction with completeness of the analysis (Theorem 5.15) allows us to argue for the presence of a certain accesses in some concrete trace of a program , if the analysis reports it for . This is due to the following fact establishing that if records an access in a certain trace, then such a path was indeed present in its argument:

Lemma 5.6 (Path access existence).

Let be a set of traces, , and (where or ) is a query about the access path in the locking context . If then there exist a trace and a non-empty, shortest prefix such that

  • the last state of is and , are both stores or loads;

  • where ;

  • a path such that or , and .


By the definition of and the properties of follows that there must exist a trace such that . By the definition of the other elements follow directly. ∎

5.2. Proving that the Analysis Loses No Information

We structure the proof in two stages: first considering only straight-line programs with no method calls (Section 5.2.1) and then lifting it to programs with finite method call hierarchies (Section 5.2.2). We do not consider the cases with recursive calls (cf. Assumption 5).

5.2.1. Intra-procedural case

The abstract transfer function of the analysis for simple commands is defined in Figure 7 as . We first prove the completeness for simple commands for a singleton-trace concrete domain.

Lemma 5.7 (Analysis is complete for simple commands (per-trace)).

For any non-empty WF trace , sets , , number , and a simple command , which is not , such that (a) , (b) , where is the configuration of the last element of , the following holds:

The following result lifts the reasoning of Lemma 5.7 from singletons to sets of arbitrary traces.

Lemma 5.8 (Analysis of simple commands is complete for sets of traces).

For any set of non-empty well-formed traces, , , , and simple command , which is not , such that (a) , (b) for any , , then

The fact of preserving the equality of abstract results in Lemmas 5.7 and 5.8 is quite noteworthy: for straight-line programs (and, in fact, for any program in our IL) the analysis is precise, i.e., we do not lose information wrt. locking context, wobbliness, or access paths, and hence can include elements into the -component as a part of -machinery without the loss of precision.

We now lift these facts to the analysis for compound programs without method calls. Recall that we only consider programs with balanced locking (Assumption 2), i.e., such that within them

  • the commands and are only allowed to appear in balanced pairs (including their appearances in nested method calls) within conditional branches and looping statements, and,

  • every command has a matching .

The following two lemmas are going exploit this fact for proving the analysis completeness.

Lemma 5.9 (RacerDX analysis and balanced locking).

If   is a compound program with balanced locking, and . Then .

Lemma 5.10 (Balanced locking and syntactic traces).

For any program with balanced locking and well-behaved states and and ,

As a corollary, .

The proof of Lemma 5.10 hinges on the the following observations:

  • Balanced locking ensures that , which the case for any intermediate state of the traces of the programs in consideration.

  • Well-behavedness ensures that the set of traces for loads and stores is non-empty.

  • Well-behavedness is preserved along traces (cf. the semantics of ).

That is, for compound programs with balanced lockings the -component of the abstraction remains immutable (by the end of execution) for both and the abstraction over the concrete semantics. The following lemma delivers the completeness in the intra-procedural case.

Lemma 5.11 ().

For any compound program with balanced locking and no method calls, the starting components , , , and a set of well-formed non-empty traces , such that ,

where is well-defined, as consists of non-empty traces.


By induction on and Lemma 5.8. Details are in the Proof in Appendix A. ∎

5.2.2. Inter-procedural case

The main technical hurdle on the way for extending the statement of Lemma 5.8 to an inter-procedural case (i.e., allowing for method calls) is the gap between the semantic implementation of method calls via stack-management discipline (Figure 5) and treatment of method summaries by the analysis, via explicit substitutions (Figure 7). To address this, we introduce the following convention, enforced by RacerDX’s intermediate language representation:

Definition 5.12 (ANF).

We say that a method is in Argument-Normal Form (ANF), if for every simple command in its body, is not of the form for some .

Intuitively, this requirement enforces a “sanitisation” of used arguments from the set , so one could substitute them with some (non-variable) paths without disrupting the syntactic structure of a compound program. Let us denote as the set of paths