Causality-Guided Adaptive Interventional Debugging

03/21/2020 ∙ by Anna Fariha, et al. ∙ University of Massachusetts Amherst Microsoft 0

Runtime nondeterminism is a fact of life in modern database applications. Previous research has shown that nondeterminism can cause applications to intermittently crash, become unresponsive, or experience data corruption. We propose Adaptive Interventional Debugging (AID) for debugging such intermittent failures. AID combines existing statistical debugging, causal analysis, fault injection, and group testing techniques in a novel way to (1) pinpoint the root cause of an application's intermittent failure and (2) generate an explanation of how the root cause triggers the failure. AID works by first identifying a set of runtime behaviors (called predicates) that are strongly correlated to the failure. It then utilizes temporal properties of the predicates to (over)-approximate their causal relationships. Finally, it uses fault injection to execute a sequence of interventions on the predicates and discover their true causal relationships. This enables AID to identify the true root cause and its causal relationship to the failure. We theoretically analyze how fast AID can converge to the identification. We evaluate AID with six real-world applications that intermittently fail under specific inputs. In each case, AID was able to identify the root cause and explain how the root cause triggered the failure, much faster than group testing and more precisely than statistical debugging. We also evaluate AID with many synthetically generated applications with known root causes and confirm that the benefits also hold for them.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern data management systems and database-backed applications run on commodity hardware and heavily rely on asynchronous and concurrent processing [DBLP:journals/cacm/DeanG08, DBLP:journals/sigops/Herlihy92, DBLP:conf/mss/ShvachkoKRC10, mysql]. As a result, runtime nondeterminism, such as transient faults and variability in timing and thread scheduling, are a fact of life. As many bug reports show, these systems often contain software bugs related to handling nondeterminism. Previous studies reported such bugs in MySQL [DBLP:conf/asplos/LuPSZ08, bovenzi2012aging], PostgreSQL [lu2007muvi], NoSQL systems [leesatapornwongsa2016taxdc, yuan2014simple], and database-backed applications [bailis2015feral], and showed that the bugs can cause crashes, unresponsiveness, and data corruptions. It is, therefore, crucial to identify and fix these bugs as early as possible.

Unfortunately, localizing root causes of intermittent failures is extremely challenging [luo2014empirical, DBLP:conf/icml/ZhengJLNA06, DBLP:conf/icsm/LiuQWM14]. For example, concurrency bugs such as deadlocks, order and atomicity violation, race conditions, etc. may appear only under very specific thread interleavings. Even when an application executes with the same input in the same environment, these bugs may appear only rarely. When a concurrency bug is confirmed to exist, the debugging process is further complicated by the fact that the bug cannot be consistently reproduced. Heavy-weight techniques based on record-replay [DBLP:conf/osdi/AttariyanCF12] and fine-grained tracing with lineage [DBLP:conf/sigmod/AlvaroRH15] can provide insights on root causes after a bug manifests; but their runtime overheads often interfere with thread timing and scheduling, making it even harder for the intermittent bugs to manifest in the first place [lam2019root].

Statistical Debugging (SD) [DBLP:conf/kbse/JonesH05, statisticalDebuggingLiblit, Sober, crugLiblit] is a data-driven technique that partly addresses the above challenge. SD uses lightweight logging to capture an application’s runtime (mis)behaviors, called predicates. An example predicate indicates whether a method returns null in a particular execution or not. Given an application that intermittently fails, SD logs predicates from many successful and failed executions. SD then uses statistical analyses of the logs to identify discriminative predicates that are highly correlated with the failure.

SD has two key limitations. First, SD can produce many discriminative predicates that are correlated to, but not a true cause of, a failure. Second, SD does not provide enough insights that can explain how a predicate may eventually lead to the failure. Lack of such insights and the presence of many non-causal predicates make it hard for a developer to identify the true root cause of a failure. SD expects that a developer has sufficient domain knowledge about if/how a predicate can eventually cause a failure, even when the predicate is examined in isolation without additional context. This is often hard in practice, as is reported by real-world surveys [DBLP:conf/issta/ParninO11].

Example 1.

To motivate our work, we consider a recently reported issue in Npgsql [npgsql]

, an open-source ADO.NET data provider for PostgreSQL. On its GitHub repository, a user reported that a database application intermittently crashes when it tries to create a new PostgreSQL connection (GitHub issue #2485 

[npgsqlBug2485]). The underlying root cause is a data race on an array index variable. The data race, which happens only when racing threads interleave in a specific way, causes one of the threads to access beyond the size of the array. This causes an exception that crashes the application.

We used SD to localize the root cause of this nondeterministic bug (more details are in Section 7). SD identified 14 predicates, only three of which were causally related to the error. Other predicates were just symptoms of the root cause or happened to co-occur with the root cause.

In Section 7, we describe five other case studies that show the same general problem: SD produces too many predicates, only a small subset of which are causally related to the failure. Thus, SD is not specific enough, and it leaves the developer with the task of identifying the root causes from a large number of candidates. This task is particularly challenging, since SD does not provide explanations of how a potential predicate can eventually lead to the failure.

In this paper, we address these limitations with a new data-driven technique called Adaptive Interventional Debugging (AID). Given predicate logs from successful and failed executions of an application, AID can pinpoint why the application failed, by identifying one (or a small number of) predicate that indicates the real root cause (instead of producing a large number of potentially unrelated predicates). Moreover, AID can explain how the root cause leads to the failure, by automatically generating a causal chain of predicates linking the root cause, subsequent effects, and the failure. By doing so, AID enables a developer to quickly localize (and fix) the bug, even without deep knowledge about the application.

AID achieves the above by combining SD with causal analysis [DBLP:journals/pvldb/MeliouRS14, DBLP:conf/sigmod/MeliouGNS11, DBLP:conf/mud/MeliouGMS10], fault injection [DBLP:conf/sigmod/AlvaroRH15, han1995doctor, DBLP:journals/tc/KanawatiKA95], and group testing [hwang-group-testing] in a novel way. Like SD, it starts by identifying discriminative predicates from successful and failed executions. In addition, AID uses temporal properties of the predicates to build an approximate causal DAG (Directed Acyclic Graph), which contains a superset of all true causal relationships among predicates. AID then starts a sequence of rounds to progressively refine the approximate causal DAG. In each round, AID uses ideas from adaptive group testing to carefully select a subset of predicates. Then, AID re-executes the application during which it intervenes (i.e., injects faults) to forcefully alter values of the selected predicates. Depending on whether the intervention still causes the application to fail or not, AID confirms or discards causal relationships in the approximate causal DAG, assuming counterfactual causality ( is a counterfactual cause of iff would not occur unless occurs) and a single root cause. A sequence of interventions enables AID to identify the root cause and generate a causal explanation path, a sequence of causally-related predicates that connect the root cause to the failure.

A key benefit of AID is its efficiency—it can identify root-cause and explanation predicates with significantly fewer rounds of interventions than adaptive group testing. In group testing, predicates are considered independent and hence each round can select a random subset of predicates to intervene on and make causality decisions about only those intervened predicates. In contrast, AID uses potential causality among predicates (in the approximate causal DAG). This enables AID to (1) make decisions not only about the intervened predicates, but also about other predicates; and (2) carefully select predicates whose intervention would maximize the effect of (1). Through theoretical and empirical analyses we show that this can significantly reduce the number of required interventions. This is an important benefit in practice since each round of intervention involves executing the application with fault injection and hence is time-consuming.

We evaluated AID on three open-source applications: Npgsql, Apache Kafka, and Microsoft Azure Cosmos DB, and three proprietary applications. We used known issues that cause these applications to intermittently fail even for the same inputs. In each case, AID was able to successfully identify the root cause of the failure and generate an explanation that is consistent with the explanation provided by the respective developers. Moreover, AID achieved this with significantly fewer interventions than a traditional adaptive group testing technique. We also evaluated AID with a set of synthetic workloads. The results show that AID requires fewer interventions than traditional adaptive group testing, and has significantly better worst-case performance than the other variants.

In summary, we make the following contributions:

  • We propose Adaptive Interventional Debugging (AID), a diagnostic technique that localizes the root cause of an intermittent failure through a novel combination of statistical debugging, causal analysis, fault injection, and group testing (Section 2). AID provides significant benefits over the state-of-the-art Statistical Debugging (SD) techniques by (1) pinpointing the root cause of an application’s failure and (2) generating an explanation of how the root cause triggers the failure (Sections 35). In contrast, SD techniques generate a large number of potential causes and without explaining how a potential cause may trigger the failure.

  • We use information theoretic analysis to show that AID, by utilizing causal relationship among predicates, can converge to the true root cause and explanation significantly faster than traditional adaptive group testing (Section 6).

  • We evaluate AID with six real-world applications that intermittently fail under specific inputs (Section 7). AID was able to identify the root causes and explain how the root causes triggered the failure, much faster than adaptive group testing and more precisely than SD. We also evaluate AID with many synthetically generated applications with known root causes and confirm that the benefits hold for them as well.

2 Background and Preliminaries

AID combines several existing techniques in a novel way. We now briefly review the techniques.

Statistical Debugging

Statistical debugging (SD) aims to automatically pinpoint likely causes for an application’s failure by statistically analyzing its execution logs from many successful and failed executions. It works by instrumenting an application to capture runtime predicates about the application’s behavior. Examples of predicates include “the program takes the false branch at line 31”, “the method foo() returns null”, etc. Executing the instrumented application generates a sequence of predicate values, which we refer to as predicate logs. Without loss of generality, we assume that all predicates are Boolean.

Intuitively, the true root cause of the failure will cause certain predicates to be true only in the failed logs (or, only in the successful logs). Given logs from many successful executions and many failed executions of an application, SD aims to identify those discriminative predicates. Discriminative predicates encode program behaviors of failed executions that deviate from the ideal behaviors of the successful executions. Without loss of generality, we assume that discriminative predicates are true during failed executions. The predicates can further be ranked based on their precision and recall, two well-known metrics that capture their discriminatory power.

Causality

Informally, causality characterizes the relationship between an event and an outcome: the event is a cause if the outcome is a consequence of the event. There are several definitions of causality [HP01, DBLP:journals/amai/Pearl11]. In this work, we focus on counterfactual causes. According to counterfactual causality, C causes E iff E would not occur unless C occurs. Reasoning about causality frequently relies on a mechanism for interventions [Pearl2000, Hitchcock2015, woodward2003making, Spirtes2000], where one or more variables are forced to particular values, while the mechanisms controlling other variables remain unperturbed. Such interventions uncover counterfactual dependencies between variables.

Trivially, executing a program is a cause of its failure: if the program was not executed at the first place, the failure would not have occurred. However, our analysis targets fully-discriminative predicates (with 100% precision and 100% recall), thereby eliminating such trivial predicates that are program invariants.

Fault Injection

In software testing, fault injection [DBLP:conf/sigmod/AlvaroRH15, han1995doctor, DBLP:journals/tc/KanawatiKA95, marinescu2009lfi] is a technique to force an application, by instrumenting it or by manipulating the runtime environment, to execute a different code path than usual. We use the technique to intervene on (i.e., repair) discriminative predicates. Consider a method ExecQuery() that returns a result object in all successful executions and null in all failed executions. Then, the predicate “ExecQuery() returns null” is discriminative. The predicate can be intervened by forcing ExecQuery() to return the correct result object. Similarly, the predicate “there is a data race on X” can be intervened by delaying one access to X or by putting a lock around the code segments that access X to avoid simultaneous accesses to X.

Group Testing

Given a set of discriminative predicates, a naïve approach to identify which predicates cause the failure is to intervene on one predicate at a time and observe if the intervention causes an execution to succeed. However, the number of required interventions is linear in number of predicates. Group testing reduces the number of interventions.

Group testing refers to the procedure that identifies certain items (e.g., defective) among a set of items while minimizing the number of group tests required. Formally, given a set of elements where of them are defective, group testing performs group tests, each on group . Result of test on group is positive if , and negative otherwise. The objective is to minimize , i.e., the number of group tests required. In our context, a group test is simultaneous intervention on a group of predicates, and the goal is to identify the predicates that cause the failure.

Two variations of group testing are studied in the literature: adaptive and non-adaptive. Our approach is based on adaptive group testing where the -th group to test is decided after we observe the results of all previous group tests. A trivial upper bound for adaptive group testing [hwang-group-testing] is . A simple binary search algorithm can find each of the defective items in at most group tests and hence a total of group tests are sufficient to identify all defective items. Note that if , then a linear strategy is preferable over any group testing scheme. Hence, we assume that .

3 Adaptive Interventional Debugging

Adaptive Interventional Debugging (AID) targets applications (e.g., flaky tests [luo2014empirical]) that, even with the same inputs, intermittently fail due to various runtime nondeterminism such as thread scheduling and timing. Given predicate logs of successful and failed executions of an application, the goals of AID are to (1) identify what predicate actually causes the failure, and (2) generate an explanation of how the root cause leads to the failure (via a sequence of intermediate predicates). This is in contrast with traditional statistical debugging, which generates a set of potential root-cause predicates (often a large number), without any explanation of how each potential root cause may lead to the failure.

3.1 AID Overview

Figure 1 shows an overview of AID. First, the framework employs standard SD techniques on predicate logs to identify a set of fully-discriminative predicates, i.e., predicates that always appear in the failed executions and never appear in the successful executions. Then, AID uses the temporal relationships of predicates to infer approximate causality: if temporally precedes in all logs where they both appear, then may cause . AID represents this approximate causality in a DAG called Approximate Causal DAG (AC-DAG), where predicates are nodes and edges indicate these possible causal relationships. We describe the AC-DAG in Section 4.

Figure 1: Adaptive Interventional Debugging workflow.

Based on its construction, the AC-DAG is guaranteed to contain all the true root-cause predicates and causal relationships among predicates. However, it may also contain additional predicates and edges that are not truly causal. The key insight of AID is that we can refine the AC-DAG and prune the non-causal nodes and edges through a sequence of interventions. To intervene on a predicate, AID changes the application’s behavior through fault injection so that the predicate’s value matches its value in successful executions. If the failure does not occur under the intervention, then, based on counterfactual causality, the predicate is guaranteed to be a root cause of the failure. Over several iterations, AID intervenes on a set of carefully chosen predicates, refines the set of discriminative predicates, and prunes the AC-DAG, until it discovers the true root cause and the path that leads to the failure. We describe the intervention mechanism of AID in Section 5.

We now describe how AID adapts existing approaches in SD and fault injection for two of its core ideas: predicates and interventions. We refer to the Appendix for additional details and discussion.

3.2 AID Predicates

Predicate design: Similar to traditional SD techniques, AID is effective only if the initial set of predicates (in the predicate logs) contains a root-cause predicate that causes the failure. Predicate design is orthogonal to AID. We use predicates used by existing SD techniques, especially the ones used for finding root causes of concurrency bugs [crugLiblit], a key reason behind intermittent failures [luo2014empirical]. Figure 2 shows examples of predicates in AID (column 1).

Predicate extraction: AID automatically instruments a target application to generate its execution trace (see Appendix) . The trace contains each executed method’s start and end time, its thread id, ids of objects it accesses, return values, whether it throws exception or not, and so on. This trace is then analyzed offline to evaluate a set of predicates at each execution point. This results in a sequence of predicates, called predicate log. The instrumented application is executed multiple times with the same input, to generate a set of predicate logs, each labeled as a successful or failed execution. Figure 2 shows the runtime conditions used to extract predicates (column 2).

Modeling nondeterminism: In practice, some predicates may cause a failure nondeterministically: two predicates A and B in conjunction cause a failure. AID does not consider such predicates since they are not fully discriminative (recall 100%). However, AID can still model these cases with compound predicates, adapted from state-of-the-art SD techniques [crugLiblit], which model conjunctions. These compound predicates (“A and B”) would deterministically cause the failure and hence be fully discriminative. Note that AID focuses on counterfactual causality and thus does not support disjunctive root causes (as they are not counterfactual). In Section 5, we discuss AID’s assumptions and their impact in practice.

(1) Predicate (2) Extraction condition (3) Intervention mechanism
There is a data race involving methods and and temporally overlap accessing some object while one of them is a write Put locks around the code segments within and that access
Method fails throws an exception Put in a try-catch block
Method runs too fast ’s duration is less than the minimum duration for among all successful executions Insert delay before ’s return statement
Method runs too slow ’s duration is greater than the maximum duration for among all successful executions Prematurely return from the correct value that returns in all successful executions
Method returns incorrect value ’s return value , where is the correct value returned by in all successful executions Alter ’s return statement to force it to return the correct value
Figure 2: Few example predicates, conditions used to extract them, and the corresponding interventions using fault injection.

3.3 AID Interventions

Intervention mechanism: AID uses an existing fault injection tool (similar to LFI [marinescu2009lfi]) to intervene on fully-discriminative predicates; interventions change a predicate to match its value in a successful execution. In a way, AID’s interventions try to locally “repair” a failed execution. Figure 2 shows examples of AID’s interventions (column 3). Most of the interventions rely on changing timing and thread scheduling that can occur naturally by the underlying execution environment and runtime. More specifically, AID can slow down the execution of a method (by injecting delays), force or prevent concurrent execution of methods in different threads (by using synchronization primitives such as locks), change the execution order of concurrent threads (by injecting delays), etc. Such interventions can repair many concurrency bugs.

Validity of intervention: AID supports two additional intervention types, return-value alteration and exception-handling, which, in theory, can have undesirable runtime side-effects. Consider two predicates: (1) method QueryAvgSalary fails returning null and (2) method UpdateSalary fails returning error. AID can intervene to match their return values in successful executions, e.g., 50 and OK, respectively. The intervention on the first predicate does not modify any program state and, as the successful execution shows, the return value 50 can be safely used by the application. However, altering the return value of UpdateSalary, but not updating the salary, may not be sufficient intervention: other parts of the application that rely on the updated salary may fail. Inferring such side-effects is hard, if not impossible.

AID is restricted to safe interventions. It relies on developers to indicate which methods do not change (internal or external) application states and limits return-value interventions to only those methods (e.g., to QueryAvgSalary, but not to UpdateSalary). The same holds for exception-handling interventions. AID removes from predicate logs any predicates that cannot be safely intervened without undesirable side-effects. This ensures that the rest of the AID pipeline can safely intervene on any subset of predicates. Excluding some interventions may limit AID’s precision, as it may eliminate a root-cause predicate. In such cases, AID may find another intervenable predicate that is causally related to the root cause, and is still useful for debugging. In our experiments (Section 7) we did not observe this issue, since the root-cause predicates were safe to intervene.

4 Approximating Causality

AID relies on traditional SD to derive a set of fully-discriminative predicates. Using the logs of successful and failed executions, AID extracts temporal relationships among these predicates, and uses temporal precedence to approximate causality. It is clear that in the absence of feedback loops, a cause temporally precedes an effect [DBLP:conf/kr/PearlV91]. To handle loops, AID considers multiple executions of the same program statement (e.g., within a loop, recursion, or multiple method calls) as separate instances, identified by their relative order of appearances during program execution, and maps them to separate predicates (see Appendix) . This ensures that temporal precedence among predicates correctly over-approximates causality.

Approximate causal DAG. AID represents the approximation of causality in a DAG: each node represents a predicate, and an edge indicates that temporally precedes in all logs where both predicates appear. Figure 4(a) shows an example of the approximate causal DAG (AC-DAG). We use circles to explicitly depict junctions in the AC-DAG; junctions are not themselves predicates, but denote splits or merges in the precedence ordering of predicates. Therefore, each predicate has in- and out-degrees of at most 1, while junctions have in- or out-degrees greater than 1. Note that, for clarity of visuals, in our depictions of the AC-DAG, we omit edges implied by transitive closure. For example, there exists an edge , implied by and , but it is not depicted. AID enforces an assumption of counterfactual causality by excluding from the AC-DAG any predicates that were not observed in all failed executions: if some executions failed without manifesting , then cannot be a cause of the failure.

Completeness of AC-DAG. The AC-DAG is complete with respect to the available, and safely-intervenable, predicates: it contains all fully-discriminative predicates that are safe to intervene, and if causes , it includes the edge . However, it may not be complete with respect to all possible true root causes, as a root cause may not always be represented by the available predicates (e.g., if the true root cause is a data race and no predicate is used to capture it). In such cases, AID will identify the (intervenable) predicate that is closest to the root cause and is causally related to the failure.

Since temporal precedence among predicates is a necessary condition for causality, the AC-DAG is guaranteed to contain the true causal relationships. However, temporal precedence is not sufficient for causality, and thus some edges in the AC-DAG may not be truly causal.

Temporal precedence. Capturing temporal precedence is not always straightforward. For simplicity of implementation, AID relies on computer clocks, which works reasonably well in practice. Relying on computer clocks is not always precise as the time gap between two events may be too small for the granularity of the clock; moreover, events may occur on different cores or machines whose clocks are not perfectly synchronized. These issues can be addressed with the use of logical clocks such as Lamport’s Clock [DBLP:journals/cacm/Lamport78].

Another challenge is that some predicates are associated with time windows, rather than time points. The correct policy to resolve temporal precedence of two temporally overlapping predicates often depends on their semantics. However, the predicate types give important clues regarding the correct policy. In AID, predicate design involves specifying a set of rules that dictates the temporal precedence of two predicates. In constructing the AC-DAG, AID uses those rules.

For example, consider a scenario where foo() calls bar() and waits for bar() to end—so, foo() starts before but ends after bar().

  • (Case 1): Consider two predicates : “foo() is running slow” and : “bar() is running slow”. Here, can cause but not the other way around. In this case, AID uses the policy that end-time implies temporal precedence.

  • (Case 2): Now consider : “foo() starts later than expected” and : “bar() starts later than expected”. Here, can cause but not the other way around. Therefore, in this case, start-time implies temporal precedence.

AID

works with any policy of deciding precedence, as long as it does not create cycles in the AC-DAG. Since temporal precedence is a necessary condition for causality, any conservative heuristic for deriving temporal precedence would work. A conservative heuristic may introduce more false positives (edges that are not truly causal), but those will be pruned by interventions (Section 

5).

Notation Description
Approximate causal DAG (AC-DAG)
Causal path
Failure indicating predicate
A predicate
Set of predicates
Predicate is observed in execution
Predicate is not observed in execution
There is a path from to in
Figure 3: Summary of notations used in Section 5.

5 Causal Intervention

In this section, we describe AID’s core component, which refines the AC-DAG through a series of causal interventions. An intervention on a predicate forces the predicate to a particular state; the execution of the application under the intervention asserts or contradicts the causal connection of the predicate with the failure, and AID prunes the AC-DAG accordingly. Interventions can be costly, as they require the application to be re-executed. AID minimizes this cost by (1) smartly selecting the proper predicates to intervene, (2) grouping interventions that can be applied in a single application execution, and (3) aggressively pruning predicates even without direct intervention, but based on outcomes of other interventions. Figure 3 summarizes the notations used in this section.

We start by formalizing the problem of causal path discovery and state our assumptions (Section 5.1). Then we provide an illustrative example to show how AID works (Section 5.2). We proceed to describe interventional pruning that AID applies to aggressively prune predicates during group intervention rounds (Section 5.3). Then we present AID’s causality-guided group intervention algorithm (Section 5.4) which administers group interventions to derive the causal path.

5.1 Problem Definition and Assumptions

Given an application that intermittently fails, our goal is to provide an informative explanation for the failure. To that end, given a set of fully-discriminative predicates , we want to find an ordered subset of that defines the causal path from the root-cause predicate to the predicate indicating the failure. Informally, AID finds a chain of predicates that starts from the root-cause predicate, ends at the failure predicate, and contains the maximal number of explanation predicates such that each is caused by the previous one in the chain. We address the problem in a similar setting as SD, and make the following two assumptions:

Assumption 1 (Single Root-cause Predicate). The root cause of a failure is the predicate whose absence (i.e., a value of false) certainly avoids the failure, and there is no other predicate that causes the root cause. We assume that in all the failed executions, there is exactly one root-cause predicate.

This assumption is prevalent in the SD literature [statisticalDebuggingLiblit, Sober, crugLiblit], and is supported by several studies on real-world concurrency bug characteristics [DBLP:conf/asplos/LuPSZ08, wong2009survey, DBLP:conf/icsm/VahabzadehF015], which show that a vast majority of root causes can be captured with reasonably simple single predicates (see Appendix). In practice, even with specific inputs, a program may fail in multiple ways. However, failures by the same root cause generate a unique failure signature and hence can be grouped together using metadata (e.g., stack trace of the failure, location of the failure in the program binary, etc.) collected by failure trackers [DBLP:conf/sosp/GlerumKGAONGLH09]. AID can then treat each group separately, targeting a single root cause for a specific failure. Moreover, the single-root-cause assumption is reasonable in many simpler settings such as unit tests that exercise small parts of an application.

Note that this assumption does not imply that the root cause consists of a single event; a predicate can be arbitrarily complex to capture multiple events. For example, the predicate “there is a data race on X” is true when two threads access the same shared memory X at the same time, the accesses are not lock-protected, and one of the accesses is a write operation. Whether a single predicate is sufficient to capture the root cause depends on predicate design, which is orthogonal to AID. AID adapts the state-of-the art predicate design, tailored to capture root causes of concurrency bugs [crugLiblit], which is sophisticated enough to capture all common root causes using single predicates. If no single predicate captures the true root cause, AID still finds the predicate closest to the true root cause in the true causal path.

Assumption 2 (Deterministic Effect). A root-cause predicate, if triggered, causes a fixed sequence of intermediate predicates (i.e., effects) before eventually causing the failure. We call this sequence causal path, and we assume that there is a unique one for each root-cause-failure pair.

Prior work has considered, and shown evidence of, a unique causal path between a root cause and the failure in sequential applications [sumner2009algorithms, DBLP:conf/sp/JohnsonCCMPRS11]. The unique causal path assumption is likely to hold in concurrent applications as well for two key reasons. First, the predicates in AID’s causal path may remain unchanged, despite nondeterminism in the underlying instruction sequence. For example, the predicate “there is a data race between methods X and Y” is not affected by which method starts first, as long as they temporally overlap. Second, AID only considers fully-discriminative predicates. If such predicates exist to capture the root cause and its effects, by the definition of being fully discriminative, there will be a unique causal path (of predicates) from the root cause to the failure. In all six of our real-world case studies (Section 7), such predicates existed and there were unique causal paths from the root causes to the failures.

Note that it is possible to observe some degree of disjointness within the true causal paths. For example, consider a case where the root cause triggers the failure in two ways: in some failed executions, the causal path is and, for others, . This implies that neither nor is fully discriminative. Since AID only considers fully-discriminative predicates, both of them are excluded from the AC-DAG. In this case, AID reports as the causal path; this is the shared part of the two causal paths, which includes all counterfactual predicates and omits any disjunctive predicates. One could potentially relax this assumption by encoding the interaction of such predicates through a fully-discriminative predicate (e.g., encodes disjunction and is fully discriminative).

Based on these assumptions, we define the causal path discovery problem formally as follows.

Definition 1 (Causal Path Discovery).

Given an approximate causal DAG and a predicate indicating a specific failure, the causal path discovery problem seeks a path such that the following conditions hold:

  • is the root cause of the failure and .

  • , and , , .

  • , is a counterfactual cause of .

  • is maximized.

Figure 4: (a) AC-DAG as constructed by AID. The DAG includes all edges implied by transitive closure, but we omit them for clarity of the visuals. We indicate the predicates in the causal path with the dashed red outline. (b) The actual causal DAG is a subgraph of the AC-DAG. (c) Step by step illustration to discover the causal path (shown at bottom right). Steps \⃝raisebox{-0.9pt}{1} and \⃝raisebox{-0.9pt}{2} perform branch pruning, steps \⃝raisebox{-0.9pt}{3}\⃝raisebox{-0.9pt}{8} perform group intervention with pruning on the predicate chain, steps \⃝raisebox{-0.9pt}{6} and \⃝raisebox{-0.9pt}{7} apply interventional pruning.

5.2 Illustrative Example

AID performs causal path discovery through an intervention algorithm (Section 5.4). Here, we illustrate the main steps and intuitions through an example.

Figure 4(a) shows an AC-DAG derived by AID (Section 4). The AC-DAG contains all edges implied by transitive closure, but we do not depict them to have clearer visuals. The true causal path for the failure is , depicted with dashed red outline. The AC-DAG is a superset of the actual causal graph, which is shown in Figure 4(b).

AID follows an intervention-centric approach for discovering the causal path. Intervening on a predicate forces it to behave the way it does in the successful executions, which is by definition, the opposite of the failed executions. (Recall that, without loss of generality, we assume that all predicates are boolean.) Following the adaptive group testing paradigm, AID performs group intervention, which is simultaneous intervention on a set of predicates, to reduce the total number of interventions. Figure 4(c) shows the steps of the intervention algorithm, numbered \⃝raisebox{-0.9pt}{1}\⃝raisebox{-0.9pt}{8}.

AID first aims to reduce the AC-DAG by pruning entire chains that are not associated with the failure, through a process called branch pruning (Section 5.4). Starting from the root of the AC-DAG, AID discovers the first junction, after predicate . For each child of a junction, AID creates a compound predicate, called an independent branch, or simply branch, that is a disjunction over the child and all its descendants that are not descendants of the other children. So, for the junction after , we get branches and . AID intervenes on one of the branches chosen at random—in this case —at step \⃝raisebox{-0.9pt}{1}; this requires an intervention on all of its disjunctive predicates (, , and ) in order to make the branch predicate False. Despite the intervention, the program continues to fail, and AID prunes the entire branch of , resolving the junction after . For a junction of branches, AID would need interventions to resolve it using a divide-and-conquer approach. At step \⃝raisebox{-0.9pt}{2}, AID similarly prunes a branch at the junction after . At this point, AID is done with branch pruning since it is left with just a chain of predicates (step \⃝raisebox{-0.9pt}{3}).

What is left for AID is to prune any non-causal predicate from the remaining chain. AID achieves that through a divide-and-conquer strategy that intervenes on groups of predicates at a time (Algorithm 1). It intervenes on the top half of the chain——which stops the failure and confirms that the root cause is in this group (step \⃝raisebox{-0.9pt}{3}). With two more steps that narrow down the interventions (steps \⃝raisebox{-0.9pt}{4} and \⃝raisebox{-0.9pt}{5}), AID discovers that is the root cause. Note that we cannot simply assume that the root of the AC-DAG is a cause, because the edges are not all necessarily causal.

After the discovery of the root cause, AID needs to derive the causal path. Continuing the divide-and-conquer steps, it intervenes on (step \⃝raisebox{-0.9pt}{6}). This stops the failure, confirming that is in the causal path. In addition, since is not causally dependent on , the intervention on does not stop from occurring. This observation allows AID to prune without intervening on it directly. At step \⃝raisebox{-0.9pt}{7}, AID intervenes on . The effect of this intervention is that the failure is still observed, but no longer occurs, indicating that is causally connected to , but not to the failure; this allows AID to prune both and . Finally, at step \⃝raisebox{-0.9pt}{8}, AID intervenes on and confirms that it is causal, completing the causal path derivation. AID discovered the causal path in 8 interventions, while naïvely we would have needed 11—one for each predicate.

5.3 Predicate Pruning

In the initial construction of the AC-DAG, AID excludes predicates based on a simple rule: a predicate is excluded if there exists a program execution , where occurs and the failure does not (), or does not occur and the failure does (). Intervening executions follow the same basic intuition for pruning the intervened predicate : By definition does not occur in an execution that intervenes on predicate (); thus, if the failure still occurs on (), then is pruned from the AC-DAG.

As we saw in the illustrative example, intervention on a predicate may also lead to the pruning of additional predicates. However, the same basic pruning logic needs to be applied more carefully in this case. In particular, we can never prune predicates that precede in the AC-DAG, as their potential causal effect on the failure may be muted by the intervention on . Thus, we can only apply the pruning rule to any predicate that is not an ancestor of in the AC-DAG (). We formalize the predicate pruning strategy over in the following definition.

Definition 2 (Interventional Pruning).

Let be a set of program executions111Because of nondeterminism issues in concurrent applications, we execute a program multiple times with the same intervention. However, it is sufficient to identify a single counter-example execution to invoke the pruning rule. intervening on a group of predicates . Every is pruned from iff such that . Any other predicate is pruned from iff such that and such that .

Input :  A set of candidate predicates, ,
AC-DAG,
Failure indicating predicate,
Output : The set of counterfactual causes of ,
The set of spurious predicates,
1 /* causal predicate set */
2 /* spurious predicate set */
3 while  do
4        = first half of in topological order
5        Intervene ()
6        if  s.t.  then /* failure stopped */
7               if  contains a single predicate then
8                      /* a cause is confirmed */
9              else  /* need to confirm causes */
10                      GIWP()
11                      /* confirmed causes */
12                      /* spurious predicates */
13              
       /* interventional pruning */
14        if  s.t.  then /* failure didn’t stop */
15               /* pruning */
16       foreach  s.t.  do
17               if  s.t.  then
18                      /* pruning */
19              
20        /* remove confirmed and spurious predicates from candidate predicate pool */
return ,
Algorithm 1 GIWP ()

5.4 Causality-guided Intervention

AID’s core intervention method is described in Algorithm 1: Group Intervention With Pruning (GIWP). GIWP applies adaptive group testing to derive causal and spurious (non-causal) nodes in the AC-DAG. The algorithm applies a divide-and-conquer approach that groups predicates based on their topological order (a linear ordering of its nodes such that for every directed edge , comes before in the ordering). In every iteration, GIWP selects the predicates in the lowest half of the topological order, resolving ties randomly, and intervenes by setting all of them to False (lines 11). The intervention returns a set of predicate logs.

If the failure is not observed in any of the intervening executions (line 1), based on counterfactual causality, GIWP concludes that the intervened group contains at least one predicate that causes the failure. If the group contains a single predicate, it is marked as causal (line 1). Otherwise, GIWP recurses to trace the causal predicates within the group (line 1).

During each intervention round, GIWP applies Definition 2 to prune predicates that are determined to be non-causal (lines 11). First, if the algorithm discovers an intervening execution that still exhibits the failure, then it labels all intervened predicates as spurious and marks them for removal (line 1). Second, GIWP examines each other predicate that does not precede any intervened predicate and observes if any of the intervened executions demonstrate a counterfactual violation between the predicate and the failure. If a violation is found, that predicate is pruned (line 1).

At completion of each intervention round, GIWP refines the predicate pool by eliminating all confirmed causes and spurious predicates (line 1) and enters the next intervention round . It continues the interventions until all predicates are either marked as causal or spurious and the remaining predicate pool is empty. Finally, GIWP returns two disjoint predicate sets—the causal predicates and the spurious predicates (line 1).

Input : AC-DAG,
Failure indicating predicate,
Output : Reduces to an approximate causal chain
1 /* potential causal predicate set */
2 /* spurious predicate set */
3 while  do
4        = predicates at the lowest topological level in
5        if  contains a single predicate then
6               /* add to potential causal set */
7       else  /* this is a junction */
8              
9               foreach  do
10                     
11                     
12                      /* set of branches */
13               GIWP ()
14               /* add to potential causal set */
15               /* add to spurious set */
16              
       /* refining */
17        /* unreachable */
18        /* remove spurious predicates */
19        /* remove unreachable predicates */
Algorithm 2 Branch-Prune ()
Input : AC-DAG,
Failure indicating predicate,
, whether to apply branch pruning or not
Output : A causal path
1 if  then
2        Branch-Prune ()
3       
4 GIWP ()
return
Algorithm 3 Causal-Path-Discovery ()

Branch Pruning

GIWP is sufficient for most practical applications and can work directly on the AC-DAG. However, when the AC-DAG satisfies certain conditions (analyzed in Section 6.3.1), we can reduce the number of required interventions through a process called branch pruning. The intuition is that since there is a single causal path that explains the failure, junctions (where multiple paths exist) can be used to quickly identify independent branches to be pruned or confirmed as causal as a group. The branches can be used to more effectively identify groups for intervention, reducing the overall number of required interventions.

Branch pruning iteratively prunes branches at junctions (steps \⃝raisebox{-0.9pt}{1} and \⃝raisebox{-0.9pt}{2} in the illustrative example) to reduce the AC-DAG to a chain of predicates. The process is detailed in Algorithm 2. The algorithm traverses the DAG based on its topological order, and does not intervene while it encounters a single node at a time, which means it is still in a chain (line 2). When it encounters multiple nodes at the same topological level, it means it encountered a junction (line 2). A junction means that the true causal path can only continue in one direction, and AID can perform group intervention to discover it. The algorithm invokes GIWP to perform this intervention over a set of special predicates constructed from the branches at the encountered junction (lines 22). A branch at predicate is defined as a disjunctive predicate over and all descendants of that are not descendants of any other predicate at the same topological level as . An example branch from our illustrative example is . To intervene on a branch, one has to intervene on all of its disjunctive predicates. The algorithm defines as the union of all branches, which corresponds to a completely disconnected graph (no edges between the nodes), thus all branch predicates are at the same topological level. GIWP is then invoked (line 2) to identify the causal branch. The algorithm removes any predicate that is not causally connected to the failure (line 2) or is no longer reachable from the correct causal chain (line 2), and updates the AC-DAG accordingly. At the completion of branch pruning, AID reduces the AC-DAG to simple chain of predicates.

Finally, Algorithm 3 presents the overall method that AID uses to perform causal path discovery, which optionally invokes branch pruning before the divide-and-conquer group intervention through GIWP.

6 Theoretical Analysis

In this section, we theoretically analyze the performance of AID in terms of the number of interventions required to identify all causal predicates, which are the predicates causally related to the failure.222Causal predicates correspond to faulty predicates in group testing. This distinction in terminology is because group testing does not meaningfully reason about causality. Similar to the analysis of group testing algorithms, we study the information-theoretic lower bound, which specifies the minimum number of bits of information that an algorithm must extract to identify all causal predicates for any instance of a problem. We also study the lower and the upper bounds that quantify the minimum and the maximum number of group interventions required to identify all causal predicates, respectively, for AID versus a Traditional Adaptive Group Testing (TAGT) algorithm.

Any group testing algorithm takes items (predicates), of which are faulty (causal), and aims to identify all faulty items using as few group interventions as possible. Since there are possible outcomes, the information-theoretic lower bound for this problem is . The upper bound on the number of interventions using TAGT is , since group interventions are sufficient to reveal each causal predicate. Here, we assume ; otherwise, a linear approach that intervenes on one predicate at a time is preferable.

We now show that the Causal Path Discovery (CPD) problem (Definition 1) can reduce the lower bound on the number of required interventions compared to Group Testing (GT). We also show that the upper bound on the number of interventions is lower for AID than TAGT, because of the two assumptions of CPD (Section 5.1). In TAGT, predicates are assumed to be independent of each other, and hence, after each intervention, decisions (about whether predicates are causal) can be made only about the intervened predicates. In contrast, AID uses the precedence relationships among predicates in the AC-DAG to (1) aggressively prune, by making decisions not only about the intervened predicates but also about other predicates, and to (2) select predicates based on the topological order, which enables effective pruning during each intervention.

Example 2.

Consider the AC-DAG of Figure 5(a), consisting of predicates and the failure predicate . If AID intervenes on all predicates in one branch (e.g., ) and finds causal connection to the failure, it can avoid intervening on predicates in the other branch according to the deterministic effect assumption. AID can also use the structure of the AC-DAG to intervene on (or ) before other predicates since the intervention can prune a large set of predicates. Since GT algorithms do not assume relationships among predicates, they can only intervene on predicates in random order and can make decisions about only the intervened predicates.

6.1 Search Space

The temporal precedence and potential causality encoded in the AC-DAG restrict the possible causal paths and significantly reduce the search space of CPD compared to GT.

Example 3.

In the example of Figure 5(a), GT considers all subsets of the 6 predicates as possible solutions, and thus its search space includes candidates. CPD leverages the AC-DAG and the deterministic effect assumption (Section 5.1) to identify invalid candidates and reduce the search space considerably. For example, the candidate solution is not possible under CPD, because it involves predicates in separate paths on the AC-DAG. In fact, based on the AC-DAG, CPD does not need to explore any solutions with more than 3 predicates. The complete search space of CPD includes all subsets of predicates along each branch of length 3, thus a total of possible solutions.

Figure 5: (a) An AC-DAG with failure predicate . (b) Horizontal and vertical expansion. (c) A symmetric AC-DAG with junctions where each junction has branches and each branch has predicates.

We proceed to characterize the search space of CPD compared to GT more generally. We use to denote the number of predicates in an AC-DAG represented by , and and to denote the size of the search space for GT and CPD, respectively. We start from the simplest case of DAG, a chain, and then using the notions of horizontal and vertical expansion, we can derive the search space for any DAG.

If is a simple chain of predicates, then GT and CPD have the same search space: . CPD reduces the search space drastically when junctions split the predicates into separate branches, like in Example 3. We call this case a horizontal expansion: a DAG is a horizontal expansion of two subgraphs and if it connects them in parallel through two junctions, at the roots (lowest topological level) and leaves (highest topological level). In contrast, is a vertical expansion, if it connects them sequentially via a junction. Horizontal and vertical expansion are depicted in Figure 5(b). In horizontal expansion, the search space of CPD is additive over the combined DAGs, while in vertical expansion it is multiplicative.

Lemma 1 (DAG expansion).

Let and be the numbers of valid solutions for CPD over DAGs and , respectively. Let and represent their horizontal and vertical expansion, respectively. Then:

In contrast, in both cases, the search space of GT is .

Intuitively, in horizontal expansion, the valid solutions for are those of and those from , but no combinations between the two are possible. Note that both and have the empty set as a common solution, so in the computation of , one solution is subtracted from each search space () and then added to the overall result.

Symmetric AC-DAG. Lemma 1 allows us to derive the size of the search space for CPD over any AC-DAG. To further highlight the difference between GT and CPD, we analyze their search space over a special type of AC-DAG, a symmetric AC-DAG, depicted in Figure 5(c). A symmetric AC-DAG has junctions, and branches at each junction, where each branch is a simple chain of predicates. Therefore, the total number of predicates in the symmetric AC-DAG is , and the search space of GT is . For CPD, based on horizontal expansion, the subgraph in-between two subsequent junctions has a total of candidate solutions. Then, based on vertical expansion, the overall search space of CPD is:

6.2 Lower Bound of Number of Interventions

We now show that, due to the predicate pruning mechanisms, and the strategy of picking predicates according to topological order, the lower bound333Lower bound is a theoretical bound which states that, it might be possible to design an algorithm that can solve the problem which requires number of steps equal to the lower bound. Note that, this does not imply that there exists one such algorithm. on the required number of interventions in CPD is significantly reduced. For the sake of simplicity, we drop the deterministic effect assumption in this analysis. In GT, after each group test, we get at least bit of information. Since after retrieving all information, the remaining information should be , therefore, the number of required interventions in GT is bounded below by . In contrast, for CPD, we have the following theorem. (Proofs are in the Appendix .)

Theorem 2.

The number of required group interventions in CPD is bounded below by , where at least predicates are discarded (either pruned using the pruning rule or marked as causal) during each group intervention.

Since , we obtain a reduced lower bound for the number of required interventions in CPD than GT. In general, as increases, the lower bound in CPD decreases. Note that we are not claiming that AID achieves this lower bound for CPD; but this sets the possibility that improved algorithms can be designed in the setting of CPD than GT.

Symmetric AC-DAG. Figure 6 shows the lower bound on the number of required interventions in CPD and GT for the symmetric AC-DAG of Figure 5(c), assuming that each intervention discards at least predicates in CPD.

6.3 Upper Bound of Number of Interventions

We now analyze the upper bound on the number of interventions for AID under (1) branch pruning, which exploits the deterministic effect assumption, and (2) predicate pruning.

Search #Interventions
space Lower bound Upper bound (AID/TAGT)
CPD
GT
Figure 6: Theoretical comparison between CPD and GT for the symmetric AC-DAG of Figure 5(c).

6.3.1 Branch Pruning

Whenever AID encounters a junction, it has the option to apply branch pruning. In CPD, at most one branch can be causal at each junction; hence, we can find the causal branch using interventions at each junction, where is the number of branches at that junction. Also, is upper-bounded by the number of threads in the program. This holds since we assume that the program inputs are fixed and there is no different conditional branching due to input variation in different failed executions within the same thread. If there are junctions and at most branches at each junction, the number of interventions required to reduce the AC-DAG to a chain is at most . Now let us assume that the maximum number of predicates in any path in the AC-DAG is . Therefore, the chain found after branch pruning can contain at most predicates. If of them are causal predicates, we need at most interventions to find them. Therefore, the total number of required interventions for AID is . In contrast, the number of required interventions for TAGT, which does not prune branches, is . Therefore, whenever , the upper bound on the number of interventions for AID is smaller than the upper bound for TAGT.

6.3.2 Predicate Pruning

For an AC-DAG with predicates, of which are causal, we now focus on the upper bound on the number of interventions in AID using only predicate pruning. In the worst case, when no pruning is possible, the number of required interventions would be the same as that of TAGT without pruning, i.e., .

Theorem 3.

If at least predicates are discarded (pruned or marked as causal) from the candidate predicate pool during each causal predicate discovery, then the number of required interventions for AID is .

Hence, the reduction depends on . When , we are referring to TAGT, in absence of pruning, because once TAGT finds a causal predicate, it removes that predicate from the candidate predicate pool.

Symmetric AC-DAG. Figure 6 shows the upper bound on the number of required interventions using AID and TAGT for the symmetric AC-DAG of Figure 5(c), assuming that at least predicates are discarded during each causal predicate discovery by AID.

7 Experimental Evaluation

We now empirically evaluate AID. We first use AID on six real-world applications to demonstrate its effectiveness in identifying root cause and generating explanation on how the root cause causes the failure. Then we use a synthetic benchmark to compare AID and its variants against traditional adaptive group testing approach to do a sensitivity analysis of AID on various parameters of the benchmark.

7.1 Case Studies of Real-world Applications

We now use three real-world open-source applications and three proprietary applications to demonstrate AID’s effectiveness in identifying root causes of transient failures. Figure 7 summarizes the results and highlights the key benefits of AID:

  • AID is able to identify the true root cause and generate an explanation that is consistent with the explanation provided by the developers in corresponding GitHub issues.

  • AID requires significantly fewer interventions than traditional adaptive group testing (TAGT), which does not utilize causality among predicates (columns 5 and 6).

  • In contrast, SD generates a large number of discriminative predicates (column 3), only a small number of which is actually causally related to the failures (column 4).

7.1.1 Data race in Npgsql

As a case study, we first consider a recently discovered concurrency bug in Npgsql [npgsql], an open-source ADO.NET Data Provider for PostgreSQL. The bug (GitHub issue #2485) causes an Npgsql-baked application to intermittently crash when it tries to create a new PostgreSQL connection. We use AID to check if it can identify the root cause and generate an explanation of how the root cause triggers the failure.

We used one of the existing unit tests in Npgsql that causes the issue, and generated logs from 50 successful executions and 50 failed executions of the test. By applying SD, we found a total of 14 discriminative predicates. However, SD did not pinpoint the root cause or generate any explanation.

We then applied AID on the discriminative predicates. In the branch pruning step, it used 3 rounds of interventions to prune 8 of the 14 predicates. In the next step, it required 2 more rounds of interventions. Overall, AID required a total of 5 intervention rounds; in contrast, TAGT would require 11 interventions in the worst case.

After all the interventions, AID identified a data race as the root cause of the failure and produced the following explanation: (1) two threads race on an index variable: one increments it while the other reads it (2) The second thread accesses an array at the incremented index location, which is outside the array size. (3) This access throws IndexOutOfRange exception (4) Application fails to handle the exception and crashes. This explanation matches the root cause provided by the developer who reported the bug to Npgsql GitHub repository.

(1) (2) (3) (4) #Interventions
Application nothing GitHub Issue # #Discrim. preds (SD) #Preds in causal path (5) iAID i (6) TAGT
Npgsql [npgsql] 2485 [npgsqlBug2485] 14 3 5 11
Kafka [kafka] 279 [kafkaBug279] 72 5 17 33
Azure Cosmos DB [cosmosdb] 713 [cosmosBug713] 64 7 15 42
Network N/A 24 1 2 5
BuildAndTest N/A 25 3 10 15
HealthTelemetry N/A 93 10 40 70
Figure 7: Results from case studies of real-world applications. SD produces way too many spurious predicates beyond the correct causal predicates (columns 3 & 4). SD actually produces even more predicates, but here we only report the number of fully-discriminative predicates. AID and traditional adaptive group testing (TAGT) both pin-point the correct causal predicates using interventions, but AID does so with significantly fewer interventions (columns 5 & 6).

7.1.2 Use-after-free in Kafka

Next, we use AID on an application built on Kafka [kafka], a distributed message queue. On Kafka’s GitHub repository, a user reported an issue [kafkaBug279] that causes a Kafka application to intermittently crash or hang. The user also provided a sample code to reproduce the issue; we use a similar code for this case study.

As before, we collected predicate logs from 50 successful and 50 failed executions. Using SD, we identified 72 discriminative predicates. The AC-DAG identified 30 predicates with no causal path to the failure indicating predicate, and hence were discarded. AID then used the intervention algorithm on the remaining 42 predicates. After a sequence of 7 interventions, AID could identify the root-cause predicate. It took an additional 10 rounds (total 17) of interventions to discover a causal path of 5 predicates that connects the root cause and the failure. The causal path gives the following explanation: (1) The main thread that creates a Kafka consumer starts a child thread (2) the child thread runs too slow before calling a method on (3) main thread disposes (4) child thread calls a commit method on (5) since has already been disposed by the main thread, the previous step causes an exception, causing the failure. The explanation matches well with the description provided in GitHub.

Overall, AID required 17 interventions to discover the root cause and explanation. In contrast, SD generates 72 predicates, without pinpointing the true root cause or explanation. TAGT could identify all predicates in the explanation, but it takes 33 interventions in the worst case.

7.1.3 Timing bug in Azure Cosmos DB application

Next, we use AID on an application built on Azure Cosmos DB [cosmosdb], Microsoft’s globally distributed database service for operational and analytics workloads. The application has an intermittent timing bug similar to the one mentioned in a Cosmos DB’s pull request on GitHub [cosmosBug713]. In summary, the application populates a cache with several entries that would expire after 1 second, performs a few tasks, and then accesses one of the cached entries. During successful executions, the tasks run fast and end before the cached entries expire. However, a transient fault triggers expensive fault handling code that makes a task run longer than the cache expiry time. This makes the application fail as it cannot find the entry in the cache (i.e., it has already expired).

Using SD, we identified 64 discriminative predicates from successful and failed executions of the application. Applying AID on them required 15 interventions and it generated an explanation consisting of 7 predicates that are consistent with the aforementioned informal explanation. In contrast, SD would generate 64 predicates and TAGT would take 42 interventions in the worst case.

Figure 8: Number of interventions required in the average and worst case by traditional adaptive group testing (TAGT) and different variations of AID with varying . For average case analysis, total number of predicates is shown using a grey dotted line. Total number of predicates is not shown for the worst-case analysis, because the worst cases vary across approaches.

7.1.4 Bugs in Microsoft applications

We demonstrate the effectiveness of AID in finding non-trivial bugs, through evaluation on three proprietary applications inside Microsoft in Figure 7: (1) Network: The control panel of a data center network, (2) BuildAndTest: A centralized build and test platform, and (3) HealthTelemetry: A module used by various services to report their runtime health. These applications had been intermittently failing for many months and their developers could not identify the exact root causes. AID identified the root causes as a random number collision in Network, an order violation in BuildAndTest, and a race condition in HealthTelemetry. The application developers confirmed that AID identified the correct root causes and that the generated explanations correctly show how these lead to the (intermittent) failures.

Figure 7 also shows the performance of AID with these applications. As before, SD produces many discriminative predicates, only a subset of which are causally related to the failures. Moreover, for all applications, AID requires significantly fewer interventions than what TAGT would require in the worse case.

7.2 Experiments with Synthetic Applications

We further evaluate AID on a benchmark of synthetically-generated applications, designed to fail intermittently and with known root causes. We generate multi-threaded applications ranging the maximum number of threads from 2 to 40. For each parameter setting, we generate applications. In these applications, the total number of predicates ranges from 4 to 284, and we randomly choose the number of causal predicates in the range .

For this experiment, we compare four approaches: TAGT, AID, AID without predicate pruning (AID-P), and AID without predicate or branch pruning (AID-P-B). All four approaches derive the correct causal paths but differ in the number of required interventions. Figure 8 shows the average (left) and the maximum (right) number of interventions required by each approach. The grey dotted line in the average case shows the average number of predicates over the 500 instances for that setting. This experiment provides two key observations:

Interventions in topological order converge faster. Causally-related predicates are likely to be topologically close to each other in the AC-DAG. AID discards all predicates in an intervened group only when none are causal. This is unlikely to occur when predicates are grouped randomly. For this reason, AID-P-B, which uses topological ordering, requires fewer interventions than TAGT.

Pruning reduces the required number of interventions. We observe that both predicate and branch pruning reduce the number of interventions. Pruning is a key differentiating factor of AID from TAGT. In the worst-case setting in particular, the margin between AID and TAGT is significant: TAGT requires up to 217 interventions in one case, while the highest number of interventions for AID is 52.

8 Related Work

Causal inference has been long applied for root-cause analysis of program failures. Attariyan et al. [DBLP:conf/osdi/AttariyanCF12, DBLP:conf/osdi/AttariyanF10] observe causality within application components through runtime control and data flow; but only report a list of root causes ordered by the likelihood of being faulty, without providing further causal connection between root causes and performance anomalies. Beyond statistical association (e.g., correlation) between root cause and failure, few techniques [Baah:2010:CIS:1831708.1831717, DBLP:conf/sigsoft/BaahPH11, DBLP:conf/icst/ShuSPC13, DBLP:journals/corr/abs-1712-03361] apply statistical causal inference on observational data towards software fault localization. However, observational data collected from program execution logs is often limited in capturing certain scenarios, and hence, observational study is ill-equipped to identify the intermediate explanation predicates. This is because observational data is not generated by randomized controlled experiments, and therefore, may not satisfy conditional exchangeability (data can be treated as if they came from a randomized experiment [DBLP:conf/uai/JensenBR19]) and positivity (all possible combinations of values for the variables are observed in the data)—two key requirements for applying causal inference on observational data [DBLP:conf/icst/ShuSPC13]. While observational studies are extremely useful in many settings, AID’s problem setting permits interventional studies, which offer increased reliability and accuracy.

Explanation-centric approaches are relevant to AID as they also aim at generating informative, yet minimal, explanations of certain incidents, such as data errors [DBLP:conf/sigmod/WangDM15] and binary outcomes [DBLP:journals/pvldb/GebalyAGKS14], however these do not focus on interventions. Viska [DBLP:conf/sigmod/GudmundsdottirS17] allows the users to perform intervention on system parameters to understand the underlying causes for performance differences across different systems. None of these systems are applicable for finding causally connected paths that explain intermittent failures due to concurrency bugs.

Statistical debugging approaches [DBLP:conf/icse/ChilimbiLMNV09, crugLiblit, statisticalDebuggingLiblit, Sober, DBLP:conf/issta/ThakurSLL09, DBLP:conf/icml/ZhengJLNA06, DBLP:conf/sosp/KasikciCGN17, DBLP:conf/sosp/KasikciCGN17] employ statistical diagnosis to rank program predicates based on their likelihood of being the root causes of program failures. However, all statistical debugging approaches suffer from the issue of not separating correlated predicates from the causal ones, and fail to provide contextual information regarding how the root causes lead to program failures.

Predicates in AID are extracted from execution traces of the application. Ball et al. [DBLP:journals/toplas/BallL94] provide algorithms for efficiently tracing execution with minimal instrumentation. While the authors had a different goal (i.e., path profiling) than ours, the traces can be used to extract AID predicates.

Fault injection techniques [DBLP:conf/sigmod/AlvaroRH15, han1995doctor, DBLP:journals/tc/KanawatiKA95, marinescu2009lfi] intervene application runtime behavior with the goal to test if an application can handle the injected faults. In fault injection techniques, faults to be injected are chosen based on whether they can occur in practice. In contrast, AID intervenes with the goal of verifying (presence or absence of) causal relationship among runtime predicates, and faults are chosen based on if they can alter selected predicates.

Group testing [DBLP:conf/isit/AgarwalJM18, DBLP:journals/sigpro/BaiWLLLZ19, hwang-group-testing, DBLP:conf/isit/BaldassiniJA13, DBLP:conf/isit/LiCHKJ14, dingzhu1993combinatorial, DBLP:conf/itw/KarbasiZ12] has been applied for fault diagnosis in prior literature [DBLP:conf/uai/ZhengRB05]. Specifically, adaptive group testing is related to AID’s intervention algorithm. However, none of the existing works considers the scenario where a group test might reveal additional information and thus offers an inefficient solution for causal path discovery.

Control flow graph-based techniques [ChengLZWY09, contextAwareControlFlowPath] aim at identifying bug signature for sequential programs using discriminative subgraphs within the program’s control flow graph; or generating faulty control flow paths that link many bug predictors. But these approaches do not consider causal connection among these bug predictors and program failure.

Differential slicing [DBLP:conf/sp/JohnsonCCMPRS11] aims towards discovering causal path of execution differences but requires complete program execution trace generated by execution indexing [DBLP:conf/pldi/XinSZ08]. Dual slicing [DBLP:conf/issta/WeeratungeZSJ10] is another program slicing-based technique to discover statement level causal paths for concurrent program failures. However, this approach does not consider compound predicates that capture certain runtime conditions observed in concurrent programs. Moreover, program slicing-based approaches cannot deal with a set of executions, instead they only consider two executions—one successful and one failed.

9 Conclusions

In this work, we defined the problem of causal path discovery for explaining failure of concurrent programs. Our key contribution is the novel adaptive interventional debugging (AID) framework, which combines existing statistical debugging, causal analysis, fault injection, and group testing techniques in a novel way to discover root cause of program failure and generate the causal path that explains how the root cause triggers the failure. Such explanation provides better interpretability for understanding and analyzing the root causes of program failures. We showed both theoretically and empirically that AID is both efficient and effective to solve the causal path discovery problem. As a future direction, we plan to incorporate additional information regarding the program behavior to better approximate the causal relationship among predicates, and address the cases of multiple root causes and multiple causal paths. Furthermore, we plan to address the challenge of explaining multiple types of failures as well.

References

Appendix A Program Instrumentation

AID separates program instrumentation and predicate extraction unlike prior SD techniques [statisticalDebuggingLiblit, crugLiblit, Sober]. One advantage of our separation of instrumentation and predicate extraction is that it enables us to design predicates after collection of the application’s execution traces. In contrast, prior works in SD instrument applications to directly extract the predicates. For example, to assess if two methods return the same value, prior work would instrument the program using a hard coded conditional statement “pred = (foo() == bar())”. In contrast, our instrumentation simply collects the return values of the two methods and stores them in the execution trace. AID later evaluates the predicates based on the execution traces. This gives us the flexibility to design predicates post-execution, often based on knowledge of some domain-expert. For example, in this case, we can design multiple predicates such as whether two values are equal, unequal, or satisfy any custom relation.

Instrumentation granularity. Instrumentation granularity is orthogonal to AID. Like prior SD work, we could have instrumented at a finer granularity such as at each conditional branch; but instrumenting method calls were sufficient for our purpose. Since our instrumentation is of much sparser granularity than existing SD work [statisticalDebuggingLiblit, crugLiblit, Sober] that employ sampling based finer granularity instrumentation, we do not use any sampling.

Appendix B Predicate Extraction and Fault Injection

Figure 9 shows the complete pipeline of predicate extraction and fault-injection for the Npgsql bug of Example 1, whose simplified source code is shown in Figure 9(a). Executions of the instrumented application generate a list of runtime method signatures per execution, called execution traces. Two partial execution traces—one for a successful and the other for a failed execution—are shown in Figure 9

(b). Then we extract predicates and compute their precision and recall as shown in Figure 

9(c).

In AID, we use existing fault injection techniques—which is able to change a method’s input and return value, can cause a method to throw exception, can cause a method to run slower or run before/after/concurrently with another method in another thread—to intervene on discriminative predicates. For example, to allow for return value alteration intervention, AID modifies the entire application by adding (1) an optional parameter to each function, and (2) a conditional statement at the end of each function that specifies that “if a value is passed to the optional parameter, the function should return that specific value, and the actually computed value otherwise”. As another example, the predicate “there is a data race on X” can be intervened by delaying one access to X or by putting a lock around the code segments that access X to prevent simultaneous access to X. Figure 9(d) shows how fault is injected by putting a lock to intervene on the data race predicate of Figure 9(c).

Appendix C Real-world Concurrency Bug Characteristics

Studies on real-world concurrency bug characteristics [DBLP:conf/asplos/LuPSZ08, wong2009survey, DBLP:conf/icsm/VahabzadehF015] show that a vast majority of root-causes can be captured with reasonably simple single predicates and hence this assumption is very common in the SD literature [statisticalDebuggingLiblit, Sober, crugLiblit]. Some notable findings include: (1) “97% of the non-deadlock concurrency bugs are covered by two simple patterns: atomicity violation and order violation” [DBLP:conf/asplos/LuPSZ08], (2) “66% of the non-deadlock concurrency bugs involve only one variable” [DBLP:conf/asplos/LuPSZ08] (3) “The manifestation of 96% of the concurrency bugs involves no more than two threads.” [DBLP:conf/asplos/LuPSZ08], (4) “most fault localization approaches assume that each buggy source file has exactly one line of faulty code” [wong2009survey], (5) “The majority of flaky test bugs occur when the test does not wait properly for asynchronous calls during the exercise phase of testing.” [DBLP:conf/icsm/VahabzadehF015], etc.

Figure 9: (a) Simplified code for the Npgsql bug of Example 1. (b) Partial execution traces of one successful and one failed execution. The start-time and end-time of the events reflect concurrent read/write access to the shared variable _nextSlot. (c) The race predicate is one of the discriminative predicates. (d) Fault injection to intervene (disable) the race predicate: putting a lock around the instructions within TryGetValue().

Appendix D Proof of Theorem 2

Proof.

After the first intervention, we get at least bits of information. Suppose that there are interventions. Since after retrieving all information, the remaining information should be :

Since for small ; we assume to be small:

Appendix E Proof of Theorem 3

Proof.

Since at least predicates are discarded during each causal predicate discovery, and there are causal predicates, we compute the upper bound of the number of required interventions: