Causal Testing: Finding Defects' Root Causes

09/19/2018 ∙ by Brittany Johnson, et al. ∙ 0

Isolating and repairing unexpected or buggy software behavior typically involves identifying and understanding the root cause of that behavior. We develop Causal Testing, a new method of root-cause analysis that relies on the theory of statistical causal inference to identify a set of executions that likely hold the key causal information necessary to understand and repair buggy behavior. Given one or more faulty executions, Causal Testing finds a small set of minimally different executions that do not exhibit the faulty behavior. Reducing the differences any further causes the faulty behavior to reappear, and so the differences between these minimally different executions and the faulty executions capture causal execution information, which can aid system understanding and debugging tasks. Evaluating Causal Testing on a subset of the Defects4J benchmark, we find that Causal Testing could be applied to 71 real-world defects, and for 77 root cause of the defect. We implement and make public Holmes, a Causal Testing Eclipse plug-in that automatically computes and presents causal information. A controlled experiment with 16 developers showed that Holmes improves the subjects' ability to identify the cause of the defect: Users with standard testing tools identified the cause 81 did so 92

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Identifying the root cause of a behavior software exhibits, for example of a bug or an anomalously slow execution, lies at the heart of understanding and debugging behavior. Many existing approaches aim to automatically localize a bug to a specific location in the code in a way to help reduce developer effort necessary to fix that bug. For example, spectrum-based fault localization [28] uses information about test coverage (e.g., which lines execute more often during failing test cases as compared to passing ones) to identify the likelihood that a line of code contains a specific test-failing bug [13]. Delta debugging allows one to minimize a failing test input [66] and a set of test-breaking changes [65]. Even without a clear evidence of a defect, techniques attempt to find static code elements that correlate with (if not cause) defects [39] to improve software quality.

This paper presents Causal Testing, a novel method for identifying root causes of failing executions relying on the principles of statistical causal inference theory [46]. Given one or more failing executions, Causal Testing produces (when possible) a small set of executions that differ minimally from the failing ones but do not exhibit the failing behavior. The differences in these executions capture the root cause of the failure. Reducing the differences between executions any further causes the faulty behavior to reappear, thus indicating that all the presented differences are necessary for the failure.

Consider a developer working on a web mapping service (such as Google Maps or MapQuest) receiving a bug report that getting directions between “New York, NY, USA” and “900 René Lévesque Blvd. W Montreal, QC, Canada” resulted in incorrect directions, despite the web service correctly identifying each address, individually. The developer replicates the faulty behavior and hypothesizes potential causes. Maybe the special characters in “René Lévesque” caused a problem. Maybe the first address being a city and the second a specific building caused a mismatch in internal data types. Maybe the route is too long and the service’s precomputing of some routes is causing the problem. Maybe construction on the Tappan Zee Bridge along the route has created flawed route information in the database. There are many possible causes to consider. The developer decides instead to step through the execution, but the shortest path algorithm, coupled with precomputed-route caching and many optimizations is complex, and it is not clear how the wrong route is produced. The developer gets lost inside the many libraries and cache calls, and the stack trace becomes unmanageable quickly, without even knowing the point at which the erroneous internal representation first manifests itself.

Suppose, instead, a tool had analyzed the bug report and presented the developer with the following information:

1Failing: New York, NY, USA to                           900 René Lévesque Blvd. W Montreal, QC, Canada
2Failing: Boston, MA, USA to                             900 René Lévesque Blvd. W Montreal, QC, Canada
3Failing: New York, NY, USA to                           1 Harbour Square, Toronto, ON, Canada
4Passing: New York, NY, USA to                          39 Dalton St, Boston, MA, USA
5Passing: Toronto, ON, Canada to                        900 René Lévesque Blvd. W Montreal, QC, Canada
6Passing: Vancouver, BC, Canada to                      900 René Lévesque Blvd. W Montreal, QC, Canada

The developer would quickly see that all the failing test cases have one address in the United States and the other in Canada, whereas all the passing test cases have both the starting and ending addresses in the same country. It would become clear that the special characters, the first address being a city, the length of the route, and the construction are not the root cause of the problem.

Existing debugging methods cannot produce this information for the developer. Delta debugging minimizes existing inputs or changed code [65, 66], but does not produce new test cases. Causal Testing does not minimize an input or a set of changes, but rather produces other inputs (not necessarily smaller) that differ minimally but cause relevant behavioral changes. The same is true for spectral fault localization [28, 13]

, machine-learning-based defect prediction 

[39], traditional step-through debugging, time-travel debugging [10], and other debugging approaches do not attempt to produce new tests to help debug. Test fuzzing [18, 19, 24, 29, 58], and more generally, automated test generation, such as afl [1], EvoSuite [16], and Randoop [43], do generate new tests, but they do not aim to help developers by identifying most similar tests that differ in behavior.

Causal Testing produces precisely these kinds of similar but behaviorally different executions to help developers understand behavior. Building on the principles of statistical causal inference theory [46], Causal Testing observes that software allows one to conduct causal experiments. If the software executes on an input and produces a behavior, purposefully changing the input in a particular way and observing the changes in the software behavior infers causal relationships. The change in the input causes the behavioral change. Thus, finding minimal such changes to the inputs can identify tests that exhibit the root causes of the failures.

We implement Causal Testing in Holmes, our Eclipse plug-in that works on Java programs and interfaces with JUnit. Holmes uses speculative analysis [9] to automatically produce small, minimally different test sets in the background, and presents them to the developer when the developer is ready to debug a failing test. Holmes is open-source and available to the public.

We evaluate Causal Testing in two ways. First, we evaluate Holmes in a controlled experiment with 16 developers. The experiment asks developers to identify the root causes of real-world defects, giving them access to either (1) Eclipse and all its built-in tools, or (2) Holmes, Eclipse, and all its built-in tools. We find that using Holmes leads the developers to identify the root cause 92% of the time, while the control group only does so 81% of the time. While the study’s sample size is too small to show statistical significance (Fisher’s exact test ), this experiment provides promising evidence that Holmes and Causal Testing can be helpful for developers.

Second, we evaluate Causal Testing’s applicability to real-world defects by considering defects from real-world programs made as part of the development process, collected in the Defects4J benchmark [32]. We find that Causal Testing could be applied to 71% of real-world defects, and for 77% of those, it can help developers identify the root cause. For 47%, Causal Testing would require significant computing resources, so the developer would potentially need to leave the plug-in running overnight, or a cloud-based version would need to be developed.

Fig. 1: Screenshots of Amaya’s Eclipse IDE while debugging.

The main contributions of this paper are:

  • Causal Testing, a new technique for identifying root causes of software defects.

  • An enumeration of the different types of static and dynamic causal execution information and an evaluation of which types of information are helpful to developers.

  • Holmes, an open-source, publicly available implementation of Causal Testing as an Eclipse plug-in.

  • An empirical evaluation of Causal Testing’s applicability to real-world defects and effectiveness at helping developers identify the root causes of defects.

  • A replication package of all the artifacts and experiments described in this paper.

The rest of this paper is structured as follows. Section II illustrates how Causal Testing can help developers on a real-world defect. Section III details Causal Testing and Section IV describes Holmes, our prototype implementation. Section V evaluates how useful Holmes is in identifying root causes and Section VI evaluates how applicable Causal Testing is to real-world defects. Finally, Section VII identifies threats to the validity of our studies, Section VIII discusses our findings, Section IX places our work in the context of related research, and Section X summarizes our contributions.

Ii Motivating Example

Consider Amaya, a developer who regularly contributes to open source projects. Amaya codes primarily in Java and regularly uses the Eclipse IDE and JUnit. Amaya is working on a bug report in the Apache Commons Lang project. The report comes with a failing test (see

1
in Figure 1).

Figure 1 shows Amaya’s IDE as she works on this bug. Amaya runs the test to reproduce the error and JUnit reports that an exception occurred while trying to create the number 0Xfade (see

2
in Figure 1).

Amaya looks through the JUnit failure trace, looking for the place the code threw the exception. The code throw new NumberFormatException(str \+ ‘‘is not a valid number.’’); (see

3
Figure 1) threw the exception, but it is not clear why.

Amaya observes that the exception is thrown from within a switch statment, and that there is no case for the e at the end of 0Xfade. She wants to try adding such a case, but realizes she doesn’t know what should happen when the case is executed.

Amaya examines the other switch cases and realizes that each case is making a different kind of number, e.g., the case for l attempts to create a long or BigInteger. Unsure of what number she should be making, she turns to the Internet to see what the code might be expecting. She finds that the decimal value of 0Xfade is 64222. She conjectures that perhaps this number would fit in an int as opposed to a long or BigInteger. She creates a new method call to createInteger() inside of the case for e. Unfortunately, the test still fails.

Amaya next uses the debugger to trace what happens during the test’s execution. She walks through the execution to the NumberFormatException that was thrown at line 545 (see

3
in Figure 1). As the debugger walks through the execution, she sees that there are two other locations the input touches (see

4
and

5
in Figure 1) during execution that could be affecting the outcome. She now realizes that the code on lines 497–545, despite being where the exception was thrown, may not be the location of the defect that caused the test to fail. She is feeling stuck.

But then, Amaya remembers a friend telling her about Holmes, a Causal Testing Eclipse plug-in that helps developers debug. She installs the plug-in and minutes later, Holmes tells her that the code fails on the input 0Xfade, but passes on input 0xfade. The key difference is the lower case x. Amaya looks for where the 0x prefix is handled: line 458 (see

4
in Figure 1). The if statement

458if (str.startsWith("0x") || str.startsWith("-0x")){
459    return createInteger(str);
460}

fails to check for the 0X prefix. Now, armed with the cause of the defect, Amaya turns to the internet to find out the hexadecimal specification and learns that the test is right, 0X and 0x are both valid prefixes for hexadecimal numbers. She augments the if statement and the bug is resolved!

Impressed by the help she got from Holmes, she contacts her friend who had recommended it to learn more about how it works. Her friend tells her that Holmes implements Causal Testing. Holmes takes the failing test case (or test cases), identifies the inputs, and perturbs the inputs to generate a large number of other possible inputs. For example, Holmes may perturb 0Xfade to 0XFADE, 0xfade, edafX, 0Xfad, Xfade, fade, and many more. It then runs all these tests to find those that pass, and, next, selects from the passing test cases a small number whose inputs are the most similar to the original, failing test cases. It then reports those most-similar passing test cases to help the developer understand the key input difference that makes the test pass. Sometimes, Holmes may find other failing test cases whose inputs are even more similar to the passing ones than the original input, and it would report those too. The idea is to show the smallest difference that causes the behavior to change. In this case, Holmes finds the 0xfade input is the closest that results in a passing test:

184assertTrue("createNumber(String) 9b failed", 0xFADE == NumberUtils.createNumber("0xfade").intValue());

Holmes has multiple definitions of what it means for test inputs to be similar. Here, since the inputs are Strings, Holmes uses the Levenshtein distance (the minimum number of single-character edits required to change one String into the other) to measure similarity. Holmes can also use other measures, including dynamic ones, such as differences in the tests’ execution traces. Holmes presents both the static (test input) and dynamic (execution trace) information for the developer to compare the minimally-different passing and failing executions to better understand the root cause of the bug. For example, for this bug, Holmes would show the inputs, 0XFADE and 0xFADE, and the traces of the two executions, showing that the failing test enters a method from within createInteger that the passing test case does not, dictating to Amaya the expected code behavior, leading her to fix the bug.

Iii Causal Testing

Amaya’s debugging experience is based on what an actual developer did while debugging a real defect in a real-world version of Apache Commons Lang (taken from the Defects4J benchmark [32]). As the example illustrates, real-world software is complex and identifying root causes of program failures is challenging. This section describes our Causal Testing approach to computing and presenting the developer with information that can help identify failures’ root causes.

Figure 2 describes the Causal Testing approach. Given a failing test, causal testing provides developers with causal execution information in the form of minimally different passing and failing tests and traces of their executions.

Fig. 2: Causal Testing computes minimally different tests that, nevertheless, produce different behavior.

Iii-a Perturbing Test Inputs

Causal Testing produces new tests that are similar to the failing test but exhibit different behavior. To find these new tests, Causal Testing perturbs the original failing tests’ inputs by using a form a fuzzing.

Note that it is also possible to perturb a test oracle. For example, we might change the assertTrue in Figure 2 to assertFalse. However, perturbing test oracles is unlikely to produce meaningful information to guide the developer to the root cause of the bug because a passing test with a perturbed oracle would be passing because the expectation of the behavior changed, not because of a causal experiment showing a change in the input causing a change in the behavior. Instead, Causal Testing focuses on perturbing test inputs. For example, the input 0Xfade can be perturbed as shown in Figure 2.

Perturbing test inputs can be done in three different ways. First, Causal Testing could simply rely on the tests already in the test suite, finding the most similar passing test case. Second, Causal Testing could use automated test generation to generate a large number of test inputs [1, 16, 43] and then filter them to select only those that are close to the original input. Third, we could use test fuzzing to change the original input in small ways to generate inputs only similar to the original input. Fuzz testing is an active research area [18, 19, 24, 29, 58] (although the term fuzz testing is also used to mean simply generating tests [1]). Fuzz testing has been applied in the security domain to stress-test an application and automatically discover vulnerabilities, e.g., [19, 24, 58].

Unsurprisingly, using the existing tests is likely to often not produce a similar-enough test to articulate the root cause. Still, it is worthwhile to consider these tests first, before trying to generate more. Meanwhile, our research led us to the realization that existing fuzz testing techniques, for the most part, generate random inputs. As such, our solution to the challenge of generating similar inputs is to (1) use multiple fuzzers, (2) generate many tests, and (3) filter those tests to select those similar to the original test. As we observed with Holmes, our proof-of-concept Causal Testing tool (described in Section IV), using multiple input fuzzers provided a different set of perturbations, increasing the chances that Holmes found a set of minimally different inputs at least one of which would lead to a non-failing execution.

Iii-B Input Similarity

Given two test inputs (with the same oracle), Causal Testing needs to be able to tell how similar they are. Conceptually, to apply the theory of statistical causal inference, the idea behing test input similarity is that two minimally-different test inputs should differ in only one factor. For example, imagine a software system that processes apartment rental applications. If two input applications are identical in every way except one entry, and the software crashes on one application but does not crash on the other, this pair of inputs provides one piece of evidence that the one entry in which the applications differ causes the software to crash. (Other pairs that also only differ in only that entry would provide more such evidence.) If the applications differed in multiple entries, it would be harder to know which entry is responsible. Thus, to help developers understand root causes, Causal Testing needs to be able to precisely measure input similarity.

Input similarity can be viewed at different scopes. First, inputs can consist of multiple arguments and thus can agree in some and differ in others of those arguments. Agreeing in more arguments makes inputs more similar. Second, each argument whose values in the two tests differ can differ to varying degrees. A measure of that difference depends on the type of the argument. For arguments of type String, the Levenshtein distance (the minimum number of single-character edits required to change one String into the other) is a reasonable measure, though there are others as well. For numerical arguments, their numerical difference or ratio is a reasonable measure.

The specific semantics of the similarity measure are domain specific. For example, in apartment rental applications, a difference in the address may play a much smaller role than a difference in salary or credit history. How the similarity of each argument is measured, and how the similarities of the different arguments are weighed are specific to the domain and may require fine tuning by the developer, especially for custom data types (e.g., project-specific Object types). However, we found that relatively simple measures suffice for general debugging, and likely work well in many domains. Using Levenshtein distance for Strings, differences for numerical values, sums of elements distances for Arrays works reasonably well, in practice, in the 330 defects from four different real-world systems we examined from the Defects4J benchmark [32]. In the end, while our measures worked well for us, there is no single right answer for measuring input similarity, and further empirical research is needed to evaluate different measures.

Iii-C Gathering & Displaying Testing Results

After generating test inputs, Causal Testing ranks them by similarity, and executes the tests in that order. This process continues until either a fixed target number of passing tests is found (in our experience, 3 was a good target, or a time-out is reached. This approach enables Causal Testing to produce results for the developer as quickly as possible, while it performs more computation, looking for potentially more results.

With each test execution, Causal Testing collects the inputs used and the execution trace information. This trace includes methods that execute during the test’s run, the arguments each method executed with, call cites for the methods, and methods’ return values.

Execution traces can get large, making parsing them difficult. However, the trace differences between the two similar tests should be small, and thus Causal Testing displays the difference in the traces by default, allowing the developer to expand the traces as necessary.

Iv Holmes: A Causal Testing Implementation

We have implemented Holmes, an Eclipse plug-in Causal Testing prototype. Holmes consists of input fuzzers, edit distance comparer, test executor & comparator, and output view.

Holmes runs in the background, following the speculative analysis approach [9], precomputing potentially useful information silently, until the developer elects to debug a bug. When the developer has an Eclipse project open with a failing test case, Holmes begins fuzzing that test’s input. Holmes uses 2 publicly available fuzzers: Peach [45] and Fuzzer [17].

Next, Holmes orders the generated test inputs by similarity to the original test and executes them, looking for passing tests. As Holmes executes the tests, it uses InTrace [25], an open-source Java tracer to collect the execution trace.

The developer can right-click on the test and select “Run Holmes” from the menu to launch Holmes’ view. Holmes then displays the minimally different executions it has found, with test input and trace information.

V Identifying Root Causes with Causal Testing

We designed a controlled experiment with 16 developers to answer the following research questions:

  • Does causal execution information improve the developers’ ability to identify the root causes of defects?

  • Does causal execution information improve the developers’ ability to repair defects?

  • Do developers find causal execution information useful, and, if so, what causal execution information is most useful?

V-a User Study Design

The goal of our research is to help developers determine the cause of a test failure, thereby helping developers better understand and eliminate defects from their code. We designed our user study to provide evidence of a need for causal testing while also providing a foundation for determining the kind of information that might be useful for causal testing.

We randomly selected 7 defects from Defects4J, from the Apache Commons Lang project. We chose Apache Commons Lang because it (1) is the most widely known project in Defects4J, (2) had defects that required only limited domain knowledge, and (3) can be developed in Eclipse.

Our user study consisted of a training task, and 6 experimental tasks. Each task mapped to 1 of the 7 defects selected. To reduce the effects of ordering bias, we randomized defect order across participants. Each participant completed 3 tasks in the control group and 3 tasks in the experimental group.

For the training task, we provided an Eclipse project with a defective code version and single failing test. We explained how to execute the test suite via JUnit, and how to invoke Holmes. We allowed participants to explore the code and ask questions, telling them that the goal is to make all tests pass.

For each task following the training task, we provided participants with an Eclipse project with a defective code version and a single failing test. The control group tasks did not have access to Holmes, the experimental group tasks did.

We recorded audio and the screen for later analysis. We asked participants to complete a causality questionnaire after each task consisting of two questions: “What caused Test XX to fail?” and “What changes did you make to fix it?”

At then end, the participants completed a post-evaluation survey, asking them to rate JUnit and various aspects of Holmes’ causal execution information on a 4-point Likert scale: “Very useful”, “Somewhat useful”, “Not useful”, and “Misleading or harmful”. We also allowed participants to provide additional open-ended feedback.

V-B Evaluation Pilot

We initially conducted a pilot study with 23 students from an advanced, graduate software engineering course. We now briefly describe the lessons learned that informed our final study design. Our pilot study consisted of 5 tasks and a mock-up version of Holmes. We learned the following lessons:

Experience with outside technology. Many of the participants lacked experience with Java (most preferred Python), Eclipse, and JUnit, significantly hindering the study and resulting in noisy data.

Difficulty separating defect cause from resolution. The participants were often not able to separate understanding the root cause of the defect from what they did to repair it.

User interface issues. The Holmes mock-up inserted Causal Testing information into the code as comments. The participants reported forgetting to look at the comments, unintentionally ignoring Holmes’ output.

Our final study built on these lessons by requiring participants to have minimum Java, Eclipse, and JUnit experience, did not require participants to separately find the cause and the produce a repair, developed a full Holmes Eclipse plug-in with a separate view in the IDE, and reminded the participants what tools were available to them at the start of each task.

V-C Participants

Participant Industry Exp. Prog. Exp. Java Exp.
P1 none 3 years 3 years
P2 8 months 12 years 8 years
P3 4 months 3.5 years 3 years
P4 none 2 years 2 years
P5 none 3 years 1.5 years
P6 none 3 years 1.5 years
P7 4 months 6 years 6 years
P8 none 3 years 2 years
P9 none 6 years 6 years
P10 1 year 8 years 4 years
P11 30 years 30 years 1 year
P12 3.5 years 9 years 9 years
P13 none 1 year 1 year
P14 3 months 15 years 8 years
P15 15 years 26 years 15 years
P16 none 2 years 2 years
Fig. 3: User study participants and demographics.

We recruited 16 Java developers (Figure 3). Of them, 11 were undergraduate students, 2 were PhD students, 2 were industry developers, and 1 was a research scientist. Participants’ programming experience ranged from 1 to 30 years, with a median of 3.5 years. Experience with Java ranged from 1 to 15 years, with a median of 2.5 years. All participants had prior experience using Eclipse and JUnit.

V-D User Study Findings

RQ1: Root Cause Identification

The primary goal of Holmes is to provide causal execution information to help developers identify the root cause of test failures. To answer RQ1, we analyzed the responses participants gave to the question “What caused Test XX to fail?” We marked responses as either correct or incorrect. We considered a response correct if it captured the full and true cause of the defect, and incorrect if it was missing at least part of the true root cause.

Figure 4

shows the root cause identification correctness results. When using Holmes, developers correctly identified the cause 92% of the time (44 out of 48), but the control group only identified the cause 81% of the time (39 out of 48). Fisher’s exact test reports the probability that the control and experimental group’s distributions of identifying the causes come from the same underlying distribution as only 11%.

For four of the six defects, (defects 1, 2, 3, and 6), developers using Holmes were more accurate when identifying root causes than the control group. One defect (defect 5), showed 100% accuracy in root cause identification by both groups. For one defect (defect 4), the control group was better at identifying root causes (but only one Holmes user identified the root cause incorrectly).

Defect Group Correct Incorrect Total
1 Control 10 (91%) 1  (9%) 11
Holmes 05 (100%) 0  (0%) 05
2 Control 06  (86%) 1 (14%) 07
Holmes 09 (100%) 0  (0%) 09
3 Control 03  (50%) 3 (50%) 06
Holmes 08  (80%) 2 (20%) 10
4 Control 07 (100%) 0  (0%) 07
Holmes 08  (89%) 1 (11%) 09
5 Control 09 (100%) 0  (0%) 09
Holmes 07 (100%) 0  (0%) 07
6 Control 04 (50%) 4 (50%) 08
Holmes 07 (88%) 1 (12%) 08
Total Control 39 (81%) 9 (19%) 48
Holmes 44 (92%) 4  (8%) 48
Fig. 4: Distribution of correct and incorrect cause descriptions, per defect.

Our findings suggest that causal execution information improves developers’ ability to understand defects’ root causes.

RQ2: Defect Repair

Resolution Time (in minutes)
Defect: 1 2 3 4 5 6
Control 6.9 15.6 10.3 16.2 3.6 14.3
Holmes 7.3 17.6 13.5 14.9 5.5 13.6
Fig. 5: The time developers took to resolve the defects, in minutes.

The secondary goal of Holmes is to improve the developers’ ability to repair defects (though we note that Holmes provides information on the root cause of the defect, which, while likely useful for repairing the defect, is a step removed from that task). To answer RQ2, we analyzed participants’ responses to the question “What changes did you make to fix it?” We used the same evaluation criteria and labeling as for RQ1. To determine if causal execution information improves developers’ ability to debug and repair defects, we observed the time it took participants to complete each task and the correctness of their repairs.

Figure 5 shows the time it took developers to repair defects. We omitted times for flawed repair attempts that do not address the defect. While for Defects 4 and 6, developers resolved defects faster with Holmes than without, for the other defects, the control group was faster. The differences were not statistically significant. One explanation for this observation is that while Holmes helps developers understand the root cause, this does not lead to faster repair. Repairing the defect depends on factors other than understanding the root cause, which can introduce noise in the time measurements.

Correctness
Group Correct Incorrect Total by Group
Control 36 (92%) 3 (8%) 39
Holmes 35 (100%) 0 (0%) 35
Fig. 6: Distribution of correct and incorrect repairs implemented by participants.

Figure 6 shows repair correctness results (omitting several repairs that were either incomplete or made the tests pass without fixing the underlying defect). When using Holmes, developers correctly repaired the defect 100% of the time, but the control group only identified repaired the defect 92% of the time (36 out of 39).

For four of the six defects, Defects 1, 2, 3, and 5, developers using Holmes repaired the defect correctly more often (Defect 1: 100% vs. 91% Defect 2: 100% vs. 71%; Defect 3: 60% vs. 50%; Defect 5: 100% vs. 89%). For the other two defects, the control group was more successful (Defect 4 56% vs. 86%; Defect 6 38% vs. 50%). Over all the defects, none of the Holmes users ever repaired the defect completely incorrectly (although several did partially incorrectly), while 3 (6%) of the control group did produce incorrect repairs. Still, there was no meaningful trend. Again, as repairing the defect depends on factors other than understanding the root cause, Holmes did not demonstrate an observable advantage to repairing defects.

Our findings suggest that causal execution information sometimes helped developers repair defects, but neither consistently nor statistically significantly.

RQ3: Usefulness of Causal Testing

To answer RQ3, we analyzed post-evaluation survey responses on information most useful when understanding and debugging the defects. We extracted and aggregated qualitative responses regarding information most helpful when determining the cause of and fixing defects. We also analyzed the Likert-scale ratings regarding the usefulness of JUnit and the various components of causal execution information.

Overall, when compared to state-of-the-art testing tool JUnit, participants considered Holmes more useful. Out of 16 participants, 13 found causal execution information either more useful than JUnit (8 participants) or at least as useful (5 participants), which 3 participants found JUnit information more useful. JUnit and interactive debuggers are an important part of debugging, so our expectations would be that Causal Testing would augment those tools, not replace them.

Of the information provided by Holmes, our findings suggest participants found the most value in the minimally different passing tests, followed by the execution traces. The test inputs for passing and failing tests that Holmes provided received “Very Useful” or “Useful” rankings more often than the test execution traces. For 10 participants, the passing tests were at least as useful as, sometimes more useful than, the additional failing tests provided. For 6 participants, the passing and failing execution traces were at least as useful as the passing and failing tests. For 2 of those participants, the execution traces were more helpful than the test inputs.

To gain a better understanding of what parts of causal execution information are most useful, we also analyzed participants’ qualitative responses regarding what they found most helpful for understanding and repairing the defects in the study.

For cause identification, 9 participants explicitly mentioned some aspect of Holmes as being most helpful. For 1 of these participants, all the causal execution information provided by Holmes was most helpful for cause identification. Of these participants, 3 noted that the similar passing and failing tests were most helpful of the causal execution information provided. The other 5 participants stated the execution traces were most helpful.

For defect resolution, 8 participants mentioned some aspect of Holmes as being most helpful. Of those participants, 4 stated that the similar passing tests specifically were most helpful of the information provided by Holmes. For the other 4, the execution trace was most helpful for resolving the defect. A participant in this group specifically mentioned that the return values provided in Holmes’ execution traces were most helpful.

In the free-form comments, participants found the passing and failing tests and traces useful for their tasks; 4 participants explicitly mentioned during their session that having the additional passing and failing tests were “super useful” and saved them time and effort in understanding and debugging the defect. However, one suggestion for improvement was to make the differences between the passing and failing tests more explicit. Two participants explicitly suggested, rather than bolding the entire fuzzed input, only bolding the character(s) that are different from the original.

Interestingly, participants more often found the execution traces useful for understanding and debugging than the passing and failing tests, which is reflected in the post-evaluation responses. Despite this fact, participants still mentioned ways the traces provided by Holmes could be improved. One issue participants brought up regarding the traces was that, when an exception was thrown, the trace did not explicitly say so. Since JUnit provides this information, many participants went to JUnit’s failure trace when they wanted to locate where exactly an exception was thrown.

Research has shown that prior experience and knowledge can affect how developers internalize and use the information provided by tools [27]. For 5 participants, their ability to succeed at understanding and resolving the defects was a product of their prior experience, or lack there of, with the concepts relevant to the defect or the tooling used. For 4 of those participants, their familiarity with JUnit and lack-there-of with Holmes led to them using JUnit more often; one participant noted they always forgot to go back to the output provided by Holmes, though in hindsight, it would have been useful. The other participant cited her extensive experience with Java and number theory as what helped her the most, followed by the information provided by Holmes.

Our findings suggest that the causal execution information provided by Holmes is useful for both cause identification and defect resolution, and while it cannot replace JUnit and an interactive debugger, it effectively complements those tools.

Vi Causal Testing Applicability to Real-World Defects

To evaluate the usefulness and applicability of causal testing to real-world defects, we conducted an evaluation on the Defects4J benchmark [32]. Defects4J is a collection of reproducible defects found in real-world, open-source Java software projects: Apache Commons Lang, Apache Commons Math, Closure compiler, JFreeChart, and Joda-Time. For each defect, Defects4J provides a defective and a repaired version of the source code, along with the developer-written test suites that exhibit the defect.

We manually went through 330 defects in four of the five projects in the Defects4J benchmark and categorized them based on whether causal testing would work and whether it would be useful in identifying the root cause of the defect. We excluded Joda-Time from our analysis because of difficulty reproducing the defects. (Some of these difficulties have been documented in the Joda-Time issue tracker: https://github.com/dlew/joda-time-android/issues/37.)

Vi-a Evaluation Process

When possible, we executed Holmes on the defect to aid our analysis; however, we also accounted for limitations in Holmes that prevent it from working on some defects, whereas an improved version would. We took these limitations into account. Our goal was to classify when Causal Testing could work to identify the root cause, not when a particular prototype implementation could.

To categorize the defects, we: (1) Imported the defecting and fixed versions of the code into Eclipse. (2) Executed the test suite on the defective version to identify the failing test(s) and the methods it tested. (3) Determined the input(s) to the test(s). (4) Either using Holmes, or manually, worked to perturb the test to produce a reasonably similar passing test likely to capture the root cause of the defect. (5) Examined the developer-written patch to help understand the root cause to ensure the produced tests captured it. And, finally, (6) categorized the defect into one of five defect categories.

Vi-B Defect Applicability Categories

We categorized each defect into the following categories:

  1. [label=.]

  2. Works, useful, and fast.

    For these defects, Causal Testing can produce at least one minimally different passing test that captures its root cause. We reason Causal Testing would be helpful to developers. In our estimate, the difference between the failing and minimally different passing tests is reasonably small that it can be found on a reasonable personal computer, reasonably fast.

  3. Works, useful, but slow. For these defects, Causal Testing again can produce at least one minimally different passing test that captures its root cause, and this would be helpful to developers. However, the difference between the tests is large, and, in our estimation, Causal Testing would need additional computation resources, such as running overnight or access to cloud computing.

  4. Produces minimally different passing test, but is not useful. For these defects, Causal Testing again can produce at least one minimally different passing test, but in our estimation, this test would not be useful to understanding the root cause of the defect.

  5. Will not work. For these defects, Causal Testing would not be able to perturb the tests, and would tell the developer it cannot help right away.

  6. We could not make a determination. Understanding the root cause of these defects required project domain knowledge that we lacked, so we opted not to make an estimation of whether Causal Testing would work.

Vi-C Results

Category
Project I II III IV V Total
Math 14 15 11 20 046 106
Lang 11 06 03 14 031 065
Chart 02 04 01 01 018 026
Closure 02 22 08 05 096 133
Total 29 47 23 40 191 330
Fig. 7: Distribution of defects across five applicability categories:
 I: Works, useful, and fast.
 II: Works, useful, but slow.
 III: Produces minimally different passing test, but is not useful.
 IV: Will not work.
 V: We could not make a determination.

Figure 7 shows our defect classification results. Of the 330 defects, we could make a determination for 139 of them. Of these, Causal Testing could promise the developer to try to help for 99 (71%). For the remaining 40 (29%), Causal Testing would simply say it cannot help and would not waste the developer’s time. Of these 99 defects, for 76 (77%), Causal Testing can produce information helpful to identifying the root cause. For 29 (29%), an simple local IDE-based tool would work, and for 47 (47%), a tool would need more substantial resources, such as running overnight or on the cloud. The remaining 23 (23%) of the defects would produce information not helpful to developer.

Our findings suggest that Causal Testing can try to apply to 71% of real-world defects, and for 77% of those, it can help developers identify the root cause of the defect.

Vii Threats to Validity

External Validity. Our studies used Defects4J defects, which may not generalize. We mitigated this threat by using a well-known and widely-used benchmark of real-world defects. We selected defects for the user study randomly from those that worked with our current implementation of Holmes and that required little or no prior project or domain knowledge, with varying levels of difficulty. The applicability evaluation considered defects across four projects.

The user study used 16 participants, which is within range of higher data confidence and on par with similar user studies [15, 53, 5, 41]. We mitigated this threat by finding participants with different backgrounds and experience.

Internal Validity. Our user study participants were volunteers. This leads to the potential for self-selection bias. However, we were able to recruit a diverse set of participants, somewhat mitigating this threat.

Construct Validity. We performed a manual analysis of whether causal testing would apply and be useful. This leads to the potential for researcher bias. We minimized the potential for this threat by developing and following concrete, reproducible methodology and criteria for usefulness.

The user study asked participants to understand and debug code they had not written. We mitigate this threat by selecting defects for the study that required the least project and domain knowledge. Additionally, we did not disclose the true purpose of the user study to the subjects until the end of each participant’s session.

Viii Discussion

Our findings suggest that causal testing can be useful for understanding root causes of defects. Our user study found that having passing and failing tests that exemplify expected and unexpected behavior are especially useful for understanding, and even debugging, software defects. This suggests that an important aspect of debugging is being able to identify expected behavior when software is behaving unexpectedly.

Causal Testing supplements existing debugging tools, such as JUnit. Understandably, participants sometimes found themselves needing information that Holmes does not provide, especially once they understood the root cause and needed to repair the defect. Our findings suggest that Causal Testing is most useful for root cause identification. Still, more than half of the participants in our study found Holmes useful for both cause identification and defect repair, despite sometimes taking longer to resolve defects with Holmes. We speculate that increased familiarity with Holmes may further help developers use the right tool at the right time, improving debugging efficiency with Holmes, as supported by prior studies [27].

Execution traces can be useful for finding the location of a defect [12], and our study has shown that such traces can also be useful for understanding and repairing defects. Participants in our study found comparing execution traces was useful for understanding why the test was failing and how the test should behave differently. For some participants, the execution trace information was the most useful of all.

Ix Related Work

Like Causal Testing, Delta debugging [65, 66] aims to help developers understand the cause of a set of failing tests. Given a failing test, the underlying ddmin algorithm minimizes that test’s input such that removing any other piece of the test makes the test pass [23]. Delta debugging can also be applied to a set of test-breaking code changes to minimize that set, although in that scenario, multiple subsets that cannot be reduced further are possible because of interactions between code changes [55, 66]. By contrast, Causal Testing does not minimize an input or a set of changes, but rather produces other inputs (not necessarily smaller) that differ minimally but cause relevant behavioral changes. The two techniques are likely complementary in helping developers debug.

When applied to code changes, delta debugging requires a correct code version and a set of changes that make it fail. Iterative delta debugging lifts the need for the correct version, using the version history and traditional delta debugging to produce a correct version [3]. Again, Causal Testing is complementary, though future work could extend Causal Testing to consider the development history to guide fuzzing.

Fault localization (also known as automated debugging) is concerned with located the line or lines of code responsible for a failing test case [2, 28, 63]. Spectral fault localization uses the frequency with which each code line executes on failing and passing tests cases to identify the suspicious lines [28, 13]. Accounting for redundancy in test suites can improve fault localization precision [21, 22]. MIMIC can also improve fault localization precision by synthesizing additional passing and failing executions [67], and Apollo can do so by generating tests to maximize path constraint similarity [4]. Unfortunately, research has shown that showing the developers even the ground truth fault localization (as well as the output of state-of-the-art fault localization techniques) does not improve the developers’ ability to repair defects [44], likely because understanding defect causes requires understanding more code than just the lines that need to be edited to repair it. By contrast, Causal Testing attempts to illustrate the changes to software inputs that cause the behavioral differences, and a controlled experiment has shown promise that Causal Testing positively affects the developers’ ability to understand defect causes.

Fault localization is used heavily in automated program repair, e.g., [62, 61, 60, 47, 48, 34, 57, 42, 50, 37, 38, 33, 51, 35, 36], where helping humans understand behavior is not required. Causal Testing cannot do this; however, an interesting future research direction is to explore whether Causal Testing information can help generate patches, or improve patch quality [52] or repair applicability [40].

Mutation testing targets a different problem than Causal Testing, and the approaches differ significantly. Mutation testing mutates the source code to evaluate the quality of a test suite [31, 30]. Causal Testing does not mutate source code (it perturbs test inputs) and helps developers identify root causes of defects, rather than improving test suites (although it does generate new tests.) In a special case of Causal Testing, when the defect being analyzed is in software whose input is a program (e.g., a compiler), Causal Testing may rely on code mutation operators to perturb the inputs.

Reproducing field failures [26] is an important part of debugging complementary to most of the above-described techniques, including Causal Testing, as these techniques typically start with a failing test case. Field failures often tell more about software behavior than in-house testing [59].

Fuzz testing is the process of changing existing tests to generate more tests [18, 19] (though, in industry, fuzz testing is often synonymous with automated test input generation). Fuzz testing has been used most often to identify security vulnerabilities [19, 58]. Fuzzing can be white-box, relying on the source code [19] or black-box [29, 58], relying only on the specification or input schema. Causal testing uses fuzz testing and improvements to fuzz testing research can directly benefit Causal Testing by helping it to find similar test inputs that lead to different behavior. Fuzzing can be used on complex inputs, such as programs [24], which is necessary to apply Causal Testing to software with such inputs (as is the case for, for example, Closure, one of the subject programs we have studied). Fuzz testing by itself does not provide the developer with information to help understand defects’ root causes, though the failing test cases it generates can certainly serve as a starting point.

The central goal of automated test generation (e.g., EvoSuite [16], and Randoop [43]) and test fuzzing is finding new failing test cases. For example, combining fuzz testing, delta debugging, and traditional testing can identify new defects, e.g., in SMT solvers [8]. Automated test generation and fuzzing typically generate test inputs, which can serve as regression tests [16] or require humans to write test oracles. Without such oracles, one cannot know if the tests pass or fail. Recent work on automatically extracting test oracles from code comments can help [6, 20, 56]. Differential testing can also produce oracles by comparing the executions of the same inputs on multiple implementations of the same specification [7, 11, 14, 49, 54, 64]. Identifying defects by producing failing tests is the precursor to Causal Testing, which uses a failing test to help developers understand the defects’ root cause.

X Contributions

We have presented Causal Testing, a novel method for identifying root causes of software defects. Causal Testing is applicable to 71% of real-world defects, and for 77% of those, it can help developers identify the root cause of the defect. Developers using Holmes were 11% (92% vs. 81%) more likely to correctly identify root causes than without Holmes. Overall, Causal Testing shows promise for improving the debugging process.

Acknowledgment

This work is supported by the National Science Foundation under grants no. CCF-1453474, IIS-1453543, and CCF-1763423.

References

  • [1] American fuzzy lop. http://lcamtuf.coredump.cx/afl/, 2018.
  • [2] Hiralal Agrawal, Joseph R. Horgan, Saul London, and W. Eric Wong. Fault localization using execution slices and dataflow tests. In International Symposium on Software Reliability Engineering (ISSRE), pages 143–151, 1995.
  • [3] Cyrille Artho. Iterative delta debugging. International Journal on Software Tools for Technology Transfer, 13(3):223–246, 2011.
  • [4] Shay Artzi, Julian Dolby, Frank Tip, and Marco Pistoia. Directed test generation for effective fault localization. In International Symposium on Software Testing and Analysis (ISSTA), pages 49–60, 2010.
  • [5] Titus Barik, Yoonki Song, Brittany Johnson, and Emerson Murphy-Hill. From quick fixes to slow fixes: Reimagining static analysis resolutions to enable design space exploration. In Proceedings of the International Conference on Software Maintenance and Evolution (ICSME), pages 211–221, Raleigh, NC, USA, 2016.
  • [6] Arianna Blasi, Alberto Goffi, Konstantin Kuznetsov, Alessandra Gorla, Michael D. Ernst, Mauro Pezzè, and Sergio Delgado Castellanos. Translating code comments to procedure specifications. In International Symposium on Software Testing and Analysis (ISSTA), pages 242–253, Amsterdam, Netherlands, 2018.
  • [7] Chad Brubaker, Suman Jana, Baishakhi Ray, Sarfraz Khurshid, and Vitaly Shmatikov. Using frankencerts for automated adversarial testing of certificate validation in SSL/TLS implementations. In IEEE Symposium on Security and Privacy (S&P), pages 114–129, 2014.
  • [8] Robert Brummayer and Armin Biere. Fuzzing and delta-debugging SMT solvers. In International Workshop on Satisfiability Modulo Theories (SMT), pages 1–5, 2009.
  • [9] Yuriy Brun, Reid Holmes, Michael D. Ernst, and David Notkin. Speculative analysis: Exploring future states of software. In Future of Software Engineering Research (FoSER), pages 59–63, November 2010.
  • [10] Brian Burg, Richard Bailey, Andrew J. Ko, and Michael D. Ernst. Interactive record/replay for web application debugging. In ACM Symposium on User Interface Software and Technology (UIST), pages 473–484, St. Andrews, UK, October 2013.
  • [11] Yuting Chen and Zhendong Su. Guided differential testing of certificate validation in SSL/TLS implementations. In European Software Engineering Conference and ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE), pages 793–804, Bergamo, Italy, 2015.
  • [12] Valentin Dallmeier, Christian Lindig, and Andreas Zeller. Lightweight defect localization for Java. In European Conference on Object Oriented Programming (ECOOP), pages 528–550, Glasgow, UK, 2005.
  • [13] Higor Amario de Souza, Marcos Lordello Chaim, and Fabio Kon. Spectrum-based software fault localization: A survey of techniques, advances, and challenges. CoRR, abs/1607.04347, 2016.
  • [14] Robert B. Evans and Alberto Savoia. Differential testing: A new approach to change detection. In European Software Engineering Conference and ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE) Poster track, pages 549–552, Dubrovnik, Croatia, 2007.
  • [15] Laura Faulkner. Beyond the five-user assumption: Benefits of increased sample sizes in usability testing. Behavior Research Methods, Instruments, & Computers, 35(3):379–383, 2003.
  • [16] Gordon Fraser and Andrea Arcuri. Whole test suite generation. IEEE Transactions on Software Engineering (TSE), 39(2):276–291, February 2013.
  • [17] Fuzzer. https://github.com/mapbox/fuzzer, 2018.
  • [18] Patrice Godefroid. Random testing for security: Blackbox vs. whitebox fuzzing. In International Workshop on Random testing (RT), page 1, 2007.
  • [19] Patrice Godefroid, Michael Y. Levin, and David A. Molnar. Automated whitebox fuzz testing. In Network and Distributed System Security Symposium (NDSS), pages 151–166, 2008.
  • [20] Alberto Goffi, Alessandra Gorla, Michael D. Ernst, and Mauro Pezzè. Automatic generation of oracles for exceptional behaviors. In International Symposium on Software Testing and Analysis (ISSTA), pages 213–224, Saarbrücken, Genmany, July 2016.
  • [21] Dan Hao, Ying Pan, Lu Zhang, Wei Zhao, Hong Mei, and Jiasu Sun. A similarity-aware approach to testing based fault localization. In IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 291–294, Long Beach, CA, USA, 2005.
  • [22] Dan Hao, Lu Zhang, Hao Zhong, Hong Mei, and Jiasu Sun. Eliminating harmful redundancy for testing-based fault localization using test suite reduction: An experimental study. In IEEE International Conference on Software Maintenance (ICSM), pages 683–686, 2005.
  • [23] Ralf Hildebrandt and Andreas Zeller. Simplifying failure-inducing input. In International Symposium on Software Testing and Analysis (ISSTA), pages 135–145, 2000.
  • [24] Christian Holler, Kim Herzig, and Andreas Zeller. Fuzzing with code fragments. In USENIX Security Symposium, pages 445–458, Bellevue, WA, USA, 2012.
  • [25] Intrace. https://mchr3k.github.io/org.intrace, 2018.
  • [26] Wei Jin and Alessandro Orso. BugRedux: Reproducing field failures for in-house debugging. In ACM/IEEE International Conference on Software Engineering (ICSE), ICSE ’12, pages 474–484, Zurich, Switzerland, 2012.
  • [27] Brittany Johnson, Rahul Pandita, Justin Smith, Denae Ford, Sarah Elder, Emerson Murphy-Hill, Sarah Heckman, and Caitlin Sadowski. A cross-tool communication study on program analysis tool notifications. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 73–84. ACM, 2016.
  • [28] James A. Jones, Mary Jean Harrold, and John Stasko. Visualization of test information to assist fault localization. In International Conference on Software Engineering (ICSE), pages 467–477, Orlando, FL, USA, 2002.
  • [29] Jaeyeon Jung, Anmol Sheth, Ben Greenstein, David Wetherall, Gabriel Maganis, and Tadayoshi Kohno. Privacy oracle: A system for finding application leaks with black box differential testing. In ACM Conference on Computer and Communications Security (CCS), pages 279–288, Alexandria, VA, USA, 2008.
  • [30] René Just. The Major mutation framework: Efficient and scalable mutation analysis for Java. In International Symposium on Software Testing and Analysis (ISSTA), pages 433–436, San Jose, CA, USA, July 2014.
  • [31] René Just, Michael D. Ernst, and Gordon Fraser. Efficient mutation analysis by propagating and partitioning infected execution states. In International Symposium on Software Testing and Analysis (ISSTA), pages 315–326, San Jose, CA, USA, July 2014.
  • [32] René Just, Darioush Jalali, and Michael D. Ernst. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA), pages 437–440, San Jose, CA, USA, July 2014.
  • [33] Yalin Ke, Kathryn T. Stolee, Claire Le Goues, and Yuriy Brun. Repairing programs with semantic code search. In IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 295–306, Lincoln, NE, USA, November 2015.
  • [34] Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. Automatic patch generation learned from human-written patches. In ACM/IEEE International Conference on Software Engineering (ICSE), pages 802–811, San Francisco, CA, USA, 2013.
  • [35] Fan Long and Martin Rinard. Staged program repair with condition synthesis. In European Software Engineering Conference and ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE), pages 166–178, Bergamo, Italy, 2015.
  • [36] Fan Long and Martin Rinard. Automatic patch generation by learning correct code. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 298–312, St. Petersburg, FL, USA, 2016.
  • [37] Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. DirectFix: Looking for simple program repairs. In International Conference on Software Engineering (ICSE), Florence, Italy, May 2015.
  • [38] Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. Angelix: Scalable multiline program patch synthesis via symbolic analysis. In International Conference on Software Engineering (ICSE), Austin, TX, USA, May 2016.
  • [39] Tim Menzies, Jeremy Greenwald, and Art Frank. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33(1):2–13, January 2007.
  • [40] Manish Motwani, Sandhya Sankaranarayanan, René Just, and Yuriy Brun. Do automated program repair techniques repair hard and important bugs? Empirical Software Engineering (EMSE), 23(5):2901–2947, October 2018.
  • [41] Kıvanç Muşlu, Yuriy Brun, Michael D. Ernst, and David Notkin. Reducing feedback delay of software development tools via continuous analyses. IEEE Transactions on Software Engineering (TSE), 41(8):745–763, August 2015.
  • [42] Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chandra. SemFix: Program repair via semantic analysis. In ACM/IEEE International Conference on Software Engineering (ICSE), pages 772–781, San Francisco, CA, USA, 2013.
  • [43] Carlos Pacheco and Michael D. Ernst. Randoop: Feedback-directed random testing for Java. In Conference on Object-oriented Programming Systems and Applications (OOPSLA), pages 815–816, Montreal, QC, Canada, 2007.
  • [44] Chris Parnin and Alessandro Orso. Are automated debugging techniques actually helping programmers? In International Symposium on Software Testing and Analysis (ISSTA), pages 199–209, Toronto, ON, Canada, 2011.
  • [45] Peach. https://github.com/MozillaSecurity/peach, 2018.
  • [46] Judea Pearl. Causal inference in statistics: An overview. Statistics Surveys, 3:96–146, 2009.
  • [47] Yu Pei, Carlo A. Furia, Martin Nordio, Yi Wei, Bertrand Meyer, and Andreas Zeller. Automated fixing of programs with contracts. IEEE Transactions on Software Engineering (TSE), 40(5):427–449, 2014.
  • [48] Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. In International Symposium on Software Testing and Analysis (ISSTA), pages 24–36, Baltimore, MD, USA, 2015.
  • [49] Vipin Samar and Sangeeta Patni. Differential testing for variational analyses: Experience from developing KConfigReader. CoRR, abs/1706.09357, 2017.
  • [50] Stelios Sidiroglou and Angelos D. Keromytis. Countering network worms through automatic patch generation. IEEE Security and Privacy, 3(6):41–49, November 2005.
  • [51] Stelios Sidiroglou-Douskos, Eric Lahtinen, Fan Long, and Martin Rinard. Automatic error elimination by horizontal code transfer across multiple applications. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 43–54, Portland, OR, USA, 2015.
  • [52] Edward K. Smith, Earl Barr, Claire Le Goues, and Yuriy Brun. Is the cure worse than the disease? overfitting in automated program repair. In European Software Engineering Conference and ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE), pages 532–543, Bergamo, Italy, September 2015.
  • [53] Justin Smith, Brittany Johnson, Emerson Murphy-Hill, Bill Chu, and Heather Richter Lipford. Questions developers ask while diagnosing potential security vulnerabilities with static analysis. In European Software Engineering Conference and ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE), pages 248–259, Bergamo, Italy, 2015.
  • [54] Varun Srivastava, Michael D. Bond, Kathryn S. McKinley, and Vitaly Shmatikov. A security policy oracle: Detecting security holes using multiple API implementations. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 343–354, San Jose, CA, USA, 2011.
  • [55] Roykrong Sukkerd, Ivan Beschastnikh, Jochen Wuttke, Sai Zhang, and Yuriy Brun. Understanding regression failures through test-passing and test-failing code changes. In International Conference on Software Engineering New Ideas and Emerging Results Track (ICSE NIER), pages 1177–1180, San Francisco, CA, USA, May 2013.
  • [56] Shin Hwei Tan, Darko Marinov, Lin Tan, and Gary T. Leavens. @tComment: Testing Javadoc comments to detect comment-code inconsistencies. In International Conference on Software Testing, Verification, and Validation (ICST), pages 260–269, Montreal, QC, Canada, 2012.
  • [57] Shin Hwei Tan and Abhik Roychoudhury. relifix: Automated repair of software regressions. In International Conference on Software Engineering (ICSE), Florence, Italy, 2015.
  • [58] Robert J. Walls, Yuriy Brun, Marc Liberatore, and Brian Neil Levine. Discovering specification violations in networked software systems. In International Symposium on Software Reliability Engineering (ISSRE), pages 496–506, Gaithersburg, MD, USA, November 2015.
  • [59] Qianqian Wang, Yuriy Brun, and Alessandro Orso. Behavioral execution comparison: Are tests representative of field behavior? In International Conference on Software Testing, Verification, and Validation (ICST), pages 321–332, Tokyo, Japan, March 2017.
  • [60] Yi Wei, Yu Pei, Carlo A. Furia, Lucas S. Silva, Stefan Buchholz, Bertrand Meyer, and Andreas Zeller. Automated fixing of programs with contracts. In International Symposium on Software Testing and Analysis (ISSTA), pages 61–72, Trento, Italy, 2010.
  • [61] Westley Weimer, Zachary P. Fry, and Stephanie Forrest. Leveraging program equivalence for adaptive program repair: Models and first results. In IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 356–366, Palo Alto, CA, USA, 2013.
  • [62] Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest.

    Automatically finding patches using genetic programming.

    In ACM/IEEE International Conference on Software Engineering (ICSE), pages 364–374, Vancouver, BC, Canada, 2009.
  • [63] W. Eric Wong, Vidroha Debroy, and Byoungju Choi.

    A family of code coverage-based heuristics for effective fault localization.

    Journal of Systems and Software (JSS), 83(2):188–208, 2010.
  • [64] Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. Finding and understanding bugs in C compilers. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 283–294, San Jose, CA, USA, 2011.
  • [65] Andreas Zeller. Yesterday, my program worked. Today, it does not. Why? In European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 253–267, Toulouse, France, 1999.
  • [66] Andreas Zeller and Ralf Hildebrandt. Simplifying and isolating failure-inducing input. IEEE Transactions on Software Engineering, 28(2):183–200, February 2002.
  • [67] Daniele Zuddas, Wei Jin, Fabrizio Pastore, Leonardo Mariani, and Alessandro Orso. MIMIC: Locating and understanding bugs by analyzing mimicked executions. In ACM/IEEE International Conference on Software Engineering (ICSE), pages 815–826, 2014.