Automated Fix Detection Given Flaky Tests

by   David Landsberg, et al.

Research Proposal in Automated Fix Detection


page 1

page 2


Detection of perfusion ROI as a quality control in perfusion analysis

In perfusion analysis automated approaches for image processing is prefe...

Provenance and Pseudo-Provenance for Seeded Learning-Based Automated Test Generation

Many methods for automated software test generation, including some that...

Automatic Detection and Decoding of Photogrammetric Coded Targets

Close-range Photogrammetry is widely used in many industries because of ...

Learning the Noise of Failure: Intelligent System Tests for Robots

Roboticists usually test new control software in simulation environments...

Ensure A/B Test Quality at Scale with Automated Randomization Validation and Sample Ratio Mismatch Detection

eBay's experimentation platform runs hundreds of A/B tests on any given ...

Automated decontamination of workspaces using UVC coupled with occupancy detection

Periodic disinfection of workspaces can reduce SARS-CoV-2 transmission. ...

RESTORE: Automated Regression Testing for Datasets

In data mining, the data in various business cases (e.g., sales, marketi...

1. Introduction

Developers ignore tools that they think waste their time — hampering the adoption of verification and validation (V&V) tools in general. Automatic V&V will not be ubiquitous until we can measure its value, by answering ”How many of the bugs it reports do developers fix?” Here, the problem is determining whether a fix has actually occurred — the automated fix detection problem (FDP). Any solution is expected to be a function of a failure’s symptoms, such as stack traces and user/test reports. At Facebook, which develops software using continual integration and deployment in conjunction with automatic V&V, the need to solve this ”largely overlooked” problem is especially acute (Nadia Alshahwan and Zorin, 2018). Alshahwan et al. decompose FDP into two subproblems: failure grouping, which associates groups of failures to the methods which generate them, and proving a negative, which determines when we can be confident failures will not recur (i.e. a fix has succeeded).

We take up this challenge: To group failures, we use methods of causal inference to assign each failure a root cause (Section 2). To prove a negative, we apply statistical change point detection methods to detect when a fix has succeeded in the presence of flaky tests (Section 3). Combined, these offer a novel solution to the fix detection problem which is at once scalable and integratable into Facebook’s development process (Section 4).

2. Grouping Failures

The failure grouping problem (FGP) is that of grouping failures to their likely causes (here assumed to be methods). Being able to tell which failures a method causes is key to being able to tell whether it is fixed. Thus far, Alshahwan et al.

use method identifiers (located at the top of stack traces) as the heuristic for grouping. However, they propose this solution would be improved upon by applying techniques of causal inference. They write ”there has been much recent progress on causal inference

(Pearl, 2000) … Therefore, the opportunity seems ripe for the further development and exploitation of causal analysis as one technique for informing and understanding fix detection” (Nadia Alshahwan and Zorin, 2018).

We take up Alshahwan et al.’s challenge. We begin our development with the probabilistic measure of causality due to Pearl (Pearl, 2009; Pearl et al., 2016)

. We pick this particular theory because (as we shall see) there are simple and low-cost ways to estimate the value of the formula, and it opens the window to a number of different (potentially better) theories of causality. Here,

is a cause of the event when the following obtains:


The intuition is that causes raise the probability of their effects. Applied to FGP, we parse

Equation 1 as follows: reads ”the probability of given ”, is an event of a failure, and is the introduction of a given patch into the given codebase. The operation represents an external intervention that compels to obtain, whilst holding certain background factors fixed (in our case this is the rest of the codebase — see Pearl for technical details (Pearl, 2009)). Intuitively then, measures the probability that a failure occurs upon the introduction of a given patch. Accordingly, Equation 1 says that a patch is a cause of the failure if the likelihood of the failure would have decreased had the patch not been introduced into the program.

A major question for our research is to estimate and . As a starting point, we envisage conducting a controlled experiment. Here, we assume i) we have a program together with its updated version, ii) that the updated version only differs from the original by a patch , iii) that there is only one bug in the codebase, and iv) a fix for the bug repairs the method, and v) there is a test available which can be run on both versions a given number of times (in real-world testing scenarios we will not have to make all of these assumptions — see Section 4). Here, we propose is estimated by the proportion of times the test results in failure in the updated version, and as the proportion of times the test results in failure in the non-updated version. Note that the estimated probabilities might assume values anywhere in the interval — depending on the presence of noise, indeterminism, flaky tests, and degree of unspecified behaviour. Accordingly, if Equation 1 holds, we say the method causes the given failure in that update for that test, thereby grouping the failure to the associated method as its cause.

Pearl’s theory is not enough. It is not guaranteed to handle (what Alshahawan calls) false grouping (Nadia Alshahwan and Zorin, 2018). Accordingly, Equation 1 may include too many (or too few) causes in practice. To investigate this, we propose experimenting with different measures for the degree of causality (which in our context may be said to measure error-causing degree), such as and  (Pearl, 2009), and saying causality obtains when the value given by the measure is over a given bound. Previous research has confirmed that different measures of causality perform very differently (et al., 2015), suggesting a requirement to experiment with many different measures from the literature on A.I., fault localisation, and philosophy of science, of which there are hundreds (et al., 2015).

3. Proving a Negative

Alshahwan et al. ask the following: ”how long should we wait, while continually observing no re-occurrence of a failure (in testing or production) before we claim that the root cause(s) have been fixed?” (Nadia Alshahwan and Zorin, 2018) Here, we assume the root cause(s) of a failure have been estimated by the work of Section 2. The famous proving a negative problem rears its head here: How we can prove a negative (no more failures) in the absence of direct evidence to the contrary. Alshahwan et al. state that identifying the correct fix detection protocol (Nadia Alshahwan and Zorin, 2018) provides the solution, and experiment with their own protocol within the Sapienz Team at Facebook. Their protocol uses heuristics and a finite state machine, but emphasize they ”do not claim it is the only possible protocol, nor that it is best among alternatives”. Accordingly, In this section we propose an alternative.

We begin our development by answering Alshawan’s question above directly: We wait until we can claim a fix has occurred, i.e. when the error-causing behaviour of the method has diminished. Our answer is made precise as follows. We let the error causing behaviour of a given method be a time series T = , where each datapoint is an error-causing degree for a given failure group (as per Section 2) over a given period. Let T1 = and T2 = be two adjacent time series splitting T. Following the standard definition of changepoint detection, a changepoint is detected for T1 and T2 if T1 and T2 are shown to be drawn from a different distribution according to a given hypothesis testing method (Aminikhanghahi and Cook, 2017; Nicholas A. James, [n. d.]). We detect that some fix/bug has been introduced into T2 since T1, if i) a changepoint is detected for T1 and T2 and ii) the average error causing degree in T2 is smaller/larger than the average error causing degree in T1. Finally, we say the the error-causing behaviour of the method has diminished when a fix is detected.

To illustrate the setup, consider Figure 1, which represents a time series of real-valued datapoints. Let T1 be the series before the vertical green line and T2 the series after. Already, our setup could be used to say some fix has been introduced into T2 since T1. It then remains to find the precise point where the fix was introduced. This is done by applying a changepoint detection method (CDM). In general, CDMs try to identify exact times (changepoints

) when the probability distribution of a stochastic process or time series can be confidently said to change. Ideally, we would apply a CDM which identifies the changepoint with the datapoint indicated by the green line in 

Figure 1. Research into CDMs is a large and well-developed area (Aminikhanghahi and Cook, 2017; Nicholas A. James, [n. d.]), and have been applied successfully to solve similar problems to FDP in continuous code deployment (Nicholas A. James, [n. d.]). Key differences between CDMs include where they locate changepoints, and how scalable the technique is.

Figure 1. Time Series with Change Point.

4. Deployment

We first discuss three integration scenarios; with the Sapienz tool, FBlearner, and canary testing. We then discuss the development of our techniques.

The first area of deployment is alongside the Sapienz tool, which has been integrated alongside Facebook’s production development process Phabricator (Nadia Alshahwan and Zorin, 2018)

to help identify faults. Accordingly, our methods could be integrated alongside Sapienz to help detect fixes made as a consequence of testing. The second area of deployment is alongside FBLearner, a Machine Learning (ML) platform through which most of Facebook’s ML work is conducted. In FBlearner there is an existing fix detection workflow stage 

(Nadia Alshahwan and Zorin, 2018)

, which involves using reinforcement learning to learn to classify faults and fixes. Accordingly, our methods could be integrated in the fix classification stage. The third area of deployment is alongside Facebook’s canary testing/rolling deployment process for mobile devices. Canary releasing slowly rolls out changes to a small subset of users before rolling it out to the entire infrastructure. Facebook uses a strategy with multiple canaries (versions) 

(Savor et al., 2016; can, [n. d.]). In practice, data about different canaries could be used to form part of the dataset used for our fix detection methods. Namely, if an update is deployed in one cluster but not another, we will have important data about which failures are caused by which updates and for which methods.

We now discuss development issues. To develop 2.1, we will need an experimental framework where we can evaluate the performance of different causal measures on given benchmarks using standard IR measures (such as accuracy, precision, recall, and F-scores). We will evaluate the measures on different testing scenarios which do not make many of the restrictive assumptions outlined in 2.1. For instance, if i) is not true we need to perform fault localisation using a causal measure on the updated program alone (using a given fault localisation setup 

(et al., 2015)). If ii) or iii) are not true we will need to employ measures empirically demonstrated to perform well in the presence of noise (Pearl et al., 2016).

The development of 2.2 will include an experimental comparison of different CDMs, testing for effectiveness and scalability when employed at the fix detection task. To measure effectiveness, we use standard IR methods (Aminikhanghahi and Cook, 2017; Nicholas A. James, [n. d.]). To measure scalability, we will measure practical runtime on representative benchmarks. This work is made feasible insofar as many CDMs are already implemented, known to scale well, and can be used in an ”online” contexts involving continuous real-time streams of datapoints.111


  • (1)
  • can ([n. d.]) [n. d.]. Canary Release.
  • Aminikhanghahi and Cook (2017) Samaneh Aminikhanghahi and Diane J. Cook. 2017. A Survey of Methods for Time Series Change Point Detection. Knowl. Inf. Syst. 51, 2 (May 2017), 339–367.
  • et al. (2015) David Landsberg et al. 2015. Evaluation of Measures for Statistical Fault Localisation and an Optimising Scheme. FASE (2015), 115–129.
  • Nadia Alshahwan and Zorin (2018) Mark Harman Yue Jia Ke Mao Alexander Mols Taijin Tei Nadia Alshahwan, Xinbo Gao and Ilya Zorin. 2018. Deploying Search Based Software Engineering with Sapienz at Facebook. Facebook, UK (2018).
  • Nicholas A. James ([n. d.]) David S. Matteson Nicholas A. James, Arun Kejariwal. [n. d.]. Leveraging Cloud Data to Mitigate User Experience from Breaking Bad.
  • Pearl (2000) Judea Pearl. 2000. Causality: Models, Reasoning and Inference (1st ed.). Cambridge University Press, New York, NY, USA.
  • Pearl (2009) Judea Pearl. 2009. Causal inference in statistics: An overview. Statist. Surv. (2009), 96–146.
  • Pearl et al. (2016) J. Pearl, M. Glymour, and N.P. Jewell. 2016. Causal Inference in Statistics: A Primer. Wiley.
  • Savor et al. (2016) T. Savor, M. Douglas, M. Gentili, L. Williams, K. Beck, and M. Stumm. 2016. Continuous Deployment at Facebook and OANDA. In 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C). 21–30.