How good does a Defect Predictor need to be to guide Search-Based Software Testing?

Defect predictors, static bug detectors and humans inspecting the code can locate the parts of the program that are buggy before they are discovered through testing. Automated test generators such as search-based software testing (SBST) techniques can use this information to direct their search for test cases to likely buggy code, thus speeding up the process of detecting existing bugs. However, often the predictions given by these tools or humans are imprecise, which can misguide the SBST technique and may deteriorate its performance. In this paper, we study the impact of imprecision in defect prediction on the bug detection effectiveness of SBST. Our study finds that the recall of the defect predictor, i.e., the probability of correctly identifying buggy code, has a significant impact on bug detection effectiveness of SBST with a large effect size. On the other hand, the effect of precision, a measure for false alarms, is not of meaningful practical significance as indicated by a very small effect size. In particular, the SBST technique finds 7.5 less bugs on average (out of 420 bugs) for every 5 In the context of combining defect prediction and SBST, our recommendation for practice is to increase the recall of defect predictors at the expense of precision, while maintaining a precision of at least 75 imprecision of defect predictors, in particular low recall values, SBST techniques should be designed to search for test cases that also cover the predicted non-buggy parts of the program, while prioritising the parts that have been predicted as buggy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/26/2021

Defect Prediction Guided Search-Based Software Testing

Today, most automated test generators, such as search-based software tes...
06/08/2021

Validating Static Warnings via Testing Code Fragments

Static analysis is an important approach for finding bugs and vulnerabil...
06/19/2021

GLIB: Towards Automated Test Oracle for Graphically-Rich Applications

Graphically-rich applications such as games are ubiquitous with attracti...
06/18/2019

SAVIOR: Towards Bug-Driven Hybrid Testing

Hybrid testing combines fuzz testing and concolic execution. It leverage...
05/08/2018

Crowdtesting : When is The Party Over?

Trade-offs such as "how much testing is enough" are critical yet challen...
07/08/2021

Duplicate-sensitivity Guided Transformation Synthesis for DBMS Correctness Bug Detection

Database Management System (DBMS) plays a core role in modern software f...
02/10/2021

Novel Techniques to Assess Predictive Systems and Reduce Their Alarm Burden

The performance of a binary classifier ("predictor") depends heavily upo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Search-based software testing (SBST) techniques search for test cases to optimise a given coverage criterion such as branch coverage, method coverage, or a combination of the two. Coverage related heuristics, such as branch distance 

(Korel, 1990; McMinn, 2011) and approach level (Panichella et al., 2017) are used to guide the search for test cases to cover the uncovered areas in the program. SBST techniques are known to be effective at achieving high code coverage (Panichella et al., 2017, 2018). While it is necessary for a test case to cover the buggy code to find a bug, just covering the buggy code may not be sufficient to discover the bug (Shamshiri et al., 2015; Perera et al., 2020). In fact, SBST techniques guided only by coverage have been shown to struggle in terms of bug detection (Shamshiri et al., 2015; Perera et al., 2020; Almasi et al., 2017; Salahirad et al., 2019). This is because the SBST techniques have no guidance in terms of where the buggy code is likely to be located, and hence spend most of the search effort in non-buggy code which constitutes a greater portion of the code base.

Defect predictors (Hall et al., 2011) and static bug detectors (Ayewah et al., 2008)

can estimate the locations of the bugs effectively. Most of the defect predictors use classifiers trained on an existing dataset with features related to various metrics like code size, code complexity, change history, etc., and whether the components (e.g., file/class or method) are buggy or not 

(Giger et al., 2012). Static bug detectors statically check the code against pre-defined bug patterns and label the buggy code (e.g., line) with a warning (Habib and Pradel, 2018). Both defect predictors and static bug detectors are used in the industry to assist developers in manual code reviews (Lewis et al., 2013; Lewis and Ou, 2011; Aftandilian et al., 2012; Sadowski et al., 2015). Defect predictors have also been used to inform automated testing techniques; G-clef (Paterson et al., 2019) is a test case prioritisation strategy that prioritises test cases that cover highly likely to be defective classes, and SBST (Perera et al., 2020) and BTG (Hershkovich et al., 2019) are time budget allocation techniques which allocate a higher time budget to highly likely to be defective classes.

Often, the predictions produced by defect predictors are not perfectly accurate. The false positives and false negatives can significantly hinder the potential benefits of these tools. For example, false positives (i.e., wrongly labelling a program as buggy) result in SBST techniques looking for bugs in non-buggy areas in code, thus spending valuable search resources in vain. On the other hand, false negatives (i.e., labelling a buggy program as non-buggy) can result in SBST techniques not generating tests for buggy areas in code. Previous work that use defect predictors to guide SBST techniques report on improved bug detection performance of SBST (Perera et al., 2020; Hershkovich et al., 2019). The defect predictors used in these approaches have a relatively high performance, e.g., the defect predictor used by Perera et al. (Perera et al., 2020) had a recall of 85%, and Hershkovich et al. (Hershkovich et al., 2019) employed a defect predictor which had an area under curve (AUC) of 0.95.

The performance of defect predictors, however can vary, e.g., from as low as 5% and 25% to as high as 95% and 85% for precision and recall, respectively 

(Hall et al., 2011). Given such wavering performance, the question that we address in this paper is “What is the impact of imprecise predictions on the bug detection performance of SBST?”.

To answer this question, we simulate defect predictors for different value combinations of recall and precision in the range 75% and 100% (Section 2.1). Defect predictors having recall and precision above 75% are considered acceptable defect predictors (Zimmermann et al., 2009). We employ the state of the art DynaMOSA (Panichella et al., 2017) as the SBST technique which is guided by the defect predictions (DP) (see Section 2.2), which we refer to as SBST guided by DP throughout the paper. We evaluate how the bug detection effectiveness of SBST guided by DP changes with the different levels of imprecision when applied to 420 bugs from the Defects4J dataset (Just, 2019) (Section 3.1).

The results from our experimental evaluation reveal that the recall of the defect predictor has a significant impact on the bug detection effectiveness of SBST with a large effect size. More specifically, SBST guided by DP finds 7.5 less bugs on average (out of 420 bugs) for every 5% decrements of recall. On the other hand, the impact of precision is not of practical significance as indicated by a very small effect size, hence we conclude that the precision of defect predictors has negligible impact on the bug detection effectiveness of SBST, as long as one uses a defect predictor with acceptable performance, i.e., with precision and recall greater than 75%. Further analysis into the results reveals that the impact of recall is greater for the bugs that are isolated in one method than for the bugs that are spread across multiple methods.

In summary, the contribution of this work is a comprehensive experimental analysis of the impact of imprecision of defect predictions on bug detection effectiveness of SBST. The experimental evaluation involving 420 bugs from 6 open source Java projects took roughly 180,750 CPU-hours in total. Based on the results of our study we make the following recommendations;

  1. SBST techniques must take potential errors in the predictions into account, in particular the false negatives. One possible solution is to prioritise predicted buggy parts of the program, while guiding the search with a certain probability towards locations that are predicted as not buggy.

  2. In the context of combining defect prediction and SBST, it is beneficial to increase the recall of the defect predictor by sacrificing precision, while maintaining the precision above 75%. One potential solution is to lower the cut-off point of the classifier such that more components will be labelled as buggy at the expense of more false positives.

The source code of SBST guided by DP, defect predictor simulator, post processing scripts and data are publicly available in the following link: https://figshare.com/s/a8d75f161b8cfa11d297

2. Methodology

Our aim is to understand how the defect prediction imprecision impacts the bug finding performance of SBST. To this end, we design a study that addresses the following research question:

RQ: What is the impact of the imprecision of defect prediction on bug detection performance of SBST?

To address this research question, we measure the effectiveness of SBST in terms of finding bugs when using defect predictors with different levels of imprecision. We use DynaMOSA (Panichella et al., 2017), a state-of-the-art SBST technique, and incorporate predictions about buggy methods in order to guide the search for test cases towards likely buggy methods (see Section 2.2), which we refer as SBST guided by DP throughout the paper. Fine-grained defect predictions such as method level is chosen so that the location of the bug is narrowed down better than coarse-grained defect predictions such as class level. Hence the defect predictors at method level provide additional information to the SBST technique such that it can further narrow down the search for test cases to likely buggy methods.

We measure defect predictor imprecision using recall and precision. Recall and precision have been widely used in previous work to report the performance of defect predictors (Hall et al., 2011; Hosseini et al., 2017). We consider a defect predictor with either recall or precision less than 75% is not an acceptable defect predictor, as recommended by Zimmermann et al. (Zimmermann et al., 2009). Hence, we simulate defect predictors for varying levels of recall and precision in the range 75% to 100% (see Section 2.1) and measure the impact on the bug detection performance of SBST by the prediction imprecision.

2.1. Defect Prediction Simulation

To measure the bug detection performance of SBST against the imprecision of defect predictions, we simulate defect predictor outcomes at various levels of performance in the range 75% and 100% for both precision and recall. Recall is the rate of the defect predictor identifying buggy methods. It is calculated as in Equation. (1), where is the number of true positives, i.e., number of buggy methods that are correctly classified, and is the number of false negatives, i.e., number of buggy methods that are incorrectly classified.

(1)

Precision is the rate of the correct buggy methods labelled by the defect predictor. It can be calculated as in Equation (2), where is the number of false positives, i.e., number of non-buggy methods that are incorrectly classified as buggy methods.

(2)

We simulate defect predictions from 75% to 100% recall in 5% steps, with 75% and 100% precision. Thus, there are altogether 12 defect predictor configurations, with the following values of (precision, recall): (75%, 75%), (75%, 80%), (75%, 85%), (75%, 90%), (75%, 95%), (75, 100%), (100%, 75%), (100%, 80%), (100%, 85%), (100%, 90%), (100%, 95%), (100, 100%). Our preliminary experiments suggest that the bug detection performance of SBST guided by DP changes by a small margin when the precision is changed from 100% to 75%, while keeping the recall unchanged. On the other hand, the bug detection performance of SBST guided by DP changes by a large margin when only the recall is changed from 100% to 75%. Hence, we decide to consider only the values of 75% and 100% for precision, while recall is sampled at 5% steps.

The output of the simulated defect predictor is binary, i.e., method is buggy or not buggy, similar to most of the existing defect predictors. Some of the existing defect predictors output the likelihood of the components being buggy or the ranking of the components according to their likelihood of being buggy. Since we employ a theoretical defect predictor and not a specific one, we resort to the generic defect predictor, which is the one that gives a binary classification.

Input:       ,       recall and precision
      ground truth

1:procedure SimulateDefectPredictor
2:      Count() for s.t. = 1
3:     
4:     
5:     
6:     
7:     
8:      RandomChoice() RandomChoice()
9:      ,
10:     Return()
Algorithm 1 Defect Predictor Simulation

Algorithm 1 illustrates the steps of simulating the defect predictor outputs for a given recall and precision combination. The procedure SimulateDefectPredictor receives the set of methods in the project with the ground truth labels for their defectiveness, , where

and outputs a set of labels for each method in the project, , where

First, it calculates the number of buggy () and non-buggy methods () in the project (lines 2-3 in Algorithm 1). Next, it finds the set of indices of all the buggy () and non-buggy methods () in the project (lines 4-5). The true positives () and false positives () are then calculated for the given recall () and precision () (lines 6-7). The RandomChoice() procedure returns number of randomly selected methods from the set , where . is assigned a set of randomly picked number of buggy and number of non-buggy method indices (line 8). is the set of buggy method indices as classified by the simulated defect predictor. The output is the set , where if the method with index is labelled as buggy and if the method with index is labelled as not buggy (line 9).

2.2. Search-Based Software Testing Guided By Defect Prediction

We incorporate buggy method predictions in DynaMOSA (Panichella et al., 2017), the state-of-the-art SBST technique, to guide the search for test cases towards likely buggy methods. DynaMOSA tackles the test generation problem as a many objective optimisation problem, where each coverage target in the program, e.g., branch and statement, is an objective to optimise. It is more effective at achieving high branch, statement and strong mutation coverage than previously proposed SBST techniques ((Fraser and Arcuri, 2012; Rojas et al., 2017; Panichella et al., 2015)(Panichella et al., 2017). In the next sections, we refer to the DynaMOSA approach guided by the defect predictor as SBST guided by DP. SBST guided by DP is presented in Algorithm 2. It shares the same search steps and genetic operators as DynaMOSA, except for the updated steps shown in blue colour in Algorithm 2.

SBST guided by DP receives as input a class with methods labelled as buggy or non-buggy, which are labels that can be obtained using existing defect predictors (Giger et al., 2012; Hata et al., 2012). In our study, SBST guided by DP receives these labels from defect predictor simulations (Section 2.1).

SBST guided by DP devotes all the search resources to find tests that cover likely buggy methods, thereby increasing the chances of finding bugs. Initially, SBST guided by DP filters out the coverage targets that are deemed to not contain buggy methods as indicated by the defect prediction information, and keeps only targets that contain likely buggy methods (as shown in line 2 of Algorithm 2 and described in Section 2.2.1).

It also generates more than one test case for all the selected buggy targets, hence, further increases the chances of finding bugs (lines 6710 and 11 and described in Section 2.2.2(Perera et al., 2020).

To generate more than one test case for all the likely buggy targets, SBST guided by DP does not remove a target once it is covered during the search. This is likely to cause SBST guided by DP to miss nontrivial targets in the search and keep on generating tests to cover more trivial targets (Rojas et al., 2017). To address this, we use a method to dynamically disable targets from the search based on their current test coverage and number of independent paths (line 13 and described in Section 2.2.3). We refer to this as balanced test coverage. This ensures that the nontrivial targets have an equal chance of being covered compared to the targets that are easier to cover.

SBST guided by DP randomly generates a set of test cases that forms the initial population (line 5). Then, it evolves this initial population through creating new test cases via crossover and mutation (line 9), and selecting test cases to the next generation (line 14), until a termination criteria, such as maximum time budget, is met.

Input:
      the set of coverage targets of CUT
      control dependency graph of the CUT
      partial map between edges and targets
      = {} the set of defectiveness classifications for methods in the CUT

1:procedure SBST
2:      FilterTargets()
3:      IndependentPaths()

is a vector of the number of independent paths for each edge

4:      targets in with no control dependencies
5:      RandomPopulation() is the population size
6:      UpdateArchive() is the archive
7:      UpdateTargets()
8:     for terminationCriteria;  ++ do
9:          GenerateOffspring()
10:          UpdateArchive()
11:          UpdateTargets()
12:         
13:          SwitchOffTargets()
14:          SelectPopulation()      
15:      Update the final test suite
16:     Return()
Algorithm 2 SBST Guided By Defect Prediction

2.2.1. Filtering Targets with Defect Prediction

A defect predictor classifies the methods of the class under test (CUT) as buggy or non-buggy, denoted as , where

This information is used to filter out the likely non-buggy targets from the set of all targets using the classifications given (line 2). Spending the limited search resources on covering non-buggy targets is likely to be ineffective when it comes to finding bugs. Filtering out targets that are unlikely to be buggy allows the search to focus on test cases that cover the likely buggy targets (i.e., ), hence, generating more effective test cases faster than other approaches which search for tests in all the targets in the CUT.

2.2.2. Dynamic Selection of Targets and Archiving Tests

There are structural dependencies of targets that should be considered when selecting objectives, i.e., targets, to optimise. For instance, some of the targets can be covered only if their control dependent targets are covered. To better understand this, let us consider the following example in Figure 1. Assume the test generation scenario is to optimise branch coverage and , , , , and are the branches to be covered. Branch holds a control dependency link to and , which means that they can be covered only if is covered by a test case. If an SBST technique optimises test cases to cover and , while is uncovered, this will unnecessarily increase the computational complexity of the algorithm because of the added objectives, i.e., and , to the search without any added benefit. To address this, DynaMOSA dynamically selects targets to the search only when their control dependent targets are covered (Panichella et al., 2017). In our example, and are added to the search only when is covered.

Figure 1. Control Dependency Graph

At the start of the search, SBST guided by DP selects the set of targets that do not have control dependencies (line 4). These are the targets SBST guided by DP can cover without requiring to cover any other targets in the program. At any given time in the search, it searches for test cases to cover only the targets in .

Once a new population of test cases is generated (lines 5 and 9), the procedure UpdateTargets is executed to update by adding new targets to the search. The procedure UpdateTargets adds a target to only if the control dependent targets of are covered as explained with the example above.

SBST guided by DP maintains an archive of test cases found during the search which cover the selected targets. Once the search finishes, this archive forms the final test suite. Unlike in DynaMOSA, we configure the UpdateTargets procedure to not remove a covered target from and the UpdateArchive procedure (lines 6 and 10) to archive all the test cases that cover the selected targets . This way, SBST guided by DP can generate more than one test case for each target , hence increasing the bug detection capability of the generated test suites (Perera et al., 2020). Perera et al. (Perera et al., 2020) showed that DynaMOSA finds up to 79% more bugs when it was configured to not remove covered targets from the search and retain all the generated tests.

2.2.3. Balanced Test Coverage of Targets

As we discussed in Section 2.2.2, SBST guided by DP does not remove covered targets from to allow the search to find more than one test case for each target. While this benefits SBST guided by DP in terms of bug detection, one downside of this approach is that trivial targets are covered more often than they need to be. Hence, we propose a method to balance the test coverage among the targets in the CUT.

We use the control dependency graph (CDG) in Figure 1 to explain how we balance the test coverage. Assume the goal of the test generation scenario is to maximise the branch coverage. There are always more test cases that cover the branches closer to the root node, e.g., , than the branches closer to the leaf nodes of the CDG, e.g., . This is because the execution of a test case can take many paths in the program once it reaches branches like . For instance, there are 3 independent paths that leads from in the example. On the other hand, the execution of a test case can only take one path from , i.e., the exit path. Therefore, to balance the test coverage of targets in the CUT, we want to ensure all targets have an equal number of tests per an independent path that leads from the respective target. For example, if we assume the number of tests that cover to be 90, then there should be 60 and 30 test cases that cover and , respectively. Those 60 tests that cover should be equally distributed among and .

SBST guided by DP calculates the number of independent paths of each edge in the control dependency graph of the program (line 3). The CDG, i.e., , consists of nodes and edges . The nodes represent statements in the program. The edges represent control dependencies between the statements. The procedure IndependentPaths calculates the number of independent paths for each edge . When calculating the number of independent paths of an edge , the IndependentPaths procedure assumes the paths start at , however, the actual execution of the paths start at the root node, e.g., node in Figure 1. In our example in Figure 1, the independent paths starting from are and . Finally, all the targets that are directly control dependent by an edge have the same number of independent paths as that of .

1:procedure SwitchOffTargets()
2:      NodesWithPredicates()
3:     for  do
4:          outgoing edges in from node
5:          GetIndependentPaths()
6:          GetIndependentPaths()
7:          RandomChoice()
8:          RandomChoice()
9:          GetTests()
10:          GetTests()
11:         if  then
12:              
13:         else if  then
14:                             
15:     Return()
Algorithm 3 Temporarily Removal of Targets to Balance Test Coverage

SBST guided by DP dynamically switches off targets with higher test coverage from in every iteration to focus more on increasing the test coverage for targets which already have lower coverage (line 13). The procedure SelectPopulation selects test cases to the next generation considering only these targets with low test coverage. Hence, this paves way for the search to find more test cases in the next generation that cover these targets and eventually make all targets to have an equitable test coverage. This ensures that nontrivial targets also receive a good coverage in the presence of more trivial targets.

The procedure SwitchOffTargets starts with finding the set of nodes with predicates in (line 2 in Algorithm 3). In our running example, . Next, it fetches the number of independent paths from the outgoing edges of each node (lines 5-6). For the node , the edges and have 3 and 1 independent paths, respectively. We consider the test coverage is equal among all the control dependent targets of an edge, including the edge itself as well. Therefore, it randomly selects a control dependent target from each outgoing edge of (lines 7-8) and finds the test coverage of each edge (lines 9-10). In the case of a test generation scenario for maximising branch coverage, the edges , , etc. also become targets of the CUT. Then, it finds the edge which has the largest test coverage per an independent path, and removes all the control dependent targets of that edge from (lines 9-14). If we assume there are 30 and 20 tests in the archive which cover and , respectively, then it removes from since has 20 (=20/1) tests per an independent path, while has only 10 (=30/3).

3. Design of Experiments

We design a set of experiments to evaluate the effectiveness of SBST guided by DP in terms of finding bugs when using defect predictors with 12 different levels of imprecision as described in Section 2.1 (RQ). We use the bugs from the Defects4J dataset as the experimental subjects (Just, 2019) (see Section 3.1).

To account for the randomness of the defect prediction simulation algorithm (Algorithm 1), we repeat the simulation runs 5 times for each defect predictor configuration (i.e., recall and precision pair). For each of these simulation runs, we repeat the test generation runs 5 times, to account for the randomness in SBST guided by DP.

Once tests are generated and evaluated for bug detection, we conduct two-way ANOVA test to statistically analyse the effects of recall and precision of the defect predictor on the bug detection effectiveness of SBST guided by DP.

3.1. Experimental Subjects

We use the Defects4J dataset (version 1.5.0) (Just, 2019; Just et al., 2014) as our benchmark. It contains 438 real bugs from 6 real-world open source Java projects. In our experiments, we remove 18 bugs altogether from the dataset; 4 deprecated bugs, 12 bugs that do not have buggy methods, and 2 bugs for which SBST guided by DP generated uncompilable tests (e.g., method signature is changed in the bug fix). Thus, we evaluate SBST guided by DP on a total of 420 bugs. The bugs are drawn from the following projects; JFreeChart (25 bugs), Closure Compiler (170 bugs), Apache commons-lang (59 bugs), Apache commons-math (104 bugs), Mockito (37 bugs), and Joda-Time (25 bugs).

The Defects4J benchmark gives a buggy version and a fixed version of the program for each bug in the dataset. The fixed version is different to the buggy version by the applied patch to fix the bug, which indicates the location of the bug. We label all the methods that are either modified or removed in the bug fix as buggy methods (Sohn and Yoo, 2019).

Defects4J is widely used for research on automated unit test generation (Shamshiri et al., 2015; Perera et al., 2020; Gay, 2017b), automated program repair (Aleti and Martinez, 2020), fault localisation (Pearson et al., 2017), test case prioritisation (Paterson et al., 2019), etc. This makes Defects4J a suitable benchmark for evaluating SBST guided by DP, as it allows us to compare our results to existing work.

3.2. Prototype

DynaMOSA is implemented in the state-of-the-art SBST tool, EvoSuite (Fraser and Arcuri, 2011). EvoSuite is an automated test generation framework that generates JUnit test suites for java programs (EvoSuite, 2019; Fraser, 2018). EvoSuite is actively maintained and evaluated for its effectiveness in terms of bug finding on both industrial and open source projects (Shamshiri et al., 2015; Perera et al., 2020; Almasi et al., 2017; Gay, 2017b). For the experimental evaluation, we implement the changes described in Section 2.2 for SBST guided by DP. The changes are implemented within EvoSuite version 1.0.7, forked from the GitHub repository (EvoSuite, 2019) on June 18, 2019. We also implement the defect predictor simulator as described in Section 2.1. The prototypes are available to download from here: https://figshare.com/s/a8d75f161b8cfa11d297

3.3. Parameter Settings

We use the default parameter settings of EvoSuite (Fraser and Arcuri, 2012) and DynaMOSA (Panichella et al., 2017) except for the parameters mentioned in the next paragraphs. Parameter tuning of SBST techniques is a long and expensive process (Arcuri and Fraser, 2013). According to Arcuri and Fraser (Arcuri and Fraser, 2013), EvoSuite with default parameter values performs on par compared to EvoSuite with tuned parameters.

Time Budget: We set 2 minutes as time budget per CUT for test generation. In practice, the time budget allocated for SBST tools depends on the size of the project, frequency of test generation runs and availability of computational resources in the organisation.

Real world projects are usually very large and can have thousands of classes (Broy et al., 2007). If an SBST tool runs test generation for 2 minutes per class, then it will take at least 33 hours to finish the task for the whole project.

To address this issue, practitioners can adapt the SBST tools in their continuous integration (CI) systems (Fowler and Foemmel, 2006). However, the introduction of new SBST tools to the CI system should not make the existing processes in the system idle (Perera et al., 2020).

Thus, given the limited computational resources available in practice (Campos et al., 2014) and the expectation of faster feedback cycles from testing in agile development prompt the necessity of frequent test generation runs with limited testing budget. Therefore, we decide that 2 minutes per class is a reasonable time budget in a usual resource constrained environment.

Coverage criteria: We use branch coverage as coverage criterion in SBST guided by DP. EvoSuite is more effective in terms of finding bugs when it is using branch coverage as the coverage criterion compared to other single criteria (Gay, 2017a). According to Gay (Gay, 2017a), some criteria combinations perform better than branch coverage. However, there were other combinations that perform worse than branch coverage. Hence, they did not recommend a strategy to combine criteria. Therefore, we decide to use the most effective single criterion, i.e., branch coverage, in our experimental evaluation.

Termination criteria: We use only the maximum time budget as the termination criterion. Stopping the search after it covers all the targets is detrimental to bug detection (Perera et al., 2020). The search needs to utilise the full time budget to generate as many tests for each target in the CUT in order to increase the chances of detecting bugs. Therefore, we terminate the search for test cases only when the allocated time budget runs out.

Test suite minimisation: We disable test suite minimisation since all the test cases in the archive form the final test suite (see Section 2.2.2).

Assertion strategy: We choose all possible assertions as the assertion strategy because the mutation-based assertion filtering can be computationally expensive and can lead to timeouts (Shamshiri et al., 2015; Perera et al., 2020).

3.4. Experimental Protocol

We run experiments with SBST guided by DP using defect predictors with 12 different levels of imprecision as described in Section 2.1. For each bug in the Defects4J dataset, we checkout the buggy version of the project and collect the ground truth labels for the buggy and non-buggy methods. If a method is either modified or removed in the bug fix, we label that method as a buggy method, and non-buggy otherwise (Sohn and Yoo, 2019). Then, for each of the six projects in the dataset, we combine the ground truth labels from all the bugs respective to each project. For example, we combine the labels from all the 104 bugs from Apache commons-math project. Then, we simulate defect prediction outcomes for each project using the defect prediction algorithm described in Section 2.1.

We assume an application scenario of generating tests to find bugs not only limited to regressions, but also the bugs introduced to the code in various times in development. Therefore, we run test generation on the buggy version of the projects. We measure the bug finding effectiveness of SBST guided by DP only on the Defects4J bugs. Thus, we only run test generation for buggy classes, i.e., classes that are modified in the bug fixes, in the projects.

For each level of defect predictor imprecision, we run test generation with SBST guided by DP 25 times for each bug in the dataset. Consequently, we have to run a total of 12 (levels of defect prediction imprecision) 25 (repetitions) 482 (buggy classes) 144,600 test generations.

Defects4J (Just, 2019) allows us to evaluate if the 144,600 generated test suites in the experiments find the bugs. First, we remove the flaky test cases in test suites using the ‘fix test suite’ interface (Just, 2019) in Defects4J as described in (Shamshiri et al., 2015). We use the ‘run bug detection’ interface (Just, 2019), which executes a test suite against the buggy and fixed versions of a program and determines if the test suite finds the bug by checking if the test execution results are different between the two versions. EvoSuite generates assertions assuming the program under test is correct, therefore, the generated tests should always pass when they are run against the buggy version. A test suite is considered broken if it is not compilable or fails when run against the buggy version of the program. The test suite is considered as it has missed detecting the bug if it produces the same execution results when run against the buggy and fixed versions of the program, and it is considered as it has detected the bug if the test results are different.

4. Results

We present the results for our research question following the method described in Section 3. Our aim is to evaluate the effectiveness of bug finding performance of SBST guided by DP when using imprecise defect predictors.

RQ. What is the impact of the imprecision of defect prediction on bug detection performance of SBST?

Figure 2 shows the distributions of the number of bugs found by SBST guided by DP as violin plots and the profile plot of the mean number of bugs found by SBST guided by DP for each combination of the factors of six recalls and two precisions. The two lines in our profile plot run almost parallel to each other, i.e., the two lines do not cross each other at any point. This means that there is no observable interaction effect between recall and precision.

The two lines descent steeply from recall 100% to 75%. This shows that recall has an effect on number of bugs found by SBST guided by DP. In particular, bug detection effectiveness decreases as recall decreases.

The precision=75% line closely follows the precision=100% line while staying slightly above the latter, except at recall=85%, where there is a considerable gap between the two. We can soon see if this difference is significant from the two-way ANOVA test results.

Figure 2. Distributions of the number of bugs found by SBST guided by DP as violin plots together with the profile plot of mean number of bugs found by SBST guided by DP for each combination of the groups of recall and precision.

To statistically test the effect of each of the metrics, recall and precision, and their interaction on the number of bugs found by SBST guided by DP, we conduct the two-way ANOVA test. Prior to conducting two-way ANOVA test, we have to make sure that our data holds the following assumptions of the test.

  1. The dependent variable should approximately follow a normal distribution for all the combinations of groups of the two independent variables.

  2. Homogeneity of variances exists for all the combinations of groups of the two independent variables.

To check the first assumption, we conduct the Kolmogorov-Smirnov test (Massey Jr, 1951) for normality of the distributions (

) of the number of bugs found for each combination of the groups of recall and precision. Based on the results of the tests, we cannot reject our null hypothesis (p-values

), i.e., H = the number of bugs found is normally distributed, hence we assume all the samples come from a normal distribution (i.e., H is true).

To check the second assumption, we conduct the Bartlett’s test for homogeneity of variances () in each combination of the groups of recall and precision. Based on the results of the test, we cannot reject our null hypothesis (p-value ), i.e., H = variances of the number of bugs found are equal across all combinations of the groups, hence we assume the variances are equal across all samples (i.e., H is true).

Df Sum Sq Mean Sq F value p-value
Recall 5 51341 10268 497.42 <0.001
Precision 1 273 273 13.21 <0.001
Recall:Precision 5 190 38 1.84 0.105
Residuals 288 5945 21
Table 1.

Summary of the two-way ANOVA test results. Df = degrees of freedom, Sum Sq = sum of squares and Mean sq = mean sum of squares.

Table 1 shows the summary of the two-way ANOVA test results. According to the two-way ANOVA test, recall and precision explain a significant amount of variation in number of bugs found by SBST guided by DP (p-values ). The test also indicates that we cannot reject the null hypothesis that there is no interaction effect between recall and precision on number of bugs found (p-value ). That means we can assume the effect of recall on number of bugs found does not depend on the effect of precision, and vice versa.

To check if the observed differences among the groups are of practical significance, we measure the epsilon squared effect size ((Yigit and Mendes, 2018) of the variations in number of bugs found with respect to recall and precision. We find that the effect of recall on bug detection effectiveness is large with an effect size of 0.89, while the effect of precision is very small ((Cohen, 1992), which can be seen from the overlapping distributions in the violin plots in Figure 2 as well.

To further analyse which groups are significantly different from each other, we conduct the Tukey’s Honestly-Significant-Difference test (Tukey, 1949). The Tukey post-hoc test shows that the number bugs found by SBST guided by DP is significantly different between each of the six levels of recall (p-values ). The Cohen’s effect sizes of the differences between the groups of recall range from medium ( for recall 95% and 100%) to large ( for all other pairs of groups).

[colback=black!5!white,colframe=black!75!black] In summary, the imprecision of the defect predictor has a significant impact on the bug finding performance of SBST. In particular, when the recall of the defect predictor decreases, the bug detection effectiveness significantly decreases with a large effect size. On the other hand, we conclude that there is no meaningful practical effect of precision on the bug detection performance of SBST, as indicated by a very small effect size.

4.1. Sensitivity to the Recall of the Defect Predictor

As shown in Figure 2, SBST guided by DP finds less number of bugs when using defect predictors with a lower recall compared to using one with a higher recall. In particular, SBST guided by DP finds 7.5 less bugs and misses test generation for 15 bugs on average (out of 420) when the recall decreases by 5% in our experiments. SBST guided by DP completely trusts the defect predictor and only generates tests for classes having at least one method predicted as buggy (e.g., true positive). The number of true positives by the defect predictor decreases when the recall decreases. This results in SBST guided by DP generating tests for a fewer number of classes as the recall decreases, hence finding less number of bugs when recall drops from 100% to 75%.

We identify this as a weakness of SBST when using defect predictions. To mitigate this, SBST techniques have to take potential errors in the predictions into account. One way to do this is to always generate tests for methods that are predicted buggy, while also generating tests for predicted non-buggy methods at least with a minimum probability. This way the SBST technique gets a chance to search for tests in incorrectly classified buggy methods (when recall <100%), while also giving higher priority to methods that are predicted buggy by the defect predictor.

4.2. Number of Buggy Methods

As we discussed previously, when the recall of the defect predictor decreases, SBST guided by DP completely misses test generation for certain bugs, hence leads to poorer bug detection. In our experiments, SBST guided by DP misses test generation for 18.2% of the bugs on average when recall decreases from 100% to 75%. Further analysis of the results indicates that SBST guided by DP only misses test generation for 4.5% of the bugs on average for the bugs that spread across multiple methods, whereas it misses 24.7% of the bugs on average for the bugs that are concentrated into only one method. This suggests that the bugs that are found within only one method are more prone to the impact of recall compared to bugs that are spread across multiple methods.

To understand the effects of recall on finding bugs which are found within only one method and spread across multiple methods, we conduct Welch ANOVA test (Liu, 2015) separately for the two subsets of our dataset, i.e., bugs having only one buggy method and bugs having more than one buggy method. The reason for carrying out Welch ANOVA test is because our data fails the assumption of homogeneity of variances for each combination of the groups of recall for bugs having only one buggy method.

Num Df Denom Df F value p-value
# buggy methods 5.00 137.06 67.24 <0.001
# buggy methods 5.00 136.68 395.91 <0.001
Table 2. Summary of the Welch ANOVA test results. Num Df = degrees of freedom of the numerator and Denom Df = degrees of freedom of the denominator.

The results of the Welch ANOVA test are shown in Table 2. There are 135 bugs which have more than one buggy method. The results for these bugs show that overall recall has a significant effect on number of bugs found by SBST guided by DP (p-value <0.001) with a large effect size ((Carroll and Nordholm, 1975). However, the Games-Howell post-hoc test reveals that the bug detection effectiveness is not significantly different between recall 80%, 85% and 90%, and 95% and 100%. This can be seen in the violin plots in Figure 3 as well.

Figure 3. Distributions of the number of bugs found by SBST guided by DP as violin plots together with the means plot of number of bugs found by SBST guided by DP for the groups of recall. Only for the bugs that have more than one buggy method. Total number of bugs = 135.

There are 285 bugs which have only one buggy method. The results of Welch ANOVA test for these bugs show that recall has a significant effect on number of bugs found by SBST guided by DP (p-value <0.001) with a large effect size (). The Games-Howell post-hoc test confirms that the number of bugs found by SBST guided by DP is significantly different between each group of recall (p-values <0.001) with large effect sizes () as can be seen in Figure 4.

Figure 4. Distributions of the number of bugs found by SBST guided by DP as violin plots together with the means plot of number of bugs found by SBST guided by DP for the groups of recall. Only for the bugs that have one buggy method. Total number of bugs = 285.

In summary, we find that recall has a significant effect on bug detection effectiveness of SBST guided by DP regardless of whether the bugs are found within one method or spread across multiple methods. However, for the bugs that are spread across multiple methods, the effect size of recall effect is smaller when compared to bugs that are found within one method (). In contrast to bugs that are found within one method, the effect of recall is not significant between the groups of recall 80%, 85% and 90%, and 95% and 100% for the bugs that are spread across multiple methods.

4.3. Sensitivity to the Precision of the Defect Predictor

According to the two-way ANOVA test, the precision of the defect predictor has a statistically significant effect, although with a very small effect size, which suggests the effect is not of meaningful practical significance. Precision is associated with false positives (Equation (2)), i.e., non-buggy methods predicted as buggy by the defect predictor. Change of precision from 100% to 75% means that there are false positives in the defect prediction results. We investigate the buggy method labels produced by the defect predictor and the bug finding results of SBST guided by DP in our experiments to find out if false positives have actually helped SBST guided by DP to find more bugs. We find that the false positives have not contributed to the bug finding performance of SBST guided by DP. We conclude that the impact of precision is not of practical significance to the bug finding performance of SBST.

5. Discussion

Defect predictors have mainly been used to provide a list of likely defective parts of a program (e.g., classes and methods) to programmers, who then manually inspect or test the likely defective parts to find the bugs (Lewis et al., 2013; Dam et al., 2019). In this context, the precision of the defect predictor is very important (Wan et al., 2018). Poor precision of the defect predictor means there are higher false positives. Higher false positives can waste developers’ time and lead to losing their trust on the prediction results (Lewis et al., 2013). However, when the defect predictions are consumed by another automated testing technique such as SBST, this may not be the case. In the context of SBST, our study reveals contrasting findings. We find that the effect of precision on the bug detection performance of SBST is negligible, while the recall of the predictor has a significant impact with a large effect size.

We recommend that programmers improve the recall of the defect predictor at the cost of precision to achieve good performance in SBST guided by defect prediction. There is a trade-off between recall and precision of a defect predictor (Koru and Liu, 2005). Defect predictors are usually good at detecting bugs (i.e., high recall) at the expense of false positives (i.e., low precision). Our study shows the bug detection effectiveness of SBST guided by DP is highly sensitive to recall, while the effect of precision is negligible. This means that most defect predictors proposed in the literature would be suitable for guiding SBST. As the scope of our study is to analyse the impact of precision in the range of an acceptable defect predictor, i.e., precision , we cannot make any conclusions about defect predictors with precision below 75%. Therefore, we can conclude that it is beneficial to increase the recall of the defect predictor by sacrificing precision as long as it is above 75%.

6. Threats to Validity

Construct Validity.

To systematically investigate the impact of defect prediction imprecision, we simulate the predictions by assuming a uniform distribution. This means in our simulations, every method has an equal chance of being labelled as buggy or non-buggy. However, real defect predictors may have different distributions of their predictions depending on the underlying characteristics and nature of the prediction problem, which may impact the realism of a simulated defect predictor. Nevertheless, in the absence of prior knowledge about defect prediction distributions, it is reasonable to assume a uniform distribution of predictions in the defect prediction simulation.

SBST guided by DP generates more than one test case for each target in the CUT. This increases the chances of finding bugs at the cost of larger test suites. Larger test suites are associated with a higher number of assertions in the tests generated by EvoSuite, which need to be manually adapted by developers in practice. We design our study to investigate the impact of imprecision in defect prediction on SBST along one dimension, that is the number of bugs found by the generated test suites. We identify investigating the impact of defect prediction imprecision on SBST in terms of the cost of manual adaptation of the generated assertions as future work, which will complement the findings of our study.

Internal Validity. To account for the randomness in the defect prediction simulation, we repeat the simulations 5 times for each combination of the groups of recall and precision. For each simulation, we repeat the test generation 5 times to account for the non-deterministic behaviour of SBST guided by DP. In total, we conduct 25 test generation runs for each bug and for each level of defect prediction imprecision.

Conclusion Validity. To account for any threats to the conclusion validity, we derive conclusions from the experimental results after conducting sound statistical tests; two-way ANOVA test, epsilon squared effect size, Tukey’s Honestly-Significant-Difference test, Cohen’s d effect size, Welch ANOVA test and Games-Howell post-hoc test.

External Validity. We use 420 real bugs from Defects4J dataset as the experimental subjects. They are drawn from 6 open source projects. At the time of writing this paper, another 401 bugs from 11 projects were added to the Defects4J dataset. However, we understand that these projects do not represent all program characteristics, especially in industrial projects. Nevertheless, Defects4J dataset has been widely used in previous work as a benchmark (Shamshiri et al., 2015; Perera et al., 2020; Paterson et al., 2019; Aleti and Martinez, 2020). Future work needs to be done on investigating the impact of imprecision of defect prediction on SBST with respect to other bug datasets.

SBST guided by DP uses defect prediction information at method level. Our findings may not be generalised to previous work (Perera et al., 2020; Hershkovich et al., 2019) which use defect prediction at a different level of granularity (class level). Nevertheless, the findings from our study will help to further explore the opportunities of combining defect predictions and SBST.

We investigate the impact of defect prediction imprecision only in the range of 75% to 100% for recall and precision. Therefore, our findings may not be generalised to the defect predictors which have recall or precision less than 75%. While this choice of performance sampling in our simulation is a threat to external validity, it is also a threat to construct validity for lack of characterising all possible defect predictors. However, we opted to use this range with the justification that this is the range for an acceptable performance for a defect predictor as recommended by Zimmermann et al. (Zimmermann et al., 2009).

7. Related Work

7.1. Defect Prediction in Software Testing

Defect prediction was originally proposed to provide a list of likely defective parts of a program to assist developers in code reviews (Lewis et al., 2013; Lewis and Ou, 2011), manual testing (Dam et al., 2019), etc. More recently, defect predictors have been used to inform automated testing techniques as well. G-clef (Paterson et al., 2019) is a test prioritisation strategy that uses the likelihood of the defectiveness of classes to prioritise test cases and it was shown to be effective at reducing the number of test cases required to find bugs. FLUCCS (Sohn and Yoo, 2019) is a fault localisation approach that leverages the likelihood of methods being defective and it was shown to significantly outperform the state-of-the-art spectrum based fault localisation (SBFL) techniques. Perera et al. (Perera et al., 2020) and Hershkovich et al. (Hershkovich et al., 2019) used defect predictions at class level to determine the time budget allocated to classes in a project to run test generation with SBST techniques. A highly likely to be defective class according to the defect predictor has more chance of being selected to run test generation (Hershkovich et al., 2019) or allocated a higher time budget (Perera et al., 2020). Despite showing the improved bug detection performance of the proposed SBST techniques, we find that the defect predictors used in these two works have relatively high performance, e.g., 85% recall in (Perera et al., 2020) and 0.95 AUC in (Hershkovich et al., 2019), which can be difficult to achieve for a defect predictor. For example, Zimmermann et al. (Zimmermann et al., 2009) found that only 21 out of 622 cross-project defect predictor combinations to have recall, precision and accuracy greater than 75%. In their systematic literature review, Hall et al. (Hall et al., 2011) reported defect predictor performances in the ranges of 5%-95% and 25%-85% for precision and recall, respectively. This leads to the question of how does the variation in defect prediction performance affect the bug detection effectiveness of SBST techniques that incorporate defect prediction information. To address this gap, we study the impact of imprecision in defect predictions on the bug detection performance of SBST.

7.2. Search-Based Software Testing

Search-based software testing techniques use search algorithms like genetic algorithms to search for test cases to meet a given criteria like branch coverage 

(Fraser and Arcuri, 2011). The test generation problem can be formulated in two ways; i) single objective formulation (Rojas et al., 2017; Fraser and Arcuri, 2011) and ii) many objective formulation (Panichella et al., 2017, 2015). In many objective optimisation, such as MOSA (Panichella et al., 2015) and DynaMOSA (Panichella et al., 2017), SBST techniques aim to find a set of non-dominated test cases that minimise the fitness functions for all the test targets, e.g., branches. In single objective optimisation, SBST techniques optimise whole test suites to minimise a single fitness function which is created by aggregating all the individual test target distances. A target distance measures how far away the test suite is from covering that target (Fraser and Arcuri, 2011). Whole test suite generation (WS) (Fraser and Arcuri, 2011) and archive-based WS (WSA) (Rojas et al., 2017) are two examples for techniques that use single objective optimisation. Previous work showed that DynaMOSA, a state-of-the-art many objective optimisation technique, is better than single objective optimisation techniques in terms of achieving high code coverage (Panichella et al., 2017). In this paper, we study the effect of defect prediction imprecision on bug detection performance of an SBST technique that uses many objective optimisation.

7.3. Imprecision in Defect Predictors

There is a plethora of defect predictors which have been proposed over the past 40 years (Wan et al., 2018). Measures such as recall, precision, f-measure, AUC, Matthews correlation coefficient (MCC) (Yao and Shepperd, 2020), etc. have been used to measure the predictive power of the defect predictors (Hall et al., 2011). Out of these measures, recall and precision have been widely used in previous work (Hall et al., 2011; Hosseini et al., 2017) and are often preferred by practitioners (Wan et al., 2018). Existing defect predictors have wavering performance. For example, Hall et al. (Hall et al., 2011) reported defect predictor performances from as low as 5% and 25% to as high as 95% and 85% for precision and recall, respectively. Hosseini et al. (Hosseini et al., 2017) also reported similar findings in their systematic literature review of cross-project defect predictors. It is thus important to study the impact of the wavering defect prediction performance on the bug detection performance of SBST. In our study, we consider the recall and the precision should be greater than 75% to be considered acceptable as recommended by Zimmermann et al. (Zimmermann et al., 2009), and simulate defect predictions in the range from 75% to 100% for recall and precision.

Previous work report the developers’ opinions about the defect predictor performance (Dam et al., 2019; Lewis et al., 2013; Wan et al., 2018), showing that false positives cause developers to waste their precious time on inspecting non-buggy code, which eventually leads to loosing trust on the defect predictor (Dam et al., 2019; Lewis et al., 2013). In the eyes of the developers, higher precision is more important compared to higher recall in a defect predictor, because higher precision means low false positives (Wan et al., 2018). In the context of using defect prediction to guide SBST, our study reveals contrasting findings. In particular, precision has a negligible impact on the bug detection performance of SBST, while the effect of recall is significant.

8. Conclusion

We study the impact of imprecision in defect prediction on the bug detection performance of SBST. We use simulated defect predictors to systematically sample defect predictors in the range of 75% to 100% for recall and precision. We use the state-of-the-art SBST technique, DynaMOSA, and incorporate predictions about buggy methods as given by the simulated defect predictor to guide the search for test cases towards likely buggy methods. Through a comprehensive experimental evaluation on 420 bugs from the Defects4J dataset, we find that the recall of the defect predictor has a significant impact on the bug detection effectiveness of SBST with a large effect size. On the other hand, the impact of precision is not of meaningful practical significance as indicated by a very small effect size. Further analysis of the results shows that the impact of the recall for the bugs that are spread across multiple methods is smaller compared to the bugs that are found within only one method.

Based on the results of our study, we make the following recommendations:

  1. SBST techniques must take into account potential errors in the predictions, especially the false negatives. One way to do this is to prioritise the likely buggy parts of the program, while guiding the search towards the likely non-buggy parts with at least a minimum probability.

  2. In the context of SBST, it is beneficial to increase the recall of the defect predictor at the expense of precision as long as it stays above 75%. One straightforward method to do this is to lower the cut-off point of the classifier such that more components will be labelled as buggy at the cost of more false positives.

We identify the following directions as future work to extend this study; i) investigate the impact of defect prediction imprecision on SBST in terms of the cost of manual adaptation of generated assertions, ii) validate the findings against other bug datasets (Herbold et al., 2020; Madeiral et al., 2019; Saha et al., 2018), and iii) explore the options for using the likelihood of defectiveness of methods to guide SBST techniques.

References

  • E. Aftandilian, R. Sauciuc, S. Priya, and S. Krishnan (2012) Building useful program analysis tools using an extensible java compiler. In 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation, pp. 14–23. Cited by: §1.
  • A. Aleti and M. Martinez (2020) E-apr: mapping the effectiveness of automated program repair. arXiv preprint arXiv:2002.03968. Cited by: §3.1, §6.
  • M. M. Almasi, H. Hemmati, G. Fraser, A. Arcuri, and J. Benefelds (2017) An industrial evaluation of unit test generation: finding real faults in a financial application. In Proceedings of the 39th International Conference on Software Engineering: Software Engineering in Practice Track, pp. 263–272. Cited by: §1, §3.2.
  • A. Arcuri and G. Fraser (2013) Parameter tuning or default values? an empirical investigation in search-based software engineering. Empirical Software Engineering 18 (3), pp. 594–623. Cited by: §3.3.
  • N. Ayewah, W. Pugh, D. Hovemeyer, J. D. Morgenthaler, and J. Penix (2008) Using static analysis to find bugs. IEEE software 25 (5), pp. 22–29. Cited by: §1.
  • M. Broy, I. H. Kruger, A. Pretschner, and C. Salzmann (2007) Engineering automotive software. Proceedings of the IEEE 95 (2), pp. 356–373. Cited by: §3.3.
  • J. Campos, A. Arcuri, G. Fraser, and R. Abreu (2014) Continuous test generation: enhancing continuous integration with automated test generation. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering, pp. 55–66. Cited by: §3.3.
  • R. M. Carroll and L. A. Nordholm (1975) Sampling characteristics of kelley’s and hays’ . Educational and Psychological Measurement 35 (3), pp. 541–554. Cited by: §4.2.
  • J. Cohen (1992) A power primer.. Psychological bulletin 112 (1), pp. 155. Cited by: §4.
  • H. K. Dam, T. Pham, S. W. Ng, T. Tran, J. Grundy, A. Ghose, T. Kim, and C. Kim (2019) Lessons learned from using a deep tree-based model for software defect prediction in practice. In Proceedings of the 16th International Conference on Mining Software Repositories, pp. 46–57. Cited by: §5, §7.1, §7.3.
  • EvoSuite (2019) EvoSuite - automated generation of junit test suites for java classes. Note: Last accessed on: 29/11/2019 External Links: Link Cited by: §3.2.
  • M. Fowler and M. Foemmel (2006) Continuous integration. Cited by: §3.3.
  • G. Fraser and A. Arcuri (2011) Evolutionary generation of whole test suites. In 2011 11th International Conference on Quality Software, pp. 31–40. Cited by: §3.2, §7.2.
  • G. Fraser and A. Arcuri (2012) Whole test suite generation. IEEE Transactions on Software Engineering 39 (2), pp. 276–291. Cited by: §2.2, §3.3.
  • G. Fraser (2018) EvoSuite - automatic test suite generation for java. Note: Last accessed on: 19/09/2019 External Links: Link Cited by: §3.2.
  • G. Gay (2017a) Generating effective test suites by combining coverage criteria. In International Symposium on Search Based Software Engineering, pp. 65–82. Cited by: §3.3.
  • G. Gay (2017b) The fitness function for the job: search-based generation of test suites that detect real faults. In 2017 IEEE International Conference on Software Testing, Verification and Validation (ICST), pp. 345–355. Cited by: §3.1, §3.2.
  • E. Giger, M. D’Ambros, M. Pinzger, and H. C. Gall (2012) Method-level bug prediction. In Proceedings of the 2012 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, pp. 171–180. Cited by: §1, §2.2.
  • A. Habib and M. Pradel (2018) How many of all bugs do we find? a study of static bug detectors. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 317–328. Cited by: §1.
  • T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell (2011) A systematic literature review on fault prediction performance in software engineering. IEEE Transactions on Software Engineering 38 (6), pp. 1276–1304. Cited by: §1, §1, §2, §7.1, §7.3.
  • H. Hata, O. Mizuno, and T. Kikuno (2012) Bug prediction based on fine-grained module histories. In 2012 34th international conference on software engineering (ICSE), pp. 200–210. Cited by: §2.2.
  • S. Herbold, A. Trautsch, B. Ledel, A. Aghamohammadi, T. A. Ghaleb, K. K. Chahal, T. Bossenmaier, B. Nagaria, P. Makedonski, M. N. Ahmadabadi, K. Szabados, H. Spieker, M. Madeja, N. Hoy, V. Lenarduzzi, S. Wang, G. Rodríguez-Pérez, R. C. Palacios, R. Verdecchia, P. Singh, Y. Qin, D. Chakroborti, W. Davis, V. Walunj, H. Wu, D. Marcilio, O. Alam, A. Aldaeej, I. Amit, B. Turhan, S. Eismann, A. Wickert, I. Malavolta, M. Sulír, F. Fard, A. Z. Henley, S. Kourtzanidis, E. Tuzun, C. Treude, S. M. Shamasbi, I. Pashchenko, M. Wyrich, J. Davis, A. Serebrenik, E. Albrecht, E. U. Aktas, D. Strüber, and J. Erbel (2020) Large-scale manual validation of bug fixing commits: A fine-grained analysis of tangling. CoRR abs/2011.06244. External Links: Link, 2011.06244 Cited by: §8.
  • E. Hershkovich, R. Stern, R. Abreu, and A. Elmishali (2019) Prediction-guided software test generation. In Proceedings of the 30th International Workshop on Principles of Diagnosis DX’19, pp. . Cited by: §1, §1, §6, §7.1.
  • S. Hosseini, B. Turhan, and D. Gunarathna (2017) A systematic literature review and meta-analysis on cross project defect prediction. IEEE Transactions on Software Engineering 45 (2), pp. 111–147. Cited by: §2, §7.3.
  • R. Just, D. Jalali, and M. D. Ernst (2014) Defects4J: a database of existing faults to enable controlled testing studies for java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis, pp. 437–440. Cited by: §3.1.
  • R. Just (2019) Defects4J - a database of real faults and an experimental infrastructure to enable controlled experiments in software engineering research. Note: Last accessed on: 02/10/2019 External Links: Link Cited by: §1, §3.1, §3.4, §3.
  • B. Korel (1990) Automated software test data generation. IEEE Transactions on software engineering 16 (8), pp. 870–879. Cited by: §1.
  • A. G. Koru and H. Liu (2005) Building effective defect-prediction models in practice. IEEE software 22 (6), pp. 23–29. Cited by: §5.
  • C. Lewis, Z. Lin, C. Sadowski, X. Zhu, R. Ou, and E. J. Whitehead Jr (2013) Does bug prediction support human developers? findings from a google case study. In Proceedings of the 2013 International Conference on Software Engineering, pp. 372–381. Cited by: §1, §5, §7.1, §7.3.
  • C. Lewis and R. Ou (2011) Bug prediction at google. Note: Last accessed on: 16/09/2019 External Links: Link Cited by: §1, §7.1.
  • H. Liu (2015)

    Comparing welch anova, a kruskal-wallis test, and traditional anova in case of heterogeneity of variance

    .
    Virginia Commonwealth University. Cited by: §4.2.
  • F. Madeiral, S. Urli, M. Maia, and M. Monperrus (2019) Bears: An Extensible Java Bug Benchmark for Automatic Program Repair Studies. In Proceedings of the 26th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER ’19), External Links: Link Cited by: §8.
  • F. J. Massey Jr (1951) The kolmogorov-smirnov test for goodness of fit. Journal of the American statistical Association 46 (253), pp. 68–78. Cited by: §4.
  • P. McMinn (2011) Search-based software testing: past, present and future. In 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops, pp. 153–163. Cited by: §1.
  • A. Panichella, F. M. Kifetew, and P. Tonella (2015) Reformulating branch coverage as a many-objective optimization problem. In 2015 IEEE 8th international conference on software testing, verification and validation (ICST), pp. 1–10. Cited by: §2.2, §7.2.
  • A. Panichella, F. M. Kifetew, and P. Tonella (2017) Automated test case generation as a many-objective optimisation problem with dynamic selection of the targets. IEEE Transactions on Software Engineering 44 (2), pp. 122–158. Cited by: §1, §1, §2.2.2, §2.2, §2, §3.3, §7.2.
  • A. Panichella, F. M. Kifetew, and P. Tonella (2018) A large scale empirical comparison of state-of-the-art search-based test case generators. Information and Software Technology 104, pp. 236–256. Cited by: §1.
  • D. Paterson, J. Campos, R. Abreu, G. M. Kapfhammer, G. Fraser, and P. McMinn (2019) An empirical study on the use of defect prediction for test case prioritization. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST), pp. 346–357. Cited by: §1, §3.1, §6, §7.1.
  • S. Pearson, J. Campos, R. Just, G. Fraser, R. Abreu, M. D. Ernst, D. Pang, and B. Keller (2017) Evaluating and improving fault localization. In Proceedings of the 39th International Conference on Software Engineering, pp. 609–620. Cited by: §3.1.
  • A. Perera, A. Aleti, M. Böhme, and B. Turhan (2020) Defect prediction guided search-based software testing. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pp. . External Links: Document Cited by: §1, §1, §1, §2.2.2, §2.2, §3.1, §3.2, §3.3, §3.3, §3.3, §6, §6, §7.1.
  • J. M. Rojas, M. Vivanti, A. Arcuri, and G. Fraser (2017) A detailed investigation of the effectiveness of whole test suite generation. Empirical Software Engineering 22 (2), pp. 852–893. Cited by: §2.2, §2.2, §7.2.
  • C. Sadowski, J. Van Gogh, C. Jaspan, E. Soderberg, and C. Winter (2015) Tricorder: building a program analysis ecosystem. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1, pp. 598–608. Cited by: §1.
  • R. K. Saha, Y. Lyu, W. Lam, H. Yoshida, and M. R. Prasad (2018) Bugs. jar: a large-scale, diverse dataset of real-world java bugs. In Proceedings of the 15th International Conference on Mining Software Repositories, pp. 10–13. Cited by: §8.
  • A. Salahirad, H. Almulla, and G. Gay (2019) Choosing the fitness function for the job: automated generation of test suites that detect real faults. Software Testing, Verification and Reliability 29 (4-5), pp. e1701. Cited by: §1.
  • S. Shamshiri, R. Just, J. M. Rojas, G. Fraser, P. McMinn, and A. Arcuri (2015) Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges (t). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 201–211. Cited by: §1, §3.1, §3.2, §3.3, §3.4, §6.
  • J. Sohn and S. Yoo (2019) Empirical evaluation of fault localisation using code and change metrics. IEEE Transactions on Software Engineering. Cited by: §3.1, §3.4, §7.1.
  • J. W. Tukey (1949) Comparing individual means in the analysis of variance. Biometrics, pp. 99–114. Cited by: §4.
  • Z. Wan, X. Xia, A. E. Hassan, D. Lo, J. Yin, and X. Yang (2018) Perceptions, expectations, and challenges in defect prediction. IEEE Transactions on Software Engineering 46 (11), pp. 1241–1266. Cited by: §5, §7.3, §7.3.
  • J. Yao and M. Shepperd (2020) Assessing software defection prediction performance: why using the matthews correlation coefficient matters. In Proceedings of the Evaluation and Assessment in Software Engineering, pp. 120–129. Cited by: §7.3.
  • S. Yigit and M. Mendes (2018) Which effect size measure is appropriate for one-way and two-way anova models? a monte carlo simulation study. Revstat Statistical Journal 16 (3), pp. 295–313. Cited by: §4.
  • T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pp. 91–100. Cited by: §1, §2, §6, §7.1, §7.3.