Killing Stubborn Mutants with Symbolic Execution

01/09/2020 ∙ by Thierry Titcheu Chekam, et al. ∙ University of Luxembourg 0

We introduce SeMu, a Dynamic Symbolic Execution technique that generates test inputs capable of killing stubborn mutants (killable mutants that remain undetected after a reasonable amount of testing). SeMu aims at mutant propagation (triggering erroneous states to the program output) by incrementally searching for divergent program behaviours between the original and the mutant versions. We model the mutant killing problem as a symbolic execution search within a specific area in the programs' symbolic tree. In this framework, the search area is defined and controlled by parameters that allow scalable and cost-effective mutant killing. We integrate SeMu in KLEE and experimented with Coreutils (a benchmark frequently used in symbolic execution studies). Our results show that our modelling plays an important role in mutant killing. Perhaps more importantly, our results also show that, within a two-hour time limit, SeMu kills 37 none and where the mutant infection strategy (strategy suggested by previous research) kills 17

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Deep testing is often required in order to assess the core logic and the ‘critical’ parts of the programs under analysis. Unfortunately, performing thorough testing is hard, tedious and time consuming. As a result testing the most important program parts requires substantial efforts, skills and experience.

To deal with this issue, mutation testing aims at guiding the design of strong (likely fault revealing) test cases. The key idea of mutation is to use artificially introduced defects, called mutants, to identify the weaknesses of test suites (undetected defects form test suite deficiencies) and to guide test generation (undetected defects form test objectives). Thus, testers can improve their test suites by designing test cases that take the mutation feedback into account.

Experience with mutation testing has shown that it is relatively easy to detect a large number of mutants by simply covering them (Ammann et al., 2014; Papadakis et al., 2016; Petrovic and Ivankovic, 2018). Such trivial mutants are not useful as they fail to provide any particular guidance towards test case design (Schuler and Zeller, 2013). However, experience has also shown that there are some few mutants that are relatively hard to detect (a.k.a. stubborn mutants (Yao et al., 2014)) and can provide significant advantages when used as test objectives (Petrovic and Ivankovic, 2018; Yao et al., 2014). Interestingly, these mutants form special corner cases and are linked with fault revelation (Titcheu Chekam et al., 2017). The importance of using the stubborn mutants as test objectives has also been underlined by several industrial studies (Delgado-Pérez et al., 2018; Baker and Habli, 2013) including a large study with Google developers (Petrovic and Ivankovic, 2018).

Stubborn mutants are hard to detect mainly due to a) the difficulty of infecting the program state (causing an erroneous program state when executing the mutation/defective point) and b) due to the masking effects that prohibit the propagation of erroneous states to the program output (aka failed error propagation (Androutsopoulos et al., 2014) or coincidental correctness (Abou Assi et al., 2019)). Either being the case, the issues linked with these mutants form corner cases which are most likely to escape testing (since stubborn mutants form small semantic deviations) (Petrovic and Ivankovic, 2018).

Killing stubborn mutants (designing test cases that reveal undetected mutants) is challenging due to the variety of the code paths, constraints and data states of the program versions (original and mutant versions) that need to be differentially analysed. The key challenge here regards the handling of the failed error propagation (masking effects), which is prevalent in stubborn mutants. Effective error propagation analysis is still an open problem (Papadakis et al., 2019; Pizzoleto et al., 2019) as it involves state comparisons among the mutant and the original program executions that grow exponentially with the number of the involved paths (from the mutation point to the program output).

We present SEMu, an approach based on dynamic symbolic execution that generates test inputs capable of killing stubborn mutants. The particular focus of SEMu is on the effective and scalable handling of mutant propagation. Our technique executes both the original and mutant program versions with a single symbolic execution instance, where the mutant executions are “forked” when reaching the mutation points. The forked execution follows the original one and compares with it. The comparisons are performed based on the involved symbolic states and related (propagation) constraints that ensure divergences.

A key issue with both symbolic execution and mutation testing regard their scalability. To account for this problem, we develop a framework that allows defining the mutant killing problem as a search problem within a specific area around the mutation points. This area is defined by a number of parameters that control the symbolic exploration. We thus, perform a constrained symbolic exploration, starting from a pre-mutation point (a point in the symbolic tree that is before the mutation point) and ending at a post-mutation checkpoint (a point after the mutant) where we differentially compare the symbolic states of the two executions (forked and original) and generate test inputs.

We assume the existence of program inputs that can reach the areas we are targeting. Based on these inputs, we infer preconditions (a set of consistent and simplified path conditions), which we use to constrain the symbolic exploration to only a subset of program paths that are relevant to the targeted mutants. To further restrict the exploration to a relevant area, we systematically analyse the symbolic tree up to a relatively small distance from the mutation point (performing a shallow propagation analysis).

To improve the chances for propagation we also perform a deep exploration of some subtrees. Overall, by controlling the above parameters we can define strategies with trade-offs between space (cost) and deepness (effectiveness). Such strategies allow the differential exploration of promising code areas, while keeping their execution time low.

Many techniques targeting mutation-based test generation have been proposed (Anand et al., 2013; Papadakis et al., 2019). However, most of these techniques focus on generating unit-level test suites from scratch, mainly by either covering the mutants or by causing an erroneous program state at the mutation point. However, there is no work leveraging the value of existing tests to perform deep testing by targeting stubborn mutants, which are mostly hard to propagate. Moreover, none of the available symbolic execution tools aim at generating test inputs for killing mutants.

We integrated SEMu 111Publicly available at https://github.com/thierry-tct/KLEE-SEMu. into KLEE (Cadar et al., 2008). We evaluated SEMu on 47 programs from Coreutils, real-world utility programs written in C, and compare it with the mutant infection strategy, denoted as infection-only, that was proposed by previous work (Zhang et al., 2010; Harman et al., 2011). Our results show that SEMu achieves significantly higher killing rates (approximately +37% and +20%) of stubborn mutants, for both KLEE (alone) and infection-only strategy, on the majority of the studied subjects.

In summary, our paper makes the following contributions:

  1. We introduce and implement a symbolic execution technique for generating tests that kill stubborn mutants. Our technique leverages existing tests in order to perform a deep and targeted test of specific code areas.

  2. We model the mutant killing as a search problem within a specific area (around the mutation point). Such a modelling allows controlling the symbolic execution cost, while at the same time allows forming cost-effective heuristics.

  3. We report empirical results demonstrating that SEMu has a strong mutant killing ability, which is significantly superior to KLEE and other mutation-based approaches.

2. Context

Our work aims at the automatic test input generation for specific methods/components of the systems under test. Our working scenario assumes that testers have performed some basic testing and want to dig into some specific parts of the program. This is a frequent scenario used to increase confidence on the critical code parts (encode the core program logic) or on parts that testers feel uncertain. To do so, it is reasonable to use mutation testing by adding tests that detect the surviving mutants (mutants undetected by the existing test suite) (Schuler and Zeller, 2013; Yao et al., 2014).

We consider a mutant as detected (killed) by a test when its execution leads to different observable output from that on the original program. According to our scenario, the targeted mutants are those (killable) that survive a reasonable amount of testing. This definition depends on the amount of the performed testing; strong test suites kill more mutants than weak ones, while ‘adequate’ test suites kill them all (Yao et al., 2014; Ammann and Offutt, 2008).

To adopt a baseline for basic or ‘reasonable amount of testing’ we augment the developer test suites with KLEE. This means that the stubborn mutants are those that are killable and survive the developer and automatically generated test suites. The surviving mutants form the objectives for our test generation technique.

2.1. Symbolic Encoding of Programs

Independently of its language, we define a program as follows.

Definition 2.1 ().

A program is a Labeled Transition System (LTS) where:

  • is a finite set of control locations;

  • is the unique entry point (start) of the program;

  • is the set of terminal locations of the program;

  • is a finite set of variables;

  • is a predicate capturing the set of possible initial valuations of ;

  • is a deterministic transition function where each transition is labeled with a guarded command of the form where is a guard condition and is a function updating valuation of variables ( is the set of labels).

The LTS modelling a given program defines the set of control paths from to any . A path is a sequence of connected transitions such that for all . Any well-terminating execution of the program goes through one such path. Since we consider deterministic programs, this path is unique and determined by the initial valuation (i.e. the test input) of the variables . More precisely, each path defines a path condition which symbolically encodes all executions going through . This path condition consists of a Boolean formula such that the test with input executes through iff . By solving (e.g. with a constraint solver like Z3 (De Moura and Bjørner, 2008)), one can obtain an initial valuation satisfying the path condition, thereby obtaining a test input that goes through the corresponding program path.

The execution of the resulting test input is a sequence of couples of variable valuations and locations, noted , , , , such that and for all , and . While is the valuation of all variables when terminates, the observable result of (its output), noted , is the subset of restricted only to all observable variables. Since a path encompasses a set of executions, we can also represent the set of outputs of those executions into a symbolic formula .

2.2. Symbolic Encoding of Mutants

A mutation alters or deletes a statement of the original program . Thus, a mutant is defined as a change in the transitions of that correspond to that statement (i.e. two transitions for branching statements; one for the others).

Definition 2.2 ().

Let be an original program. A mutant of is a program with such that:

It may happen that a program mutation leads to an equivalent mutant (i.e. semantically equivalent to the original program), that is, for any test input , . All non-equivalent mutants, however, should be discriminated (i.e. killed) by at least one test input. Thus, there must exist a test input that satisfies the following three conditions (referred to as RIP (Ammann and Offutt, 2008; DeMillo and Offutt, 1991; Morell, 1990)): the execution of on must (i) reach a mutated transition, (ii) infect (cause a difference in) the internal program state (i.e. change the variable valuations or the reached control locations), (iii) propagate this difference up to the program outputs. One can encode those conditions as the symbolic formula: . Any valuation satisfying this formula forms a test input killing . For given and , denotes the formula encoding the test inputs that kills and go through and in and , respectively.

Definition 2.3 ().

Let be an original program and be a set of mutants of . Then the mutant killing problem is the problem of finding, for each mutant :

  1. two paths and such that is satisfiable;

  2. a test input satisfying .

Figure 1. Example. The rounded control locations represent conditionals (at least 2 possible transition from them).

2.3. Example

Figure 1 shows a simple C program. The corresponding C code and transition system are shown in the left and middle of Figure 1, respectively. The transition system does not show the guarded commands for readability. The right side of Figure 1 shows two test inputs and their corresponding traces (as sequences of control locations of the transition system). The transition system contains control locations, corresponding to the 12 numbered lines in the program. The squared nodes of the transition system represent the non-branching control locations and the circular nodes represent the branching control location. For simplicity, we assume that each line is atomic. The initial condition is where is the set of all integers. Two mutants and are generated by mutating statements and , respectively. results from changing the statement into and results from changing the statement into . The mutants and result from the mutation of the guarded command of the transitions and , respectively.

The test execution of reaches but not , while reaches but not . Test infects and infects . The test execution of on the original program and mutant return and , respectively. The mutant is killed by because . Similarly, the test execution of on the original program and mutant return and , respectively. Test does not kill mutant .

3. Symbolic Execution

One can apply symbolic execution to explore the different paths, using a symbolic representation of the input domain (as opposed to concrete values) and building progressively the path conditions of the explored paths. The symbolic execution starts by setting an initial path condition to . At each location, it evaluates (by calling a dedicated solver) the guarded command of any outgoing transition. If the conjunction of the guard condition and is satisfiable then there exists at least one concrete execution that can go through the current path and the considered transition. In this case, the symbolic execution reaches the target location and is updated by injecting into it the guarded command of the transition. When multiple transitions are available, the symbolic execution successively chooses one and pursues the exploration, e.g. in a breadth-first manner.

As the symbolic execution progresses, it explores additional paths. The explored paths can together be concisely represented as a tree (King, 1976) where each node is an execution state made of its path condition and symbolic program state (itself constituted by the current control location – program counter value – and the current symbolic valuation of variables).

Still, the tree remains too large to be explored exhaustively. Thus, one typically guides the symbolic execution to restrict the paths to explore, effectively cutting branches of the tree. Preconditioned symbolic execution attempts to reduce the path exploration space by setting the initial path condition (at the beginning of the symbolic execution) to a specific condition. This precondition restricts the symbolic execution to the subset of paths that are feasible given the precondition. The idea is to derive the preconditions from pre-existing tests (aka seeds in the KLEE platform) that reach the particular points of interests. This allows us to provide vital guidance towards reaching the areas that should be explored symbolically, while drastically reducing the search space. In the rest of the paper, we refer to a preconditioned symbolic execution that explores the paths followed by some concrete executions as “seeded symbolic execution”.

Overall, one can make the following steps to generate test inputs for a program via symbolic execution:

  1. Precondition: specify a logical formula over the program inputs (computed as the disjunction of the path conditions of the paths followed by the executions of the seeds) to prune out the paths that are irrelevant to the analysis.

  2. Path exploration: explore a subset of the paths of , effectively discarding infeasible paths.

  3. Test input generation: for each feasible path , solve to generate a test input whose execution follows .

4. Killing Mutants

4.1. Exhaustive Exploration

A direct way to generate test inputs killing some given mutants (of program ) is to apply symbolic execution on both and the mutants, thereby obtaining their respective set of (symbolic) paths. Then, we can solve to generate a test input that kills mutant and goes through in and through in .

Figure 2 illustrates the use of symbolic execution to kill mutant of Figure 1. We skip the symbolic execution subtree rooted at control location since the corresponding paths do not reach mutant and can easily be pruned using static analysis. Also, we do not represent the symbolic variables and , which are not updated in this example. The symbolic execution on the original program leads to the paths and such that , , and . The symbolic execution on the mutant leads to the paths and such that and , and and .

The test generation that targets mutant solves the following formulae:

  1. . Satisfiable: example solution is .

  2. . Unsatisfiable: no possible output difference.

  3. . Unsatisfiable: infeasible path ().

  4. . Unsatisfiable: infeasible path ().

This method effectively generates tests to kill killable mutants. However, it requires a complete symbolic execution on and on each mutant . This implies that (i) all the path conditions and symbolic outputs have to be stored and analysed, and (ii) has to be solved possibly for each pair of paths . This leads to large computational cost that makes the approach impractical.

Figure 2. Example of Symbolic execution for mutant test generation. After control location , the symbolic execution on the original program contains transition with while the symbolic execution of the mutant contains transition with .

4.2. Conservative Pruning of the Search Space

To reduce the computational costs induced by the exhaustive exploration we apply two safe optimizations (preserve all opportunities to kill the mutants) that prune the space of program paths. We take advantage of the fact that mutants are simple syntactic alterations that share a large portion of their code with the original program.

4.2.1. Meta-mutation

Our first optimization stems from the observation that all paths and path prefixes of the original program that do not include a mutated statement (i.e. location whose outgoing transitions have changed in the mutants) also belong to the mutants. Thus, the symbolic execution of and that of the mutants may explore a significant number of identical path prefixes. As seen in Figure 2, the symbolic execution is identical for the original and mutant up to control location . Instead of making two separate symbolic executions, SEMu performs a shared symbolic execution based on a meta-mutant program. A meta-mutant (Untch et al., 1993; Papadakis and Malevris, 2010, 2011) represents all mutants in a single code. A branching statement (named mutant choice statement) is inserted at each mutation point and controls, based on the value of a special global variable (the mutant ID), the execution of the original and mutant programs.

The symbolic execution on the meta-mutant program initialises the mutant ID to an unknown value and explores a path normally until it encounters a mutant choice statement. Then, the path is duplicated once for the original program and once for each mutant, with the mutant ID set to the corresponding value, and each duplicated path is further explored normally. While the effect of this optimization is limited to the prefixes common to the program and all mutants, it reduces the overall cost of exploration at insignificant computation costs and without compromising the results.

4.2.2. Discarding non-infected mutant paths

In practice, many execution paths reach (cover) a mutant but fail to infect the program state (introducing an erroneous program state). Extending the execution along such paths is a waste of effort as the mutant will not be killed along those paths. Thus, SEMu terminates anticipatively the exploration of any path that reaches the mutant but fails to infect the program state.

4.3. Heuristic Search

Even with the aforementioned optimizations, the exhaustive exploration procedure remains too costly due to two factors: the size of the tree to explore and the number of couples of paths and to consider. To speed up the analysis, one can further prune the search space, at the risk of generating useless test inputs (that kill no mutant) or missing opportunities to kill mutants (by ignoring relevant paths).

A first family of heuristics reduce the number of paths to explore by selecting and prioritizing them, at the risk of discarding paths that would lead to killing mutants. A second family stop exploring a path after transitions and solve, instead of , the formula

where, for any path , denotes the prefix of of length and where is the symbolic state reached after executing . It holds that , , since a mutation cannot propagate to the output of the program if it does not infect the program in the first place. The converse does not hold, though: statements after the mutation can cancel the effects of an infection, rendering the output unchanged at the end of the execution. The problem then boils down to selecting an appropriate length where to stop the exploration, so as to maximize the chances of finding an infection that propagates up to the observable outputs.

As illustrated in Figure 2, generating a test at (control location ), requires to solve the constraint . The constraint solver may return which does not propagate the infection to the output. However, generating a test at (control location ), using the original path and mutant path , requires to solve the constraint . Any value returned by the constraint solver kills the mutant.

An ideal method to kill a mutant would explore only one path and one path , and up to the smallest prefix length where the constraint solver can generate a test that kills . However, identifying the right and the optimal is hard, as it requires precisely capturing the program semantics. To overcome this difficulty, SEMu defines heuristics to prune non-promising paths on the fly and to control at what point (what prefix length ) to call the constraint solver. Once candidate path prefixes are identified, SEMu invokes the solver to solve .

5. SEMu Cost-Control Heuristics

SEMu consists of parametric heuristics to control the symbolic exploration of promising code regions. Any configuration of SEMu sets the parameters of the heuristics, which together define which paths to explore and the test generation process. SEMu also takes as inputs the original program, the mutants to kill and a set of pre-existing test inputs to drive the seeded symbolic execution. During the symbolic exploration, SEMu selects which paths to explore and when to stop the exploration to generate test inputs based on the obtained path prefix.

Figure 3. Illustration of SEMu cost-control parameters. Subfigure (a) illustrates the Precondition Length where the green subtree represents the candidate paths constrained by the precondition (the thick green path prefix is explored using seeded symbolic execution). Subfigure (b) illustrates the Checkpoint Window (here CW is 2). Subfigure (c) illustrates the Propagation Proportion (here PP is 0.5) and the Minimum Propagation Depth (here if MPD is 1 the first test is generated, for unterminated paths, from Checkpoint 1).

5.1. Pre Mutation Point: Controlling for Reachability

To improve the efficiency of the path exploration, it is important to quickly prune paths that are infeasible (cannot be executed) or irrelevant (cannot reach the mutants). To achieve this, we leverage seeded symbolic execution (as implemented in KLEE) where the seeds are pre-existing tests. We proceed in two steps. First, we explore the paths in seeded mode up to a given length (precondition length). Then, we stop following the seeds’ executions and proceed with a non-seeded symbolic execution. The location of the switching point thus determines where the exploration stops using the precondition. In particular, if it is set to the entry point of the program then the execution is equivalent to a full non-seeded symbolic execution. If it is set beyond the output then it is equivalent to a fully seeded symbolic execution. Formally, let denote the complete set of paths of a program , be the set of seeds, and be the chosen precondition length. Then the sets of explored paths resulting from the seeded symbolic execution of length and with seeds is the largest set satisfying

This heuristics is illustrated in Figure 3a where the thick (green) segments represent the portion of the tree explored by seeded symbolic execution and the subtree below (light green) represents the portion explored by non-seeded symbolic execution. The precondition leads to pruning the leftmost subtree.

Accordingly, the first parameter of SEMu controls the precondition length (PL) at which to stop the seeded symbolic execution. Instead of demanding a specific length , the parameter can take two values reflecting two strategies to define dynamically: GMD2MS (Global Minimum Distance to Mutant Statement) and SMD2MS (Specific Minimum Distance to Mutant Statement). When set to GMD2MS, the precondition length is defined, for all explored paths, as the length of the smallest path prefix that reaches a mutated statement. When set to SMD2MS, the precondition length is defined, individually for each path , as the length of the smallest prefix of this path that reaches a mutated statement.

5.2. Post Mutation Point: Controlling for Propagation

From the mutation point, all paths of the original program are explored. When it comes to a mutant, however, it happens that path prefixes that cover and infect the program state fail to propagate the infection to the outputs. These prefixes should be discarded to reduce the search space. Accordingly, our next set of parameters controls where to check that the propagation goes on, the number of paths to continue exploring from those checkpoints, and when to stop the exploration and generate test inputs. Overall, those parameters contribute to reducing the number of paths explored by the symbolic execution as well as the length of the path prefixes from which tests are generated.

5.2.1. Checkpoint Location

The first parameter is an integer named the Checkpoint Window (CW) which determines the location of the checkpoints. Any checkpoint is a program location with branching statements (i.e. transitions with guarded command such that ) that is found after the mutation point. Then, the checkpoint window defines the number of branching statements (that are not checkpoints) between the mutation point and the first checkpoint, and between any two consecutive checkpoints. The effect of this parameter is illustrated in Figure 3b. The marked horizontal lines represent the checkpoints. In this case, the checkpoint window is set to 2, meaning that there are two branching statements between two checkpoints. At each checkpoint, SEMu can perform two actions: (1) discard some branches (path suffixes) of the current path prefix (by ignoring some of the branches) and (2) generate tests based on the current prefix. Whether and how those two actions are performed is determined according to the following parameters.

5.2.2. Path Selection

The parameter Propagating Proportion (PP) specifies the percentage of the branches that are kept to pursue the exploration, whereas the parameter Propagation Selection Strategy

(PSS) determines the strategy used to select these branches. We implemented two strategies: random (RND) and Minimum Distance to Output (MDO). The first one simply selects the branches randomly with a uniform probability. The second one assigns a higher priority to the branches that can lead to the program output more rapidly (i.e. by executing fewer statements). This distance is estimated statically based on the control flow and call graphs of the program. The two parameters are illustrated in Figure 

3c, where the crossed subtrees represent branches pruned at Checkpoint 0.

5.2.3. Early Test Generation

Generating test inputs before the end of the symbolic execution (on the path prefixes) allows us to reduce its computation cost. Being placed after the mutation point, all checkpoints are potential places where to trigger the test generation. However, generating sooner reduces the chances of seeing the infection propagate to the program output. To alleviate this risk, we introduce the parameter Minimum Propagation Depth (MDP), which specifies the number of checkpoints that the execution must pass through before starting to generate tests. In Figure 3c, if MDP is set to 1 then tests are generated from Checkpoint 1 (for the two remaining paths prefixes). Note that in case MDP is set to 0, tests are generated for the crossed (pruned) path prefixes at Checkpoint 0.

5.3. Controlling the Cost of Constraint Solving

Remember that requires the state of the original program and the mutant to be different. The subformulae representing the symbolic program states can be large and/or complex, which may hinder the performance of the invoked constraint solver. To reduce this cost, we devise a parameter No State Difference (NSD) that determines whether to consider the program state differences when generating tests. When set to , is reduced to ; however, its solution has lower chances of killing mutant .

5.4. Controlling the Number of Attempts

It is usually sufficient to generate a single test that covers the mutant to kill it. However, the stubborn mutants that we target may not be killed by the early attempts (applied closer to the mutation point) and require deeper analysis. Furthermore, a test generated to kill a mutant may collaterally kill another mutant. For those reasons, generating more than one test for a given mutant can be beneficial. Doing this, however, comes at higher test generation and test execution costs. To control this, we devise a parameter Number of Tests Per Mutant (NTPM) that specifies the number of tests generated for each mutant (i.e. the number of formulas solved for each mutant).

6. Empirical Evaluation

6.1. Research Questions

We first empirically evaluate the ability of SEMu to kill stubborn mutants. This is an essential question, since there is no point in evaluating SEMu if it cannot kill some of the targeted mutants.

RQ1:

What is the ability of SEMu to kill stubborn mutants?

Since the results of RQ1 indicate a strong killing ability of SEMu, we turn our attention to the question of whether the killing ability is due to the extended symbolic exploration that is anyway performed by KLEE. We thus, compare SEMu with KLEE by running KLEE in the seed mode (using the initial test suite as a seed for KLEE test generation) to generate additional tests. Such a comparison is also a first natural baseline to compare with. These motivate RQ2:

RQ2:

How does SEMu compare with KLEE in terms of killed stubborn mutants?

Perhaps not surprisingly, we found that SEMu outperforms KLEE. This provides evidence that our dedicated mutation-based approach is indeed suitable for mutation-based test generation. At the same time though, our results raises further questions on whether the superior killing ability of SEMu is due to mutant infection (suggested by previous research) or due to mutant propagation (specific target of SEMu). In case we find that mutant infection is sufficient for killing stubborn mutants then mutant propagation should be skipped in order to save effort and resources. To investigate this, we ask:

RQ3:

How does SEMu compare with the infection-only strategy in terms of killed stubborn mutants?

6.2. Test Subjects

To answer our research questions, we experimented with the C programs of GNU Coreutils222https://www.gnu.org/software/coreutils/ (version 8.22). GNU Coreutils is a collection of text, file, and shell utility programs widely used in unix systems. The whole codebase of Coreutils is made of more than 60,000 lines of C code333Measured with cloc (http://cloc.sourceforge.net/).

The repository of Coreutils contains developer tests for the utilities programs which are system tests written in shell or perl scripts that involve more than 20,000 lines of code3.

Applying mutation analysis on all Coreutils programs requires excessive amount of effort. Therefore, we randomly sampled 60 programs, based on which we performed our analysis. Unfortunately, in 13 of them mutation analysis took excessive computational time (due to costly test execution), for which we terminated the analysis. Therefore, we analysed 47 programs. These are: base64, basename, chcon, chgrp, chmod, chown, chroot, cksum, date, df, dirname, echo, expr, factor, false, groups, join, link, logname, ls, md5sum, mkdir, mkfifo, mknod, mktemp, nproc, numfmt, pathchk, printf, pwd, realpath, rmdir, sha256sum, sha512sum, sleep, stdbuf, sum, sync, tee, touch, truncate, tty, uname, uptime, users, wc, whoami. The following Figure presents the size of these subjects.




For each subject we selected the 3 functions that were covered by the largest number of developer tests (from the initial test suite).

6.3. Employed Tools

We implemented our approach on top of LLVM444https://llvm.org/ using the symbolic virtual machine KLEE (Cadar et al., 2008). The version of our tool is based on the KLEE revision 74c6155, LLVM 3.4.2. Our implementation modified (or added) more than 8,000 lines of code on KLEE, and is publicly available555https://github.com/thierry-tct/KLEE-SEMu. We are planning to add support for newer versions of LLVM. To convert system tests into the format of seeds required by KLEE for the seeded symbolic execution, we use Shadow (Palikareva et al., 2016).

Our tool requires the targeted mutants to be represented in a meta-mutant program (presented in Section 4.2.1), which were produced using the Mart (Titcheu Chekam et al., 2019) mutant generation tool. Mart mutates a program by applying a set of mutation operators (code transformations) to the original LLVM bitcode program.

6.4. Experimental Setup

6.4.1. Selected Mutants

To perform our experiment we need to form our target mutant set. To do so, we employed Mart by using its default configuration and generated 172,919 mutants. This configuration generates a comprehensive set of mutants based on a large set of mutation operators, consisting of 816 code transformations. It is noted that the operator set includes the classical 5 operators (Offutt et al., 1996) that are used by most of the todays’ studies and mutation testing tools. Unfortunately, space constraints prohibit us from detailing the operator set. The interested reader is refered to Mart’s paper for further details (Titcheu Chekam et al., 2019).

To identify the stubborn mutant set we started by eliminating trivial equivalent and duplicated mutants, and form our initial mutant set . To do so, we applied Trivial Compiler Equivalence (TCE) (Papadakis et al., 2015), a technique that statically removes a large number of mutant equivalences. In our experiment, TCE removed a total number of 102,612 mutants as being equivalent or duplicated. This gave us 70,307 mutants to be used for our initial mutant set, i.e., = 70,307.

Then, we constructed our initial test suites (composed of the developer test suite augmented with a simple test generation run of KLEE). To generate these tests with KLEE, we set a test generation timeout of 24 hours, while using the same configurations presented by the authors of KLEE (Cadar et al., 2008) (except for larger memory limit and max-instruction-time, set to 9GB and 30s respectively). This run resulted in 5,161 tests ( 2,693 developer tests and 2,468 tests generated by the initial run of KLEE).

We then executed the initial test suites () with the initial mutant set () and identified the live and killed mutants. The killed mutants were discarded, while the live ones formed our target mutant set (denoted it as ), i.e., is the target of SEMu. In our experiment we found that included 26,278 mutants, which is approximately 37% of . It is noted that is a superset of the stubborn mutants as it includes both stubborn and equivalent mutants. Unfortunately, judging mutant equivalence is undecidable and thus, we cannot remove such mutants before test generation. Therefore, to preserve realistic settings we are forced to run SEMu on all mutants.

To evaluate SEMu effectiveness we need to measure the extent to which it can kill stubborn mutants. Unfortunately, contains a large proportion of equivalent mutants (Schuler and Zeller, 2013), which may result in significant underestimations of test effectiveness (Kurtz et al., 2016). Additionally, may contain a large portion of subsumed mutants (mutants killed collaterally by tests designed to kill other mutants), which may inflate (overestimate) test effectiveness (Papadakis et al., 2016). Although we discarded easy-to-kill mutants, it is still likely that a significant amount of ‘noise’ still remains.

To reduce such biases (both under and over estimations) (Papadakis et al., 2016; Kurtz et al., 2016), there is a need to filter out the subsumed mutants by forming the subsuming mutant set (Papadakis et al., 2019; Kintis et al., 2010). The subsuming mutants are mainly distinct (in the sense that killing one of them does not alter, increase or decrease, the chances of killing the others) providing objective estimations of test effectiveness. Unfortunately, identifying subsuming mutants is undecidable and thus, several mutation testers, e.g., Ammann et al. (Ammann et al., 2014), Papadakis et al. (Papadakis et al., 2016), Kurtz et al. (Kurtz et al., 2016) suggested approximating them through strong test suites. Therefore, to approximate them, we used the combined test suite that merges all tests generated by KLEE and SEMu across the execution of its 128 different configurations, , where is KLEE and are the SEMu configurations (refer to Section 6.4.2 for details). This process was applied on and resulted in a set of 529 mutants. In the rest of the paper we call the mutants belonging to as reference mutants. We use for our effectiveness evaluation.

Overall, through our experiments we used two distinct mutant sets, and . To preserve realistic settings, the former is used for test generation, while the later is used for test evaluation (to reduce bias).

6.4.2. SEMu Configuration

To specify relevant values for our modelling parameters we performed ad-hoc exploratory analysis on some small program functions. Based on this analysis we specify 2 relevant values for each of the 7 parameters (defined in Section 5). These values provided us the basis for constructing a set of configurations (parameter combinations) to experiment with. In particular the values we used are the following: Precondition Length: GMD2MS and SMD2MS, Checkpoint Window: 0 and 3, Propagating Proportion: 0 and 0.25, Propagating Selection Strategy: RND and MDO, Minimum Propagation Depth: 0 and 2, No State Difference: True and False, Number of Tests Per Mutant: 1 and 5.

We then experiment with them in order to select the SEMu configuration and form our approach. It is noted that different values and combinations form different strategies. Examining them is a non-trivial task since the number of configurations is exponentially increased, i.e., and mutant execution takes considerable amount of time. In our study, the total test generation of the various configurations and KLEE took roughly 276 CPU days, while the execution of the mutants took approximately 1,400 CPU days.

To identify and select the most prominent configuration, we executed our framework on all test subjects under all configurations where . We restrict the symbolic execution time to 2 hours. We then randomly split the set of test subjects into 5 buckets of equal size (each one containing 20% of the test subjects). Then, we pick 4 buckets (80% of the test subjects) and select the best configuration by computing the ratio of killed reference mutants. We assess the generalization of this configuration on the left out bucket (5th bucket that includes 20% of the test subjects). To reduce the influence of random effects, we repeated this process 5 times by leaving every bucket out for evaluation. At the end we selected the median performing configuration (performance on the bucket that had been left out). It is noted that such a cross validation process is commonly used in order to select stable and potentially generalizable configurations.

Based on the above procedure we selected the SEMu configuration: .

6.5. Experimental Settings and Procedure

To perform our experiment we set, on KLEE, the following (main) settings (which are similar to the default parameters of KLEE): a) we set a memory usage threshold of 8 GB, (a threshold never reached by any of the studied methods), b) we set the search strategy on Breadth-First Search (BFS), which is commonly used in patch testing studies (Palikareva et al., 2016) and c) we set a 2 hours time limit for each subject.

It is noted that our current implementation supports only BFS. We believe that such a strategy fits well with our purpose as it is important that the mutants and original program paths are explored in a lock step in order to enable state comparison at the same depth. The time budget of 2 hours was adopted because it is frequently used in test generation studies, e.g., (Palikareva et al., 2016), and forms a time budget that is neither too big nor too small. It is noted that since SEMu performs a deeper analysis than the other methods, adopting a higher time limit would probably lead to an improved performance, compared to the other methods. Of course reducing this limit could lead to reduced performance.

We then evaluated the generated test suites by computing the ratio of reference mutants that they kill. Unfortunately, in 11 among the 47 test subjects we considered, none of the evaluated techniques managed to kill any mutant. This means that for these 11 subjects we approximate having 0 stubborn mutants and thus, we discarded those programs. Therefore, the following results regard the 36 programs for which we could kill at least one stubborn mutant.

To answer RQ1 we compute and report the ratio of the reference mutants killed, i.e., set, by SEMu when it targets the 26,278 surviving mutants, i.e., set.

To answer RQs 2 and 3 we compute and contrast the ratio of the reference mutants killed by KLEE (executed in ”seeding” mode), the infection-only strategy (a strategy suggested by previous research (Zhang et al., 2010; Harman et al., 2011)) and SEMu (for fair comparison, we used the initial test suite as seeds for the three approaches). We also report and contrast the number of mutant-killing tests that were generated. Since the generated tests may include large numbers of redundant tests, i.e., a test is redundant with respect to a set of tests when it does not kill any unique mutant compared to the mutants killed by the other tests in the set (Papadakis et al., 2019), we compare the sizes of non-redundant test sets, which we call mutant-killing test sets. The size of these sets represents the raw number of end objectives that were successfully met by the techniques (Ammann and Offutt, 2008; Papadakis et al., 2019).

To compute the mutant-killing test sets we used a greedy heuristic. This heuristic incrementally selects the tests that kill the maximum number of mutants that were not killed by the previously selected tests.

6.6. Threats to Validity

All in all we targeted 133 functions from 47 programs from Coreutils. This level of evidence sufficiently demonstrates the potential of our approach, but should not be considered as a general assertion of its test effectiveness.

We generated tests at the system level, relying on the developers’ tests suites. We believe that this is the major advantage of our approach because this way we focus on stubborn mutants that encode system level corner cases that are hard to reveal. Another benefit of doing so is that at this level we can reduce false alarms, experienced at unit level (feasible behaviors at unit but infeasible at system level), (Gross et al., 2012). Unfortunately though, this could mean that our results do not necessarily extend to unit level.

Another issue may be due to the tools and frameworks we used. Potential defects and limitations of these tools could influence our observations. To reduce this threat we used established tools, i.e., KLEE and Mart, that have been used by many empirical studies. To reduce this threat further we also performed manual checks and made our tool publicly available.

In our evaluation we used the subsuming stubborn mutants in order to cater for any bias caused by trivial mutants (Papadakis et al., 2016). While this practice follows the recommendations made by the mutation testing literature (Papadakis et al., 2019), the subsuming set of mutants is a subject to the combined reference test suite, which might not be representative to the input domain. Nevertheless, any issue caused by the above approximations could only reduce the mutant killed ratios and not the superiority of our method. Additional (future) experimentations will increase the generalizability of our conclusions.

The comparison between the studied methods (infection-only) was based on a time limit that did not include any actual mutant test execution time. This means that when reaching the time limit, we cannot know how successful (at mutant killing) the generated tests were. Additionally, we cannot perform test selection (eliminate ineffective tests) as this would require expensive mutant executions. While, it is likely that a tester would like to execute the mutants in order to perform test selection, leaving mutant execution out allows a fair comparison basis between the studied methods since mutant execution varies between the methods and heavily depends on test execution optimizations used (Papadakis et al., 2019). Nevertheless, it is unlikely that including the mutant execution would change our results since SEMu generates less tests than the baselines (because it makes a deeper analysis than the baselines).

7. Empirical Results

7.1. Killing ability of SEMu

To evaluate the effectiveness of SEMu we run it for 2 hours per subject program and collect the generated test inputs. We then execute these inputs with the reference mutants and determine the killed ones. Interestingly SEMu kills a large portion of the reference mutants. The median percentage of killed mutants is 37.3%, indicating a strong killing ability. To kill these mutants SEMu generated 153 mutant-killing test inputs (each test kills at least one mutant that is not killed by any other test).

7.2. Comparing SEMu with KLEE

Figure 4. Comparing the stubborn mutant killing ability of SEMu, KLEE and the infection-only.

Figure 4 records the proportion of the killed reference mutants by SEMu, seeded mode of KLEE and infection-only (investigated in RQ3). It is noted that the boxes include the proportions of killed mutants among the different test subjects we use. From these results we can observe that SEMu has a median value of 37.3% while KLEE has a median of 0.0%.

To further validate the difference we use the Wilcoxon statistical test (paired version) to check whether the differences are significant. The statistical test gives a p-value of 0.006 suggesting that the two samples’ values are indeed significantly different. As statistical significance does not provide any information related to the volume of the difference, we also compute the Vargha Delaney effect size ( value) that quantifies the frequency the observed difference. The results give a of 0.736, which indicates that SEMu is superior to KLEE in 73.6% of the cases.

Figure 5 depicts the differences and overlap between the reference mutants killed by SEMu and KLEE, per studied subject. From this figure, we can observe that the number of programs with overlapping killed mutant is very small indicating that the two methods differ significantly. We also observe that SEMu performs best in the majority of the cases. Interestingly, a non negligible number of mutants are killed by KLEE only. These cases fall within a small number of test subjects. We investigated these cases and found that the differences were big either because there was only one reference mutant, which was killed by KLEE alone, or because of the large number of surviving mutants that force SEMu perform a shallow search. Unfortunately, SEMu spends much time trying to kill every targeted mutant and thus, when a large number of them is involved, the 2 hours time limit we set is not sufficient to effectively kill them.

To better demonstrate the effectiveness differences of the methods we also record the number of the mutant killing test inputs (each test kills at least one mutant that is not killed by any other test). We found that SEMu generated 153 mutant-killing test inputs, while KLEE generated only 62.

Figure 5. Comparing the mutant killing ability of SEMu and KLEE in per program basis.

7.3. Comparing SEMu with infection-only

A first comparison between SEMu and infection-only can be made based on the data from Figure 4. According to these data SEMu has a median value of 37.3% while infection-only has a median of 17.2%. Interestingly, this shows a big difference in favour of our approach. To further validate this finding, we performed a Wilcoxon statistical test and got a p-value of 0.04 suggesting that the two samples’ values are statistically significant (at the commonly adopted 5% confidence level). Like in RQ2 we also computed the Vargha Delaney effect size and found that SEMu yields higher killing rates than infection-only in 61% of the cases.

To demonstrate the differences we also present our results in a per test subject basis. Figure 6 shows the differences and overlap between the killed reference mutants. From these results we observe a large overlap between the mutants killed by both approaches, with SEMu being able to kill more mutants for most of the cases. We also observe that in 5 of the cases infection-only performed better than SEMu, while SEMu performed better in 13.

Similarly, to the previous RQs we compare the strategies by counting the number of the mutant killing test inputs that were generated by the strategies. Interestingly, we found that SEMu generated 87% more mutant killing test inputs than the ”infection-only” one ( 153 vs. 82 inputs) , indicating the usefulness of our framework.

Figure 6. Comparing the mutant killing ability of SEMu and infection-only in per program basis.

8. Related Work

Many techniques targeting mutation-based test generation have been proposed (Anand et al., 2013; Papadakis et al., 2019). However, most of these focus on generating test suites from scratch, by maximizing the number of mutants killed, mainly by either covering the mutants or by targeting mutant infection. In contrast we aim at deep testing of specific program areas by taking advantage of existing tests and by targeting stubborn mutants that are hard to propagate.

The studies of Papadakis and Malevris (Papadakis and Malevris, 2011, 2012) and Zhang et al. (Zhang et al., 2010) proposed embedding mutant related constraints, called infection conditions, within the meta-programs that inject and control the mutants in order to force symbolic execution to cover them. As a result, symbolic execution modules can produce test cases that satisfy the infection conditions and have good chances to kill the mutants. Although effective, these approaches only target mutant infection, which makes them relatively weak (Titcheu Chekam et al., 2017).

To bypasss the abovementioned problem, the studies of Papadakis and Malevris (Papadakis and Malevris, 2010) and Harman et al. (Harman et al., 2011) aimed at indirectly handling mutant propagation. The former technique searches symbolically the path space of the mutant programs (after the mutation point), while the later one searches the input program space defined by the path conditions in order to bypass constraints not handled by the used solver and to indirectly make the mutants propagate. In contrast SEMu aims at incrementally differentially exploring the path space by considering the symbolic states and making a relevant exploration.

Fraser and Zeller (Fraser and Zeller, 2012) and Fraser and Arcuri (Fraser and Arcuri, 2015) applied Search-based testing in order to generate mutation-based tests. Their key advancement was to guide the search by measuring the differences between the test traces of the original and mutant programs. While powerful, such an attempt still fails to provide the guidance needed in order to trigger such differences.

Moreover, search techniques rely on the ability to execute test cases fast (applied at the unit level), making them less effective in cases of slow test execution (such as system level testing). Nevertheless, a comparison between search-based test generation and symbolic execution falls out of the scope of the present paper.

Much of work on testing software patches has also been performed the recent years (Taneja et al., 2011; Marinescu and Cadar, 2012, 2013). However, most of these methods aim at covering patches and not the program semantics (behavioural changes). Moreover, these techniques target the general patch testing problem, which in a sense assume very few patches with many changes. The mutation case though involves many mutants, which are small syntactic deviations, facts that our method takes advantage in order to optimize the mutant killings.

Differential symbolic execution (Person et al., 2008) aims at reasoning about semantic differences of program versions, but since it performs a whole program analysis it can experience significant scalability issues when considering large programs and multiple mutants. Directed incremental symbolic execution (Person et al., 2011) guides the symbolic exploration through static program slicing. Unfortunately, such a method can be expensive when used with many mutants. Nevertheless, program slicing could be used to further guide SEMu towards the relevant mutant exploration space.

Shadow symbolic execution (Palikareva et al., 2016) applies a combined execution on both program versions under analysis. It relies on analysis a meta-program that is similar to the mutant’s meta-program in order to take advantage of the common program parts. The major difference with our method is that we specifically target multiple mutants at the same time, limit the program exploration through data state comparisons in order to optimize performance. Since shadow targets single patches and exhaustively searches the path space (after the mutation point) it can experience scalability issues.

Overall, while many related techniques have been proposed, they have not been investigated in the context of mutation testing and particularly to target stubborn mutants. Stubborn mutants are hard to kill and their killing results in test inputs that are linked with corner cases and increase fault revelation (Titcheu Chekam et al., 2017).

9. Conclusion

This paper introduced SEMu, a method that generates test inputs for killing stubborn mutants. SEMu relies on a form of shared differential symbolic execution that incrementally searches a small but ‘promising’ code region around the mutation point in order to reveal divergent behaviours. This allows the fast and effective generation of test inputs that thoroughly exercise the targeted program corner cases. We have empirically evaluated SEMu on Coreutils and demonstrated that it can kill approximately 37% of the involved stubborn mutants within a two hour time budget.

References

  • R. Abou Assi, C. Trad, M. Maalouf, and W. Masri (2019) Coincidental correctness in the defects4j benchmark. Software Testing, Verification and Reliability 29 (3), pp. e1696. Note: e1696 STVR-18-0045.R2 External Links: Document, Link Cited by: §1.
  • P. Ammann, M. E. Delamaro, and J. Offutt (2014) Establishing theoretical minimal sets of mutants. In Seventh IEEE International Conference on Software Testing, Verification and Validation, ICST 2014, March 31 2014-April 4, 2014, Cleveland, Ohio, USA, pp. 21–30. External Links: Link, Document Cited by: §1, §6.4.1.
  • P. Ammann and J. Offutt (2008) Introduction to software testing. Cambridge University Press. External Links: ISBN 978-0-521-88038-1 Cited by: §2.2, §2, §6.5.
  • S. Anand, E. K. Burke, T. Y. Chen, J. A. Clark, M. B. Cohen, W. Grieskamp, M. Harman, M. J. Harrold, and P. McMinn (2013) An orchestrated survey of methodologies for automated software test case generation. Journal of Systems and Software 86 (8), pp. 1978–2001. External Links: Link, Document Cited by: §1, §8.
  • K. Androutsopoulos, D. Clark, H. Dan, R. M. Hierons, and M. Harman (2014) An analysis of the relationship between conditional entropy and failed error propagation in software testing. In 36th International Conference on Software Engineering, ICSE ’14, Hyderabad, India - May 31 - June 07, 2014, pp. 573–583. External Links: Link, Document Cited by: §1.
  • R. Baker and I. Habli (2013) An empirical evaluation of mutation testing for improving the test quality of safety-critical software. IEEE Trans. Software Eng. 39 (6), pp. 787–805. External Links: Link, Document Cited by: §1.
  • C. Cadar, D. Dunbar, and D. R. Engler (2008) KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs. In 8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008, December 8-10, 2008, San Diego, California, USA, Proceedings, pp. 209–224. External Links: Link Cited by: §1, §6.3, §6.4.1.
  • L. De Moura and N. Bjørner (2008) Z3: an efficient smt solver. In Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS’08/ETAPS’08, Berlin, Heidelberg, pp. 337–340. External Links: ISBN 3-540-78799-2, 978-3-540-78799-0, Link Cited by: §2.1.
  • P. Delgado-Pérez, I. Habli, S. Gregory, R. Alexander, J. A. Clark, and I. Medina-Bulo (2018) Evaluation of mutation testing in a nuclear industry case study. IEEE Trans. Reliability 67 (4), pp. 1406–1419. External Links: Link, Document Cited by: §1.
  • R. A. DeMillo and A. J. Offutt (1991) Constraint-based automatic test data generation. IEEE Trans. Software Eng. 17 (9), pp. 900–910. External Links: Link, Document Cited by: §2.2.
  • G. Fraser and A. Arcuri (2015) Achieving scalable mutation-based generation of whole test suites. Empirical Software Engineering 20 (3), pp. 783–812. External Links: Link, Document Cited by: §8.
  • G. Fraser and A. Zeller (2012) Mutation-driven generation of unit tests and oracles. IEEE Trans. Software Eng. 38 (2), pp. 278–292. External Links: Link, Document Cited by: §8.
  • F. Gross, G. Fraser, and A. Zeller (2012) Search-based system testing: high coverage, no false alarms. In International Symposium on Software Testing and Analysis, ISSTA 2012, Minneapolis, MN, USA, July 15-20, 2012, pp. 67–77. External Links: Link, Document Cited by: §6.6.
  • M. Harman, Y. Jia, and W. B. Langdon (2011) Strong higher order mutation-based test data generation. In SIGSOFT/FSE’11 19th ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE-19) and ESEC’11: 13th European Software Engineering Conference (ESEC-13), Szeged, Hungary, September 5-9, 2011, pp. 212–222. External Links: Link, Document Cited by: §1, §6.5, §8.
  • J. C. King (1976) Symbolic execution and program testing. Commun. ACM 19 (7), pp. 385–394. External Links: ISSN 0001-0782, Link, Document Cited by: §3.
  • M. Kintis, M. Papadakis, and N. Malevris (2010) Evaluating mutation testing alternatives: A collateral experiment. In 17th Asia Pacific Software Engineering Conference, APSEC 2010, Sydney, Australia, November 30 - December 3, 2010, pp. 300–309. External Links: Link, Document Cited by: §6.4.1.
  • B. Kurtz, P. Ammann, J. Offutt, and M. Kurtz (2016) Are we there yet? how redundant and equivalent mutants affect determination of test completeness. In Ninth IEEE International Conference on Software Testing, Verification and Validation Workshops, ICST Workshops 2016, Chicago, IL, USA, April 11-15, 2016, pp. 142–151. External Links: Link, Document Cited by: §6.4.1, §6.4.1.
  • P. D. Marinescu and C. Cadar (2012) Make test-zesti: A symbolic execution solution for improving regression testing. In 34th International Conference on Software Engineering, ICSE 2012, June 2-9, 2012, Zurich, Switzerland, pp. 716–726. External Links: Link, Document Cited by: §8.
  • P. D. Marinescu and C. Cadar (2013) KATCH: high-coverage testing of software patches. In Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE’13, Saint Petersburg, Russian Federation, August 18-26, 2013, pp. 235–245. External Links: Link, Document Cited by: §8.
  • L. J. Morell (1990) A theory of fault-based testing. IEEE Trans. Softw. Eng. 16 (8), pp. 844–857. External Links: ISSN 0098-5589, Link, Document Cited by: §2.2.
  • A. J. Offutt, A. Lee, G. Rothermel, R. H. Untch, and C. Zapf (1996) An experimental determination of sufficient mutant operators. ACM Trans. Softw. Eng. Methodol. 5 (2), pp. 99–118. External Links: Link, Document Cited by: §6.4.1.
  • H. Palikareva, T. Kuchta, and C. Cadar (2016) Shadow of a doubt: testing for divergences between software versions. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016, pp. 1181–1192. External Links: Link, Document Cited by: §6.3, §6.5, §6.5, §8.
  • M. Papadakis, C. Henard, M. Harman, Y. Jia, and Y. L. Traon (2016) Threats to the validity of mutation-based test assessment. In Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, Saarbrücken, Germany, July 18-20, 2016, pp. 354–365. External Links: Link, Document Cited by: §1, §6.4.1, §6.4.1, §6.6.
  • M. Papadakis, Y. Jia, M. Harman, and Y. L. Traon (2015) Trivial compiler equivalence: A large scale empirical study of a simple, fast and effective equivalent mutant detection technique. In 37th IEEE/ACM International Conference on Software Engineering, ICSE 2015, Florence, Italy, May 16-24, 2015, Volume 1, pp. 936–946. External Links: Link, Document Cited by: §6.4.1.
  • M. Papadakis, M. Kintis, J. Zhang, Y. Jia, Y. L. Traon, and M. Harman (2019) Mutation testing advances: an analysis and survey. Advances in Computers, Vol. 112, pp. 275 – 378. External Links: ISSN 0065-2458, Document Cited by: §1, §1, §6.4.1, §6.5, §6.6, §6.6, §8.
  • M. Papadakis and N. Malevris (2010) Automatic mutation test case generation via dynamic symbolic execution. In IEEE 21st International Symposium on Software Reliability Engineering, ISSRE 2010, San Jose, CA, USA, 1-4 November 2010, pp. 121–130. External Links: Link, Document Cited by: §4.2.1, §8.
  • M. Papadakis and N. Malevris (2011) Automatically performing weak mutation with the aid of symbolic execution, concolic testing and search-based testing. Software Quality Journal 19 (4), pp. 691–723. External Links: Link, Document Cited by: §4.2.1, §8.
  • M. Papadakis and N. Malevris (2012) Mutation based test case generation via a path selection strategy. Information & Software Technology 54 (9), pp. 915–932. External Links: Link, Document Cited by: §8.
  • S. Person, M. B. Dwyer, S. G. Elbaum, and C. S. Pasareanu (2008) Differential symbolic execution. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2008, Atlanta, Georgia, USA, November 9-14, 2008, pp. 226–237. External Links: Link, Document Cited by: §8.
  • S. Person, G. Yang, N. Rungta, and S. Khurshid (2011) Directed incremental symbolic execution. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, San Jose, CA, USA, June 4-8, 2011, pp. 504–515. External Links: Link, Document Cited by: §8.
  • G. Petrovic and M. Ivankovic (2018) State of mutation testing at google. In Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice, ICSE (SEIP) 2018, Gothenburg, Sweden, May 27 - June 03, 2018, pp. 163–171. External Links: Link, Document Cited by: §1, §1.
  • A. V. Pizzoleto, F. C. Ferrari, J. Offutt, L. Fernandes, and M. Ribeiro (2019) A systematic literature review of techniques and metrics to reduce the cost of mutation testing. Journal of Systems and Software. External Links: ISSN 0164-1212, Document, Link Cited by: §1.
  • D. Schuler and A. Zeller (2013) Covering and uncovering equivalent mutants. Softw. Test., Verif. Reliab. 23 (5), pp. 353–374. External Links: Link, Document Cited by: §1, §2, §6.4.1.
  • K. Taneja, T. Xie, N. Tillmann, and J. de Halleux (2011) EXpress: guided path exploration for efficient regression test generation. In Proceedings of the 20th International Symposium on Software Testing and Analysis, ISSTA 2011, Toronto, ON, Canada, July 17-21, 2011, pp. 1–11. External Links: Link, Document Cited by: §8.
  • T. Titcheu Chekam, M. Papadakis, Y. Le Traon, and M. Harman (2017) An empirical study on mutation, statement and branch coverage fault revelation that avoids the unreliable clean program assumption. In Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017, pp. 597–608. External Links: Link, Document Cited by: §1, §8, §8.
  • T. Titcheu Chekam, M. Papadakis, and Y. Le Traon (2019) Mart: a mutant generation tool for llvm. In Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), Cited by: §6.3, §6.4.1.
  • R. H. Untch, A. J. Offutt, and M. J. Harrold (1993) Mutation analysis using mutant schemata. In Proceedings of the 1993 International Symposium on Software Testing and Analysis, ISSTA 1993, Cambridge, MA, USA, June 28-30, 1993, pp. 139–148. External Links: Link, Document Cited by: §4.2.1.
  • X. Yao, M. Harman, and Y. Jia (2014) A study of equivalent and stubborn mutation operators using human analysis of equivalence. In 36th International Conference on Software Engineering, ICSE ’14, Hyderabad, India - May 31 - June 07, 2014, pp. 919–930. External Links: Link, Document Cited by: §1, §2, §2.
  • L. Zhang, T. Xie, L. Zhang, N. Tillmann, J. de Halleux, and H. Mei (2010) Test generation via dynamic symbolic execution for mutation testing. pp. 1 – 10. External Links: Document Cited by: §1, §6.5, §8.