1 Introduction
Test case prioritization aims at ordering regression test suites so that testing meets its goals as early as possible. This means that stopping the test process at any arbitrary point (in time), test effectiveness is optimal (with respect to the used time budget) [1]. To achieve this goal, test prioritisation needs to ‘predict’ which test cases will detect faults. This prediction is usually performed based on surrogates, like test coverage [2] or test case diversity [3, 4, 5].
Although test prioritization techniques have been extensively studied in the literature [4], most of them rely on the use of various types of structural coverage [1]. Little attention has been paid to advanced test elements like mutants (i.e., artificial faults). We believe the mutation based criteria call for further attention, given that mutants have been shown to be effective at revealing faults [6] and that mutant killing (i.e., detecting the deference between a mutant and its original program) ratios are similar with the fault detection ratios [2, 7]. Yet very few approaches study the mutation-based test case prioritization and none evaluates them with real-world applications and faults.
Recent advances in test case prioritization focus on identifying and promoting the diversity of the selected test cases [3, 4, 5], rather than maximising the coverage. This trend can provide several advantages, especially in cases where there is no source code availability [4]. Therefore, the combination of mutation-based and diversity-based approaches could provide substantial benefits by increasing early fault detection. Investigating such a combination is the primary aim of this study.
In this paper we propose and empirically investigate a new diversity-aware mutation-based test case prioritization technique. The technique relies on the diversity-aware mutation adequacy criterion, which is recently proposed by Shin et al. [8, 9]. The diversity-aware criterion aim is to distinguish the behaviour of every mutant from the behaviour of all other mutants (and the original program version). In contrast the traditional mutation adequacy criterion aims at distinguishing only the behaviour of the mutants from that of the original program. According to Shin et al., distinguishing mutants improves the fault detection capabilities of mutation testing. Therefore, our diversity-aware mutation-based prioritization gives higher priority to those test cases that help distinguishing all mutants as early as possible.
Our study investigates the relative cost and effectiveness of two mutation-based prioritization techniques, i.e., one using traditional mutant kill and another using distinguishement, with real-world applications and faults. For this, we use 352 real faults and 553,477 developer-written test cases in the Defects4J data set [10]. The empirical evaluation considers both the traditional kill-only and the proposed diversity-aware mutation-based prioritization criteria in various settings: single-objective greedy, single-objective hybrid, as well as multi-objective prioritization that seeks to prioritize using both criteria simultaneously. We find that there is no single technique that we could characterize as the dominant one. To this end, we provide a graphical model called Mutant Distinguishment Graph (MDG) to help us understand how a set of test cases that kills and distinguishes mutants related with fault detection. This visualisation scheme demonstrates why and when each one of the mutation-based prioritization criteria performs poorly.
Overall, the technical contributions of this paper can be summarized as follows:
-
We present a large empirical study that investigates the relative cost and effectiveness of mutation-based prioritization techniques with real faults.
-
We investigate two different mutation-based test prioritization techniques under both single (greedy and hybrid) and multi-objective prioritization schemes.
-
We investigate and identify the reasons behind the differences between the traditional kill-only mutation and distinguish mutation prioritization schemes, using intuitive graphical models.
The rest of this paper is organized as follows. Section 2 provides background for mutation adequacy criteria and test case prioritization. Section 3 explains the mutation-based test case prioritization techniques that are studied in this paper. Section 4 explains our experimental settings including research questions, measures and variables, subject faults, test, and mutants. The results from the empirical evaluation are given in Section 5, together with the threats to validity. Section 6 presents the related work, and Section 7 concludes this paper.
2 Background
2.1 Mutation Adequacy Criteria
In the late 1970s, DeMillo et al. [11] proposed the mutation adequacy criterion as a way to assess the quality of test suites. The criterion focuses on the differences between the original program version and its mutant versions (i.e., artificially mutate programs) in the program outputs to measure the mutant kills. This technique is relies on the idea that tests capable of distinguishing the behavior of mutants from those of the original programs are also capable of revealing faults. This idea was recently extended by Shin et al. [9], who proposed distinguishing the behavior of mutants among themselves (in addition to the original program). This forms a diversity-aware mutation adequacy criterion that caters for the diversity of behaviors introduced by the mutants.
To be precise, we formally represent and discuss the mutation adequacy criteria using the essential elements of an existing formal framework (for the mutation-based testing methods) [12]. Let be a set of programs which includes the program under test. There are an original program and a mutant generated from . For a test case in a set of test cases , if the behaviors of and are different for , it is said that kills . Note that the notion of behavioral difference is an abstract concept. It is formalized by a testing factor, called a test differentiator, which is defined as follows:
Definition 1 (Test Differentiator).
A test differentiator is a function,111This function-style definition is replaceable by a predicate-style definition, such as . such that
for all test cases and programs .
By definition, a test differentiator concisely represents whether the behaviors of and are different for . In addition to a differentiator which formalizes the difference of two programs for a single test, it will be helpful to consider whether the two programs are different for a set of tests. A d-vector
Definition 2 (d-vector).
A d-vector is an -dimensional vector, such that
for all , , and .
In other words, a differentiator returns Boolean value (i.e., 0 or 1) from a single test, whereas a d-vector returns -dimensional Boolean vector from test cases. Note that a test suite is used to denote the order of test cases in a the test suite, while denotes a set of tests without any particular order of test cases.
Using the test differentiator and d-vector, we define the notion of mutant kill as follows:
Definition 3 (Mutant Kill).
A mutant generated from an original program is killed by a test case when the following condition holds:
Similarly, generated from is killed by a test suite when the following condition holds:
Based on the notion of mutant kill, the traditional mutation adequacy criterion is defined as follows:
Definition 4 (Traditional Mutation Adequacy Criterion).
For a set of mutants generated from an original program , a test suite is mutation-adequate when the following condition holds:
This definition means that a test suite is mutation-adequate if all mutants are killed by at least one test case . Figure 1 shows the working example for mutation adequacy criteria. There are four different mutants and three different test cases. Each of the values represents whether a test case kills a mutant. In the working example, a test suite is adequate to the traditional mutation adequacy criterion because all the four mutants are killed by . This working example is an abstraction of what commonly happens in practice. For example, it is very common that several test cases, like , “thinly” inspect a large part of a program, while the others, like or , “deeply” inspect a specific part of the program.

Note that the traditional mutation adequacy focuses on the difference between a mutant and its original program. To formalize the diversity of mutants in terms of test cases, the notion of mutant distinguishment is defined as follows:
Definition 5 (Mutant Distinguishment).
Two mutants and generated from an original program are distinguished by a test case when the following condition holds:
Similarly, and generated from are distinguished by a test suite when the following condition holds:
We now introduce the diversity-aware mutation adequacy criterion, called the distinguishing mutation adequacy criterion, based on the mutant distinguishment as follows:
Definition 6 (Distinguishing Mutation Adequacy Criterion).
For a set of mutants generated from an original program , a test suite is distinguishing mutation-adequate when the following condition holds:
where and .
In other words, a test suite is distinguishing mutation-adequate if all possible pair of different mutants and in are distinguished by . In the working example, distinguishes all mutants in because all five mutants (i.e., ) have unique d-vectors for .
For the sake of simplicity, let -criterion hereafter refer to the distinguishing mutation adequacy criterion (i.e., diversity-aware) and, similarly, -criterion to the traditional mutation adequacy criterion (i.e., kill-only).
By definition, the -criterion subsumes the -criterion: for a set of mutants generated from an original program , if a test suite is adequate to the -criterion, it is guaranteed that is adequate to the -criterion as well. In other words, the -criterion is stronger than the -criterion. For more information, please refer the recent study of Shin et al. [9].
2.2 Test Case Prioritization
Regression testing is performed when changes are made to existing software; the purpose of regression testing is to provide confidence that the newly introduced changes do not obstruct the behaviors of the existing, unchanged part of the software [1]. In regression testing, test case prioritization finds an ordering of test cases that maximizes a desirable property, such as the rate of fault detection. To see the benefit of using test case prioritization, consider the test suite with fault detection information in Figure 2. If a tester want to detect faults as early as possible, it is clearly beneficial to execute first, followed by .

Rothermel et al. [2] formally define the test case prioritization problem as follows:
Definition 7 (Test Case Prioritization Problem).
Given: A test suite, , the set of permutations of , , and an objective function from to real numbers, .
Problem: Find a permutation such that .
In this definition, represents all possible orderings of the given test cases in , and represents an objective function that calculates an award value for an ordering . For the example in Figure 2, an ordering is better than another ordering , since detects the fault earlier than .
The main usage scenario of the prioritisation techniques is to be used for the test of the program changes made on subsequent program versions. Recent research [4] has shown that the effectiveness degradation of the prioritization techniques over subsequent program versions is small and that taking into account the code changes performed on a subsequent version does not provide any important information [13]. Therefore, testers need to obtain the required information at a specific point in time (prioritization time) and then use it to prioritize and order the relevant test suites in the subsequent program versions.
At prioritization time, we need to consider a surrogate for the fault detection based on the historical information of the test cases instead of re-executing them, hoping that early maximization of the surrogate will result in the early maximization of the fault detection. Therefore, while the goal of test case prioritization remains the early maximization of the fault detection, it actually aims the early maximization of the chosen surrogate. Naturally, the test case prioritization techniques vary depending on the chosen surrogate.
The structural coverage information, such as statement coverage, of test cases is one of the widely-used surrogate in test case prioritization [2, 14, 15]. For example, the statement-total approach prioritizes test cases according to the number of statements covered by individual test cases. In other words, a test case covering more statements has higher priority. Similarly, the statement-additional approach prioritizes test cases according to the additional number of statements covered by individual test cases.
Mutants are also used as another surrogate for test case prioritization [2, 16, 17]. Instead of using the structural coverage of individual test cases, the mutant kill of individual test cases is utilized. For example, Rothermel et al. [2] consider the Fault Exposing Potential (FEP)-total approach that prioritizes test cases according to the number of mutants killed by individual test cases. Similarly, the FEP-additional approach prioritizes test cases according to the additional number of mutants killed by individual test cases. Note that, to kill a mutant, a test case not only needs to cover the location of mutation but also to execute the mutated part [1]. It means the mutation-based approaches can be constructed at least as strong as coverage-based approaches.
In this paper, we focus on the mutation-based test case prioritization, using the two mutation-based adequacy criteria (i.e., kill and distinguish), while we use the coverage-based and random approaches as baselines.
2.3 Multi-Objective Test Case Prioritization
The essence of the multi-objective optimization is the notion of Pareto optimality. Given multiple objectives, an ordering of test cases is said to be non-dominated if none of the objectives can be improved in value without degrading the other objective values. Otherwise, an ordering of test cases is said to be dominated by another ordering that has at least one higher objective value without decreasing any other objective values. Formally, let be the number of different objectives. For , each objective function is represented as . An ordering is said to dominate another ordering if and only if the following is satisfied:
When evolutionary algorithms are applied to multi-objective optimization, they produce a set of orderings that are not dominated by each other. Such a set is called a Pareto front. The number of orderings in a Pareto front is determined by the number of population in the evolutionary algorithms. For example, the Non-Dominated Sorting Genetic Algorithm II (NSGA-II)
[18], one of the most widely studied multi-objective evolutionary algorithm, generates number of Pareto optimal solutions in a Pareto front , where is the predefined population size.3 Mutation-based Test Case Prioritization Techniques
In this paper, we consider six different test case prioritization techniques as described in Table 1. The first column represents the mnemonic for each technique that will be used throughout this paper. The second column represents the prioritization objective of each technique. The letters for the mnemonic are capitalized. The third column represents the tie-breaking rule when there are multiple candidate test cases (for greedy and hybrid) or orderings (for multi-objective optimization) satisfying the same level of the objective(s). The last column summarizes each technique. Additional details regarding the techniques listed in Table 1 can be found in the following subsections.
Mnemonic | Objective | Tiebreaker | Description | |||
GRK |
|
random | iteratively select a test case that maximizes the number of additionally killed mutants | |||
GRD |
|
random | iteratively select a test case that maximizes the number of additionally distinguished mutants | |||
HYB- |
|
random | iteratively select a test case that maximizes the weighted sum of the number of additionally killed mutants and additionally distinguished mutants | |||
MOK |
|
kill | optimize an ordering of test cases to both kills and distinguishes mutants as early as possible, and select one of the Pareto optimal orderings that kills mutants as early as possible | |||
MOD |
|
distinguish | optimize an ordering of test cases to both kills and distinguishes mutants as early as possible, and select one of the Pareto optimal orderings that distinguishes mutants as early as possible | |||
RND | RaNDom | random | randomized ordering | |||
SCV |
|
random | iteratively select a test case that maximizes the number of additionally covered statements |
3.1 Greedy and Hybrid Techniques
We first describe the single-objective greedy techniques: GRK, GRD, and HYB. Algorithmically, these techniques are in essence instances of additional greedy algorithms [19]. The additional greedy test case prioritization technique iteratively selects a test case that maximizes the additional achievement of the objective at a time. Note that the hybrid technique is also an instance of the single-objective additional greedy because its only objective is the form of the weighted sum of GRK and GRD.
GRK and GRD:
Based on the -criterion, GRK iteratively selects a test case that maximizes the number of additionally killed mutants. If there are multiple test cases that additionally kills the same number of mutants, one of them is randomly selected. Formally, let be the number of additional mutants killed by a test case . GRK iteratively selects in a test suite that satisfies . Similarly, GRD iteratively selects a test case that maximizes the number of additionally distinguished mutants, based on the -criterion. If there are multiple test cases that additionally distinguishes the same number of mutants, one of them is randomly selected. Formally, let be the number of additional mutants distinguished by a test case . GRD iteratively selects in a test suite that satisfies .
Semantically, GRK distinguishes mutants from its original program as early as possible, whereas GRD distinguishes all mutants from each other as early as possible. In other words, GRK is essentially based on the concept of intensification, whereas GRD is essentially based on the concept of diversification. Such difference may lead the effectiveness difference between GRK and GRD in prioritization. Section 5.6 discusses this issue in more detail.
Hyb-:
This hybrid prioritization technique is a weighted sum of GRK and GRD. It iteratively selects a test case that maximizes the number of the weighted sum of additionally killed mutants and additionally distinguished mutants. Formally, for a weight factor , HYB- iteratively selects a test case in a test suite that . By definition, refers the GRK technique and refers the GRD technique.
3.2 Multi-Objective Optimization Techniques
Unlike the greedy techniques, which iteratively select a test case that suits its objective in a given situation, a multi-objective prioritization technique optimizes an ordering of test cases as a whole to both kill and distinguish mutants as early as possible.
MOK and MOD:
To represent the two mutation-based objectives (i.e., kill mutants as early as possible and distinguish mutants as early as possible) as two measurable functions (i.e., fitness functions in an evolutionary algorithm), we define metrics called APMK (Average Percentage of Mutants Killed) and APMD (Average Percentage of Mutants Distinguished), respectively. The core of these metrics are in APFD (Average Percentage of Faults Detected) [14]
that is the most commonly used test case prioritization evaluation metric. The APFD implies how quickly faults are detected by a given ordering of test cases. It is defined as the area under the curve connecting the points (x, y) = (test suite fraction, percentage of faults detected) for a given ordering of test cases. The APFD value ranges from 0 to 1; higher APFD means more effective test case prioritization. We extract the core concept of the APFD as a template and call it APXX (Average Percentage of XX) that implies how quickly XX is satisfied by a given ordering of test cases. Figure
3 visualizes the APXX. To be precise, let be the ordering fraction of the first test cases for an ordering of test cases, and let be the percentage of XX for . Note that by definition. For an ordering of test cases, the APXX value as the area under the curve is calculated as follows:
Using the APXX template, we define APMK and APMD as follows:
Definition 8 (APMK and APMD).
For an ordering of a test suite , the APMK and APMD values are calculated as follows:
where and is the ordering fraction that contains the first test cases.
In other words, the APMK and APMD imply how quickly mutants are killed and distinguished by a given ordering of tests cases, respectively. As a result, the multi-objective prioritization technique optimizes an ordering of test cases to maximize both APMK and APMD values using a multi-objective optimization algorithm such as NSGA-II. As described in Section 2.3, NSGA-II returns a set of Pareto optimal orderings, and an additional rule is necessary to choose one of these orderings. MOK selects one of the Pareto optimal orderings that has the highest APMK value. Similarly, MOD selects one of the Pareto optimal orderings that has the highest AMPD value.
3.3 Techniques for Comparison
To facilitate our empirical studies, we introduce two simple but widely studied techniques as baselines.
Rnd:
We consider random prioritization that randomly prioritizes test cases as a minimum prioritization baseline.
Scv:
As an additional control in our studies, we apply the statement-coverage-based test case prioritization. As explained in Section 2.2, the structural coverage information is widely used surrogate in test case prioritization. We implement the statement-additional that iteratively selects a test case that maximizes the number of additionally covered statements, which is the most effective coverage-based prioritization schemes [4]. If there are multiple test cases that additionally covers the same number of statements, one of them is randomly selected.
4 Experimental Design
4.1 Research Questions
In the experiments, we investigate the following five research questions:
-
RQ1: How do the mutation-based prioritization techniques compare with the random and coverage-based prioritization in terms of early fault detection?
-
RQ2: What is the superior mutation-based prioritization technique in terms of early fault detection?
-
RQ3: What is the effect of using different weight values in the hybrid (single-objective) test prioritization scheme?
-
RQ4: How effective are the Pareto front solutions of the multi-objective prioritization scheme?
-
RQ5: How much time takes to perform each one of the examined techniques?
RQ1 compares the effectiveness of the mutation-based prioritization techniques with that of the random and coverage-based prioritization. Specifically, we count the number of faults where each of the prioritization techniques is statistically significantly superior, equal, or inferior with the random ordering and the coverage-based ordering, respectively. We also measure the effect size of the effectiveness differences of the techniques with the controls.
RQ2 compares the effectiveness of the studied techniques among each other with the aim of identifying the best performing technique. Similar to RQ1, we count the number of faults where a technique A is statistically significantly superior, or equal, inferior to another technique B, as well as their exact effectiveness difference.
RQ3 focuses on the hybrid prioritization techniques that uses both the -criterion (i.e., kill) and the -criterion (i.e., distinguish). We examine different weight factors (between kill and distinguish) and see how it impacts the prioritization effectiveness.
RQ4 considers the effectiveness of orderings of test cases in a Pareto front given by the multi-objective test case prioritization techniques. For the multi-objective prioritization, all orderings of test cases in a Pareto front are equally good in terms of the their objectives. However, since the objectives are proxies, the important question is how these orderings perform in terms of the prioritization effectiveness. Thus, for the Pareto front orderings, we investigate the relationship among the prioritization objectives and the prioritization effectiveness.
RQ5 attempts to answer the cost of mutation-based prioritization techniques. One obvious cost of a prioritization technique is the execution time of the technique. We compare the execution times of all the mutation-based prioritization techniques including greedy, hybrid, and multi-objective.
4.2 Test Subjects and Faults
For the purposes of the present study, we consider the Java applications in the Defects4J database [10]. These are all open source software systems and are accompanied by 357 developer-fixed and manually verified real faults. In total, we use the following five applications: JFreeChart (Chart), Closure compiler (Closure), Commons Lang (Lang), Commons Math (Math), and Joda-Time (Time). In Defects4J, each fault is given as an independent fault-fix pair of the program versions.
Out of 357 faults, five faults are excluded because they are not able to give mutation analysis results within a practical time limit (i.e., one-hour per each test case). As a result, we consider the remaining 352 faults, which are summarized in Table 2. Detailed information for each subject fault is available from our webpage at http://se.kaist.ac.kr/donghwan/downloads.
Program | Faults | Test Cases (sum) | dT (sum) | aM (sum) | kM (sum) | dM (sum) |
---|---|---|---|---|---|---|
Chart | 25 | 5,806 | 91 | 21,611 | 8,614 | 1,462 |
Closure | 133 | 443,596 | 347 | 109,727 | 82,676 | 34,685 |
Lang | 65 | 11,409 | 124 | 81,524 | 63,551 | 5,467 |
Math | 106 | 20,661 | 172 | 101,978 | 73,931 | 14,591 |
Time | 27 | 72,005 | 76 | 19,996 | 13,665 | 3,838 |
Total | 352 | 553,477 | 810 | 334,836 | 242,437 | 60,043 |
4.3 Test Suites
For each fault, Defects4J provides “relevant” JUnit test cases that touch the modified classes between the faulty version and the fixed version. Test prioritization is performed when testing newly introduced changes. Thus, it is reasonable to use the information about the modified classes, and focus on them instead of the whole program. This is common practice in industry and is performed by retrieving the test cases that have a dependence with the files that were changed [20]. To account for this issue in our experiments, we compose a test suite of relevant test cases for each fault we consider.
JUnit test cases are Java classes that contain one or more test methods. It leads two different test suite granularity by considering JUnit test cases as the test-class level and the test-method level [16]. We use the test-method level because it is finer and more informative than the test-class level. In Table 2, the column Test Cases (sum) shows the sum of the number of test cases for each fault. For example, there are total 5,806 test cases for the 25 Chart faults. The column dT (sum) shows the sum of the number of fault detecting test cases for each fault. For example, there are total 91 fault detecting test cases for the 25 Chart faults. Total 553,477 test cases including 810 fault detecting test cases are considered for the 352 subject faults.
4.4 Mutants
We use Major [21] mutation analysis tool for generating and executing all mutants to the test cases for each fault. It provides a set of commonly used set of mutation operators [7, 22] including the AOR (Arithmetic Operator Replacement), LOR (Logical Operator Replacement), COR (Conditional Operator Replacement), ROR (Relational Operator Replacement), ORU (Operator Replacement Unary), STD (STatement Deletion), and LVR (Literal Value Replacement). We applied all the mutation operators. Since the use of sufficient mutation operators may affect on the experimental results, we will discuss this issue in Section 5.7.
We generate mutants out of the fixed (i.e., clean) version of each fault. To perform a controlled experiment, we assume the fixed version is the norm, and perform mutation analysis on it: subsequently, we “reverse” the fix patch to recreate the fault, and evaluate our prioritization. We will discuss this in Section 5.7 as well.
We generate mutants only from the modified classes between the fixed version and the faulty version, as we considered only the relevant test cases. In Table 2, the column aM (sum), kM (sum), and dM (sum) show the sum of the number of all generated mutants, killed mutants by the test cases, and distinguished mutants by the test suite for each fault, respectively. For example, for the 25 faults in the Chart program, 8,614 mutants and 1,462 mutants among 21,611 mutants are killed and distinguished by the test cases, respectively.
4.5 Multi-Objective Algorithm Configuration
For NSGA-II, we set the population size as 100. The chosen genetic operators are ones that are widely used for permutation type representation: partially matched crossover, swap mutation, and binary tournament selection [23, 24]. The crossover rate is set to 0.9, and the mutation rate is set to 0.2. The maximum fitness evaluation is set to 100,000. Since finding the best configuration for the mutation-based test case prioritization falls out of the scope of our work, we simply follow the default configuration and parameter values which is commonly used and tuned. Using the default parameter values is a common practice and has been found to be suitable for our context [25], i.e., search-based testing.
4.6 Variables and Measures
For independent variables, RQ1, RQ2, RQ5 manipulates all the prioritization techniques listed in Table 1, whereas RQ3 and RQ4 focus on the hybrid techniques and the multi-objective techniques, respectively.
For dependent variables, we mainly measure the quality and the cost of the test case prioritization techniques. For the quality of the prioritization, we measure the APFD value for each ordering of test cases. For the cost of the prioritization, we measure the execution time for each ordering of test cases. To provide statistical analysis, we independently generate 100 orderings of test cases for each of the greedy, hybrid, and control techniques. For each of the multi-objective techniques, we independently generate 30 orderings of test cases because it takes too long (more than hours for one ordering in the longest case). All our experiments were performed on the Microsoft Azure Clould Platform using the Ubuntu 16.04 operating system on 8 DS3v2 (4 vcpus, 14 GB memory) virtual machines.
To compare the effectiveness of two prioritization techniques, we perform statistical hypothesis tests following the guideline provided by Arcuri and Briand
[26]. We perform the Mann-Whitney U-test to assess the difference in stochastic order, that is, whether the APFD values in one technique are more likely to be larger than the APFD values in the other technique. Note that the Mann-Whitney U-test is a non-parametric test which makes no assumption about the distribution of the data.To reduce Type I error, the significance level is
. We also measure the Vargha and Delaney’s statistics [27]to represent the effect size of the effectiveness difference between the compared prioritization techniques. It measures the probability that one technique yields higher APFD values than the other. For example,
means that one technique outperforms the other in 70% of the runs.For the calculation of APFD values, we use the following equation:
(1) |
where is the percentage of faults detected by the ordering fraction . We should note that there is another commonly used equation provided by Elbaum et al. [14] as follows:
(2) |
where is the first test case position among test cases which detects the th fault among faults. Both (1) and (2) give the same APFD value, whereas (1) uses the percentages of faults detected by test suite fraction and (2) uses the positions of the first test case that detects each of faults.
To investigate the relationship between the prioritization effectiveness (i.e., APFD) and objectives (i.e., APMK222We do not need to additionally investigate the relationship between APMD and APFD because there is a clear inverse relationship between APMK and APMD in Pareto fronts.), we measure the Pearson linear correlation and Spearman rank correlation between APFD and APMK for the orderings in Pareto fronts. When Pearson (or Spearman) correlation is 1, it means that APFD perfectly linearly (or monotonically) increases as APMK increases for the Pareto optimal orderings. When Pearson (or Spearman) correlation is -1, it means that APFD perfectly linearly (or monotonically) decreases as APMK increases for the Pareto optimal orderings. Consequently, the closer to +1 the correlation is, the more effective MOK is, and the closer to -1 the correlation is, the more effective MOD is.
5 Results and Analysis
5.1 RQ1: Comparison with Controls
Table 3 records the results for the comparison of the prioritization techniques with the random orderings. For every compared pair (A, B), the column Superiority provides the number of subject faults where the effectiveness of A is statistically superior (+), equal (=), or inferior (-) to B , based on the Mann-Whitney U-tests with . The column Effect size provides the average statistics to represent how much one technique outperforms the other in average. In terms of superior cases, HYB-010 is the best where 86.4% (304/352) of the subject faults show that the effectiveness of HYB-010 is statistically superior than that of random. In terms of inferior cases, GRD is the best where only 2.27% (8/352) of the subject faults show that the effectiveness of GRD is statistically inferior than that of random. In terms of effect size, HYB-015 is the best where the value is 0.8520. Overall, the mutation-based test case prioritization techniques are statistically superior than or equal to random for 88.9% of the subject faults.
Pair | Superiority | Effect size | Pair | Superiority | Effect size | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
A | B | + | = | - | A | B | + | = | - | ||
GRD | RND | 294 | 50 | 8 | 0.8269 | HYB-060 | RND | 298 | 37 | 17 | 0.8480 |
HYB-005 | RND | 298 | 38 | 16 | 0.8447 | HYB-065 | RND | 298 | 36 | 18 | 0.8482 |
HYB-010 | RND | 304 | 34 | 14 | 0.8519 | HYB-070 | RND | 299 | 35 | 18 | 0.8474 |
HYB-015 | RND | 300 | 39 | 13 | 0.8520 | HYB-075 | RND | 301 | 33 | 18 | 0.8475 |
HYB-020 | RND | 298 | 40 | 14 | 0.8498 | HYB-080 | RND | 299 | 36 | 17 | 0.8473 |
HYB-025 | RND | 296 | 41 | 15 | 0.8476 | HYB-085 | RND | 300 | 34 | 18 | 0.8484 |
HYB-030 | RND | 296 | 41 | 15 | 0.8468 | HYB-090 | RND | 297 | 38 | 17 | 0.8488 |
HYB-035 | RND | 296 | 39 | 39 | 0.8475 | HYB-095 | RND | 300 | 35 | 17 | 0.8478 |
HYB-040 | RND | 295 | 39 | 18 | 0.8476 | GRK | RND | 275 | 56 | 21 | 0.8188 |
HYB-045 | RND | 297 | 38 | 17 | 0.8471 | MOK | RND | 296 | 43 | 13 | 0.8396 |
HYB-050 | RND | 298 | 36 | 18 | 0.8479 | MOD | RND | 294 | 47 | 11 | 0.8401 |
HYB-055 | RND | 298 | 36 | 18 | 0.8476 | Average | RND | 296.8 | 39.2 | 17.0 | 0.8452 |
Table 3 also shows that hybrid techniques are at least effective as the simple greedy techniques GRD and GRK. Specifically, HYB-095 is more effective than GRK (i.e., HYB-100), even the weight for the GRD is only 0.05. This signifies that it is more effective to consider the -criterion and the -criterion together than to consider the -criterion only.
Interestingly, multi-objective techniques are relatively ineffective than the hybrid techniques. It means that, in comparison with random, multi-objective optimization techniques using the -criterion and the -criterion are less beneficial than merely merging the two greedy techniques.
Pair | Superiority | Effect size | Pair | Superiority | Effect size | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
A | B | + | = | - | A | B | + | = | - | ||
GRD | SCV | 238 | 36 | 78 | 0.7012 | HYB-060 | SCV | 266 | 30 | 56 | 0.7813 |
HYB-005 | SCV | 257 | 29 | 66 | 0.7514 | HYB-065 | SCV | 267 | 29 | 56 | 0.7820 |
HYB-010 | SCV | 267 | 25 | 60 | 0.7687 | HYB-070 | SCV | 267 | 30 | 55 | 0.7822 |
HYB-015 | SCV | 268 | 28 | 56 | 0.7791 | HYB-075 | SCV | 267 | 29 | 56 | 0.7820 |
HYB-020 | SCV | 264 | 32 | 56 | 0.7756 | HYB-080 | SCV | 266 | 31 | 55 | 0.7822 |
HYB-025 | SCV | 262 | 31 | 59 | 0.7736 | HYB-085 | SCV | 266 | 30 | 56 | 0.7836 |
HYB-030 | SCV | 263 | 31 | 58 | 0.7738 | HYB-090 | SCV | 265 | 32 | 55 | 0.7838 |
HYB-035 | SCV | 263 | 30 | 59 | 0.7739 | HYB-095 | SCV | 266 | 30 | 56 | 0.7830 |
HYB-040 | SCV | 264 | 32 | 56 | 0.7772 | GRK | SCV | 236 | 58 | 58 | 0.7501 |
HYB-045 | SCV | 267 | 27 | 58 | 0.7793 | MOK | SCV | 253 | 39 | 60 | 0.7444 |
HYB-050 | SCV | 267 | 28 | 57 | 0.7806 | MOD | SCV | 250 | 38 | 64 | 0.7367 |
HYB-055 | SCV | 267 | 29 | 56 | 0.7809 | Average | SCV | 261.6 | 31.9 | 58.5 | 0.7699 |
Table 4 records the results related to the comparison of the prioritization techniques with SCV. The structure of the table is as same as Table 3. In terms of superior cases, HYB-015 is the best where 76.1% (268/352) of the subject faults show that the effectiveness of HYB-015 is statistically superior than that of SCV. In terms of inferior cases, HYB-070, HYB-080, and HYB-090 are the best where 15.6% (55/352) of the subject faults show that the effectiveness of them are statistically inferior than that of SCV. In terms of effect size, HYB-090 is the best where the value is 0.7838. Overall, the mutation-based test case prioritization techniques are statistically superior than or equal to the coverage-based prioritization technique at least 77.8% of the subject faults.
The mutation-based prioritization is superior to or equal to the random prioritization for 88.9% of the faults, and is superior to or equal to the coverage-based prioritization for 77.8% of the faults.
5.2 RQ2: Comparison between the Techniques
This section investigates whether there is a superior technique or not among the mutation-based test case prioritization techniques. We only consider GRD, HYB-010, HYB-050, HYB-090, GRK, MOK, and MOD, because there are too many pairs containing all weights for the hybrid techniques. Table 5 contains the comparison results for the pair of the techniques. The structure of the table is the same as Table 3.
Pair | Superiority | Effect size | Pair | Superiority | Effect size | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
A | B | + | = | - | A | B | + | = | - | ||
HYB-010 | GRD | 218 | 74 | 60 | 0.6780 | HYB-090 | HYB-050 | 81 | 226 | 45 | 0.5415 |
HYB-050 | GRD | 212 | 62 | 78 | 0.6633 | GRK | HYB-050 | 83 | 188 | 81 | 0.5078 |
HYB-090 | GRD | 215 | 55 | 82 | 0.6637 | MOK | HYB-050 | 40 | 170 | 142 | 0.3913 |
GRK | GRD | 214 | 50 | 88 | 0.6441 | MOD | HYB-050 | 75 | 113 | 164 | 0.4029 |
MOK | GRD | 122 | 116 | 114 | 0.5049 | GRK | HYB-090 | 47 | 225 | 80 | 0.4562 |
MOD | GRD | 138 | 127 | 87 | 0.5409 | MOK | HYB-090 | 38 | 163 | 151 | 0.3830 |
HYB-050 | HYB-010 | 102 | 169 | 81 | 0.5285 | MOD | HYB-090 | 79 | 100 | 173 | 0.3951 |
HYB-090 | HYB-010 | 109 | 163 | 80 | 0.5411 | MOK | GRK | 71 | 137 | 144 | 0.4274 |
GRK | HYB-010 | 108 | 147 | 97 | 0.5200 | MOD | GRK | 99 | 84 | 169 | 0.4230 |
MOK | HYB-010 | 62 | 139 | 151 | 0.4003 | MOD | MOK | 47 | 259 | 46 | 0.5077 |
MOD | HYB-010 | 70 | 144 | 138 | 0.4277 |
Comparing GRK and GRD in Table 5, GRK is more effective at 60.8% (214/352) faults, whereas GRD is more effective at 25% (88/352) faults. There is no statistical difference for the remaining 14.2% (50/352) faults. For all subject faults, the average effect size is 0.6441, which means that GRK outperforms GRD with the probability of an average 64.41% of all runs. While GRK is more effective than GRD in general, GRD outperforms GRK for some faults. Section 5.6 discusses the effectiveness difference between GRK and GRD in more detail.
Interestingly, in comparison to GRD, the average values of the hybrid techniques are higher than 0.66, while that of the multi-objective techniques are around 0.52. In other words, in comparison to GRD, the hybrid techniques are more effective, while the multi-objective optimization techniques are not. Considering mutant kill and distinguishment together, simple hybrid is more effective than multi-objective optimization in mutation-based test case prioritization.
Comparing MOK and MOD, they are equally effective at 73.6% (259/352) faults, and it is almost the same when MOK is more effective and MOD is more effective for the remaining faults. This implies that orderings of test cases in a Pareto front may have similar prioritization effectiveness. This issue will be investigated in Section 5.4.
Overall, Table 5 shows that all pairs have both superior and inferior cases that cannot be ignored. Also, the average of all pairs are not dramatic. It means that, in terms of mutation-based test case prioritization using the -criterion and the -criterion, there is no single superior technique among greedy, hybrid, and multi-objective.
Among greedy, hybrid, and multi-objective strategies using the traditional kill-only mutation adequacy and the diversity-aware mutation adequacy, there is no single superior test case prioritization technique.
5.3 RQ3: Effect of Changing Weight between Kill and Distinguish
To investigate the effect of weight change on APFD for the HYB- prioritization techniques, the average APFD is obtained by changing from 0 to 1 in steps of 0.05. Figure 4 shows the results; the x-axis is and the y-axis is the APFD.

In Figure 4, all the subject programs show the same result: the highest APFD is when is between 0 and 1 (i.e., neither 0 nor 1). This means that the combination of GRK and GRD has a positive effect on the test case prioritization effectiveness.
There is no single value showing the highest APFD for all programs. For Chart and Lang, shows the highest APFD. For Closure, shows the highest APFD. For Math and Time, and shows the highest APFD, respectively. It means that the best weight between GRK and GRD depends the program characteristics.
The test case prioritization effectiveness is maximized when the weight is between 0 and 1 and not on the extreme values 0 and 1. This means that the combination of GRK and GRD increases the effectiveness. The optimal weight depends on the subject programs.
5.4 RQ4: Effectiveness of Orderings in Pareto Fronts
This section investigates the effectiveness of orderings in Pareto fronts. To do that, we measure the Pearson and Spearman correlation coefficients between APMK (i.e., how quickly mutants are killed) and APFD (i.e., how quickly faults are detected) for the orderings in Pareto fronts.
For 207 among the 352 subject faults (i.e., 58.8%), the correlation coefficients are undefined because the variance of APFD is zero. It means that, for the 207 fault, all the orderings in a Pareto front are equally good in terms of APFD. This partially explains the fact that MOK and MOD are statistically equally effective at 74.1% faults as noted in Section
5.2. Remaining 145 faults have correlation coefficients ranging from -1 to +1.
![]() |
![]() |
Each point represents the (a) Pearson or (b) Spearman correlation coefficient of a fault. The bottom and top of the box are the first and third quartiles, and the band inside the box is the median. The top and bottom of the whisker are the highest and lowest datum still within 1.5 IQR of the upper and lower quartile. The distribution of the correlation coefficients widely varies except the faults in the Time program.
Figure 7 summarizes the distribution of correlation coefficients of the 145 (=352-207) faults. Each point represents the Pearson (in Figure (a)a) or Spearman (in Figure (b)b) correlation coefficient of a fault. The bottom and top of the box are the first and third quartiles, and the band inside the box is the median. The bottom and top of the whiskers are the lowest and highest datum still within 1.5 IQR of the lower and upper quartile. For example, Figure (a)a shows that all faults in the Time program have very similar Pearson correlation coefficients, which is nearly zero. This zero Pearson coefficient means that there is no linear correlation between APMK and APFD for the orderings in Pareto fronts. Except the Time program, Figure 7 shows that both correlation coefficients are widely distributed from -1 to +1. This implies that there is no superiority between MOK and MOD on average for all programs. The faults in Chart and Lang tend to have correlation coefficients close to -1, whereas the faults in Closure tend to have correlation coefficients close to +1. It implies that MOK is often more effective than MOD for Chart and Lang, whereas MOD is often more effective than MOK for Closure. The faults in Math and Time tend to have zero correlation, which implies that the effectiveness of MOK and MOD is similar.
For 58.8% of the subject faults, the orderings of test cases in Pareto fronts are equally effective in terms of APFD. For the remaining faults, the correlation coefficient between APMK (or APMD) and APFD vary from -1 to +1, depending on the studied faults.
5.5 RQ5: Execution Time of the Techniques
Table 6 shows the average test case prioritization time for each technique. For example, the GRK prioritization technique takes 2253.9 ms to prioritize a test suite on average. There is no time difference between MOK and MOD since both MOK and MOD simply select one orderings of test cases in a Pareto front, and the information needed for selection (i.e., APMK and APMD) is calculated beforehand.
Technique | Time (ms) |
---|---|
RND | 21.2 |
SCV | 150.3 |
GRK | 2253.9 |
HYB-050 | 7245.2 |
GRD | 7651.7 |
MOK (or MOD) | 2198981.8 |
In Table 6, GRD takes around 3.4 times more time than GRK. This is because the computation for mutant distinguishment (i.e., whether a mutant’s d-vector is unique) is harder than the computation for mutant kill (i.e., whether a mutant’s d-vector is non-zero). HYB is similar to GRD, because it also requires the computation for mutant distinguishment. While GRD and HYB techniques take more time than GRK, it is within 8 seconds for each prioritization on average. On the other hand, MOK (or MOD) takes far much time; approximately 37 minutes for each prioritization. This is mainly because the number of test cases and mutants are too large to optimize permutations as a whole.
We also investigate the effect of the total number of test cases (i.e., the size of a test suite) and the total number of mutants on the execution time. It turns out that the product of the total number of test cases and the total number of mutants is linearly proportional to the time for all the subject test case prioritization technique. The average Pearson correlation coefficient between the product and the time for all the techniques is 0.930.
Note that Table 6 only reports the execution time of the prioritization, not the time for mutation analysis. On average, mutation analysis takes 651.8 seconds per fault. However, there are several test cases that do not give mutation analysis results within one-hour time limit. While such test cases are excluded in our controlled experiment, it can be problematic in practice. Fortunately, mutation analysis for each test case can be easily parallelized. Further, it is possible to prepare the mutation analysis results independently from the future changes and regression testing.
On average, multi-objective techniques requires approximately 37 minutes, whereas greedy and hybrid techniques require less than 8 seconds. The prioritization execution time for all techniques has an exact linear relationship with the product of the number of test cases and mutants.
5.6 Discussion
As described in Section 5.2, there is no clear winner between GRK (i.e., kill mutants as early as possible) and GRD (i.e., distinguish mutants as early as possible) test prioritization schemes; 60.8% of the subject faults show that GRK is statistically superior than GRD, whereas 25% of the subject faults show the opposite result. This is interesting because the -criterion subsumes the -criterion as explained in Section 2.1. Taken together, the -criterion is stronger than the -criterion, whereas the prioritization based on the -criterion is not superior than the prioritization based on the -criterion. To further understand why this happens, we investigate the relationship between kill and distinguish in test case prioritization.
By definition, the mutant kill concerns the difference between the original program and its mutants, whereas the mutant distinguishment concerns the difference among all programs including the original program and mutants. To see how a set of test cases kills and distinguishes mutants and where the fault detecting test cases are, we propose a graphical representation called Mutant Distinguishment Graph (MDG). In an MDG, each node represents a set of undistinguished mutants, and a directed edge from a node to another node represents a set of test cases that distinguishes from by killing the mutants in not . We call as the child of when there is a directed edge from to . To represent where the fault detecting test cases are, we draw the edges with thickness to be proportional to the percentage of fault detecting test cases (among the set of test cases represented by the edge). To avoid zero thickness, we give a default value even if the percentage is zero. There is a special “root” node that has no incoming edge. In other words, all the remaining nodes are the children of the root. The root node refers the original program and the mutants that are not killed by all the test cases in the MDG. The structure of an MDG varies depending on the given set of test cases and mutants.

For example, using the four mutants and three test cases given in Figure 1, we have the MDG as Figure 8. There are five nodes, , , , , and , with the five directed edges between the nodes. The node at the top only has because all the mutants are killed by the set of test cases . The edge from to is labeled with . This shows is distinguished from by . In other words, kills . The edge from to is labeled with , which shows is distinguished from by . We assume is the fault detecting test case as an example, and the edges labeled with are thicker than the others.
Note that there is no direct edge from to labeled with in Figure 8, while also kills as well as . This is because the transitivity of the mutant distinguishment [12]. For mutants and test cases , if is distinguished from by and is distinguished from by , then is always distinguished from by both and . When it comes to an MDG, such transitivity implies that the edges between a node and its descendants can be omitted without any information loss. As a result, we are able to know that both and kill while there is no direct edge from to .
The transitivity of an MDG provides an interesting property for the test cases in the edges from . Among all test cases in an MDG, the test cases in the edges from are sufficient to kill all the mutants except mutants in . Furthermore, a test case in the edges from kills all the mutants in the distinguished node and its descendants. For the example in Figure 8, (i.e., the set of test cases in the edges from ) is sufficient to kill all mutants, and kills all the mutants in (i.e., the distinguished node) and , , and (i.e., the descendants of ).
The aforementioned property is an important key to understand the effectiveness difference between GRK and GRD in test case prioritization. For GRK, giving the test cases in the edges from high priorities is clearly beneficial to kill all mutants as early as possible. In other words, GRK gives the test cases not in the edges from low priorities. If there is no thick edges from , it means the fault detecting test cases are not in the edges from , and GRK gives the fault detecting test cases low priorities. In summary, GRK becomes ineffective when there is no thick edge from .
GRD tries to distinguish all mutants as early as possible. However, as GRD has to choose among test cases that distinguish mutants (select any edge instead of those from the ) it is likely to give less priority to the test cases in the edges from . Thus, it is likely to give higher priorities on test cases that distinguish mutants than killing mutants. On the contrary, when fault detecting test cases are triggered by mutatn distinguishement (thick edges are not in the ), we see that GRD becomes effective. In these cases GRD is more likely to outperform GRK.
![]() |
![]() |
![]() |
![]() |
Figure 13 shows the four representative MDGs from the 352 subject faults: Figure (a)a and Figure (b)b show the MDGs for the faults that GRD is much more effective than GRK, whereas Figure (c)c and Figure (d)d show the faults that GRK is much more effective than GRD. For each graph, the big node near the center refers . It is clear that there is no thick edge from in Figure (a)a and Figure (b)b, whereas there is a thick edge from in Figure (c)c and Figure (d)d.
Unfortunately, we cannot have an MDG without the fault detection information. Since we do not know which test case detects faults in prioritization time, we cannot use an MDG to predict which of GRK and GRD will be more effective than the other. Thus, MDGs are useful in explaining the strengths and weakness of the mutation-based test case prioritization schemes. To improve test case prioritization, we need some form of prediction related the properties of the fault detecting test cases. If such an a prediction is made then we can use MDGs to also support test case prioritization.
We should note that an MDG is similar to the Mutant Subsumption Graph (MSG) suggested by Kurtz et al. [28], as the mutant distinguishment is closely related to the mutant subsumption as discussed by Shin and Bae [12]. The main difference between an MDG and an MSG is that an MDG additionally contains the information of the fault detecting test cases in the thickness of edges.
5.7 Threats to Validity
There are several threats to validity for our experimental results. One threat is due to the subject programs and faults that we use. These might not be representative of other programs and faults. While this threat is common to any empirical study and can only be addressed by making multiple and context-related studies, we tried to mitigate it by using a large set of real faults. Thus, we used all the faults of Defects4J, which is an independently constructed dataset built to support controlled experimental results.
Our results are also to some extent dependent on the configuration of NSGA-II. Parameter turning is an important and challenging problem in evolutionary algorithms [29]. However, we feel that such a tuning will not impact much our results as the default configurations perform well in our context [25].
The mutation analysis tool Major [21] and the mutation operators we use form another source of threats for our study. This is because different types of mutants may result in different behaviour and influence our results. Therefore, the use of another mutation testing tool employing a different set of operators, like the PIT [30], may result in different findings. However, we do not consider this threat as vital as our main contribution lies in the relative comparison of the mutation-based test prioritization techniques and not on their optimal performance. Moreover, we expect that using different mutant sets will have a similar influence on all the prioritization techniques we study since all of them rely on the same set of mutants. Nonetheless, according to a recent study by Kintis et al.[22], the fault detection capabilities of PIT are considerably lower than those of Major. The same study reports that another version of the tool, named PIT_RV (i.e., the research version of PIT), is 5% more effective at detecting faults than Major. Therefore, we believe that this 5% difference from PIT_RV cannot make a major difference on the results we report.
Other threats come from the order of the fixed and faulty versions. While the fixed version comes after the faulty version in the repository timeline, we assume the fixed version as the clean version that previously passes all regression test cases and the faulty version as the change-introduced version that should be tested by the regression test cases. Such reverse order is used to perform our controlled experiment. Still more studies are needed in order to investigate the differences between the results of the controlled experiments and actual practice.
The APFD metric used for representing the effectiveness of test case prioritization has some limitations. It does not account for the severity of faults and test case execution cost. Since Defects4J provides many single-fault program versions instead of one multiple-faults program version, we do not need to concern the severity of each fault. To overcome the limitation related to the test case execution cost, we additionally measure the APFD values [31] which account for the execution times of individual test cases. The results show that the average difference between the APFD and APFD values for each test case ordering is almost negligible (i.e., 3.169e-04). This is because the execution times of test cases in a test suite are almost equivalent for all the subject test suites. As a result, we keep use the APFD metric for representing the results.
To allow reproducibility of the results presented in this paper, all the prioritization results and the implementation of the mutation-based test case prioritization techniques are available from our web page at http://se.kaist.ac.kr/donghwan/downloads.
6 Related Work
Since the diversity-aware mutation adequacy criterion (i.e., the -criterion) has been recently proposed by Shin et al. [8], there is no previous study for the diversity-aware mutation adequacy in test case prioritization. However, the -criterion is experimentally evaluated in test suite selection, compared to the traditional kill-only mutation adequacy criterion (i.e., the -criterion). The results on 45 real faults in Defects4J show that the -criterion increases the fault detection effectiveness of adequate test suites in comparison with the -criterion, whereas the -criterion requires more test cases to be adequate than the -criterion.
In test case prioritization, the traditional mutation adequacy criterion is already investigated by Rothermel et al. [2]. They investigate the effectiveness of several greedy prioritization techniques using branches, statements, and mutants, respectively. The branch and statement techniques prioritizes test cases according to the number of branches and statements covered by each test case, respectively. They find that there is no single best technique. However, on average across the programs, the mutant-based technique performs most effectively. Later, Elbaum et al. [14] extend the empirical study of Rothermel et al. by including function-level coarser granularity techniques in comparison with the statement-level fine granularity techniques. The empirical results on eight C programs listed in Siemens benchmarks [32] show that the coarser granularity decreases the effectiveness of test case prioritization in general.
For the total greedy approach and the additional greedy approach in test case prioritization, Li et al. [19]
report that the additional approach significantly outperforms the total approach. They also study meta-heuristic algorithms for test case prioritization, whereas the prioritization effectiveness difference between the performance of meta-heuristic and that of additional greedy is not significant. Zhang et al.
[15] also focus on the total and additional approaches. They develop a unified approach with the total and additional at two extreme instances. The unified model yields a spectrum of genetic approaches ranging between the total and additional approaches depending on a control parameter. The empirical results on four Java programs show that selecting a proper parameter increases the prioritization effectiveness compared to the simple total and additional approaches. However, the additional approach is almost effective as the parametrized approach in all programs.In multi-objective test case prioritization, Epitropakis et al. [24] present an empirical study of the effectiveness of multi-objective test case prioritization. They mainly investigate two different multi-objective evolutionary algorithms, NSGA-II and Two Archive Evolutionary Algorithm (TAEA) [33], for the objectives including statement coverage and fault detection history. The results show that the multi-objective prioritization techniques are superior to greedy techniques that target each of the objectives of the multi-objective technique.
Perhaps the work that is the closest to ours is that of Lou et al. [17], which studies mutation-based prioritization within software evolution. The study concerns two prioritization schemes; one based on the number of mutants killed and one based on the distribution of the killed mutants. Their results show that ordering tests by the number of mutants killed performs best. This approach is similar to our greedy one. While there are similarities between our and Lou et al. studies, our is based on real faults (while theirs is based on mutant-faults) and we consider the distinguish method with multiple heuristics, while they do not.
Regarding diversity-based test prioritization, there are several studies working mainly in a black-box manner. Henard et al. [3] suggest diversity-aware metric based on the concept of Combinatorial Interaction Testing. This method performs test prioritization by ordering tests according to the dissimilarity of the combinations of the test input parameters. Feldt et al. [5] suggest using a compression utilities to support test prioritization. The techniques measures the dissimilarity distance of test suites using the concept of Normalized Compression Distance. More recently, Hennard et al. [4] compare these techniques with other coverage-based test prioritization and find that they are of similar power despite that they do not use any dynamic information from the tested systems.
7 Conclusion
In this paper, we investigate test case prioritization guided by mutants. Based on the recently defined diversity-aware mutation adequacy criterion, we present the new prioritization objective that distinguishing all mutants as early as possible. We evaluate the effectiveness of mutation-based prioritization techniques by considering the new objective as well as the existing objective, which is killing all mutants as early as possible. Based on these two objectives, we investigate greedy, hybrid, and multi-objective prioritization strategies using 352 real faults and 553,477 developer-written test cases.
Our results show that the mutation-based prioritization is more than or equally effective than the random prioritization and the coverage-based prioritization for at least 88.9% and 77.8% of the faults, respectively. Among the greedy, hybrid, and multi-objective optimization strategies using the kill-only and diversity-aware mutation adequacy criteria, there is no single superior test case prioritization technique. Interestingly, while there is no superiority between the kill-only mutation and the diversity-aware mutation adequacy criteria, their combined use improves the effectiveness of the prioritization. For the multi-objective optimization, the effectiveness of orderings in Pareto fronts does not have steady correlation with the prioritization objectives. The prioritization execution time for the multi-objective techniques requires approximately 37 minutes, while the greedy and hybrid techniques require less than 8 seconds.
There are several implications from the results. For example, both distinguishing and killing mutants as early as possible are more effective than covering statements as early as possible. To detect faults as early as possible, the mutation-based prioritization is more beneficial than the coverage-based one. Interestingly, by considering the two mutation-based objectives in one greedy hybrid approach performs best. The same combination can be done by using a multi-objective optimization but unfortunately it does not provide any important benefits and requires far more time to prioritize than the hybrid.
More research is needed in order to develop a single test case prioritization technique that is clearly superior to killing and distinguishing mutants. To support such attempts, we provide a graphical model called Mutant Distinguishment Graph (MDG), which visualizes how mutants are killed and distinguished by a test suite with respect to the fault detecting test cases. This way we demonstrate the reasons why simply killing mutants as early as possible is not always effective.
References
- [1] Yoo S, Harman M. Regression testing minimization, selection and prioritization: a survey. Software Testing, Verification and Reliability 2012; 22(2):67–120.
- [2] Rothermel G, Untch RH, Chu C, Harrold MJ. Prioritizing test cases for regression testing. IEEE Transactions on Software Engineering (TSE) 2001; 27(10):929–948.
- [3] Henard C, Papadakis M, Perrouin G, Klein J, Heymans P, Le Traon Y. Bypassing the combinatorial explosion: Using similarity to generate and prioritize t-wise test configurations for software product lines. IEEE Transactions on Software Engineering 2014; 40(7):650–670.
- [4] Henard C, Papadakis M, Harman M, Jia Y, Le Traon Y. Comparing white-box and black-box test prioritization. Proceedings of the 38th International Conference on Software Engineering, ACM, 2016; 523–534.
- [5] Feldt R, Poulding S, Clark D, Yoo S. Test set diameter: Quantifying the diversity of sets of test cases. Software Testing, Verification and Validation (ICST), 2016 IEEE International Conference on, IEEE, 2016; 223–233.
- [6] Chekam TT, Papadakis M, Traon YL, Harman M. Empirical study on mutation, statement and branch coverage fault revelation that avoids the unreliable clean program assumption. Proceedings of the International Conference on Software Engineering (ICSE), 2017.
- [7] Andrews JH, Briand LC, Labiche Y, Namin AS. Using mutation analysis for assessing and comparing testing coverage criteria. IEEE Transactions on Software Engineering (TSE) 2006; 32(8):608–624.
- [8] Shin D, Yoo S, Bae DH. Diversity-aware mutation adequacy criterion for improving fault detection capability. Proceedings of the International Conference on Software Testing, Verification and Validation Workshops (ICSTW), 2016; 122–131.
- [9] Shin D, Yoo S, Bae DH. A theoretical and empirical study of diversity-aware mutation adequacy criterion. IEEE Transactions on Software Engineering 2017; .
- [10] Just R, Jalali D, Ernst MD. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. Proceedings of the International Symposium on Software Testing and Analysis (ISSTA), 2014; 437–440.
- [11] DeMillo RA, Lipton RJ, Sayward FG. Hints on test data selection: Help for the practicing programmer. Computer 1978; 11(4):34–41.
- [12] Shin D, Bae DH. A theoretical framework for understanding mutation-based testing methods. Proceedings of the International Conference on Software Testing, Verification and Validation (ICST), 2016; 299–308.
- [13] Lu Y, Lou Y, Cheng S, Zhang L, Hao D, Zhou Y, Zhang L. How does regression test prioritization perform in real-world software evolution? Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016, 2016; 535–546, doi:10.1145/2884781.2884874. URL http://doi.acm.org/10.1145/2884781.2884874.
- [14] Elbaum S, Malishevsky AG, Rothermel G. Test case prioritization: A family of empirical studies. IEEE transactions on software engineering 2002; 28(2):159–182.
- [15] Zhang L, Hao D, Zhang L, Rothermel G, Mei H. Bridging the gap between the total and additional test-case prioritization strategies. Proceedings of the 2013 International Conference on Software Engineering, IEEE Press, 2013; 192–201.
- [16] Do H, Rothermel G. On the use of mutation faults in empirical assessments of test case prioritization techniques. IEEE Transactions on Software Engineering 2006; 32(9):733–752.
- [17] Lou Y, Hao D, Zhang L. Mutation-based test-case prioritization in software evolution. Proceedings of the 26th International Symposium onSoftware Reliability Engineering (ISSRE), IEEE, 2015; 46–57.
-
[18]
Deb K, Pratap A, Agarwal S, Meyarivan T. A fast and elitist multiobjective
genetic algorithm: Nsga-ii.
IEEE transactions on evolutionary computation
2002; 6(2):182–197. - [19] Li Z, Harman M, Hierons RM. Search algorithms for regression test case prioritization. IEEE Transactions on software Engineering 2007; 33(4).
- [20] Memon AM, Gao Z, Nguyen BN, Dhanda S, Nickell E, Siemborski R, Micco J. Taming google-scale continuous testing. 39th IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice Track, ICSE-SEIP 2017, Buenos Aires, Argentina, May 20-28, 2017, 2017; 233–242, doi:10.1109/ICSE-SEIP.2017.16. URL https://doi.org/10.1109/ICSE-SEIP.2017.16.
- [21] Just R. The Major mutation framework: Efficient and scalable mutation analysis for Java. Proceedings of the International Symposium on Software Testing and Analysis (ISSTA), 2014; 433–436.
- [22] Kintis M, Papadakis M, Papadopoulos A, Valvis E, Malevris N, Le Traon Y. How effective mutation testing tools are? an empirical analysis of java mutation testing tools with manual analysis and real faults. Technical Report, Luxembourg University: http://pages.cs.aueb.gr/ kintism/papers/preprint1.pdf, 2017.
-
[23]
Goldberg DE, Holland JH. Genetic algorithms and machine learning.
Machine learning 1988; 3(2):95–99. - [24] Epitropakis MG, Yoo S, Harman M, Burke EK. Empirical evaluation of pareto efficient multi-objective regression test case prioritisation. Proceedings of the 2015 International Symposium on Software Testing and Analysis, ACM, 2015; 234–245.
- [25] Kotelyanskii A, Kapfhammer GM. Parameter tuning for search-based test-data generation revisited: Support for previous results. Quality Software (QSIC), 2014 14th International Conference on, IEEE, 2014; 79–84.
- [26] Arcuri A, Briand L. A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Software Testing, Verification and Reliability (STVR) 2014; 24(3):219–250.
- [27] Vargha A, Delaney HD. A critique and improvement of the cl common language effect size statistics of mcgraw and wong. Journal of Educational and Behavioral Statistics 2000; 25(2):101–132.
- [28] Kurtz B, Ammann P, Delamaro ME, Offutt J, Deng L. Mutant subsumption graphs. Proceedings of the International Conference on Software Testing, Verification and Validation Workshops (ICSTW), 2014; 176–185.
- [29] Eiben ÁE, Hinterding R, Michalewicz Z. Parameter control in evolutionary algorithms. IEEE Transactions on evolutionary computation 1999; 3(2):124–141.
- [30] Coles H, Laurent T, Henard C, Papadakis M, Ventresque A. Pit: a practical mutation testing tool for java. Proceedings of the 25th International Symposium on Software Testing and Analysis, ACM, 2016; 449–452.
- [31] Elbaum S, Malishevsky A, Rothermel G. Incorporating varying test costs and fault severities into test case prioritization. Proceedings of the 23rd International Conference on Software Engineering, IEEE Computer Society, 2001; 329–338.
- [32] Hutchins M, Foster H, Goradia T, Ostrand T. Experiments of the effectiveness of dataflow-and controlflow-based test adequacy criteria. Proceedings of the 16th international conference on Software engineering, IEEE Computer Society Press, 1994; 191–200.
- [33] Deb K, Agrawal S, Pratap A, Meyarivan T. A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: Nsga-ii. International Conference on Parallel Problem Solving From Nature, Springer, 2000; 849–858.
Comments
There are no comments yet.