Regression testing is a testing activity meant to assure that updates to the software have not changed the existing software behavior. Regression test suites normally grow in size as the software is developed or more quality assurance activity is performed. Although creating more test cases help for the test suite to be more effective, a large test suite is costly to execute and its execution might take hours or even days to finish.
Test case prioritization (TCP) seeks to help testers by prioritizing test cases in an order that testers gain maximum benefit from. For this purpose, test cases are prioritized and executed in an order that optimizes a desired goal function. The common target used for TCP is to minimize the time needed for finding failed test cases. Finding failed test cases earlier helps the development team to work on issues and resolve them faster in the development process.
The vast majority of TCP methods use structural coverage as a metric to prioritize test cases Yoo and Harman (2012); Hao et al. (2014). Coverage-based TCP methods usually aim to put test cases in an order that reaches a high coverage as soon as possible. There are two major strategies in coverage-based TCP methods, namely, total and additional strategies Rothermel et al. (1999), which we will introduce in section 2.2. Recent studies have shown that even an optimal coverage-based TCP method does not perform much better than the additional method in terms of fault detection rate Hao et al. (2016). This suggests that in order to improve the additional method some sources of data other than the structural coverage data should be used.
Another source of information that can be used for TCP is the fault-proneness information. A high fault-proneness for a code unit indicates it is relatively probable that the code unit contains a fault. There are various approaches for leveraging fault-proneness information to improve TCP; hence, there is a challenge to choose an appropriate approach in this regard. In this paper, we propose a novel approach to incorporate fault-proneness information into coverage-based TCP methods. Furthermore, in order to estimate the fault-proneness of a code unit, we designed a novel neural network based defect prediction method customized for the current problem. Our approach is generally applicable to many coverage-based TCP methods. In this study, we will specifically apply it to the total and additional TCP methods to obtainmodified total and modified additional methods. Our experiment on 357 versions of five real-world projects, included in the Defects4J dataset Just et al. (2014), shows that the fault detection rate of these methods is improved using the mentioned modification.
This paper makes the following contributions:
We propose a novel method to incorporate fault-proneness estimations into coverage-based TCP methods. Our approach is based on a modification of the concept of coverage in coverage-based TCP.
We design a customized neural network based defect prediction method to estimate the fault-proneness of a code unit. This method is customized to work when only a small set of bugs is available and utilizes the information from all versions of the source code history.
We present an extension of the Defects4J dataset. For each program version included in Defects4J, our extended dataset contains the test coverage data and computed values of some source code metrics of that version. The source code metric data is used by the developed defect prediction method.
We present an empirical evaluation using five open-source projects containing totally 357 versions of the projects. Results show that our proposed modification could improve existing coverage-based TCP techniques.
The rest of the paper is organized as follows: Section 2 presents the background material. Section 3 presents our approach of solving the problem and our proposed method. Section 4 presents the setup of our empirical evaluation and Section 5 shows the results of our experiments. In Section 6 the empirical results and threats to the validity of this study are discussed. Section 7 summarizes the most related work to this paper. Finally, Section 8 contains the conclusions and future work of this paper.
In this section, we present the definition of test case prioritization and briefly introduce coverage-based test case prioritization methods. We continue by providing background information on defect prediction, which is related to our study.
2.1 Test case prioritization
Consider a test suite containing the set of test cases . The TCP problem is formally defined as follows Elbaum et al. (2002):
Given: , a test suite; , the set of permutations of T; , a function from to the real numbers.
Problem: Find such that111This relation is expressed using Z notation’s first order logic Woodcock and Davies (1996).:
In other words, the problem of TCP is to find a permutation such that is maximum. Here is a scoring function that assuming a permutation from , assigns a score value to that permutation.
Users of TCP methods could have different goals, such as testing high-risk components as soon as possible, maximizing code coverage, and detecting faults at a faster rate. The function represents the goal of a TCP activity.
The APFD (Average Percentage of Faults Detected) goal function that measures how quickly a test suite can detect faults, is frequently used in the literature for TCP when the goal of TCP is maximizing the fault detection rate Rothermel et al. (1999); Yoo and Harman (2012); Engstrom et al. (2011); Catal and Mishra (2013). APFD is defined as follows: Denote the set of all failed test cases in as , where is the number of failed test cases and the index of test case in is . The APFD target function is formulated as Rothermel et al. (2001):
For a permutation in which the failed test cases are executed earlier, the values of to are smaller, so the APFD value will be larger.
2.2 Coverage-based test case prioritization
In order to measure coverage, the source code is partitioned into hierarchical units such as packages, files, methods and statements. Coverage-based TCP methods choose one level of partitioning (usually statements or methods) and define coverage over those units. Assuming a chosen level of partitioning, consider the source partitioned into units .
For each test case and unit of the code, denotes whether test case covers unit . The amount of coverage is either 0 or 1 if the units of code are statements; however, it can also be a real number in the range if the units are methods, files, classes or packages.
The total coverage of a test case is usually defined as follows:
The basic idea of the total prioritization strategy is that test cases with more coverage have more chance to uncover bugs. The total strategy therefore, sorts test cases by the amount of source code that each test case covers. This strategy ignores the fact that some test cases might cover the same area of the code. Therefore, when test cases are sorted using this strategy, usually some units of code are run multiple times, before the whole units are covered Elbaum et al. (2002).
On the other hand, the additional prioritization strategy assumes that running an uncovered unit of the code has more priority compared to an already covered unit. The intuition behind the additional strategy is that early coverage of all units of the code, results in revealing faults sooner Elbaum et al. (2002).
Coverage information can be collected in two ways. Dynamic coverage information is collected by executing the program and tracking every unit that is executed. In contrast, static coverage is derived by static analysis on the source code Mei et al. (2012). We apply dynamic coverage in our study, as it is generally more accurate than static coverage and usually leads to more effective prioritization results.
2.3 Defect Prediction
It is frequently observed that some areas of the code are relatively more fault-prone (i.e., more defects occur in that areas throughout the development of software). This happens due to various features of some code areas, such as relative complexity of the implementation, more code churn, and faulty designs.
Issue trackers and bug databases contain important information about software failures. This wealth of information can be analyzed by defect prediction methods to identify fault-prone areas of the code. Defect prediction applies machine learning to analyze the bug history of software and produce a prediction model for fault-proneness.
Defect prediction methods often include the following conceptual steps Nam (2014):
Feature extraction: In this step, metrics from the code and development process and other sources of information are extracted as a feature vector for each unit of the code (package, file, class or method). Moreover, the number of previous bugs related to each unit of the code is extracted and stored.
Model learning: Data extracted from the previous step is fed to a machine learning or data mining algorithm to learn a prediction model. While creating this prediction model, the metrics extracted in the previous step are used as the feature vector and the number of previous bugs related to each code unit is used as the target function.
Validation/Prediction: The model learned from the previous step can now be used to assign each unit of code a fault-proneness score. Some part of the data (namely, the validation set) is usually withdrawn from the learning procedure. The validation set is used to validate the prediction strength of the model.
There are many studies using static code metrics for defect prediction Menzies et al. (2010, 2007); Zimmermann et al. (2007). Other metrics, such as historical and process related metrics (e.g., number of past bugs Kläs et al. (2010); Ostrand et al. (2005) or number of changes Hassan and Holt (2005); Pinzger et al. (2008); Meneely et al. (2008); Moser et al. (2008)) and organizational metrics (e.g., number of developers Weyuker et al. (2008); Graves et al. (2000)), have also been used.
In this section, we introduce our proposed approach for TCP. Since our approach is a modification of existing TCP methods, before introducing our approach, we review the existing methods that we evaluate their modified versions in our empirical study.
3.1 Review of traditional random, total, and additional TCP methods
In this subsection, we review three traditional TCP strategies that we compare them with their modified strategies in our empirical study.
3.1.1 Random strategy
3.1.2 Total strategy for TCP
The total strategy for TCP begins by computing the total coverage of all test cases according to Equation 3. In the next step, test cases are sorted due to their total coverage so that the first test case has the highest total coverage. Compared with other non-random existing strategies, this strategy is simple, efficient. The time complexity of the total algorithm consists of the time complexity of computing the total coverage for all test cases plus the time complexity of the used sorting algorithm. The summation of these items results in the time complexity of the total algorithm that is .
3.1.3 Additional strategy for TCP
The same as the total strategy, the additional strategy begins by computing the total coverage of all test cases. Afterwards, a greedy algorithm is used to prioritize the test cases. Due to this algorithm, in each step, the test case that has the highest coverage over the uncovered code area is chosen as the next test case. The selected test case is then appended to the end of the ordered list of test cases and marked not to be chosen in next steps. Moreover, the area of the code covered by this chosen test case will be marked as covered area.
This strategy works in steps where shows the number of test cases. In each step, selecting the next test case and updating the coverage of the remaining test cases is done in where computing the updated coverage of a test case is performed in . Therefore, the total time complexity of this algorithm is .
The additional strategy can be implemented in different variations. In two situations, this strategy faces different options:
When more than one non-selected test cases have the highest coverage over the uncovered code area: in this case, one of these test cases should be selected with some criteria. For example, one might select the test case randomly or select the next test case with higher coverage over the whole code area (i.e. covered and uncovered).
When all areas of the code are covered by the test cases that have already been selected: in this case, the coverage of all the remaining test cases would be . Again, the remaining test cases can be ordered with different criteria. For instance, one might order them randomly or due to their total coverage. Another option is to consider all the code uncovered and repeat the algorithm with remaining test cases again Elbaum et al. (2002).
3.2 Proposed approach
In this subsection, we describe our proposed approach and the rationale behind it.
We have two main motivations for proposing our approach: First, previous studies related to defect prediction have suggested that defect prediction methods be leveraged for automated tasks, such as test case prioritization Lewis et al. (2013). Second, developers usually tend to firstly test those parts of the program that are more likely to be faulty; however, existing TCP methods generally do not consider this tendency of developers. By incorporating the fault-proneness score, estimated by learned defect prediction models, into TCP methods we address these two mentioned motivations.
3.2.2 Modified coverage
Assuming that we can extract prior knowledge on the fault probability of the code units, we propose a modified coverage formula that incorporates this prior knowledge. Given that the probability of existing faults in unit () is 222 indicates the event in which jth code unit is faulty and represents the probability of this event., we propose the following modified coverage formula to compute the coverage for the test case :
This formula considers more weight for units with more fault probability, resulting in giving more priority to these units.
To utilize this modified coverage formula, we need to estimate a probability function that represents the estimation of the probability of a defect existing in each unit of the source code. In this manner, we designed an appropriate defect prediction method, which we present in section 3.2.3. The defect prediction method assigns a fault-proneness score to every unit in the code (). In order to incorporate the fault-proneness score into the traditional TCP methods, one option is to define the probability function as follows:
Using Equation 5 as the definition of the probability function has the implication that covering the code units that are not predicted to be fault-prone will completely be ignored. Nonetheless, this is a downside for this definition because our prior knowledge indicates that even parts of the code that are not predicted to be fault-prone might contain some faults.
In contrast with Equation 5, we can also define the probability function as the following equation which leads to ignoring the fault-proneness of unit codes:
In order to avoid ignoring the fault-proneness of unit codes or the test coverage over the code units that are not predicted to be fault-prone, we introduce the following definition for the probability function:
3.2.3 Proposed defect prediction method
Defect prediction methods work at various granularity levels Zimmermann et al. (2007); Hata et al. (2012); Kamei et al. (2013). Nevertheless, mainstream research on defect prediction has been mainly focused on file level defect prediction. Therefore, we designed a file/class level defect prediction method for regression test prioritization. We use the fault-proneness score of the classes as an estimation for the fault proneness score of the methods.
Figure 1 illustrates our defect prediction method. In this method, each file of the source code that had a bug in a past version was marked as buggy. A feature extractor was designed to extract source code and historical features related to bug prediction. The feature extractor is then executed on each class of the source code, resulting in a feature vector for each class. The details about the extracted features will be discussed further in Section 4.
The next step for bug prediction is to learn a prediction model relating the source code feature vectors and the fault-proneness of the class. For this purpose, a neural network with two layers was learned. For each source code class, the learned neural network model yields a real number which falls in the range . This number can be interpreted as the fault-proneness score of the file.
Various classification methods have been proposed for defection prediction Lessmann et al. (2008). Certain challenges are raised due to the specific differences that our defect prediction problem has compared to a usual one. These differences can be summarized as follows:
Traditional defect prediction methods work with the assumption that source code classes do not change too much over time Menzies et al. (2010). Hence, they only consider the feature vector extracted from the last source code version and count the total number of bugs in previous versions for each class. The mentioned assumption leads to errors in the learning process and because of the role that defect prediction plays in our approach, we need to minimize such errors in our study.
The size of our set of positive samples (the number of classes marked as buggy) is small (the size and other properties of the dataset used in our study will be further discussed in Section 4.2).
To consider the first difference between our defect prediction problem and a usual one, we design our method to take into account the data from all versions of the source code for each class. In this manner, we put together all the feature vectors of all versions of the source code, resulting in a larger training set, approximately with the size .
This would cause the dataset to be more unbalanced, compared to when a single feature vector is extracted for each class, because the size of the positive samples would be nearly the same while the number of negative samples would be multiplied by a factor of . Moreover, the dataset would include some similar feature vectors with inconsistent labels. This happens because a non-buggy class can become buggy with a few changes; therefore, the dataset will include two very similar feature vectors related to this class with different labels. We apply a neural network learning method which is appropriate for this situation.
To address the second difference between our defect prediction problem and a usual one, we propose learning the neural network, with the
score as the loss function. The
score function is the harmonic average of the precision and recall of the classification results. This score function is often used whenever the number of positive samples is small. It is frequently used in the field of information retrieval, as the number of relevant documents is much smaller than the total number of documents. In our scenario, where the number of positive (buggy) class samples is much smaller than the number of negative samples, using this score function as the loss function will improve the learning precision.
Furthermore, we applied negative sub-sampling in the training phase of the neural network. This technique is implemented in the training phase of the neural network. The neural network is trained in 20 iterations. in each iteration all the positive samples and a subset of the negative samples is used for training. The negative subset is a random subset of the total negative samples with the size proportional to the size of positive samples set.
3.2.4 Proposed algorithm
shows a big picture of our proposed TCP algorithm. As it can be seen in the figure, this algorithm starts with a training phase in which a defect prediction classifier is learned. In the next step, for each program version in the testing set, source code and historical metrics are computed and then used by the defect prediction classifier to assign a fault-proneness score to each code unit. Moreover, the code coverage of the test cases recorded in previous executions of the program are fetched. Next, the test coverage and assigned fault-proneness score is used to prioritize the test cases and achieve a recommended priority order. Note that in this phase different strategies could be used. For the evaluation of the algorithm, the actual test results and recommended priority for the test cases are used to compute the APFD score.
3.2.5 Modified strategies
When the test coverage is computed, various strategies can be used to prioritize the test cases. For example, the traditional strategies reviewed in Section 3.1 could be used for this purpose. As it was explained in Section 3.2.4, another option is to use a modified version of these strategies which, in addition to the test coverage data, also takes into account the code units fault-proneness score.
In this paper, we propose a method to obtain such modified strategies. In this manner, we can substitute the traditional coverage (as defined in Equation 3) with the FaultBasedCover (introduced in Equation 4) and then perform the traditional strategies using this definition of test coverage. We call the resulting strategies the modified strategies. For example, Algorithm 1 shows the algorithm utilized by the modified additional strategy.
Before executing Algorithm 1, a neural network must be learned using the set of recorded bugs and previous versions of the source code (See Section 3.2.3). The resulting model will be used as the input DPModel in the algorithm. Algorithm 1 consists of three major steps. The first step of this algorithm (line 1) runs the defect prediction method and produces array. The th element of this array represents the fault-proneness score assigned to the th code unit. In the second step (which is performed in lines 2-9), the value of FaultBasedCover is calculated for each test case. In the third step (shown in lines 10-32), test cases are ordered using a greedy algorithm. According to this greedy algorithm, in each step, the test case with the highest fault based coverage over the uncovered code area is chosen as the next test case. The rest of the algorithm is similar to the traditional additional strategy explained in Section 3.1.3.
In the implementation of the additional method the following details were considered:
Whenever there is a tie between multiple test cases with the same additional coverage, the test case with bigger total coverage is chosen to break the tie.
The additional coverage of all test cases is computed at the beginning and whenever a test case is chosen, its coverage is decreased from all units.
The first, second, and third steps of this algorithm run in , , and , respectively, where n is the number of test cases, m shows the number of code units and f determines the number of features which have been used for defect prediction. Therefore, the whole algorithm runs in the time complexity of
4 Empirical study
In this section, we explain our empirical study and discuss the results of our experiments.
4.1 Research Questions
In our empirical study, we answer the following research questions:
RQ1: How does the modified additional TCP strategy compare to the traditional additional TCP strategy in terms of APFD?
RQ2: How does the modified total TCP strategy compare to the traditional total TCP strategy in terms of APFD?
RQ3: How does the value of affect the effectiveness of the modified strategies?
4.2 Subjects of study
Among previous TCP researches, some studies have used datasets with real bugs and some have used mutation analysis methods to artificially create buggy versions of the code.
Artificially created bugs are created using a random process of injecting bugs in the source code; hence, these bugs do not represent the behavior of a software development team. In this study, we want to propose a method for TCP in order to find the bugs made by the development team as soon as possible. Therefore, we limited our study to a dataset with real software bugs.
Conclusively, the proposed method must be evaluated on projects with the following properties:
The project must contain a test suite that is large enough to be used for TCP.
The bugs of the project must be recorded during a long period of time of the development process of the project, leading to a bug database with enough number of bugs.
The version control asset of the project must contain the faulty versions of the software that result in failing test cases. Moreover, the failing test cases must be identifiable.
Just et al. have collected a dataset, namely Defects4J Just et al. (2014), which has the mentioned properties. Considering the mentioned criteria, Defects4J is one of the few datasets that can be used for this purpose. In its initial published version, Defects4J provided a recorded bug history of five well-known open source java projects which contain a considerable number of test cases, summarized in Table 1. As it represents a completely real project development history, hopefully the results will be practically significant.
Per each recorded bug, Defects4J provides two versions of the project. First, a faulty version which contains the bug and one or more failing test cases identifying the bug. Second, another version with the bug fixed and no failing test cases. Defects4J localizes and isolates each bug fix, such that the difference between the buggy version and the fixed version is a single git commit containing only bug fix changes. This helps us to locate buggy classes in each version of the source code.
|Identifier||Project name||Bugs||Test classes|
4.3 Defects4J+M: The created dataset
Although Defects4J provides an appropriate dataset of real bugs, we still need to perform some calculations on it in order to use the data to evaluate our proposed approach. The result of these calculations is an extension of the Defects4J dataset which contains computed test coverages and source code metrics for each version included in Defects4J. The new dataset can be used by researchers in various software engineering fields, such as software repair, bug prediction, software testing, and fault localization. We made this dataset, called Defects4J+M, publicly available on Github333https://github.com/khesoem/Defects4J-Plus-M.
The tests in the Defects4J projects are mainly written using the JUnit framework Gamma and Beck (1999). In order to find test coverages we executed all test methods and measured the dynamic statement coverage. Method level was chosen as the partitioning level of code into units and JaCoCo Hoffmann (2011) library was used for code coverage analysis. This library is intended for easy integration with various development tools.
In this paper, we use a combination of static and process metrics for feature extraction. To extract static metrics we used the free version of SourceMeter (a tool for static code analysis). SourceMeter supports five major groups of metrics444The detailed description for SourceMeter metrics is published in its user guide page. Ltd. (2011). All the metrics were computed at class level. In addition to metrics computed using SourceMeter, we also computed the number of developers and changes per file and the number of previous bugs for each class. Table 2 contains the details of each feature group.
In order to use the computed metrics for defect prediction, we stored them in a vector that is used as the input feature vector by the defect prediction algorithm.
The source code metrics and test coverages were computed using a machine with two 10-core Intel CPUs (Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz) with 256GB RAM. It took more than 100 hours for this computer to finish the calculations. In addition to the large amount of resources and time that had to be dedicated to these calculations, we also faced some difficulties while creating the dataset that make it reasonable for us to release the dataset publicly so that other researchers can use it without facing the same difficulties. A short list of these difficulties is as follows:
In some projects, different program versions had to be built using different building tools, such as Ant, Maven, or Gradle.
In some projects, different program versions used different Java Development Kit (JDK) versions.
Computing metrics and test coverages took very long for some versions. Therefore, we had to do the calculations in parallel in order to finish them in a reasonable length of time.
A few versions of some projects could not be compiled. Moreover, the process of calculating metrics or test coverages for some projects took too long. We had to recognize such cases and ignore them if they could not be fixed.
4.4 Experimental procedure
The experiment consists of running the proposed method, the total and the additional prioritization methods on the projects of the Defects4J+M dataset. In order to create the defect prediction model for the th version of a project, the procedure explained in Section 3.2.3 is performed using the data from the st to th versions of the same project. The resulting model was then used for TCP. Since creating the defect prediction model requires the data from, at least, a minimum number of buggy versions, we created the model only for the more recent versions of each project. In this regard, the evaluation is done over the last 13, 33, 50, 14, and 50 versions of the Chart, Lang, Math, Time, and Closure projects, respectively.
The defect prediction neural network is implemented using Python language and Keras and scikit-learn machine learning libraries. The TCP algorithm is also implemented with Python language using NumPy and pandas libraries.
In this section, we present the results of our empirical study. In this regard, we provide the experimental results to answer the research questions raised in Section 4.1.
5.1 RQ1 and RQ2: Comparing modified strategies with traditional strategies
In order to answer RQ1 and RQ2, we computed the APFD score of traditional total and additional strategies as well as that of their modified versions. For each project of study, the APFD value is extracted over all the versions considered for evaluation in section 4.4. Our preliminary experiments showed that our approach works the best when is set to . Therefore, we compare the modified strategies with traditional ones using this setting. The effect of changing the value of is further discussed in Section 5.2.
The box plots in Figure 3 and Figure 4 represent the APFD score of the modified and traditional versions of additional and total strategies on each subject of study, respectively. Table 3 also shows the mean APFD score of each strategy on each project.
As it can be observed in Table 3, the mean APFD scores of modified strategies are superior to that of traditional strategies in most occasions and especially, in the overall case.
Moreover, we performed a Wilcoxon signed-rank test Wilcoxon (1992) (
) to make sure that our results are statistically significant. The null hypothesis is that there is no significant difference in the performance of the modified strategies with respect to their traditional counterparts. The results of this test demonstrate that:
There is overally a statistically significant difference between the modified and traditional additional strategies. This means that the modified additional strategy significantly performs better than the traditional one (RQ1).
There is not a significant difference between the modified and traditional total strategies (RQ2). Furthermore, Figure 4 confirms this result since it shows that the modified total strategy does not outperform the traditional total strategy in terms of the median APFD value (note that the mean value of APFD is not shown in the box plot).
|Subject||Additional Strategy||Total Strategy|
5.2 RQ3: Investigating the effect of changing on the effectiveness of modified strategies
As it was mentioned in Section 3.2.2, Equation 7 estimates the fault-proneness probability as a linear combination of a constant value (1) and the fault-proneness score assigned by the defect prediction model (), using the parameter or equivalently denoted as .
The relation between the mean APFD values versus is plotted in Figure 5. Each curve in this figure is showing the performance (APFD) of a modified method on a specific project. To observe the behavior of the modified strategies, the value of is varied in the range . This shows how the value of APFD changes in response to changing the value of .
The value of can be set by the practitioner to an appropriate value regarding the project conditions. From one point of view, tunes the amount of confidence to the predicted fault-proneness values. In the extreme cases of the interval , setting to zero gives full confidence to the defect prediction method and setting to zero ignores the defect prediction and only takes the coverage into account. When the project provides more prior knowledge (e.g. bug history size), can be set to a higher value to increase the impact of this knowledge; otherwise, it should be set to lower values.
Figure 5 shows the relation between the APFD and in modified total and modified additional strategies. Note that when the value of is set to , the modified strategy works the same as the traditional strategy. Therefore, in each curve, on the points that are higher than the most left point of the curve, the modified strategy is working better than the traditional one.
As it can be observed in Figure 5, the curves show an increasing trend in most cases; therefore, it can be claimed that the modified strategy performs better when the fault-proneness score is more taken into account. An exception to this claim is Figure (b)b. This Figure shows the effect of changing on the APFD value in the modified total strategy for the Chart project. We believe that this exception occurs because of the low number of bugs (26) recorded for the Chart project which, in turn, causes an inaccurate defect prediction model.
6.1 Practical Considerations
The proposed modification method relies on the success of the defect prediction phase. Therefore, practically this method will fail in case the requirements for successful defect prediction, such as a large enough history of bugs, are not met. Moreover, the bug history must contain real bugs related to a single software development team, not artificial bugs or bugs created by mutants.
The time required for executing the defect prediction phase is almost the same as the time needed for running the additional TCP strategy (in terms of order of magnitude). However, we can avoid executing the defect prediction phase for each prioritization task because the same defect prediction model created for a recent project version can also be used for the last version.
6.2 Threats to validity
Internal Validity. As mentioned in Section 3.2.3 our proposed method estimates the fault-proneness of the methods using defect prediction on classes and extrapolates the fault-proneness values assigned to classes to estimate the fault-proneness of methods. Using this approach results in more false positive instances for the prediction of bugs (i.e. falsely reporting a method as being buggy). In this regard, we measured the number of classes being reported as buggy, and our observations show that only a small ratio of classes are reported as buggy which means the ratio of false positives would be low as well. Hopefully, a small amount of false positives would not be harmful for our results.
External Validity. Our subject projects are all implemented in the Java language; therefore, the results might differ in projects developed using other languages. Furthermore, all the projects used in our empirical study are popular open-source projects and contain large test suites while projects that use our approach in the future might not have these features. We conducted the statistical test in order to make sure that our results can be generalized with confidence; however, we still have to evaluate our approach using projects with different languages and characteristics in the future to ensure the results are generalizable.
7 Related work
There have been many methods proposed for the problem of test case prioritization. Among methods proposed for TCP, the largest category is the category of the coverage-based methods. These methods rely on the assumption that choosing test cases with larger coverage leads to more effective fault detection. Therefore, these methods attempt to prioritize the test cases in an order that has the most coverage in the least number of executed test cases. Coverage-based TCP methods inherently solve an optimization problem, which is related to the set cover problem offutt1995 and is proved to be NP-hard Li et al. (2007)
. Therefore, there is no polynomial time algorithm for computing an optimal solution to the coverage-based TCP problem, and algorithms proposed for the coverage-based TCP problem are heuristic methods to solve this problem.
Coverage measurement is done in two main categories: Dynamic coverage and static coverage. Dynamic coverage is measured by executing the test cases and auditing the execution trace of each test case. Static coverage is an approximate estimation of dynamic coverage, measured by analyzing the static structure of the source code. Dynamic code coverage is widely used in many existing TCP studies; however, static code coverage, which is the coverage estimated from static analysis, has also been studied for TCP Mei et al. (2012); Zhou and Hao (2017); Zhang et al. (2009a).
Each coverage-based TCP method consists of two main building blocks: first, the coverage criteria used to measure coverage and second, the strategy used to take into account the measured test case coverage for TCP. Regarding the first building block, various coverage criteria have been applied for coverage-based TCP. The early studies in TCP used the statement coverage, branch coverage Rothermel et al. (1999), and method coverage Elbaum et al. (2000). After that multiple other criteria have been proposed Jones and Harrold (2003); Kovács et al. (2009); Fang et al. (2014). Fang et al. compare major existing logic and fault-based coverage criteria Fang et al. (2012). Their main conclusion is that criteria with fine-grained coverage information, MC/DC and fault based logic coverage criteria, have better fault detection capability. Elbaum et al. incorporate fault index, a metric calculated using a combination of multiple measurable attributes of the source code Elbaum and Munson (1999). There proposed method tracks variations of fault index over regressions and prioritizes test cases which run code with higher fault indexes.
As for the second building block of TCP methods, the strategy/algorithm can be thought as of a method to solve the optimization problem underlying coverage-based TCP. The aim of this optimization problem is to maximize the coverage with the hope that a high test coverage results in a high fault detection rate. For instance, the total and additional techniques are simple greedy algorithms providing approximate solutions for this optimization problem Rothermel et al. (1999). Being simple, efficient, and effective has popularized the usage of these basic methods Hao et al. (2014). The random strategy, which simply randomly orders the test cases is used for comparison with techniques in this research area. Li et al. Li et al. (2007, 2010) have applied some well known meta-heuristic optimization algorithms including hill climbing, -optimal greedy algorithms (refer to Li et al. (2007)
for its definition) and genetic algorithms to solve the coverage-based TCP problem. When compared to the total and additional prioritization algorithms using different metrics, the results indicate that the additional prioritization algorithm and the 2-optimal greedy algorithm, despite their simplicity, are the most efficient techniques in the majority of cases and the differences between the additional and 2-optimal algorithm are insignificant.
Zhang et al. Zhang et al. (2013); Hao et al. (2014) introduce strategies to mix the additional strategy and the total strategy resulting in a spectrum of algorithms in between them. Their results show that the mixed strategy may outperform both the additional strategy and the total strategy in terms of APFD. Jiang et al. Jiang et al. (2009) propose the random adaptive strategy, a variation of the greedy additional method with a technique used for choosing between ties. Their tie strategy technique chooses the test case farthest away from the currently selected test cases. They report that the random adaptive strategy does not improve effectiveness in terms of APFD but has better computational performance. Hao et al. Hao et al. (2013) propose utilizing the intermediate output of the execution process to improve test-case prioritization.
Hao et al. Hao et al. (2016)
focus on the coverage-based TCP problem to evaluate how much improvement is available in this problem. In this manner, they formulate this problem as an integer linear programming (ILP) problem, to produce the optimal solution and study its empirical properties. This empirical study demonstrates that the optimal technique can only slightly outperform the additional coverage-based technique with no statistically significant difference in terms of coverage. However, the additional technique significantly outperforms the optimal solution in terms of either fault detection rate or execution time. Note that their algorithm is not computationally practical and is only designed to compare the optimal solution with the output of other algorithms.
Some authors have leveraged similarity and information retrieval metrics between test cases and source code to prioritize test cases. Saha et al. Saha et al. (2015) introduced a new approach to address the problem of TCP by reducing it to a standard information retrieval problem such that the differences between two program versions form the query and the test cases constitute the document collection. Noor et al. Noor and Hemmati (2015) proposed a TCP approach that uses historical failure data of test cases. Their method uses similarity between test cases considering a test case as effective if it is similar to any failed test cases in the previous versions of the source code.
Some researchers have proposed using development process information to help rank test cases. Arafeen et al. Arafeen and Do (2013) used software requirements to cluster test cases and rank them. Ledru et al. Ledru et al. (2012) proposed a method that doesn’t assume the existence of code or specification and is based only on the text description of test cases, which may be useful in cases where the code coverage information is not available. Korel et al. Korel et al. (2005, 2008) proposed a model-based method for regression TCP, which assumes the system has been modeled using a state-based modeling language. In this method, when the modifications to the source code are made, developers identify model elements (i.e., model transitions) that are related to these modifications. Then the test suite is prioritized according to the relevance of test cases to these modifications. Some researches are focused on practical constraints in TCP, such as time constraints Zhang et al. (2009b); Suri and Singhal (2011); Do et al. (2010); Marijan et al. (2013); Zhang et al. (2007) and fault severity Walcott et al. (2006); Huang et al. (2012).
Engstrom et al. Engström et al. (2010) proposed utilizing previously fixed faults to choose a small set of test cases for regression test selection. Laali et al. Laali et al. (2016) propose an online TCP method that utilizes the locations of faults revealed by executed test cases in order to prioritize the non-executed test cases. Some studies such as Anderson et al. (2014) and Engstrom et al. (2011), use the idea of utilizing the history of failed regression tests to improve future regression testing phases. Kim et al. employed methods from fault localization to improve test case prioritization. Using the observation that defects are fixed after being detected, they propose that test cases covering previous faults will have lower fault detection possibility Kim and Baik (2010). Wang et al. (2017) proposed quality-aware test case prioritization method (QTEP) which focuses on potentially unrevealed faults. QTEP is based on CLAMI Nam and Kim (2015), an unsupervised defect prediction method. Their evaluation shows improvement of on average with respect to the basic prioritization methods on 7 open source java projects. Recently, Paterson et al. Paterson et al. (2019) proposed a ranked-based technique to prioritize test cases based on an estimated likelihood of java classes having bugs. This likelihood is estimated using a mixture of software development history features, such as number of revisions, number of authors, and number of fixes. Their experiments show that using their TCP method reduces the number of test cases required to find a fault on average by compared with existing coverage-based strategies.
As mentioned at the beginning of this section, coverage-based methods consist of the main building blocks of the used coverage criteria and strategy. The method proposed in this paper can be categorized as an approach to modify the existing coverage criteria proposed for TCP by utilizing the fault-proneness score assigned by defect prediction methods. Among the mentioned studies, Wang et al. (2017) and Paterson et al. (2019) are the two approaches that are the closest ones to our proposed approach. However, these two studies utilize sources of information different from the sources used in the current paper. Therefore, they should not be directly compared with our approach. More specifically, QTEP (Wang et al. (2017)) is an unsupervised method while our method is supervised and uses the bug history data. Also the method presented by Paterson et al. Paterson et al. (2019) incorporates different sources of information such as test execution history, which is not used in this paper.
8 Conclusions and future work
In this research, we introduced a novel approach to incorporate the code units fault-proneness estimations into coverage-based TCP methods. For this purpose, we proposed using the fault based coverage (introduced in Equation 4) instead of the traditional coverage (introduced in Equation 3). In order to investigate our proposed approach, we conducted an empirical study on 357 versions of five real-world projects included in the Defects4J dataset. Our evaluations show that traditional total and additional TCP strategies are improved when they are modified due to our proposal and the improvement of the modified additional strategy is statistically significant.
In the future, we can also take into account the test case execution results in the history of the development process in order to improve the proposed TCP methods. Moreover, cross-project defect prediction techniques Zimmermann et al. (2009) could be used to achieve a better estimation of the code units fault-proneness. The effect of modifying other existing coverage-based TCP strategies should also be investigated. Finally, the proposed approach can also be evaluated using larger datasets with various programming languages.
9 Conflict of interest
To the best of the authors knowledge, there isn’t any conflict of interests to report.
- Improving the effectiveness of test suite through mining historical data. In Proceedings of the 11th Working Conference on Mining Software Repositories, pp. 142–151. Cited by: §7.
- Test case prioritization using requirements-based clustering. In 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation, pp. 312–321. Cited by: §7.
- Value based regression test case prioritization. In Proceedings of the world congress on engineering and computer science, Vol. 1, pp. 24–26. Cited by: §3.1.1.
- Test case prioritization: a systematic mapping study. Software Quality Journal 21 (3), pp. 445–478. Cited by: §2.1.
- The effects of time constraints on test case prioritization: a series of controlled experiments. IEEE Transactions on Software Engineering 36 (5), pp. 593–617. Cited by: §7.
- Software evolution and the code fault introduction process. Empirical Software Engineering 4 (3), pp. 241–262. Cited by: §7.
- Test case prioritization: a family of empirical studies. Software Engineering, IEEE Transactions on 28 (2), pp. 159–182. Cited by: §2.1, §2.2, §2.2, item 2.
- Prioritizing test cases for regression testing. In Proceedings of the 2000 ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA ’00, New York, NY, USA, pp. 102–112. External Links: Cited by: §7.
- Selecting a cost-effective test case prioritization technique. Software Quality Journal 12 (3), pp. 185–210. Cited by: §3.1.1.
- Improving regression testing transparency and efficiency with history-based prioritization–an industrial case study. In Software Testing, Verification and Validation (ICST), 2011 IEEE Fourth International Conference on, pp. 367–376. Cited by: §2.1, §7.
- An empirical evaluation of regression testing based on fix-cache recommendations. In 2010 Third International Conference on Software Testing, Verification and Validation, pp. 75–78. Cited by: §7.
- Similarity-based test case prioritization using ordered sequences of program entities. Software Quality Journal 22 (2), pp. 335–361. Cited by: §7.
- Comparing logic coverage criteria on test case prioritization. Science China Information Sciences 55 (12), pp. 2826–2840. Cited by: §7.
- JUnit: a cook’s tour. Java Report 4 (5), pp. 27–38. Cited by: §4.3.
- Predicting fault incidence using software change history. Software Engineering, IEEE Transactions on 26 (7), pp. 653–661. Cited by: §2.3.
- A unified test case prioritization approach. ACM Transactions on Software Engineering and Methodology (TOSEM) 24 (2), pp. 10. Cited by: §1, §7, §7.
- To be optimal or not in test-case prioritization. IEEE Transactions on Software Engineering 42 (5), pp. 490–505. Cited by: §1, §7.
- Adaptive test-case prioritization guided by output inspection. In 2013 IEEE 37th Annual Computer Software and Applications Conference, pp. 169–179. Cited by: §7.
- The top ten list: dynamic fault prediction. In Software Maintenance, 2005. ICSM’05. Proceedings of the 21st IEEE International Conference on, pp. 263–272. Cited by: §2.3.
- Bug prediction based on fine-grained module histories. In Proceedings of the 34th International Conference on Software Engineering, pp. 200–210. Cited by: §3.2.3.
- EclEmma-jacoco java code coverage library. Note: http://eclemma.org/jacoco/index.html Cited by: §4.3.
- A history-based cost-cognizant test case prioritization technique in regression testing. Journal of Systems and Software 85 (3), pp. 626–637. Cited by: §7.
- Adaptive random test case prioritization. In Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering, pp. 233–244. Cited by: §7.
- Test-suite reduction and prioritization for modified condition/decision coverage. IEEE Transactions on software Engineering 29 (3), pp. 195–209. Cited by: §7.
- Defects4J: a database of existing faults to enable controlled testing studies for java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis, pp. 437–440. Cited by: §1, §4.2.
- A large-scale empirical study of just-in-time quality assurance. IEEE Transactions on Software Engineering 39 (6), pp. 757–773. Cited by: §3.2.3.
- An effective fault aware test case prioritization by incorporating a fault localization technique. In Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, pp. 5. Cited by: §7.
- Transparent combination of expert and measurement data for defect prediction: an industrial case study. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 2, pp. 119–128. Cited by: §2.3.
- Application of system models in regression test suite prioritization. In Software Maintenance, 2008. ICSM 2008. IEEE International Conference on, pp. 247–256. Cited by: §7.
- Test prioritization using system models. In Software Maintenance, 2005. ICSM’05. Proceedings of the 21st IEEE International Conference on, pp. 559–568. Cited by: §7.
- Optimal string edit distance based test suite reduction for sdl specifications. In International SDL Forum, pp. 82–97. Cited by: §7.
- Test case prioritization using online fault detection information. In Ada-Europe International Conference on Reliable Software Technologies, pp. 78–93. Cited by: §7.
- Prioritizing test cases with string distances. Automated Software Engineering 19 (1), pp. 65–95. Cited by: §7.
- Benchmarking classification models for software defect prediction: a proposed framework and novel findings. Software Engineering, IEEE Transactions on 34 (4), pp. 485–496. Cited by: §3.2.3.
- Does bug prediction support human developers? findings from a google case study. In Proceedings of the 2013 International Conference on Software Engineering, pp. 372–381. Cited by: §3.2.1.
- A simulation study on some search algorithms for regression test case prioritization. In Quality Software (QSIC), 2010 10th International Conference on, pp. 72–81. Cited by: §7.
- Search Algorithms for Regression Test Case Prioritization. IEEE Transactions on Software Engineering - TSE 33 (4), pp. 225–237. External Links: Cited by: §7, §7.
- SourceMeter - free-to-use, advanced source code analysis tool. Note: https://www.sourcemeter.com/resources/java Cited by: §4.3.
- Test case prioritization for continuous regression testing: an industrial case study. In 2013 IEEE International Conference on Software Maintenance, pp. 540–543. Cited by: §7.
- A static approach to prioritizing junit test cases. Software Engineering, IEEE Transactions on 38 (6), pp. 1258–1275. Cited by: §2.2, §7.
- Predicting failures with developer networks and social network analysis. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pp. 13–23. Cited by: §2.3.
- Data mining static code attributes to learn defect predictors. Software Engineering, IEEE Transactions on 33 (1), pp. 2–13. Cited by: §2.3.
- Defect prediction from static code features: current results, limitations, new approaches. Automated Software Engineering 17 (4), pp. 375–407. Cited by: §2.3, item 1.
- A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In Software Engineering, 2008. ICSE’08. ACM/IEEE 30th International Conference on, pp. 181–190. Cited by: §2.3.
- Clami: defect prediction on unlabeled datasets (t). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 452–463. Cited by: §7.
- Survey on software defect prediction. Department of Compter Science and Engineerning, The Hong Kong University of Science and Technology, Tech. Rep. Cited by: §2.3.
- A similarity-based approach for test case prioritization using historical failure data. In 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE), pp. 58–68. Cited by: §7.
- Predicting the location and number of faults in large software systems. Software Engineering, IEEE Transactions on 31 (4), pp. 340–355. Cited by: §2.3.
- An empirical study on the use of defect prediction for test case prioritization. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST), pp. 346–357. Cited by: §7, §7.
- Can developer-module networks predict failures?. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pp. 2–12. Cited by: §2.3.
- Test case prioritization: an empirical study. In Software Maintenance, 1999.(ICSM’99) Proceedings. IEEE International Conference on, pp. 179–188. Cited by: §1, §2.1, §7, §7.
- Prioritizing test cases for regression testing. Software Engineering, IEEE Transactions on 27 (10), pp. 929–948. Cited by: §2.1.
- An information retrieval approach for regression test prioritization based on program changes. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1, pp. 268–279. Cited by: §7.
- Analyzing test case selection & prioritization using aco. ACM SIGSOFT Software Engineering Notes 36 (6), pp. 1–5. Cited by: §7.
- Timeaware test suite prioritization. In Proceedings of the 2006 international symposium on Software testing and analysis, pp. 1–12. Cited by: §7.
- QTEP: quality-aware test case prioritization. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pp. 523–534. Cited by: §7, §7.
- Do too many cooks spoil the broth? using the number of developers to enhance defect prediction models. Empirical Software Engineering 13 (5), pp. 539–559. Cited by: §2.3.
- Individual comparisons by ranking methods. In Breakthroughs in statistics, pp. 196–202. Cited by: §5.1.
- Using z: specification, refinement, and proof. Vol. 39, Prentice Hall Englewood Cliffs. Cited by: footnote 1.
- Regression testing minimization, selection and prioritization: a survey. Software Testing, Verification and Reliability 22 (2), pp. 67–120. Cited by: §1, §2.1.
- Bridging the gap between the total and additional test-case prioritization strategies. In Proceedings of the 2013 International Conference on Software Engineering, pp. 192–201. Cited by: §7.
- Prioritizing junit test cases in absence of coverage information. In 2009 IEEE International Conference on Software Maintenance, pp. 19–28. Cited by: §7.
- Time-aware test-case prioritization using integer linear programming. In Proceedings of the eighteenth international symposium on Software testing and analysis, pp. 213–224. Cited by: §7.
- Test case prioritization based on varying testing requirement priorities and test case costs. In Seventh International Conference on Quality Software (QSIC 2007), pp. 15–24. Cited by: §7.
- Impact of static and dynamic coverage on test-case prioritization: an empirical study. In 2017 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pp. 392–394. Cited by: §7.
- Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pp. 91–100. Cited by: §8.
- Predicting defects for eclipse. In Predictor Models in Software Engineering, 2007. PROMISE’07: ICSE Workshops 2007. International Workshop on, pp. 9–9. Cited by: §2.3, §3.2.3.