How Different is Test Case Prioritization for Open and Closed Source Projects?

08/03/2020 ∙ by Xiao Ling, et al. ∙ NC State University IEEE 0

Improved test case prioritization means that software developers can detect and fix more software faults sooner than usual. But is there one "best" prioritization algorithm? Or do different kinds of projects deserve special kinds of prioritization? To answer these questions, this paper applies nine prioritization schemes to 31 projects that range from (a) highly rated open-source Github projects to (b) computational science software to (c) a closed-source project. We find that prioritization approaches that work best for open-source projects can work worst for the closed-source project (and vice versa). From these experiments, we conclude that (a) it is ill-advised to always apply one prioritization scheme to all projects since (b) prioritization requires tuning to different project types.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Regression testing is widely applied in both open-source projects and closed-source projects [13, 28, 30]. When software comes with a large regression suite, then developers can check if their new changes damage old functionality.

Excessive use of regression testing can be expensive and time consuming, especially if run after each modification to software. Such high-frequency regression testing can consumes as much as 80 percent of the testing budget, and require half the software maintenance effort [6].

To reduce the cost of performing regression testing, test case prioritization (TCP) is widely studied in software testing. In this approach, some features are extracted from prior test suites and test results and then applied to prioritize the current round of tests. Google reports that test case prioritization can reduce the time for programmers to find 50% of the failing tests from two weeks to one hour [12].

Prioritization schemes that work on some projects may fail on others. As shown later in this paper, not all projects track the information required for all the different prioritization algorithms. For example, suppose closed-source projects are prioritized using the textual descriptions of the test cases. That approach may not always work for open-source projects where such textual descriptions may be absent. Previously, Yu et al. [49] reported that the TERMINATOR test case prioritization algorithm was better than dozens of alternatives. However, TERMINATOR was developed for closed-source proprietary software. This raises the question: does TERMINATOR work for other kinds of projects (e.g. open-source projects)?

To explore this issue, this paper applies prioritization test case schemes to data from a closed-source proprietary project and 30 open-source projects. To the best of our knowledge, this study explores more prioritization algorithms applied to more kinds of data than prior work. Using that data, we answer the following research questions.
RQ1: What is the best algorithm for the closed-source project? We find that we can reproduce prior results:

As seen before, the TERMINATOR prioritization scheme works best for that the closed-source project.

RQ2: What is the best algorithm for open-source projects? While our RQ1 results concurred with past work, RQ2 shows that closed-source prioritization methods should not be applied to open-source projects:

For open-source projects, the best approach is not TERMINATOR, but rather to prioritize using either passing times since last failure or another exponential metric (defined in §3.4).

RQ3: Do different prioritization algorithms perform various in the open-source projects and the closed-source project? Combining RQ1 and RQ2, we can assert:

Test case prioritization schemes that work best for the industrial closed-source project can work worse for open-source projects (and vice versa).

The rest of this paper is structured as follows. Section 2 describes related work and Section 3 explains our experimental methods. Section 4 shows answers to the above questions. This is followed by some discussion in Section 5 and a review of threats to validity in Section 6, Section 7 shows our conclusion which is:

It is ill-advised to always apply one prioritization scheme to all projects since prioritization requires tuning to different projects types.

To say that another way, prioritisation schemes should always be re-assessed using local data. To simplify that process, we have made available on-line all the scripts and data used in this study111https://github.com/ai-se/TCP2020. Note that those scripts include all the major prioritization schemes seen in the current literature.

2 Background

2.1 Definitions

This paper shows that the “best” prioritization differs between closed-source proprietary projects and open-source projects. These projects can be distinguished as follows:

  • Open-source projects are developed and distributed for free redistribution, possibility for modifications, and with full access to the source code [24, 39].

  • Closed-source projects are proprietary software, developed with authorized users with private modification, republishing under a permission agreement [42].

As to the sites where we collect data:

  • Github is a hosting company for software development version control. Free GitHub accounts are commonly used to host open-source projects. As of January 2020, GitHub reports having over 40 million users and more than 100 million repositories (including at least 28 million public repositories), making it the largest host of source code in the world.

  • TravisTorrent is a public data set containing vanilla API data (build information), the build log analysis (tests information), plus repository and commit data [3].

# Scheme # Closed # Open Year Venue Citations
Prioritizing Test Cases For Regression Testing [41] 9 0 8 2001 TSE 1345
Test case prioritization: A family of empirical studies [9] 18 0 8 2002 TSE 994
Search algorithms for regression test case prioritization [27] 4 0 6 2007 TSE 739
A history-based test prioritization technique … [22] 1 0 8 2002 ICSE 461
Adaptive random test case prioritization [20] 9 0 11 2009 ASE 222
System Test Case Prioritization of New and Regression Test Cases [44] 1 0 0 2005 ESEM 223
Techniques for improving regression testing in continuous integration… [12] 1 1 0 2014 FSE 187
A clustering approach to improving test case prioritization… [4] 1 1 0 2011 ICSM 97
Test case prioritization for black box testing [38] 2 2 0 2007 COMPSAC 94
Test case prioritization for continuous regression testing… [33] 1 1 0 2013 ICSM 87
Prioritizing test cases for resource constraint environments… [13] 2 0 7 2009 ICCSIT 32
Prioritizing manual test cases in traditional & rapid release environments [16] 3 0 1 2015 ICST 30
History-based test case prioritization for failure information [7] 1 0 2 2016 APSEC 11
Test re-prioritization in continuous testing environments [52] 1 2 0 2018 ICSME 10
  This paper 9 1 30 2020 TSE
TABLE I: Summary of literature. “#Scheme” shows number of prioritization methods studied. “# Closed” and “# Open” shows how much data was used (measured in terms of number of projects).
Fig. 1: Framework for our System

2.2 Why Study Test Case Prioritization?

In software development, regression testing is very important in detecting software faults. However, it is also widely recognized as an expensive process. The most helpful approach to reduce computational cost and reveal faults earlier is called test case prioritization [27, 40, 9, 10, 8]. Better test case prioritization is useful since:

  • When developers extend a code base, they can check that their new work does not harm old functionality.

  • This, in turn, enables a rapid release process where developers can safely send new versions of old software to users each week (or even each day).

  • Faults can be revealed earlier than normal execution, which significantly increasing the efficiency and reducing the cost of regression testing. Moreover, within a time limit, more faults can be detected by performing test cases prioritization [13, 28, 15, 33].

  • Test managers can locate and fix faults earlier than normal execution by applying test cases prioritization [32].

There are many scenarios where the test case prioritization results of this paper can be applied. According to Zemlin [50], 80 percent of current software projects are open-source projects. Some projects even have the large number of test suites. To maintain the stability of projects, project developers want to detect more faults in limited time after each modification. For that purpose, test case prioritization is widely applied in regression testing. Therefore, a well-performed prioritization algorithm for open-source projects is highly demanded, which can let project developers:

  • Detect more faults within a period of time.

  • Start to fix software bugs earlier than usual.

Not only in the open-source projects, test case prioritization is also popular in the industrial closed-source projects. For example, LexisNexis is an industrial company that provides legal research, risk management, and business research [46]. The Lexis Advance platform is maintained by a set of automated UI tests, which is a case of regression testing. Such testing tasks are very expensive in execution time. Yu et al. states that the automated UI test suite that LexisNexis uses on testing takes approximate 30 hours to execute [49]. Therefore, LexisNexis seeks a prioritization algorithm that can help developers to

  1. Test software more often, then ship more updates to customers, at a faster rate;

  2. Save time when waiting for feedback on the last change [49].

2.3 Who studies Test Case Prioritization?

For all the above reasons, many researchers explore test case prioritization approach. For example:

  • Yu et al. introduced an active learning based framework TERMINATOR, which implements Support Vector Machine classifier to achieve higher fault detection rates on automated UI testing 

    [49].

  • Hemmati et al. propose a risk-driven clustering method that assigns the highest risk to the tests that failed in the closest version before the current version. After that, tests that failed in the two versions before the current version will be assigned and so on [16].

  • Fazlalizadeh et al. propose a test case fault detection performance approach which calculates the ratio of the number of times that the execution of the test case fails to the number of executions of the test case [13].

  • Kim et al. claim that the selection probabilities of each test case at each test run is useful in prioritization. They purpose an “Exponential Decay Metric” (defined later in this paper) which can calculate selection probabilities with weighted individual history observation 

    [22].

  • Zhu et al. and Cho et al. study the correlations between two test cases. They introduce different test case prioritization approaches based on different information on correlation. Zhu et al. purpose co-failure distributions, while Cho et al. implement the flipping history of two test cases [52, 7].

  • Li et al study five search techniques (Hill Climbing, Genetic Algorithm, Greedy, Additional Greedy, and 2-Optimal Greedy) for code coverage 

    [27].

  • Elbaum et al use four approaches with function coverage information of test cases. They point out that different testing scenarios should apply appropriate prioritization approach [11].

  • For more examples, see [20, 10, 31, 32, 8, 51, 19].

2.4 How to Study Test Case Prioritization?

In order to base this work on current methods in the literature, we base this paper on two literature reviews of test case prioritization. In March 2019, Yu et al. explored 1033 papers by using incremental text mining tools and found an list of prioritization algorithms represent that covered the most important methods in this arean [49]. To confirm and extend that finding for different types of projects, in May 2020, we conducted our own review. Beginning with papers from senior SE venues (as defined by Google Scholar Metrics “software systems”), we searched for highly cited or recent papers studying test case prioritization. For our purposes, “highly cited” means at least 10 citations per year since publication. This search found a dozen high profile test prioritization papers in the last 10 years. To that list, we used our domain knowledge to add two paper that we believed they are the most influential early contributions to this work. The final list of 14 papers is Table I.

Based on the papers in Table I, and the study of Yu et al, we assert that the following information is usually used in test case prioritization methods. Note that any term in italics is defined later in this paper (see §3.4).

  • Time since last failure: Prioritize test cases by using the numbers of consecutive non-failure [16, 12].

  • Failure rate: Prioritize test cases by the ratio of total failure times over total execution times [13].

  • Exponential Decay Metrics: Prioritize test cases by applying Exponential Decay Metrics, which adds weights in execution history [22].

  • ROCKET Metrics: Prioritize test cases by applying ROCKET Metrics [33].

  • Co-failure: Prioritize test cases by Co-failure distribution information [52].

  • Flipping History: Prioritize test cases by the correlations of flipping history [7].

  • TERMINATOR: An active learning method [49].

For our study, we implement the above algorithms to discover the best approach for the closed-source project and the open-source projects.

3 Methods

Our overall system framework is described in Figure 1. This section offers details on that framework.

3.1 Data Collection

Test Criteria
Developers 7
Pull Requests 0
Commits 20
Releases 1
Issues 10
Duration 1 year
Has Travis CI True
Total Builds 500
Useful Builds 100
Failed Test Cases 50
TABLE II: Sanity Check. From [45]
Feature Min Median IQR Max
Developers 8 39 57 188
Commits 2658 6189 10067 43627
Releases 1 15 21 167
Issues 310 827 667 3047
Duration (week) 137 292 197 529
Total Builds 758 5094 5017 24692
Useful Builds 193 719 991 7579
Failed Test Cases 74 530 261 3554
Feature Min Median IQR Max
Developers 24 124 258 4020
Commits 969 14446 20939 77152
Releases 23 95 179 426
Issues 192 2369 2882 13848
Duration (week) 342 470 81 636
Total Builds 206 2703 2597 19447
Useful Builds 117 262 406 8794
Failed Test Cases 50 93 111 5517
(a) Summary of 10 CS projects (b) Summary of 20 SE projects
TABLE III: Summary of projects used in this study. (IQR = (75-25)th percentile).

For closed-source project, we use the data set from Yu et al. [49]. For open-source projects, we searched GitHub. Many projects in GitHub are very small or are out of maintenance, which may not have enough information for our experiments. To avoid these traps, we implement the GitHub “sanity check”, which is introduced in the literature [21, 37, 45]. Our selection criteria is shown in Table II. Most of GitHub conditions in Table II are straight forward, but the last four conditions need explanation:

  • Has Travis CI: We use the Travis CI API for collecting repository and build log information. Travis CI can let project developers test their applications and record testing information. Therefore, our ideal projects must implement Travis CI for the testing purpose.

  • Total Builds: In GitHub project development, some builds may not trigger regression testing. In our experiments, these builds are discarded since they are not necessary for testing purpose.

  • Useful Builds: Among total builds, there are some builds that pass all test cases. Since we aim to prioritize failed test cases, we ignore the builds that pass all test cases. Moreover, we discard broken builds that may occur in Travis CI. The rest of the builds are defined as useful builds.

  • Failed Test Cases: We count all failed test cases in the entire project. If a project has a very small number of failed test cases, then such a project is not suitable for our experiments.

In order to ensure a diversity of open-source projects, we divided the projects found in this way into different populations:

  • We explored the “usual suspects”; i.e. projects that satisfy the sanity checks of Table II. Note that many of these projects have been used before in other publications. We call this first group the general software engineering group (hereafter, SE).

  • We also explored software from the computational science community. Computational Science (hereafter, CS) field studies and develops software to explore astronomy, astrophysics, chemistry, economics, genomics, molecular biology, oceanography, physics, political science, and many engineering fields [45].

After the above analysis, we find ten projects from computational science and twenty projects from software engineering that suitable for our analysis: see Table III.

Note that Table III does not include data from the software used in the Yu et al. case study [49]. Since that code is proprietary, we cannot offer extensive details on that project.

3.2 Data Preprocessing

We used the Travis CI API to extract GitHub repository information and build log information. In most cases, Travis CI API will return test builds in the consecutive order. Therefore, we did not need to re-rank the test builds after data collection. After we obtained information on failed test cases and test builds, we used a Python script to transfer the repository data and the build log data to the build-to-test tests record table for each project.

Group ID Information Utilized Algorithm Algorithm Description
A None A1 Prioritize test cases randomly.
A2 Prioritize test cases optimally.
B Execution History B1 Prioritize test cases by the information of time since last failure.
B2 Prioritize test cases by the failure rate.
B3 Prioritize test cases by Exponential Decay Metrics.
B4 Prioritize test cases by ROCKET Metrics.
C Execution History, Feedback Information C1 Prioritize test cases by co-failure information.
C2 Prioritize test cases by flipping history.
D Execution History, Feedback Information D1 Prioritize test cases by TERMINATOR with execution history feature.
TABLE IV: Information of Test Case Prioritization Algorithms

3.3 Performance Metric

For the evaluation of prioritization algorithms, we implement fault detection rates. Rothermel et al. [41] state that improved fault detection rate provides feedback faster than usual, which allows developers to correct faults earlier than normal time [41]

. Their preferred measurement is called the weighted average of the percentage of faults detected (APFD). APFD calculates the area inside the curve that interpolates the gain in the percentage of detected faults 

[41]. It is calcuated as follows:

(1)

where:

  • : The rank of the test case after prioritization that reveals fault.

  • : Total number of faults that revealed in current test run.

  • : Total number of test cases in current test run.

APFD ranges from 0 percent to 100 percent. Higher APFD value represents a larger area under the curve, which means higher fault detection rate, or better test case prioritization.

In APFD, all test cases are presumed to have the same execution time. Since the cost of test cases in GitHub projects is hard to be collected, APFD is the most suitable performance metric in our experiment.

3.4 Test Case Prioritization Algorithms

Our study implements the nine prioritization algorithms found in the literature review of §2.4. While all these rely on execution history, they prioritize test cases in different ways. We group these algorithms into Group A, B, C, and D according to the kinds of information that they use.

  • Group A: Group A contains 2 approaches that prioritize test cases with no information gain. These two algorithms are baseline methods that are used for comparison.

  • Group B: Group B includes 4 approaches that prioritize test cases only by their own execution history. They sort metrics to reorder the test cases before each test run.

  • Group C: Group C has 2 approaches that prioritize test cases by correlations between two test cases. Two test cases have a large probability to have the same outcomes if they are highly correlated.

  • Group D: Group D contains the proposed active learning based framework TERMINATOR [49]. TERMINATOR trains the SVM model with execution history when the first fault is detected.

Table IV shows the detailed group division and a brief description of each algorithm. The information utilized shows what history details is used by each algorithms.

In the rest of this section, we will explain each algorithm with a detailed example of how they order test cases in each test build. In order to make our illustration more clearly, we construct small version of test case tables, which have four test cases ( - ), four executed test builds ( - ), and one current test build (). In these tables, ✘ indicates failed testing result, and ✔ indicates passed testing result. We assume all test cases have the same cost in our study.

A1: This algorithm implements no information. It prioritizes test cases in random order. This is the baseline method since all prioritization algorithms should have better performance than A1.

A2: This algorithm uses the historical record of failed tests to sort the tests. For example, in Table V, A2 will execute and randomly before and since they will reveal faults in the current test run. We call A2 the omniscient algorithm since it uses information that is unavailable before prioritizing new tests. Note that if A1 represents the dumbest prioritization, then the omniscient A2 algorithm represents the best possible decisions. In the rest of this paper, we compare all results against A1 and A2 since that will let us baseline prioritisation against the worse and most omniscient decisions.

Test Case Metric Order
1 1
0 3
1 2
0 4
TABLE V: Example of A2

B1: B1 uses information of the time since last failure. A test case with less consecutive non-failure builds will be assigned with higher priority [12, 16]. In Table VI, and have 0 consecutive non-failure builds since they failed in . Thus they will be assigned to first and second position randomly. Moreover, has 1 consecutive non-failure build and has 2. Therefore, B1 algorithm will prioritize to as .

Test Case Metric Order
2 4
0 1
1 3
0 2
TABLE VI: Example of B1

B2: B2 uses the value of failure rate in metrics to prioritize test cases [13]. Failure rate is defined as:

(total number of failed builds) / (total test builds)

A test case with higher failure rate will be executed earlier. In Table VII, we can notice that is failed in all previous builds. Thus is assigned to in the metric. After that, is assigned to ;

Test Case Metric Order
0.25 4
0.75 2
0.5 3
1 1
TABLE VII: Example of B2

is assigned to ; and is assigned to . B2 will prioritize test cases in the order .

B3: B3 implements the “Exponential Decay Metric” (mentioned earlier in this paper) to calculate the ranking values of test cases [22, 13]:

where variables in these equations are defined as:

  • : The test result in build . if test passed and if test failed.

  • : The learning rate. In our experiments, reaches highest performance.

  • : Exponential Decay value of test case .

A test case with a higher Exponential Decay value will be executed earlier. For example, in Table VIII, Exponential Decay values for to are . Thus, B3 approach will rank test cases in in the order .

Test Case Metric Order
0.901 1
0.9 2
0.01 4
0.1 3
TABLE VIII: Example of B3

B4: B4 prioritizes test cases by implementing the ROCKET Metrics [33, 7]. In the ROCKET Metrics, prioritization value is calculated as follow:

where variables in this system are defined as:

  • : The test result in build . if test passed and if test failed.

  • : The ROCKET value of test case .

The prioritization value will be ranked in descending order. Test cases will be executed in the ranked order. For example, in Table IX, ROCKET value for to are . Therefore, the execution order will be .

Test Case Metric Order
0.1 4
0.2 3
0.9 1
0.4 2
TABLE IX: Example of B4

C1: The C1 algorithm was introduced by Zhu et al. in 2018. They consider the past test co-failure distributions in test case prioritization [52]. Making two test cases as a pair of tests, the co-failure score is calculated by:

where variables in this equation are defined as:

  • : Test cases that are not executed.

  • : Test case that executed just now.

  • : Score of test case in current test run.

  • : Score of test case in previous test run.

A higher score in this approach means highly correlated with the executed test cases. For example, in Table X, by given failed, scores of the rest test cases are updated to . Since has the highest score, is highly correlated with . Therefore, will be executed next. After failed, and have score . Because has higher score than , the final rank will be . However, C1 is a very complex algorithm. In some large projects, C1 takes a very long time (over 72 hours) to prioritize test cases. Therefore, in these large projects, we say C1 performs worse than other algorithms no matter how high APFD it can reach.

Test Case Metric Order
- , - 1
0.5, - 2
0, -0.17 3
-0.5, -0.67 4
TABLE X: Example of C1

C2: C2 algorithm is purposed by Cho et al. in 2016. They define that two test cases are highly correlated if their testing results change to the opposite status (flip) in two consecutive test runs [7]. Moreover, they utilize the ROCKET method to find the first failed test case. In Table XI, ROCKET approach puts in the first order. The flipping history for to is update to . Thus is assigned to the second order. After the execution of , flipping history becomes . Therefore, the final rank is .

Test Case Metric Order
-,- 1
1,1 4
2,- 2
1,2 3
TABLE XI: Example of C2

D1: The last algorithm TERMINATOR in our experiments was proposed by Yu et al. in 2019. TERMINATOR implements the active learning based framework [49]. This approach uses execution history as features to incrementally train a support vector machine classifier. Uncertainty sampling222Execute the test case with the most uncertain predicted probability. is applied until the number of detected faults exceeds some threshold . After that, certainty sampling333Execute the test case with the highest predicted probability. is utilized until all test cases are prioritized. In the following example, we assume the threshold is set to 2. In Table XII, with being randomly executed, and labeled as failed test case, we randomly presume as non-relevant test case. An SVM model is trained with failed and passed. The fitting results of , , and update to . Next, uncertainty sampling is applied and is selected as the most uncertain sample (closest to 0.5). Since then, is executed and labeled as a failed test case. After that, with more evidence, and have prediction result . Since the number of failed test cases exceeds the threshold, certainty sampling will be applied. will be selected because it has the highest probability to reveal the fault. The final rank in this example is .

Test Case Metric Order
0.3,0.2 4
0.6,- 2
-,- 1
0.8,0.9 3
TABLE XII: Example of D1

3.5 Statistical Methods

In our study, we report median and interquartile range (which show 50th percentile and 75th-25th percentile), of APFD results for entire test runs. We collect median and interquartile range values for each of the projects.

To make comparisons among all algorithms on a single project, we implement the Scott-Knott analysis [36]. In summary, using Scott-Knott, algorithms are sorted by their performance into some position . Algorithms are then assigned different ranks if algorithm ’s performance is significantly different to the algorithm at position .

To be more precise Scott-Knott sorts the list of treatments (in this paper, the prioritization algorithms) by their median score. After the sorting, it then splits the list into two sub-lists. The goal for such a split is to maximize the expected value of differences in the observed performances before and after division [48]. For example, in our study, we implement 9 prioritization approaches in list as treatments, then the possible divisions of and are . Scott-Knott analysis then declares one of the above divisions to be the best split. The best split should maximize the difference in the expected mean value before and after the split:

(2)

where:

  • , , and : Size of list , , and .

  • , , and : Mean value of list , , and .

After the best split is declared by the formula above, Scott-Knott then implements some statistical hypothesis tests to check whether the division is useful or not. Here “useful” means

and are differ significantly by applying hypothesis test . If the division is checked as useful split, Scott-Knott analysis will then run recursively on each half of the best split until no division can be made. In our study, hypothesis test is the cliff’s delta non-parametric effect size measure. Cliff’s delta quantifies the number of difference between two lists of observations beyond p-values interpolation [29]. The division passes the hypothesis test if it is not a “small” effect (). The cliff’s delta non-parametric effect size test explores two lists and with size and :

(3)

In this expression, cliff’s delta estimates the probability that a value in list

is greater than a value in list , minus the reverse probability [29]. This hypothesis test and its effect size is supported by Hess and Kromery [17].

4 Results

In this section, we will show our experimental results and answer RQs with these results. Note that RQ1 only shows results for the same closed-source project studied in the TERMINATOR paper [49] while RQ2 states the experimental results for 30 open-source projects.

4.1 What is the best algorithm for closed-source project? (RQ1)

max width=0.48 rank what med IQR 1 A1 0.50 0.02 max width=.1 2 C2 0.69 0.06 max width=.1 3 B1 0.70 0.08 max width=.1 3 B3 0.72 0.08 max width=.1 4 B2 0.74 0.08 max width=.1 4 B4 0.75 0.08 max width=.1 5 C1 0.79 0.09 max width=.1 5 D1 0.80 0.14 max width=.1 6 A2 0.96 0.08 max width=.1

TABLE XIII: Scott-Knott analysis for the proprietary data from our industrial partner. In this table, “med” denotes median; the blue row denotes the performance of D1 algorithm, while red row denotes the performance of B1/B3 approach.

max width = .85

rank what med IQR
1 A1 0.51 0.45 max width=.1
1 D1 0.54 0.38 max width=.1
2 B2 0.79 0.46 max width=.1
2 C1 0.81 0.48 max width=.1
2 B4 0.88 0.48 max width=.1
2 C2 0.89 0.39 max width=.1
3 B1 0.97 0.35 max width=.1
3 B3 0.97 0.32 max width=.1
4 A2 0.99 0.00 max width=.1
rank what med IQR
1 A1 0.51 0.26 max width=.1
1 D1 0.60 0.36 max width=.1
2 B2 0.81 0.34 max width=.1
2 B4 0.82 0.34 max width=.1
2 C2 0.84 0.40 max width=.1
3 C1 0.98 0.20 max width=.1
4 B1 0.99 0.20 max width=.1
4 B3 0.99 0.19 max width=.1
4 A2 0.99 0.15 max width=.1
rank what med IQR
1 A1 0.50 0.28 max width=.1
2 D1 0.66 0.48 max width=.1
3 B2 0.92 0.33 max width=.1
3 C2 0.94 0.28 max width=.1
3 B4 0.95 0.27 max width=.1
3 C1 0.96 0.24 max width=.1
4 B1 0.99 0.16 max width=.1
4 B3 0.99 0.16 max width=.1
4 A2 1.00 0.01 max width=.1
(a). Project Name: parsl/parsl (b). Project Name: radical-sybertools/radical (c). Project Name: yt-project/yt
rank what med IQR
1 A1 0.50 0.14 max width=.1
2 B2 0.80 0.44 max width=.1
2 D1 0.83 0.34 max width=.1
2 C2 0.83 0.32 max width=.1
2 B4 0.85 0.37 max width=.1
2 C1 0.89 0.39 max width=.1
3 B1 0.98 0.16 max width=.1
3 B3 0.98 0.15 max width=.1
4 A2 0.99 0.01 max width=.1
rank what med IQR
1 A1 0.50 0.20 max width=.1
2 D1 0.79 0.40 max width=.1
3 C2 0.95 0.13 max width=.1
3 B2 0.95 0.13 max width=.1
3 B4 0.96 0.10 max width=.1
4 B1 0.99 0.02 max width=.1
4 B3 0.99 0.01 max width=.1
4 A2 1.00 0.01 max width=.1
rank what med IQR
1 A1 0.50 0.41 max width=.1
1 D1 0.53 0.48 max width=.1
2 C2 0.99 0.14 max width=.1
2 B2 1.00 0.08 max width=.1
2 B4 1.00 0.05 max width=.1
2 B1 1.00 0.01 max width=.1
2 B3 1.00 0.01 max width=.1
2 A2 1.00 0.00 max width=.1
(d). Project Name: mdanalysis/mdanalysis (e). Project Name: unidata/metpy (f). Project Name: materialsproject/pymatgen
rank what med IQR
1 A1 0.50 0.24 max width=.1
2 D1 0.74 0.39 max width=.1
3 B2 0.91 0.24 max width=.1
3 B4 0.92 0.23 max width=.1
3 C2 0.93 0.20 max width=.1
4 B1 0.99 0.12 max width=.1
4 B3 0.99 0.11 max width=.1
4 A2 1.00 0.01 max width=.1
rank what med IQR
1 A1 0.50 0.14 max width=.1
2 D1 0.83 0.38 max width=.1
2 B2 0.84 0.38 max width=.1
3 B4 0.92 0.30 max width=.1
3 C2 0.94 0.24 max width=.1
4 B1 1.00 0.10 max width=.1
4 B3 1.00 0.11 max width=.1
4 A2 1.00 0.00 max width=.1
rank what med IQR
1 A1 0.50 0.29 max width=.1
2 D1 0.69 0.38 max width=.1
3 C2 0.99 0.06 max width=.1
3 B4 0.99 0.03 max width=.1
3 B2 0.99 0.03 max width=.1
3 B1 1.00 0.01 max width=.1
3 B3 1.00 0.01 max width=.1
3 A2 1.00 0.00 max width=.1
(g). Project Name: reactionMechanism../RMG-Py (h). Project Name: openforcefield/openforcefield (i). Project Name: spotify/luigi
rank what med IQR
1 A1 0.50 0.26 max width=.1
2 D1 0.74 0.35 max width=.1
3 C2 1.00 0.01 max width=.1
3 B4 1.00 0.01 max width=.1
3 B2 1.00 0.01 max width=.1
3 B1 1.00 0.00 max width=.1
3 B3 1.00 0.00 max width=.1
3 A2 1.00 0.00 max width=.1
(j). Project Name: galaxyProject/galaxy
TABLE XIV: Scott-Knott analysis results for 10 open-source computational science projects. In these tables blue row denotes the performance of D1 algorithm, while red row denotes the performance of B1/B3 approach. Note that in 8/10 in these results, B1/B3 is ranked the same as the omniscient A2 method: see figures b,c,e,f,g,h,i,j.

max width=.85

rank what med IQR
1 A1 0.52 0.28 max width=.1
1 D1 0.59 0.42 max width=.1
2 C2 0.89 0.32 max width=.1
2 C1 0.91 0.18 max width=.1
2 B2 0.91 0.23 max width=.1
2 B4 0.95 0.22 max width=.1
3 B1 0.97 0.17 max width=.1
3 B3 0.97 0.14 max width=.1
4 A2 0.99 0.02 max width=.1
rank what med IQR
1 A1 0.48 0.31 max width=.1
2 D1 0.65 0.32 max width=.1
3 C2 0.75 0.25 max width=.1
4 C1 0.94 0.14 max width=.1
4 B2 0.95 0.20 max width=.1
5 B4 0.97 0.16 max width=.1
5 B1 0.98 0.09 max width=.1
5 B3 0.98 0.06 max width=.1
5 A2 0.98 0.00 max width=.1
rank what med IQR
1 A1 0.49 0.23 max width=.1
2 D1 0.67 0.37 max width=.1
3 C2 0.78 0.28 max width=.1
4 B2 0.92 0.19 max width=.1
4 C1 0.93 0.14 max width=.1
4 B4 0.94 0.14 max width=.1
5 B1 0.98 0.04 max width=.1
5 B3 0.98 0.03 max width=.1
5 A2 0.99 0.01 max width=.1
(a). Project Name: loomio/loomio (b). Project Name: languagetool-org/languagetool (c). Project Name: deeplearning4j/deeplearning4j
rank what med IQR
1 A1 0.51 0.23 max width=.1
2 D1 0.67 0.26 max width=.1
3 C2 0.78 0.22 max width=.1
4 C1 0.92 0.10 max width=.1
4 B2 0.93 0.10 max width=.1
4 B4 0.94 0.09 max width=.1
5 B1 0.96 0.04 max width=.1
5 B3 0.96 0.03 max width=.1
5 A2 0.97 0.01 max width=.1
rank what med IQR
1 A1 0.49 0.24 max width=.1
2 D1 0.68 0.37 max width=.1
2 C2 0.68 0.36 max width=.1
3 B2 0.96 0.16 max width=.1
3 C1 0.97 0.09 max width=.1
3 B4 0.98 0.12 max width=.1
4 B1 0.99 0.01 max width=.1
4 B3 0.99 0.02 max width=.1
4 A2 1.00 0.01 max width=.1
rank what med IQR
1 A1 0.50 0.22 max width=.1
2 D1 0.73 0.35 max width=.1
3 C2 0.78 0.24 max width=.1
4 C1 0.97 0.04 max width=.1
4 B2 0.97 0.09 max width=.1
4 B4 0.97 0.07 max width=.1
5 B1 0.99 0.02 max width=.1
5 B3 0.99 0.02 max width=.1
5 A2 0.99 0.00 max width=.1
(d). Project Name: Unidata/thredds (e). Project Name: nutzam/nutz (f). Project Name: structr/structr
rank what med IQR
1 A1 0.47 0.24 max width=.1
2 D1 0.63 0.34 max width=.1
3 C2 0.72 0.24 max width=.1
4 B2 0.94 0.24 max width=.1
4 C1 0.95 0.17 max width=.1
4 B4 0.96 0.24 max width=.1
5 B1 0.98 0.13 max width=.1
5 B3 0.98 0.10 max width=.1
5 A2 0.99 0.01 max width=.1
rank what med IQR
1 A1 0.51 0.23 max width=.1
2 D1 0.74 0.29 max width=.1
3 C2 0.80 0.20 max width=.1
4 C1 0.98 0.02 max width=.1
4 B4 0.98 0.03 max width=.1
4 B2 0.98 0.04 max width=.1
4 B1 0.98 0.07 max width=.1
4 B3 0.98 0.06 max width=.1
4 A2 0.99 0.00 max width=.1
rank what med IQR
1 A1 0.51 0.28 max width=.1
2 D1 0.63 0.36 max width=.1
3 C2 0.72 0.23 max width=.1
4 C1 0.96 0.03 max width=.1
4 B2 0.97 0.04 max width=.1
4 B4 0.98 0.03 max width=.1
4 B1 0.98 0.01 max width=.1
4 B3 0.98 0.00 max width=.1
4 A2 0.98 0.00 max width=.1
(g). Project Name: ocpsoft/rewrite (h). Project Name: eclipse/jetty.project (i): Project Name: square/okhttp
rank what med IQR
1 A1 0.53 0.34 max width=.1
1 D1 0.59 0.46 max width=.1
2 B2 0.88 0.22 max width=.1
2 C2 0.88 0.36 max width=.1
2 B4 0.89 0.28 max width=.1
2 C1 0.90 0.15 max width=.1
3 B1 0.96 0.15 max width=.1
3 B3 0.96 0.15 max width=.1
4 A2 0.99 0.00 max width=.1
rank what med IQR
1 A1 0.49 0.19 max width=.1
2 D1 0.63 0.32 max width=.1
3 B2 0.71 0.31 max width=.1
3 B4 0.72 0.30 max width=.1
3 C1 0.72 0.34 max width=.1
3 C2 0.73 0.19 max width=.1
4 B1 0.93 0.26 max width=.1
4 B3 0.94 0.27 max width=.1
5 A2 0.99 0.2 max width=.1
rank what med IQR
1 A1 0.50 0.24 max width=.1
2 D1 0.69 0.37 max width=.1
3 C2 0.70 0.26 max width=.1
4 C1 0.91 0.21 max width=.1
4 B2 0.91 0.24 max width=.1
4 B4 0.92 0.23 max width=.1
5 B1 0.96 0.12 max width=.1
5 B3 0.96 0.12 max width=.1
6 A2 0.99 0.03 max width=.1
(j). Project Name: openSUSE/open-build-service (k). Project Name: thinkaurelius/titan (l): Project Name: Graylog2/graylog2-server
rank what med IQR
1 A1 0.48 0.32 max width=.1
2 D1 0.66 0.36 max width=.1
3 C2 0.95 0.28 max width=.1
4 B4 0.98 0.03 max width=.1
4 B1 0.98 0.03 max width=.1
4 B2 0.98 0.03 max width=.1
4 B3 0.98 0.03 max width=.1
4 C1 0.98 0.03 max width=.1
4 A2 0.98 0.01 max width=.1
rank what med IQR
1 A1 0.51 0.31 max width=.1
2 D1 0.64 0.41 max width=.1
3 B2 0.83 0.28 max width=.1
3 C2 0.84 0.24 max width=.1
3 B4 0.85 0.24 max width=.1
3 C1 0.88 0.25 max width=.1
4 B1 0.98 0.09 max width=.1
4 B3 0.98 0.09 max width=.1
5 A2 1.00 0.02 max width=.1
rank what med IQR
1 A1 0.48 0.45 max width=.1
1 D1 0.51 0.38 max width=.1
2 B2 0.96 0.33 max width=.1
2 C1 0.96 0.15 max width=.1
2 B4 0.97 0.29 max width=.1
2 C2 0.97 0.30 max width=.1
3 B1 0.99 0.03 max width=.1
3 B3 0.99 0.03 max width=.1
3 A2 0.99 0.00 max width=.1
(m). Project Name: puppetlabs/puppet (n). Project Name: middleman/middleman (o): Project Name: locomotivecms/engine
rank what med IQR
1 A1 0.50 0.24 max width=.1
2 D1 0.72 0.39 max width=.1
3 B2 0.78 0.36 max width=.1
3 B4 0.79 0.37 max width=.1
3 C2 0.81 0.33 max width=.1
3 C1 0.83 0.32 max width=.1
4 B1 0.90 0.17 max width=.1
4 B3 0.94 0.15 max width=.1
5 A2 0.99 0.08 max width=.1
rank what med IQR
1 A1 0.50 0.25 max width=.1
2 D1 0.67 0.41 max width=.1
3 C2 0.75 0.24 max width=.1
4 B2 0.97 0.10 max width=.1
4 B4 0.97 0.10 max width=.1
5 C1 1.00 0.03 max width=.1
5 B1 1.00 0.01 max width=.1
5 B3 1.00 0.01 max width=.1
5 A2 1.00 0.00 max width=.1
rank what med IQR
1 A1 0.52 0.23 max width=.1
2 D1 0.70 0.28 max width=.1
3 C2 0.73 0.32 max width=.1
4 C1 0.82 0.16 max width=.1
4 B2 0.83 0.17 max width=.1
4 B4 0.83 0.17 max width=.1
5 B1 0.86 0.19 max width=.1
5 B3 0.86 0.17 max width=.1
6 A2 0.98 0.15 max width=.1
(p). Project Name: diaspora/diaspora (q). Project Name: facebook/presto (r): Project Name: rspec/rspec-core
rank what med IQR
1 A1 0.50 0.12 max width=.1
1 C1 n/a n/a
2 C2 0.86 0.11 max width=.1
2 D1 0.88 0.17 max width=.1
2 B2 0.88 0.14 max width=.1
2 B4 0.88 0.15 max width=.1
3 B1 0.98 0.02 max width=.1
3 B3 0.98 0.02 max width=.1
3 A2 0.99 0.02 max width=.1
rank what med IQR
1 A1 0.50 0.04 max width=.1
1 D1 n/a n/a
1 C1 n/a n/a
1 C2 n/a n/a
2 B2 0.97 0.07 max width=.1
2 B4 0.97 0.07 max width=.1
3 B1 0.99 0.04 max width=.1
3 B3 0.99 0.04 max width=.1
3 A2 0.99 0.01 max width=.1
(s). Project Name: rails/rails (t). Project Name: jruby/jruby
TABLE XV: Scott-Knott analysis results from 20 open-source software engineering projects. In these tables, blue row marks the performance of D1 algorithm, while red row denotes the performance of B1/B3 approaches. Note that in 13/20 of these results, B1/B3 is ranked the same as the omniscient A2 method: see figures b,c,d,e,f,g,h,i,m,o,q,s,t. Algorithms with n/a mean they are too expensive to finish so that they are in the lowest rank.
build_id sp_api sp_replace sl_api simple_l nws_l
189049303 Pass Pass Pass Pass Pass
189277968 Fail Fail Fail Fail Fail
189305565 Fail Fail Fail Fail Fail
189333173 Fail Fail Fail Fail Fail
189337798 Fail Fail Fail Fail Fail
189352555 Fail Fail Fail Fail Fail
189355678 Pass Pass Pass Pass Pass
build_id tc_301 tc_302 tc_303 tc_304 tc_305
18 Pass Pass Fail Fail Pass
19 Pass Pass Fail Fail Pass
20 Fail Fail Fail Fail Fail
21 Fail Fail Pass Fail Fail
22 Pass Pass Fail Fail Pass
23 Pass Pass Pass Fail Pass
24 Pass Fail Fail Fail Pass
(a) Unidata (b) Proprietary Closed-source Data
TABLE XVI: Part of Testing Data Set from (a) GitHub open-source project Unitdata and (b) proprietary closed-source data from our industrial partner

To answer RQ1, we reproduce the Yu et al. study by implementing our discovered prioritization approaches in their data set [49]. Note that, for this data, Yu et al. recommended TERMINATOR (which we call the D1 prioritization algorithm).

Table XIII shows our simulation results of 9 prioritization algorithms. We record APFD result of each test run, and calculate median value and interquartile range of APFD for all test runs. An algorithm with higher APFD value in our experiments has better performance. As described in §3.5, algorithms differ significantly if they separate in different ranks of the Scott-Knott analysis.

As seen in Table XIII, as might be expected, the performance of all the algorithms are bounded by the dumbest randomized A1 prioritization algorithm (which performed worse) and the omniscient A2 algorithm (that performed best).

After A2 we see that D1 and C1 tied together for best place (in rank 5). That said, for two reasons, we recommend D1 over C1:

  • D1 runs five times faster than C1 (328 seconds versus 1457 seconds).

  • D1 converges faster to a higher plateau of performance (see Figure 2).

Hence we answer RQ1 as follows: As seen before, the TERMINATOR prioritization scheme works best for that closed-source project.

Fig. 2: Mean fault detection rates. X-axis = number of tests executed, Y-axis = ”recall” (percentage of failing test suites).

4.2 What is the best algorithm for open-source projects? (RQ2)

In order to answer RQ2, we use 10 computational science (CS) projects and 20 software engineering (SE) projects from GitHub. Table XIV shows the Scott-Knott analysis for 10 CS projects and Table XV states the results for 20 SE projects. From comparisons among all 30 projects, we observe that:

  • For all these open-source projects, B1 and B3 always perform better than any other algorithms.

  • Interestingly, except, algorithms B1 and B3 are ranked the same as the omniscient A2 algorithm in 8/10 of the  Table XIV results and 13/21 of the  Table XV results. That is, in the majority case, B1 and B3 are (a) and (g) are performing at such a high level that they cannot be beaten.

Moreover, in our experiments, we find C1 takes a very long time in prioritizing projects which have over 800 test builds or 1500 failed test cases. For example, in the Reaction Mechanism Generator project, which has 850 test builds and 617 failed test cases, C1 takes around 48 hours to simulate 70% test builds. Therefore, we conclude that C1 is a very computational costly algorithm which has issues scaling up to projects with a huge number of test builds or failed test cases. C1 performs so slowly that we do not use it for our analysis of projects with more than 800 test builds or more than 1500 failed test cases.

In summary, we can conduct the answer for RQ2 based on the above results that For open-source projects, the best approach is not TERMINATOR, but rather to prioritize using either passing times since last failure or another exponential metric.

4.3 Are different prioritization algorithms perform various in the open-source projects and the closed-source project? (RQ3)

To answer RQ3, we look at the B1/B3 and D1 results in Table XIII, Table XIV, and Table XV. We highlight D1 result with blue and the B1/B3 results with red. Note that the ranking of these algorithms is reversed for our closed-source and open-source examples:

  • As shown in Table XIII, for our close-sourced case study, D1 was seen to perform much better than B1/B3.

  • However, as shown in Table XIV and Table XV, for open-source projects, that ranking is completely reverse,

Based on these points, we can answer RQ3 that Test case prioritization schemes that work best for the industrial closed-source project can work worse for open-source projects (and vice versa)

5 Discussion

max width= Project Name B1 B2 B3 B4 C1 C2 D1 Unidata/thredds 0.2 0.3 0.3 0.4 8.4 1.2 1.7 OpenSUSE 0.2 0.3 0.3 0.3 11.4 1.1 1.9 Thinkaurelius 0.2 0.3 0.3 0.4 16.8 2.7 2.9 Loomio 0.2 0.4 0.5 0.6 21.9 3.0 3.1 Languagetool… 0.3 0.4 0.5 0.6 17.3 1.1 2.2 ocpsoft 0.2 0.3 0.3 0.4 19.3 1.2 3.3 Locomotivems 0.2 0.3 0.3 0.4 8.4 1.2 2.2 Parsl 0.3 0.5 0.7 0.8 27.4 1.2 3.7 Graylog2 0.4 0.4 0.5 0.6 22.3 1.7 3.2 Eclipse 0.4 0.6 0.8 1.0 86.1 4.0 8.3 Rspec 0.5 0.7 0.8 1.0 29.8 7.2 4.0 Radical-syber.. 0.9 1.3 1.7 1.9 69.1 9.7 6.2 Deeplearning4j 1.0 1.3 1.6 1.9 99.0 7.4 8.8 Puppetlabs 1.3 1.4 1.6 1.8 54.1 5.7 5.3 Nutzam 2.2 3.4 4.5 5.4 792.7 17.9 57.4 Square 2.7 3.1 3.9 4.7 168.5 11.2 13.1 Project Name B1 B2 B3 B4 C1 C2 D1 Middleman 3.4 6.2 8.0 9.4 598.8 34.5 33.0 Diaspora 9.1 20.6 26.4 31.3 1738.0 255.8 78.3 Structr 14.4 19.4 25.1 31.5 3309.9 129.9 147.9 Yt-project 14.7 27.8 34.6 39.9 12494.4 315.8 563.4 Mdanalysis 24.5 38.1 47.8 55.5 34336.1 1271.9 3542.5 Facebook 43.0 45.7 56.4 62.7 67635.8 281.7 4646.1 Reaction.. 45.0 70.4 86.3 105.5 n/a 924.2 2551.4 Openforcefield 49.2 81.8 102.4 117.6 n/a 4685.4 33962.7 Unidata 105.7 158.5 196.3 244.9 n/a 1793.8 4271.5 Materials.. 167.3 183.1 221.2 277.0 n/a 873.5 6439.4 Spotify 564.5 959.8 1203.1 1541.1 n/a 5525.3 15961.1 Rails 2950.1 5376.4 6716.6 8339.4 n/a 68648.1 82601.2 Galaxy.. 9721.8 9169.9 11297.2 14466.7 n/a 56929.1 205651.6 Jruby 1081.7 1797.7 2393.1 2946.3 n/a n/a n/a Proprietary data 1.7 2.3 2.2 2.4 1457.5 285.9 327.7

TABLE XVII: Run time for all algorithms (unit: (s)). Dark gray marks the performance of D1 in the proprietary closed-source project from our industrial partner, which is an acceptable run time. Light gray marks the performances of B1 and B3 in the open-source projects, which are much shorter than D1.

5.1 Performance of purposed Prioritization Algorithms in Different Sources of Projects

In our study, we conduct that D1 performs “best” in the industrial closed-source project, but “worst” in open-source projects. On the opposite, B1 and B3 have the best performance in open-source projects, but worse in the industrial closed-source project. This finding leads us to consider the reasons for such a phenomenon.

First and foremost, by observing our build-to-test table, we find that the industrial closed-source project and GitHub open-source projects use different testing scenarios in regression testing. As defined in §1, regression testing is a technique that re-running test cases after each modification to check whether the fixed issues have re-developed [2]. The differences are:

  • In GitHub open-source projects, developers always try to fix a single fault in consecutive builds. Such testing scenario makes same failures live for several consecutive builds. (See Table XVI(a))

  • On the opposite, the industrial closed-source project is often testing for the integrity. Most of the test builds are triggered to test whether software components work properly between each other. Therefore, test cases have less pattern in consecutive builds. (See Table XVI(b))

Table XVI shows a matrix where shows the results when tries to run . From Table XVI(a), see that the listed test cases failed simultaneously from Build 189277968 to Build 189352555. We see this pattern, often, in the test builds of GitHub open-source projects.

However, in Table XVI(b), such pattern is less common. Hence we conclude that consecutive test builds are triggered for single component in open-source projects but are only triggered for the entire code base in the closed-source project. In such testing scenario, B1 and B3 have outstanding performance in open-source projects because more weights are assigned to the recent testing results.

Secondly, as mentioned above:

  • Developers in the industrial closed-source project mostly trigger regression testing to test an entire project. This will cause the situation that many more faults will be revealed in a single test run in the closed-source project.

  • On the other hand, open-source projects have much fewer failures in each test builds since most of the builds are triggered for testing a individual unit.

  • In such different testing scenarios, more feedback information can be obtained from the execution history in the industrial closed-source project than in the open-source projects

  • Hence D1 algorithm can take this advantage to find more association rules among all test cases.

For this reason, D1 dominants in the industrial closed-source project instead of the GitHub open-source projects.

5.2 Efficiency of Prioritization Algorithms

To conclude this study, we offer a brief note on the efficiency of the different prioritization algorithms. Efficiency can be important component in judging whether a prioritization scheme is useful or not. An algorithm can be regarded worse than others if it takes a long duration to prioritize test cases.

In Table XVII, we list simulation time for each of the selected algorithm. In this table, n/a means the algorithm takes a very long time (over 48 hours) in simulation, so we will not consider it even though its APFD is very high.

From Table XVII, we find that our purposed algorithms B1 and B3 for open-source projects have very short execution time (marked in light gray). The reason they are fast is that they only need to analyze execution history one time for each test case (which is an analysis). Since most of the open-source projects have very large builds and lots of test cases, this finding consolidate our conclusion that B1 and B3 are the best prioritization algorithms in open-source projects since:

  • B1 and B3 have the best performance in prioritizing open-source projects.

  • B1 and B3 have fast simulation speed in prioritizing large open-source projects.

That said, despite their efficiency, B1 and B3 are not applicable in our industrial closed-source project:

  • In Table XVII, we can observe that D1 can finish test case prioritization in 5 minutes with very outstanding performance, which is acceptable (marked in dark gray). Although B1 and B3 only take a few seconds to finish the same task, we still prefer D1 since it can increase fault detection rates significantly.

  • Also, from Table XIII, we can find B1 and B3 are only in the rank 3. D1 has a much better performance than B1 and B3.

By taking the above reasons together, we conclude that B1 and B3 are not applicable in the closed-source project even though they have the shortest simulation time.

6 Threats to Validity

This section discusses issues raised by Feldt et al. [14]

Conclusion validity: Different treatments to simulation results may cause various conclusions. In our experiments, we implement Scott-Knott analysis to the APFD results of test runs. Prioritization algorithms differ significantly if they distribute in different ranks.

Metric validity:

We implement the weighted average of the percentage of faults detected (APFD) to evaluate the performance of prioritization approaches. This evaluation metric assumes all test cases have the same cost and the same severity. However, some test cases may take longer to execute than others. This may be a threat to evaluation validity. In our future work, we plan to collect the cost of test cases so that we can implement a better evaluation metric called the average percentage of faults detected with cost (APFDc) 

[