Testing with Fewer Resources: An Adaptive Approach to Performance-Aware Test Case Generation

by   Giovanni Grano, et al.
Delft University of Technology

Automated test case generation is an effective technique to yield high-coverage test suites. While the majority of research effort has been devoted to satisfying coverage criteria, a recent trend emerged towards optimizing other non-coverage aspects. In this regard, runtime and memory usage are two essential dimensions: less expensive tests reduce the resource demands for the generation process and for later regression testing phases. This study shows that performance-aware test case generation requires solving two main challenges: providing accurate measurements of resource usage with minimal overhead and avoiding detrimental effects on both final coverage and fault detection effectiveness. To tackle these challenges we conceived a set of performance proxies (inspired by previous work on performance testing) that provide an approximation of the test execution costs (i.e., runtime and memory usage). Thus, we propose an adaptive strategy, called pDynaMOSA, which leverages these proxies by extending DynaMOSA, a state-of-the-art evolutionary algorithm in unit testing. Our empirical study --involving 110 non-trivial Java classes--reveals that our adaptive approach has comparable results to DynaMOSA over seven different coverage criteria (including branch, line, and weak mutation coverage) and similar fault detection effectiveness (measured via strong mutation coverage). Additionally, we observe statistically significant improvements regarding runtime and memory usage for test suites with a similar level of target coverage. Our quantitative and qualitative analyses highlight that our adaptive approach facilitates selecting better test inputs, which is an essential factor to test production code with fewer resources.


page 1

page 3

page 4

page 9

page 15


Fault Detection Effectiveness of Source Test Case Generation Strategies for Metamorphic Testing

Metamorphic testing is a well known approach to tackle the oracle proble...

Regression Test Case Prioritization by Code Combinations Coverage

Regression test case prioritization (RTCP) aims to improve the rate of f...

Towards exhaustive branch coverage with PathCrawler

Branch coverage of source code is a very widely used test criterion. Mor...

Hybrid Multi-level Crossover for Unit Test Case Generation

State-of-the-art search-based approaches for test case generation work a...

Ticket Coverage: Putting Test Coverage into Context

There is no metric that determines how well the implementation of a tick...

Comparing Mutation Coverage Against Branch Coverage in an Industrial Setting

The state-of-the-practice in software development is driven by constant ...

Combinatorial Sequence Testing Using Behavioral Programming and Generalized Coverage Criteria

This paper tackles three main issues regarding test design: (1) it propo...

1 Introduction

From Waterfall to Agile, software testing has always played an essential role in delivering high-quality software [1]. Integrating automated test case generation tools [2, 3] in software development pipelines (e.g., in continuous software development (CD) [4]) could potentially reduce the time spent by developers writing test cases by hand [5]. Hence, research and industry have heavily focused on automated test generation in the last decade [6], mainly employing evolutionary search (e.g.

, genetic algorithms (GA)) to produce minimal test suites that satisfy some testing criteria 


While most of the research effort has been devoted to maximizing various code coverage criteria [6, 8, 9, 2], recent work showed that further factors need to be considered for the generation of test cases [10, 11, 12, 13]. Specifically, recent research investigated additional factors such as data input readability [11], test readability [14, 13], test code quality [15], test diversity [16], execution time [12], and memory usage [10]. An early attempt to reduce the resource demand of generated tests is the work by Lakhotia et al. [10]. The authors recast test data generation as a bi-objective problem where branch coverage and the number of bytes allocated in the memory are two contrasting objectives to optimize with Pareto-efficient approaches. Their results show that multi-objective evolutionary algorithms are suitable for this problem. Following this line of research, other works also used multi-objective search to minimize test execution time [17] or the number of generated tests, used as a proxy for the oracle cost [18, 19].

While the aforementioned works showed the feasibility of lowering the cost (e.g., runtime) of the generated tests, they all pose two significant challenges on the full code coverage [19]. First, empirical results showed that combining coverage with non-coverage criteria is harmful to the final coverage compared to traditional strategies that target coverage only [10, 18, 19, 15]. These approaches implement the classic one-branch-at-a-time (or single-target

) approach, which consists in running bi-objective meta-heuristics (

e.g., GA) multiple times, once for every branch in the code, while performance aspects are other dimensions to optimize for each branch separately. However, recent studies [20, 21, 22] empirically and theoretically showed that single-target approaches are less effective and efficient than multi-target approaches (e.g., the whole suite approaches and many-objective search) in maximizing coverage. Therefore, the second challenge to address regards how to inject test performance analysis into the main loop of multi-target strategies.

Generated tests with lower resource demand might decrease the cost of introducing test case generation into continuous integration (CI) pipelines. Hilton et al. [23] showed that acquiring hardware resources for self-hosted CI infrastructure is one of the main barriers for small companies when implementing CI policies: more performant tests would require fewer hardware resources, and therefore testing in CI would be more cost-effective. Despite the theoretical benefits, the precise measurement of memory and runtime costs adds considerable overhead since it requires running each test case multiple times [24]. Consequently, there is a need for approaches that neither penalize the final coverage nor the fault detection capability of generated tests, while minimizing the test resource demand [19, 15].

We extend the current state-of-the-art by proposing a novel adaptive approach, called pDynaMOSA (Adaptive Performance-Aware DynaMOSA), to address the two challenges described above. In designing our approach, we focus on (i) test execution time (runtime from now on), (ii) memory usage, (iii) target coverage, and (iv) fault detection capability as four relevant testing criteria in white-box test case generation. To tackle the first challenge, we explored recent studies in performance testing [25] and symbolic execution [26]

that investigate suitable approaches to estimate the computational/resource demands of test cases. In particular, we adopted three performance proxies —computable with low overhead— introduced by Albert 

et al. [26] for symbolic execution. Besides, we developed four additional performance proxies that provide an indirect approximation of the test execution costs (i.e., runtime and memory usage). These proxies, obtained through instrumentation, measure static and dynamic aspects related to resource usage through a single test execution: the number of objects instantiated (for heap memory), triggered method calls, and executed loop cycles and statements (for runtime).

Recent work in the field explored alternative ways to integrate orthogonal objectives into the fitness function, which are based on the idea of using non-coverage aspects as a second-tier objective [15]. To address our second challenge, pDynaMOSA extends DynaMOSA [27]the most recent and effective many-objective genetic algorithm for test case generation— by using the performance proxies as second-tier objectives while code branches are the first-tier objectives. pDynaMOSA uses an adaptive strategy where the second objective can be temporarily disabled in favor of achieving higher coverage values (which is the primary goal). We integrated an adaptive strategy in pDynaMOSA since our initial results show that when the second objective strongly competes with the primary one (i.e., coverage), which is the case for performance, an adaptive strategy is preferable to a non-adaptive approach [15].

To evaluate pDynaMOSA, we conduct an empirical study involving 110 non-trivial classes from 27 open-source Java libraries to show the usefulness of pDynaMOSA compared to the baseline DynaMOSA in terms of branch coverage, runtime, memory consumption, and fault-effectiveness (i.e., mutation score). Our study shows that pDynaMOSA has similar results compared to DynaMOSA over seven different coverage criteria. However, the test suites produced with pDynaMOSA are significantly less expensive to run in 65% (for runtime) and in 68% (for heap memory consumption) of the subjects. We demonstrate that the devised approach does not reduce the fault-effectiveness of the generated tests: pDynaMOSA achieves a similar or higher mutation score for ~85% of the subjects under tests.

Contributions. In this work, we devise a performance-aware test case generation technique, where runtime and memory usage of the resulting tests are optimized as secondary objectives besides branch coverage. The main contributions of the paper are:

  • We demonstrate the need for an adaptive strategy for handling the problem of reducing the resource demand of generated test suites while maintaining high test coverage and fault detection capability.

  • We propose a performance-score aggregating a set of performance proxies with low overhead as an indirect approximation of the computational demand for a generated test case.

Replication Package. To enable full replicability of this study, we publish all the data used to compute the results and a runnable version of the implemented approach in a replication package [28].

2 Background & Related Work

Test data/case generation has been intensively investigated over the last decade [7, 6]. Several tools have been proposed with the main goal of automatically generating tests with high code coverage, measured according to various code-coverage criteria such as branch [29], statement [7], line, and method [8] coverage. Search-based algorithms — in particular GAs [30] — had a strong pull on the automation of such a task [6].

Proposed approaches can be categorized into two formulations: single-target and multi-target. In the former, evolutionary algorithms (or more general meta-heuristics) aim to optimize one single-coverage target (e.g., branch) at one time. The single target is converted into a single-objective function (fitness function) measuring how close a test case (or a test suite) is to cover  [7, 31]. The “closeness” to a given branch is measured using two well-established white-box heuristics [7]: the approach level and the normalized branch distance. Fraser and Arcuri were the first to propose a multi-target approach, which optimizes all coverage targets simultaneously in order to overcome the disadvantages of targeting one branch at a time [2]. In their approach, called whole test suite generation (WS), GAs evolve entire test suites rather than single test cases. The search is then guided by a suite-level fitness function that sums up the coverage heuristics (i.e., branch distances) for all the branches of the class under test (CUT). A later improvement over WS, called archive based whole suite approach (WSA), focuses the search on uncovered branches only and uses an archive to store test cases reaching previously uncovered branches [20].

Many-objective Search. Following the idea of targeting all branches at once, Panichella et al. [3] addressed the test case generation problem in a many-objective fashion proposing a many-objective genetic algorithm called MOSA. Different from WS (or WSA), MOSA evolves test cases that are evaluated using the branch distance and approach level

for every single branch in the CUT. Consequently, the overall fitness of a test case is measured based on a vector of

objectives, one for each branch of the production code. Thus, the goal is finding test cases that separately satisfy the target branches [3], i.e., tests having a fitness score for at least one uncovered branch . To focus/increase the selection towards tests reaching uncovered branches, MOSA proposes a new way to rank candidate test cases [32], called preference criterion. Formally, a test case is preferred over another test for a given branch (or ) iff , i.e., its objective score is lower (main criterion). In addition, if the two test cases and are equally good in terms of branch distance and approach level for the branch (i.e., ), the shorter test is preferred (secondary criterion). In other words, the preference criterion promotes test cases that are closer to cover some branches if possible and have minimal length.

MOSA works as follows: a starting population is randomly generated and evolved through subsequent generations. For each generation, new offsprings are created through crossover and mutation. Then, the new population for the next generation is created by selecting tests among parents and offsprings as follows: a first front of test cases is built by using the preference criterion. Following, the remaining tests are grouped in subsequent fronts using the traditional non-dominated sorting algorithm [33]. The new population is then obtained by picking tests starting from the first front until reaching a fixed population size . To enable diversity and avoid premature convergence [34, 16], MOSA also relies on the crowding distance, a secondary heuristic that increases the chances to survive in the next generation for test cases that are the most diverse within the same front. The final test suite is the archive, an additional data structure used to store test cases that reach previously uncovered branches. If a new test hits an already covered branch , is stored in the archive if and only if shorter (secondary criterion) than the test case stored in the archive for the same branch .

Panichella et al. [27] improved the MOSA algorithm by presenting DynaMOSA, a variant that focuses the search on a subset of uncovered targets based on the control dependence hierarchy. While MOSA considers all targets as independent objectives, DynaMOSA relies on the control dependency graph (CDG) to discern the targets free of control dependencies from the ones that can be covered only after their dominators are satisfied. In particular, the difference between DynaMOSA and MOSA is the following: at the beginning of the search, DynaMOSA includes only the targets that are free of control dependencies in the vector of objectives. Therefore, at every iteration, the current set of targets is updated considering the results of the newly created tests, including any uncovered target that is control dependent on the newly covered ones. This approach does not impact the way MOSA ranks the generated solution, but rather fasten the convergence of the algorithm, while optimizing the size of the current objects Empirical results show that DynaMOSA performs better than both WSA and MOSA in terms of branch [27, 8], statement [27], and strong mutation coverage [27].

Recently, Panichella et al. [35] proposed a multi-criteria variant of DynaMOSA, which consider multiple and heterogeneous coverage targets simultaneously, e.g., branch, line, weak mutation. The algorithm relies on the enhanced control dependency graph (ECDG) enriched with structural dependencies among the coverage targets. The problem is then formulated as a many-objective optimization aimed at finding a set of test cases that minimize the fitness functions for all targets , where is the set of coverage targets corresponding to different criteria. These targets are dynamically selected using the ECGD while exploring the covered control dependency frontier incrementally. Empirical results show that even though the multi-criteria variant may results in few cases in a lower branch coverage than DynaMOSA, it reaches higher coverage on all the other criteria as well as showing better fault detection capability [35]. We use this many-criteria version of DynaMOSA both as a baseline and to implement the proposed adaptive approach.

Large-scale studies. Campos et al. [22] and Panichella et al. [21] conducted two large-scale empirical studies comparing different approaches and meta-heuristics for test case generation. Their results showed that: (1) multi-target approaches are superior to the single-target approaches, and (2) many-objective search helps to reach higher coverage than other alternative multi-target approaches in a large number of classes. Besides, no search algorithm is the best for all classes under test [22]. These recent advances motivate our choice of focusing on many-objective search.

Non-coverage objectives. In recent years, several works investigated non-coverage aspects in addition to reaching high coverage. Lakhotia et al. considered dynamic memory consumption as a further objective to optimize together with branch coverage [10]. Ferrer et al. proposed a multi-objective approach considering at the same time code coverage (to maximize) and oracle cost (to minimize) [19]. Afshan et al. looked at code readability as a secondary objective to optimize [36] and used natural language models to generate tests having readable string inputs. In these studies, coverage and non-coverage test properties were considered as equally important objectives. However, empirical studies showed the difficulty of effectively balancing the two contrasting objectives without penalizing the final branch coverage [19]. Furthermore, these studies used a single-target approach rather than multi-target ones.

Palomba et al. [15] incorporated test cohesion and coupling metrics as secondary objectives within the preference criterion of MOSA to produce more maintainable test cases, from a developer point of view. Their approach produces more cohesive and less coupled test cases without reducing coverage. More recently, Albunian [16] investigated test case diversity as a further objective to optimize together with coverage in WSA.

Our work. Following the idea of considering non-coverage criteria as second-tier objectives with regards to coverage [15], we focused on performance-related objectives (i.e., memory consumption and runtime) for the generation of tests. This required us first to define reliable metrics/proxies that approximate test performance without incurring in a too expensive overhead. However, a preliminary evaluation of this strategy showed that the overall coverage tends to decrease compared to the coverage achieved by considering only the coverage. For this reason, we introduced an adaptive strategy that enables/disables the secondary objectives when adverse effects on coverage are detected during the generation.

3 Approach

This section introduces the utilized performance proxies, their rationale, and their integration in DynaMOSA.

3.1 Performance Proxies

The accurate measurement of software system performance is known to be challenging: it requires measurements to be performed over multiple runs to account for run-to-run variations [24]. This means that we would need to re-run each generated test case hundreds of times to have rigorous runtimes and memory usages. This type of direct measurement is unfeasible for test case generation, where each search iteration generates several new tests that are typically executed only once for coverage analysis.

While a direct measurement is unfeasible in our context, various test case characteristics can be used to indirectly estimate the cost (runtime and memory) of the generated tests. According to Jin et al. [37], about 40% of real-world performance bugs stem from inefficient loops, while uncoordinated method calls and skippable functions account for respectively one third and a quarter of performance bugs. Object instantiations impact the heap memory usage [38], and the number of executed statements has been used in previous regression testing studies as a proxy for runtime [39]. Multiple studies investigate the performance impact prediction in the context of software performance analysis [40, 25, 41], but to the best of our knowledge, no prior work combined it with evolutionary unit test generation.

The closest studies are the ones from De Oliveira et al. [25] and Albert et al. [26], which fit the context of this study. However, both studies leveraged only a subset of proxies investigated in this paper, focused on different testing problems and techniques. De Oliveira et al. [25] investigated performance proxies in the context of regression testing. Albert et al. [26] proposed three performance proxies for symbolic execution and showed their benefits on example programs. Symbolic execution can be used as an alternative technique to generate test cases rather than GAs; however, it has various limitations widely discussed in the literature [42, 43], such as the path explosion problem, it cannot handle external environmental dependencies, and complex objects.

In this paper, we extend the set of performance proxies proposed in previous studies [26] and incorporate them within evolutionary test case generators in an adaptive fashion. We designed the performance proxies with the idea of estimating a test case’s performance (i.e., runtime and/or memory consumption) unobtrusively. We implemented two types of proxies: (i) Static proxies that utilize static analysis techniques such as AST parsing. (ii) Dynamic proxies that rely on the instrumentation facilities available in EvoSuite [9]. We extract the control flow graph (CFG) and the number of times each branch in the CFG is covered by a given test (frequency). All production code proxies are dynamic while proxies related to the test code are static.

In the following paragraphs, we describe each performance proxy separately and discuss for which dimension (memory or runtime) it is relevant.

Number of executed loops (I). This counts the number of loop cycles in the production code which is executed/covered by a given test case . Higher loop cycle counts influence the runtime of the test case. To this aim, at instrumentation time, we use a depth-first traversal algorithm to detect loops in the CFG. When a test case is executed, we collect the number of times each branch involved in a loop is executed (execution frequency). Thus, the proxy value for corresponds to the sum of the execution frequencies for all branches involved in loops. To avoid a negative impact on coverage, we require each loop to be covered at least once. Therefore, this proxy only considers loops with a frequency higher than one.

Number of method calls [26]. We implement two types of method call proxies: covered method calls (I), which is related to method calls in the paths of the CFG that are covered by a test ; and test case method calls (I), which counts method calls in . Notice, the former proxy considers the number of calls to each production method (i.e., the frequency) rather than a single boolean value denoting whether a method has been called or not, as in method coverage [8]. This is because a method can be invoked multiple times by either indirect calls or within loops. Method calls directly impact the memory usage [44]: every time a method is invoked, a new frame is allocated on top of the runtime stack. Further, method calls are dynamically dispatched to the right class which might influence the runtime. Thus, fewer method calls should result in shorter runtimes and lower heap memory usage due to potentially fewer object instantiations.

Number of object instantiations (I). Objects instantiated during test executions are stored on the heap. Thus, reducing the number of instantiated objects may lead to decreased usage of heap memory. The fourth proxy counts the number of object instantiations triggered by a test case . It analyzes the basic blocks of the CFG that covers and increments a counter for every constructor call and local array definition statement. Notice that we consider the frequency (e.g., the number of constructor calls) rather than a binary value (i.e., called or not called). Moreover, the constructor call counter excludes calls and local array definitions with a frequency of one, as we want to cover them at least once. We do not consider the instantiated-object size as it would require a more complex and heavier instrumentation. We also do not consider primitive data types which use memory as well, because their influence is negligible compared to objects [38].

Number of Statements [26]. Statement execution frequency is a well-known proxy for runtime [39]. Similarly to the proxies for number of method calls, we implement two types of statement-related proxies: Covered statements (I), which counts the statements in the production code covered by a test case. This proxy utilizes the dynamically-computed CFG for counting the covered statements. Test case statements (I), which corresponds to the number of non-method-call statements in a generated test case. This number is statically determined by inspecting the abstract syntax tree of the test case.

Test Case Length (I). This counts the LOC (size) of a test case and therefore represents a superset of test case method calls (I) and test case statements (I). We include this proxy for two reasons: First, it is a good performance proxy: longer tests mean more method and statement calls. Second, DynaMOSA uses test case length as a secondary objective in order to reduce the oracle cost [45]. Thus, we rely on the same metric to have a fair comparison.

3.2 Performance-aware Test Case Generation

To successfully generate test suites with high target coverage and, at the same time, low computational requirements, we incorporate the performance proxies, described in section 3.1, into the main loop of DynaMOSA [27]. We opt for DynaMOSA since it has been shown to outperform other search algorithms (e.g., WS, WSA, and MOSA) in branch and mutation coverage, positively affecting the test generation performance [27]. Additionally, its basic algorithm (i.e., MOSA) was used in prior studies to combine multiple testing criteria [8, 15]. Multiple approaches could be followed to this aim. One theoretical strategy consists of adding the performance proxies as further search objectives in addition to the coverage-based ones, merely following the many-objective paradigm of MOSA. This leads to a trade-off search between coverage and non-coverage objectives that is not meaningful in testing [15]. Test cases that reduce the memory usage but at the same time reduce the final coverage are useless in practice. Therefore, considering coverage and non-coverage criteria as equally important objectives results in tests with decreased coverages [19, 10, 17, 18, 16].

For these reasons, we investigate an alternative strategy where performance proxies are considered as secondary objectives while coverage is the primary objective. At first, we experiment with the most straightforward possible approach, i.e., using the performance proxies as secondary criteria, as proposed in a prior study [15]. However, due to the detrimental effect on coverage given by the optimization of such proxies, we refined this strategy by using an adaptive mechanism that enables and disables the proxy usage depending on whether search stagnation is detected or not. We refer to this adaptive strategy as pDynaMOSA (Adaptive Performance-Aware DynaMOSA) (section 3.2.2).

3.2.1 Performance-Score as Secondary Objective

We first explore the integration of the performance proxies relying on the same methodology used in the previous study by Palomba et al. [15]. This approach replaces the original secondary criterion of MOSA (test case length) with a quality score based on test method coupling and cohesion. Therefore, it uses the new secondary criterion in two points: (i) in the preference criterion to build the first front , selecting the test case with the lowest quality score in the case many of them have the same minimum object value for an uncovered branch ; and (ii) in the routine used to update the archive.

In this first formulation, we adopt the same methodology replacing the quality score with the performance-score computed for each test case as follows:


where denotes the seven proxies described in section 3.1. To deal with different magnitudes, each proxy value is normalized in Equation 1 using the normalization function [7, 46].

A preliminary evaluation of this strategy highlights that the introduction of the performance proxies in the described approach —i.e., even as a secondary criterion— is strongly detrimental for the branch coverage. We observed that the performance proxies strongly compete with coverage, e.g., test cases that trigger fewer method calls likely lead to lower code coverage. For this reason, we devise a second approach called pDynaMOSA that is able to overcome this limitation. We include the preliminary approach’s results in the replication package [28].

3.2.2 Adaptive Performance-Aware DynaMOSA (pDynaMOSA)

pDynaMOSA uses an adaptive mechanism to decide whether to apply (or not) the performance proxies depending on the search improvements done during the generations. We devise this strategy because continuously selecting test cases with the lowest performance proxies value leads to reduced code coverage.

The pseudo-code of pDynaMOSA is outlined in Algorithm 1. The algorithm starts by building the ECDG (line 2 of Algorithm 1) which is then used to compute the initial set of coverage targets by selecting the ones that are not under control dependencies (line 4 of Algorithm 1). Subsequently, an initial population of test cases is randomly generated, and the archive is updated by collecting the individuals covering previously uncovered targets (respectively, line 5 and 6 of Algorithm 1). Thus, the current set of targets is updated accordingly to the execution results of the initial population (line 7 of Algorithm 1) The UPDATE-TARGETS function implements this functionality. The same call to the UPDATE-TARGETS routine is repeated at every iteration of the algorithm and after offspring generation and execution (line 12 of Algorithm 1). The while loop in lines 8-25 evolves the population until all the targets are covered or the search budget is over. In each generation, pDynaMOSA creates the offsprings (line 9 of Algorithm 1), i.e., new test cases synthesized by (i) selecting parents with a tournament selection, (ii) combining parents with a single-point crossover, and (iii) further mutating the generated offsprings with the uniform mutation. Newly generated tests are executed against the CUT, and the corresponding objective scores and performance proxies values are computed.

Next, parents and offsprings are ranked into non-dominance fronts (line 13) using the original preference sorting algorithm of MOSA [3]. The first front (i.e., tests with rank 0) is built with preference sorting, while the subsequent fronts are built using the non-dominated sorting algorithm of NSGA-II [3, 47]. Then, the population for the next generation is built by selecting test cases from parents and offsprings considering both their ranking and a secondary heuristic (lines 16-25 of Algorithm 1). In MOSA, such a secondary heuristic is the crowding distance, which aims to promote more diverse test cases within the same front (i.e., tests with the same rank). The crowding distance is in charge of ensuring diversity among the selected tests [33], which is a critical aspect of evolutionary algorithms [48]. A lack of diversity leads to stagnation in local optimum [48, 27]

, which could reduce the probability to cover feasible branches.

In pDynaMOSA, we use both the crowding distance and the performance proxies as secondary heuristics. pDynaMOSA uses the routine GET-SECONDARY-HEURISTIC (lines 11 and 17-20 of Algorithm 1) to decide which of the two alternative secondary heuristics to apply, which depends on whether search stagnation is detected or not. Algorithm 2 depicts the pseudo-code of the routine GET-SECONDARY-HEURISTIC. In the first generation, the default secondary heuristic is the one based on performance proxies (lines 2-5 of Algorithm 2). For the later generations, the secondary heuristic is chosen by (i) analyzing the current objective scores to detect stagnation and (ii) taking into account which heuristics were used in the previous generations. Stagnation is detected when no improvement is observed for all uncovered branches (lines 7-9), i.e., the fitness functions for all coverage criteria are unchanged over the last two generations. Then, two counters are used to keep track of how often (i.e., in how many iterations) stagnation was detected when either applying the crowding distance or using the performance proxies. In case of stagnation, the algorithm selects a new secondary heuristic with the lowest stagnation counter (lines 11-18 of Algorithm 2). Otherwise, the secondary heuristic for the current generation remains the same as used in the previous iteration (lines 20-23).

Input : : set of coverage targets of a program
Population size
: control dependency graph of a program
Result: A test suite
1 begin
2          EXTEND-CDG() ENTRY-POINTS(, ) RANDOM-POPULATION() Archive PERFORMANCE-UPDATE-ARCHIVE(, ) UPDATE-TARGETS() while not(search budget consumed) AND () do
4                   UPDATE-TARGETS() PREFERENCE-SORTING() while  do
5                            if  is crowding-distance  then
6                                     CROWDING-DISTANCE-ASSIGNMENT()
7                           else
8                                     PERFORMANCE-SCORE-ASSIGNMENT()
                  Sort()  /* according to */
11         return
Algorithm 1 pDynaMOSA Pseudo-Algorithm

Once the secondary heuristic for the current iteration is selected, pDynaMOSA assigns a secondary score to every test cases in each dominance front (lines 18 and 20 of Algorithm 1) based on either the crowding distance or the performance proxies. If the employed secondary heuristic is the crowding distance, the secondary score of the tests corresponds to the crowding distance scores computed using the subvector dominance assignment by Köppen et al. [49, 27]. Otherwise, if the performance proxies are selected for the secondary heuristic, the secondary score for each test case is computed as follows:


where is the value of the -th proxy for the test ; and are the maximum and the minimum values of the -th proxy among all tests in the front . The performance-heuristic takes a value in ; a zero value is obtained when the test case has the largest (worst) proxy values among all tests within the same front , i.e., ; a maximum value of seven (corresponding to the total number of proxies) is obtained when has the lowest (best) proxies values among all tests within the same front , i.e., . Therefore, higher values of the performance-heuristic are preferable.

Crowding distance and performance-heuristic are then used in lines 21 and 23 of Algorithm 1 to select test cases from the fronts - until it reaches a maximum population size of . When the crowding-distance is used, more diverse tests within each front have a higher probability of being selected for the next population. On the other hand, when the performance-heuristic is used, the tests with lower predicted resource demands are favored. Notice that pDynaMOSA adopts a performance-based version of the update-archive. The update of the archive works as follows: when a test case satisfies an uncovered branch , is automatically added to the archive. Otherwise, if a new test hits an already covered branch , is added to the archive if and only if its performance-score, calculated according to Equation 1, is lower than the score of the test case already stored in the archive for . On the contrary, DynaMOSA employs the preference-criterion.

Input: : new offsprings; : the current iteration
Result: : heuristic for the current generation
1 begin
2          if i==0 then
                    // Counters for generations with stagnation
3                   PerformanceCounter 0
4                   CrowdingCounter 0
                   return PerformanceHeuristic   /* Initial heuristic */
6         stagnation TRUE
7          for  and is not covered do
8                   if best objective value for in better than in  then
9                            stagnation FALSE
12         if stagnation==TRUE then
                    // Heuristic with the lowest stagnation counter
13                   if ==PerformanceHeuristic then
14                            PerformanceCounter PerformanceCounter+1
16                  else
17                            CrowdingCounter CrowdingCounter+1
19                  if PerformanceCounter CrowdingCounter then
20                            return PerformanceHeuristic
21                  else
22                            return CrowdingDistance
24         else
                    // Heuristic used in the previous iterations
25                   if ==PerformanceHeuristic then
26                            PerformanceCounter 0
28                  else
29                            CrowdingCounter 0
31                  return

4 Empirical Study

The goal of the empirical evaluation is to assess the effectiveness of pDynaMOSA in comparison with DynaMOSA. The perspective is of interest for practitioners to employ more efficient test suites supporting frequent builds in modern CI/CD pipelines. We conduct an empirical study evaluating three dimensions: (i) we consider seven different coverage criteria (the default ones of EvoSuite); (ii) we compare fault detection effectiveness measured by strong mutation; and (iii) we compare performance measured by runtime and heap memory consumption. Therefore, we investigate the following research questions:

RQ. (Effectiveness) What is the target coverage achieved by pDynaMOSA compared to DynaMOSA?

With this first research question, we evaluate the seven default criteria provided by EvoSuite optimized by pDynaMOSA via many-objective optimization. The criteria are the following: branch, line, weak mutation, method, input, output, and exception coverage. In particular, we investigate whether and how the introduction of the performance proxies affects the target coverage each criteria.

RQ. (Fault Detection) What is the mutation score achieved by pDynaMOSA compared to DynaMOSA?

The second research question extends the comparison between pDynaMOSA and DynaMOSA in terms of fault detection effectiveness. Tests generated using the proposed performance-aware approach might have a different structure (e.g., contains fewer statements and method calls). Therefore, we conduct a mutation-based analysis assessing whether the optimization done by performance proxies is detrimental to the fault detection effectiveness.

RQ. (Performance) Does the adoption of performance proxies lead to shorter runtime and lower heap memory consumption?

The last research question investigates to what extent pDynaMOSA is able to reduce the performance impact of the generated tests. We investigate whether the approach is able to generate tests with better performance as well as stable code coverage, i.e., branch coverage. In particular, we investigate two dimensions: time, measuring runtime; and memory looking at the heap memory consumption of the generated tests.

For both RQ1 and RQ2 we also compare our approach to Random Search.

Prototype Tool. We implemented the pDynaMOSA in a prototype tool extending the EvoSuite test data generation framework, as explained in section 3.2. The source code of the prototype tool is available on GitHub.111https://github.com/giograno/evosuite All experimental results reported in this paper are obtained using this prototype tool. Moreover, a runnable version of the tool itself is available for download in the replication package [28].

4.1 Subjects

The context of our study is a random selection of classes from different test benchmarks widely used in the SBST (Search-Based Software Testing) community: (i) the SF110 corpus [50], (ii) the 5th edition of the Java Unit Testing Tool Competition at SBST 2017 [51], and (iii) benchmarks used from previous papers about test data generation [27, 3]. The SF110 benchmark222http://www.evosuite.org/experimental-data/sf110/ is a set of Java classes, extracted from 100 projects in the SourceForge repository, widely exploited in literature [20, 52]. We randomly sampled the aforementioned benchmarks discarding the trivial classes [27], i.e., the ones having cyclomatic complexity below 5. In total, we selected 110 Java classes from 27 different projects, having 29,842 branches and 139,519 mutants considered as target coverage in our experiment. Table I reports the characteristics of the classes grouped by project.

Project # Branches Mutants
Min Max Mean Min Max Mean
a4j 2 30 124 77 15 911 463
bcel 4 52 890 475 408 1,523 1,043
byuic 1 722 722 722 2,173 2,173 2,173
fastjson 10 20 2,880 564 36 13,152 2,078
firebird 3 90 194 131 347 441 392
fixsuite 1 32 32 32 110 110 110
freehep 6 48 160 92 112 807 297
freemind 1 170 170 170 2,427 2,427 2,427
gson 4 60 660 285 126 2,870 1,212
image 7 34 274 140 214 1,676 589
javathena 1 230 230 230 752 752 752
javaviewcontrol 2 212 2,360 1,286 2,058 4,972 3,515
jdbacl 2 170 174 172 595 700 648
jiprof 1 816 816 816 6,420 6,420 6,420
jmca 2 198 1,696 947 2,436 9,669 6,052
jsecurity 1 52 52 52 165 165 165
jxpath 3 98 102 100 204 449 312
la4j 7 20 280 135 196 3,217 1,122
math 4 14 238 92 135 1,274 443
okhttp 5 64 542 194 200 2,571 846
okio 9 24 562 126 34 4,271 1,009
re2j 8 68 646 178 148 2,096 1,129
saxpath 1 458 458 458 659 659 659
shop 4 38 182 102 175 465 302
webmagic 4 10 142 84 29 337 201
weka 10 212 778 359 255 13,263 2,220
wheelwebtool 7 24 804 331 75 3,898 1,637
Total 110
TABLE I: Java Projects and Classes in Our Study

4.2 Experimental Protocol

We run each strategy for each class in our dataset, collecting the resulting branch coverage and mutation score. For this, the generated test cases/suite are post-processed in EvoSuite: input data values and test cases are minimized after the search process terminates. During this minimization process, statements that do not contribute to satisfying the covered targets are removed from the individual test cases. These post-processing steps are applied to both search algorithms under study. We set the maximum search time to 180 seconds [8]. Hence, the search stops either if the full coverage is reached or the time budget runs out. We set an extra timeout of 10 minutes at the end of the search for mutation analysis. We use this budget because of the additional overhead required to re-execute each test case against the target mutants. To deal with the non-deterministic nature of the employed algorithms, each run is repeated 50 times [8]. We adopt the default GA parameters used by EvoSuite [2] since a previous study showed that such default values provide good results [53].

Mutation-Based Analysis. To evaluate the fault detection effectiveness of pDynaMOSA, we rely on strong mutation analysis. Several reasons drive this choice: Multiple studies showed a significant correlation between mutant detection and real-fault detection [54, 55]. Moreover, mutation testing is widely accepted as a high-end coverage criterion [56] and it was shown to be a superior measure of test case effectiveness compared to other criteria [57, 58]. The underlying idea of mutation testing is the creation of artificially modified versions of the original source code, called mutants [59]. These changes are introduced in the production code by mutation operators, aiming to mime real faults [54]. Finally, each test suite is run against the generated mutants and evaluated based on its mutation score, i.e., the ratio of killed (detected) mutants and the total number of generated ones. Despite being powerful, mutation testing has the disadvantage of being extremely expensive, requiring (i) the generation and the compilation of the mutants and (ii) the execution of the test suite against these.

To perform our analysis, we rely on EvoSuite’s built-in mutation engine [60], implementing eight different mutation operators, i.e., Delete Call, Delete Field, Insert Unary Operator, Replace Arithmetic Operator, Replace Bitwise Operator, Replace Comparison Operator, Replace Constant, and Replace Variable. We opt for EvoSuite’s engine for two reasons: First, it makes strong mutation analysis straightforward. Second, it was shown that the mutation scores computed by EvoSuite are close to results on real world software [60], which motivated recent works to rely on it [27, 35]

Performance Measurement. To evaluate the performance, we compare the runtimes and heap memory usages of DynaMOSA’s and pDynaMOSA’s generated test suites. A perfect measurement would require two identical test suites in terms of branch coverage and statements executed for each approach. However, due to the randomness of the employed algorithms, this evaluation is not doable in practice. Therefore, we select and evaluate for each class the test suite with the median coverage among the 50 generated versions. We built a custom toolchain that transforms the source code files for performance measurements, compiles the augmented versions, and runs the test suites with the EvoSuite standalone runtime. The transformer employs JavaParser333https://github.com/javaparser/javaparser for the AST transformations. It adds for every test case a method before (@Before) and after (@After) its execution, which reports the current performance counters. These counters, as reported by Java’s MXBeans (RuntimeMXBean, MemoryMXBean, GarbageCollectorMXBean, and OperatingSystemMXBean) are: the current time stamp (in nanoseconds), the heap size (in bytes), the gc count (number of garbage collections since the virtual machine started), and the gc time (in milliseconds). We executed the performance measurements on a bare-metal machine reserved exclusively for the measurements, i.e., without user-level background processes (except ssh) running. The machine has a 12-core Intel Xeon X5670@2.93GHz CPU with 70 GiB memory, runs ArchLinux with a kernel version 4.15.12-1-ARCH. Its disk is a Hitachi HUA722020ALA330 with 7200rpm and a 32MB cache.

We execute and measure each test suite 1000 times (forks), in a fresh jvm, resembling the methodology proposed by Georges et al. [61]. In a post-processing step, we compute the diffs for each test case and calculate the sum of all test cases to retrieve the overall performance (i.e., runtime and heap size) of each test suite. As heap memory diffs might be influenced by gc activity and therefore invalid, we replace the heap memory diff of affected methods with the median of the other forks’ results for this test case that are valid (i.e., not affected by gc activity).

We rely on the non-parametric Wilcoxon Rank-Sum Test [62] with significance level

. Significant p-values allow us to reject the null hypothesis,

i.e., that the two algorithms achieve the same branch coverage (RQ), yield the same mutation score (RQ), and have the same runtime and heap memory consumption (RQ). We estimate the effect sizes, i.e., the magnitude of the difference between the measured metrics, with the Vargha-Delaney () statistic [63]. It has the following interpretation: for the target coverage and mutation score when DynaMOSA—or the random search— achieves a higher coverage compared to pDynaMOSA while describes the opposite. For runtime and the memory consumption indicates that the suites generated by pDynaMOSA respectively run faster or use less memory than the ones generated by DynaMOSA. Vargha-Delaney (

) statistic also classifies such effect size value into four different levels (

negligible, small, medium, and large) [63].

5 Results & Discussion

This section discusses the results of the study answering the research questions formulated at the beginning of section 4.

5.1 Rq1 - Effectiveness

Fig. 1: Comparison of coverage achieved by Random Search, DynaMOSA, and pDynaMOSA over 50 independent runs for the 110 studied subjects

Table II summarizes the coverage results achieved by random search, DynaMOSA, and pDynaMOSA over seven different coverage criteria. For each approach, we report (i) the mean coverage for each criterion over the 110 subject and (ii) the number of classes for which pDynaMOSA is statistically better, worse, or equivalent than random search and DynaMOSA. For the latter statistics, we report the results of the Wilcoxon test and discuss the effect size. Full results at class level are reported in the replication package [28].

Table II shows the comparison results between DynaMOSA and pDynaMOSA. For branch coverage, they are almost equivalent: the former achieves on average 1 percentage point (pp) more over the CUTs, being statistically significantly better for 18 out of 110 classes. However, for the majority of these cases, i.e., 10 out of 18, the effect size of the difference is small. Vice versa, pDynaMOSA outperforms DynaMOSA significantly in three out of 110 classes, with an average difference of 2pp and a medium effect size. For 89 classes out of 110 (81%) there is no statistical difference between the two approaches. The average branch coverage over the entire set of subjects is about 72% and 71%, respectively for DynaMOSA and pDynaMOSA. Similar results can be observed for line (88/110) and weak mutation (82/110) coverage, where the two approaches do not show a statistically significant difference. Both approaches have the same average coverage for these two criteria over the entire set of classes (i.e., 76% and 77% for line and weak mutation coverage respectively). For the remaining criteria, i.e., method, input, output, and exception, the number of subjects with no statistically significant difference increases, ranging from 85% to 93% of the CUTs. For only one subject DynaMOSA is able to cover more exceptions than pDynaMOSA. Overall, for none of the investigated coverage criteria do we observe large differences between DynaMOSA and pDynaMOSA.

Let us now consider the comparison between pDynaMOSA and random search (Table II). For branch coverage, pDynaMOSA achieves on average +4pp over all the subjects. 66 out of 110 classes are statistically significantly better, while only 9 out of 110 classes are worse. For 52 of these 66 subjects, the magnitude of the difference is large. The largest improvement is obtained for MnMinos (freehep project) where pDynaMOSA covers 28% more branches on average. We observe similar results for all other criteria but the exception coverage where random search is not statistically significantly different for 104 out of 110 subjects. pDynaMOSA achieves +5pp and +6pp for line and weak mutation coverage. While pDynaMOSA reaches +4pp for the method coverage criterion, it achieves +28pp and +12pp for input and output coverage respectively.

Figure 1 compares coverage scores achieved by the three approaches over the distinct criteria. It highlights that DynaMOSA and pDynaMOSA have similar distributions for the different target coverages. On the other and, except for the exception coverage, pDynaMOSA leads to larger coverage scores compared to the random search.

RQ Summary. Across seven criteria, pDynaMOSA achieves similar levels of coverage compared to DynaMOSA, while it outperforms random search.

Criterion Average Coverage Random vs pDynaMOSA DynaMOSA vs pDynaMOSA
Random DynaMOSA pDynaMOSA #Better #Worse #No Diff. #Better #Worse #No Diff.
Branch 0.67 0.72 0.71 66 9 35 3 18 89
Line 0.71 0.76 0.76 76 6 28 2 20 88
Weak Mutation 0.71 0.77 0.77 87 4 19 4 20 86
Method 0.93 0.97 0.97 61 0 49 4 4 102
Input 0.55 0.83 0.83 106 0 4 12 4 94
Output 0.41 0.53 0.53 94 0 16 8 5 97
Exception 0.99 0.98 0.99 3 3 104 16 1 93
TABLE II: Comparison between Random Search, DynaMOSA, and pDynaMOSA on the considered criteria

5.2 RQ2 - Fault Detection

Figure 1 shows the the distributions of the mutation scores in box plot notation —on the extreme right, along with the other criteria— achieved by the approaches for the 110 subjects over 50 runs. We notice that the distributions of pDynaMOSA and pMOSA are similar: the median of the distributions is 30% for both approaches. Both approaches considerably outperform random search (24% of median).

Table III reports the results of the strong mutation coverage achieved by random search, DynaMOSA, and pDynaMOSA. We report (i) the mutation scores averaged over the different projects and (ii) the number of cases in each project where the pDynaMOSA is better, worse, or equivalent —according to the Wilcoxon test— compared to random search and DynaMOSA. We share the full results at class level in the replication package [28].

From Table III we observe that pDynaMOSA significantly outperforms random search in 88 out of 110 cases, corresponding to 97% of all the CUTs. The improvement of pDynaMOSA for those classes ranges between 0.7% and 34%, with an average improvement of 7%. In 79 out of these 88 cases, the magnitude of the difference is large. On the other hand, random search is significantly better than pDynaMOSA for only one class.

Let us consider the comparison between pDynaMOSA and DynaMOSA. Similar to what we observed for RQ1, in the majority of the cases, i.e., 88 out of 110, there is no statistical difference between the mutation score of the two approaches. In total, for 16 classes out of 110, DynaMOSA scores a significantly higher mutation score. However, for half of these cases the magnitude of the difference is small. The improvement of DynaMOSA ranges from 0.7% to 5% (for the Class2HTML class), with an average increase of 2%. Conversely, pDynaMOSA is better than DynaMOSA in 5 cases out of 110: in these cases, pDynaMOSA achieves +2pp mutation score, with the most substantial difference of +10pp for the class Product from a4j.

In the few cases where pDynaMOSA performs worse than DynaMOSA, the difference is due to a slight difference in branch coverage. There is a direct relation between code coverage and fault effectiveness: if a mutant is not covered, it cannot be killed. For example, let us consider the class Parser from re2j, which has 667 branches and 501 mutants. pDynaMOSA achieves 62.26% average branch coverage compared to 63.11% achieved by DynaMOSA. However, neither set of mutants killed by one of the two approaches is a subset of the other approach’s set. In fact, pDynaMOSA kills nine mutants not killed by DynaMOSA while DynaMOSA kills 18 mutants not killed by the other. Listing LABEL:code:example shows an example of mutants killed by DynaMOSA only. The mutant is injected into the first if statement of the private method removeLeadingString. The statement is covered by both approaches trough indirect method calls; however, only the test cases produced with DynaMOSA are able to kill the mutant. The reason for this is that the if-condition requires to instantiate an object of class Regexp with proper attributes op and subs. This can be done by invoking additional methods of Regexp. pDynaMOSA is designed to reduce the number of method calls (to reduce heap memory consumption); therefore, in some runs, it generates tests without setting the input object re. This example suggests that there is room for further improvement of pDynaMOSA by handling methods calls differently, depending on whether they are needed for proper input instantiation or for testing the CUT.

1public class Parser {
3    private Regexp removeLeadingString(Regexp re, int n)
4  {
5  // original code:
6  // if ((re.op == Regexp.Op.CONCAT) && (re.subs.length == 0))
7  // mutant:
8  if ((op == Regexp.Op.CONCAT) && (subs.length > 0))
9    { ... }
10 }
Listing 1: Example of mutants killed by DynaMOSA but alive with pDynaMOSA

RQ Summary. pDynaMOSA achieves similar levels of mutation score compared to DynaMOSA, while both outperform random search.

Project Classes Mutation Score Statistics
Random DynaMOSA pDynaMOSA Random vs pDynaMOSA DynaMOSA vs pDynaMOSA
#Better #Worse #No Diff. #Better #Worse #No Diff.
fixsuite 1 0.06 0.04 0.07 0 0 1 0 0 1
a4j 2 0.24 0.20 0.20 1 1 0 0 0 2
wheelwebtool 7 0.24 0.31 0.31 7 0 0 0 1 6
weka 10 0.21 0.24 0.24 8 0 2 2 0 8
math 4 0.31 0.34 0.34 3 0 1 0 1 3
firebird 3 0.48 0.51 0.51 3 0 0 0 1 2
image 7 0.29 0.36 0.36 6 0 1 0 1 6
bcel 4 0.33 0.40 0.38 3 0 1 0 2 2
saxpath 1 0.57 0.60 0.59 1 0 0 0 0 1
okhttp 5 0.28 0.34 0.34 5 0 0 0 0 5
shop 4 0.34 0.40 0.40 4 0 0 0 0 4
javaviewcontrol 2 0.12 0.17 0.17 2 0 0 0 0 2
fastjson 10 0.28 0.36 0.35 10 0 0 1 3 6
jxpath 3 0.51 0.54 0.55 3 0 0 0 0 3
la4j 7 0.25 0.32 0.30 4 0 3 0 0 7
freehep 6 0.25 0.37 0.35 4 0 2 0 2 4
okio 9 0.24 0.33 0.32 7 0 2 0 1 8
freemind 1 0.19 0.21 0.21 1 0 0 0 0 1
re2j 8 0.29 0.31 0.31 3 0 5 1 1 6
webmagic 4 0.40 0.42 0.42 2 0 2 0 0 4
gson 4 0.16 0.20 0.19 3 0 1 0 1 3
jdbacl 2 0.37 0.44 0.43 2 0 0 1 1 0
javathena 1 0.22 0.24 0.24 1 0 0 0 0 1
byuic 1 0.08 0.11 0.11 1 0 0 0 0 1
jiprof 1 0.06 0.13 0.12 1 0 0 0 1 0
jsecurity 1 0.29 0.34 0.35 1 0 0 0 0 1
jmca 2 0.19 0.29 0.29 2 0 0 0 0 2
Mean over projects 0.27 0.32 0.31
No. cases pDynaMOSA is better than Random 88 (96.8%)
No. cases pDynaMOSA is worse than Random 1 (1.1%)
No. cases pDynaMOSA is better than DynaMOSA 5 (5.5%)
No. cases pDynaMOSA is worse than DynaMOSA 16 (17.6%)
TABLE III: Mean mutation score achieved for each project

5.3 RQ3 - Performance

In this section, we compare the runtime and heap memory consumption of the test suites generated by pDynaMOSA and DynaMOSA. To have a fair comparison —i.e., between suites with the same coverage— we first select the classes with no statistical difference in branch coverage (from RQ1). Thus, for every subject and approach, we evaluate the test suite with the median coverage achieved over 50 runs.

Table IV summarizes the performance results of the suites generated by pDynaMOSA and DynaMOSA. We observe that pDynaMOSA is faster in 41 out of 63 cases (~65%), with a large effect size in 33 out of 41 cases. For these subjects, runtime improves on average by ~36% (with a median of 12%), ranging from 1.65% to 265% (for the class SimplexBuilder). Despite the faster runtime, we do not observe lower values on any target coverage compared to DynaMOSA: the differences between the two approaches are below 1% for branch, line, and weak mutation coverage.

(a) Runtime
(b) Heap Memory Consumption
Fig. 4: Comparison of runtime ((a)a) and heap memory consumption ((b)b) for the suite generated by DynaMOSA and pDynaMOSA for the JSONLexerBase class over 1,000 independent runs

On the other hand, in ~31% of the cases (20 out of 63), the runtime for the test suites generated by DynaMOSA are significantly faster compared to the ones generated by pDynaMOSA, ranging from 1.43% to 40.77% (for the class ForwardBackSubstitutionSolver) with an average of 16% and a median of 19%. While the target coverages over the different criteria do not differ, the test suites generated by pDynaMOSA reach on average +1pp mutation coverage. In particular, in the case of the class ForwardBackSubstitutionSolver, where DynaMOSA generates the suite with the maximum improvement in runtime over pDynaMOSA, the latter achieves +2pp for mutation coverage.

Let us now consider the heap memory consumption comparison between DynaMOSA and pDynaMOSA, summarized in Table IV. Similarly to what we observed for the runtime analysis, the heap memory consumption of pDynaMOSA is statistically significantly lower in 43 out of 63 cases (~68%), 40 of which with a large effect size. We observe a 22% average reduction in heap memory consumption (median = 15%), up to a maximum of -117pp for the class JSONLexerBase. At the same time, we do not observe any variation in target coverages, with all differences below 1pp.

Vice versa, the test suites generated by DynaMOSA show lower heap memory consumption in 19 cases out of 63 (~30%). In 12 out of 19 cases, the effect size of the difference is large. In these cases, the suites generated by pDynaMOSA show a +7pp heap memory consumption on average (median = 5%), up to +39pp for the class Product. We do not observe any significant difference in target coverage among the different criteria (all ). However, while scoring higher values of heap memory consumption, pDynaMOSA also reaches +1.5pp on average for mutation score. For the Product class, where DynaMOSA maximize the heap memory consumption reduction compared to pDynaMOSA, the latter approach achieves +27pp for mutation score.

As a concrete example, Figure 4 shows the distributions of runtime and heap memory consumption for the suites generated by DynaMOSA and pDynaMOSA—respectively over 1000 independent runs— for the JSONLexerBase class. These two profiled suites achieve similar levels of coverage over the seven different criteria. However, the median runtime is ~194 milliseconds (ms) for pDynaMOSA versus ~344 ms for DynaMOSA, while the median heap memory consumption is ~900 MB for pDynaMOSA versus 1765 MB for pDynaMOSA.

Project Classes Runtime (in ms) Memory Consumption (in) Statistics
DynaMOSA pDynaMOSA DynaMOSA pDynaMOSA Runtime Memory Consumption
#Better #Worse #No Diff. #Better #Worse #No Diff.
jmca 1 143.47 190.40 590.09 603.65 0 1 0 0 1 0
jdbacl 2 3.14 3.48 5.68 4.26 0 0 2 0 0 2
javaviewcontrol 3 138.79 168.86 257.26 231.33 1 2 0 3 0 0
jsecurity 1 564.60 566.64 387.30 386.56 0 0 1 1 0 0
byuic 1 119.10 103.43 313.25 290.71 1 0 0 1 0 0
shop 3 40.00 35.75 212.23 193.38 3 0 0 3 0 0
bcel 2 243.75 286.41 1,910.24 2,205.21 1 1 0 0 2 0
a4j 1 48.35 48.75 17.32 28.47 0 0 1 0 1 0
firebird 3 92.48 75.62 295.70 273.52 2 1 0 2 1 0
fastjson 6 179.39 145.30 635.24 449.17 5 1 0 3 2 1
webmagic 3 53.96 53.45 195.35 200.11 2 1 0 2 1 0
okio 6 62.47 62.35 238.40 227.20 5 1 0 5 1 0
math 4 323.59 201.01 190.47 158.09 3 1 0 1 3 0
image 5 69.38 66.95 255.70 238.85 3 2 0 3 2 0
jxpath 2 131.98 118.92 216.47 186.21 2 0 0 2 0 0
gson 2 49.62 58.02 189.41 180.44 0 2 0 1 1 0
freehep 4 223.89 172.72 147.88 133.86 2 2 0 3 1 0
la4j 6 227.20 164.14 284.32 199.07 4 2 0 6 0 0
re2j 4 32.58 30.41 199.05 195.39 2 2 0 1 2 1
okhttp 4 39.68 36.55 90.62 85.30 3 0 1 4 0 0
weka 3 71.04 73.26 243.94 230.78 2 1 0 2 1 0
Mean over projects 136.12 126.78 327.43 319.12 41 (65.08%) 20 (31.75%) 5 (7.94%) 43 (68.25%) 19 (30.16%) 4 (6.35%)
TABLE IV: Mean runtime and memory consumption achieved for each project

RQ Summary. When achieving the same target coverage, the suites generated by pDynaMOSA have both lower runtimes and heap memory consumption for ~65% and ~68% of the subjects.

5.4 Discussion

Listing LABEL:code:mosa depicts two test cases for the class GaussianSolver (from la4j) generated by DynaMOSA and pDynaMOSA respectively. The former is the test with the higher runtime in the entire suite, with about 160ms on average over 1000 runs. First, it creates a SparseMatrix with a diagonal and size equal to 795 (line 3 of Listing LABEL:code:mosa). Second, it instantiates a GaussianSolver object from the created matrix (line 4-5 of Listing LABEL:code:mosa). Thus, it creates a SparseVector object of size and capacity equal to 795 (line 6 of Listing LABEL:code:mosa). Finally, it executes the method solve that solves the corresponding linear system (line 7 of Listing LABEL:code:mosa). Let us consider the test generated by pDynaMOSA for the same class. Here, the GaussianSolver is built using a smaller matrix, i.e., 37x37 (line 13 of Listing LABEL:code:mosa). Similarly, a smaller SparseVector is then instantiated in line 14 of Listing LABEL:code:mosa. At the end, the solve method is again called to solve the linear system.

Despite implementing a similar behavior, the test generated by pDynaMOSA runs 8 times faster —on average over the 1000 runs— than the one generated by DynaMOSA. This improvement is due to a better selection of the input values for the methods directly and indirectly invoked by the generated tests. While the algorithm has no direct control over this selection, the selective pressure done by the performance proxies favors the individuals with better inputs —from a performance perspective— that randomly appear in the population. For this reason, we expect pDynaMOSA to be particularly more effective in scenarios where the input space is not trivial (i.e., most inputs are primitive values and the CUT does not handle large arrays or objects).

1// test case generated by DynaMOSA
2public void test10() throws Throwable {
3  SparseMatrix sparseMatrix0 = SparseMatrix.diagonal(795, 795);
4  GaussianSolver gaussianSolver0 =
5          new GaussianSolver(sparseMatrix0);
6  SparseVector sparseVector0 = SparseVector.zero(795, 795);
7  gaussianSolver0.solve(sparseVector0);
10// test case generated by pDynaMOSA
11public void test10() throws Throwable {
12  Matrix matrix0 = Matrix.unit(37, 37);
13  GaussianSolver gaussianSolver0 = new GaussianSolver(matrix0);
14  SparseVector sparseVector0 = SparseVector.zero(37);
15  try {
16    gaussianSolver0.solve(sparseVector0);
17    ...
Listing 2: Test cases for the GaussianSolver class

Need for an Adaptive Approach. As explained in section 3.2, we considered an adaptive approach that disables/enables the performance heuristics depending on whether the search stagnates, i.e., there is no improvement in the objective values for subsequent generations. To provide empirical evidence for the need for an adaptive approach, we conduct an additional study by running pDynaMOSA and disabling the GET-SECONDARY-HEURISTIC procedure (see section 3.2): i.e., the algorithm always uses heuristic based on the performance proxies. Our results show an expected detrimental effect on the coverage: for the branch coverage, the non-adaptive version of pDynaMOSA achieves on average with -12pp in 48 out of 110 cases (~44%). On the contrary, pDynaMOSA achieves lower branch coverage in only 3 cases out of 110 (~2.7%). We observe a similar situation for weak mutation coverage: the non-adaptive version of pDynaMOSA achieves on average with -19pp in 38 out of 110 cases (~35%), while the opposite only happens in 4 out of 110 cases.

Oracle Cost. DynaMOSA [27] relies on test case length (i.e., the number of statements) as a secondary criterion in the preference criterion. Test case length is often used in literature as proxy for oracle cost since generated tests require human effort to check the candidate assertions (the oracle problem [45]). Minimizing test size partially mitigates the problem: the shorter the tests, the lower the number of covered paths to manually validate [27]. To analyze the oracle cost for DynaMOSA and pDynaMOSA, we compare the length of the generated suite with the Wilcoxon test [62]. We observe that the test suites generated by pDynaMOSA are significantly shorter in 64 out of 110 cases (~58%), while the opposite happens in only 7 cases. The average test case length is 514 and 381 statements for DynaMOSA and pDynaMOSA respectively. We conclude that our approach —as a collateral effect— reduces the human oracle cost to a greater extent than DynaMOSA. Note that the test suites generated by both DynaMOSA and pDynaMOSA are post-processed for test minimization. Therefore, the differences observed in terms of test suite size are due to performance proxies and the adaptive strategies implemented in pDynaMOSA. We report the full results at class level in the replication package [28].

Trade-off between Coverage and Performance. Our results show that pDynaMOSA achieves similar level of coverage while optimizing runtime and memory consumption. Despite pDynaMOSA finding a good compromise between primary and secondary objectives, in a few cases the performance optimization results in slightly lower coverage. The acceptable level of performance and coverage depends on the system domain. For instance, in the development context of cyber-physical systems (CPS), tests can be particularly expensive to run, especially when they involve hardware or simulations [64]. Thus, the resource demands for testing systems in this domain is dramatically higher compared to non-CPS-based applications. Adaptive approaches focusing on performance while keeping high levels of coverage might improve the level of testability of CPS [65, 64].

6 Threats to Validity

Construct validity. We evaluate pDynaMOSA relying on metrics that are widely adopted in literature. In RQ1 we use the seven default criteria of EvoSuite, i.e., branch, line, weak mutation, method, input, output, and exception coverage; while in RQ2 we rely on strong mutation coverage; all these metrics have been widely used in the context of test data generation [27]. For RQ3 we use runtime and heap memory usage since they give a reasonable estimation of the generated test suite performances. To approximate the required rigorous performance benchmarking, we devised seven proxies related to runtime and memory consumption. We give full rationale about their choice in section 3.1. The improvement, e.g., considering the instantiated object sizes, and experimentation with different proxies is subject of future work.

Internal validity. To deal with the intrinsic randomness of the employed algorithms, we repeated each execution 50 times [8], reporting the average results along with rigorous statistical analysis. Different factors might have influenced the performance measurements of the generated tests. In particular, the order in which the tests are executed is random. Due to dynamic compiler optimizations, different execution orders might change the runtime results of individual runs. We tackle this threat by repeating the measurements for 1000 times. Another threat concerns the memory measurements where garbage collector activity invalidates the heap diff computed for every test method. We address this threat by replacing the measurements for the methods that trigger the GC with the other valid forks’ average heap utilization. To lower the resources demand of generated tests, we aggregate seven different proxies in a performance score optimized as a secondary objective. To investigate their impact in isolation, we run pDynaMOSA with a single proxy enabled at a time. Then, we measure the runtime and the achieved branch coverage of the generated tests, averaged over 5 different runs (measured in EvoSuite). While the average runtime varies across the different proxies, we observe that their usage in isolation always results in a lower values of branch coverage compared to their usage in aggregation.

Conclusion validity. To analyze the results of our experiments, we used appropriate statistical tests coupled with sufficient repetitions [8]. We relied on the Wilcoxon Rank-Sum Test [62] for determining significant differences and on the Vargha-Delaney effect size statistic [63] to estimate the magnitude of the observed differences. We only drew conclusions when these tests were statistically significant.

7 Conclusions

This paper focuses on seven different coverage criteria and performance in form of runtime and memory usage as testing criteria in white-box test-case generation. To avoid the overhead of precise performance measurements, we introduce a set of low-overhead performance proxies that estimate the computational demands of the generated tests. We devise a novel adaptive strategy to incorporate these proxies into the main loop of DynaMOSA, enabling/disabling the proxies as a substitute of the crowding distance depending on whether search stagnation is detected or not.

Our empirical study on 110 Java classes shows that pDynaMOSA achieves results comparable to DynaMOSA over seven different coverage criteria. When reaching the same branch coverage, the test suites produced by pDynaMOSA are significantly less expensive to run in 65% (for runtime) and in 68% (for heap memory consumption) of the CUTs. In the former case, the mean runtime decrease by ~16%, while in the latter we observe a reduction of heap memory consumption of ~22%. Moreover, we evaluate the fault effectiveness of the generated test suites to avoid counter-effects due to the performance optimization: we show that pDynaMOSA achieves a similar mutation score for ~85% of the subjects under tests.

Based on these promising results, we plan to investigate different directions for future work: (i) investigating new proxies and evaluating their individual impact and (ii) horizontally enlarging our study by including further Java classes and different projects.


Grano and S. Panichella acknowledge the support of the Swiss National Science Foundation (SNSF) through project no. 200021_166275 and of CHOOSE. Laaber acknowledges the support of the SNSF through project MINCA (no. 165546).


  • Fowler and Foemmel [2006] M. Fowler and M. Foemmel, “Continuous integration,” Thought-Works) http://www. thoughtworks. com/Continuous Integration. pdf, vol. 122, p. 14, 2006.
  • Fraser and Arcuri [2013] G. Fraser and A. Arcuri, “Whole test suite generation,” IEEE Trans. Softw. Eng., vol. 39, no. 2, pp. 276–291, Feb. 2013. [Online]. Available: http://dx.doi.org/10.1109/TSE.2012.14
  • Panichella et al. [2015] A. Panichella, F. M. Kifetew, and P. Tonella, “Reformulating branch coverage as a many-objective optimization problem,” in ICST.    IEEE Computer Society, 2015, pp. 1–10.
  • Vassallo et al. [2016] C. Vassallo, F. Zampetti, D. Romano, M. Beller, A. Panichella, M. D. Penta, and A. Zaidman, “Continuous delivery practices in a large financial organization,” in 2016 IEEE International Conference on Software Maintenance and Evolution, ICSME 2016, Raleigh, NC, USA, October 2-7, 2016, 2016, pp. 519–528. [Online]. Available: http://dx.doi.org/10.1109/ICSME.2016.72
  • Campos et al. [2014] J. Campos, A. Arcuri, G. Fraser, and R. Abreu, “Continuous test generation: enhancing continuous integration with automated test generation,” in Proceedings of the 29th ACM/IEEE international conference on Automated software engineering, 2014, pp. 55–66.
  • McMinn [2011] P. McMinn, “Search-based software testing: Past, present and future,” in Software testing, verification and validation workshops (icstw), 2011 ieee fourth international conference on.    IEEE, 2011, pp. 153–163.
  • McMinn [2004] P. McMinn, “Search-based software test data generation: A survey,” Softw. Test. Verif. Reliab., vol. 14, no. 2, pp. 105–156, Jun. 2004. [Online]. Available: http://dx.doi.org/10.1002/stvr.v14:2
  • Campos et al. [2017] J. Campos, Y. Ge, G. Fraser, M. Eler, and A. Arcuri, “An empirical evaluation of evolutionary algorithms for test suite generation,” in Proceedings of the 9th International Symposium on Search Based Software Engineering SSBSE 2017, 2017, pp. 33–48.
  • Fraser and Arcuri [2011] G. Fraser and A. Arcuri, “Evosuite: automatic test suite generation for object-oriented software,” in Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering.    ACM, 2011, pp. 416–419.
  • Lakhotia et al. [2007] K. Lakhotia, M. Harman, and P. McMinn, “A multi-objective approach to search-based test data generation,” in

    Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation

    , ser. GECCO ’07.    New York, NY, USA: ACM, 2007, pp. 1098–1105. [Online]. Available: http://doi.acm.org/10.1145/1276958.1277175
  • Afshan et al. [2013a] S. Afshan, P. McMinn, and M. Stevenson, “Evolving readable string test inputs using a natural language model to reduce human oracle cost,” in Proceedings International Conference on Software Testing, Verification and Validation (ICST).    IEEE, 2013, pp. 352–361.
  • Xuan and Monperrus [2014] J. Xuan and M. Monperrus, “Test case purification for improving fault localization,” in Proceedings of the International Symposium on Foundations of Software Engineering (FSE).    ACM, 2014, pp. 52–63.
  • Panichella et al. [2016] S. Panichella, A. Panichella, M. Beller, A. Zaidman, and H. C. Gall, “The impact of test case summaries on bug fixing performance: an empirical investigation,” in Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016, 2016, pp. 547–558. [Online]. Available: http://doi.acm.org/10.1145/2884781.2884847
  • Daka et al. [2015] E. Daka, J. Campos, G. Fraser, J. Dorn, and W. Weimer, “Modeling readability to improve unit tests,” in Proceedings of the 10th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE).    ACM, 2015, to appear.
  • Palomba et al. [2016] F. Palomba, A. Panichella, A. Zaidman, R. Oliveto, and A. De Lucia, “Automatic test case generation: What if test code quality matters?” in Proceedings of the 25th International Symposium on Software Testing and Analysis, ser. ISSTA 2016.    New York, NY, USA: ACM, 2016, pp. 130–141. [Online]. Available: http://doi.acm.org/10.1145/2931037.2931057
  • Albunian [2017] N. M. Albunian, “Diversity in search-based unit test suite generation,” in Proceedings of the 9th International Symposium Search Based Software Engineering, ser. SSBSE ’17.    Springer, 2017, pp. 183–189.
  • Pinto and Vergilio [2010] G. H. Pinto and S. R. Vergilio, “A multi-objective genetic algorithm to test data generation,” in

    Tools with Artificial Intelligence (ICTAI), 2010 22nd IEEE International Conference on

    , vol. 1.    IEEE, 2010, pp. 129–134.
  • Oster and Saglietti [2006] N. Oster and F. Saglietti, “Automatic test data generation by multi-objective optimisation,” in International Conference on Computer Safety, Reliability, and Security.    Springer, 2006, pp. 426–438.
  • Ferrer et al. [2012] J. Ferrer, F. Chicano, and E. Alba, “Evolutionary algorithms for the multi-objective test data generation problem,” Softw. Pract. Exper., vol. 42, no. 11, pp. 1331–1362, Nov. 2012. [Online]. Available: http://dx.doi.org/10.1002/spe.1135
  • Rojas et al. [2017] J. M. Rojas, M. Vivanti, A. Arcuri, and G. Fraser, “A detailed investigation of the effectiveness of whole test suite generation,” Empirical Software Engineering, vol. 22, no. 2, pp. 852–893, 2017.
  • Panichella et al. [2018a] A. Panichella, F. Kifetew, and P. Tonella, “A large scale empirical comparison of state-of-the-art search-based test case generators,” Information and Software Technology, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0950584917304950
  • Campos et al. [2018] J. Campos, Y. Ge, N. Albunian, G. Fraser, M. Eler, and A. Arcuri, “An empirical evaluation of evolutionary algorithms for unit test suite generation,” Information and Software Technology, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0950584917304858
  • Hilton et al. [2017] M. Hilton, N. Nelson, T. Tunnell, D. Marinov, and D. Dig, “Trade-offs in continuous integration: Assurance, security, and flexibility,” in Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE 2017.    New York, NY, USA: ACM, 2017, pp. 197–207. [Online]. Available: http://doi.acm.org/10.1145/3106237.3106270
  • Bukh [1992] P. N. D. Bukh, “The art of computer systems performance analysis, techniques for experimental design, measurement, simulation and modeling,” 1992.
  • de Oliveira et al. [2017] A. B. de Oliveira, S. Fischmeister, A. Diwan, M. Hauswirth, and P. Sweeney, “Perphecy: Performance regression test selection made simple but effective,” in Proceedings of the 10th IEEE International Conference on Software Testing, Verification and Validation (ICST), Tokyo, Japan, 2017.
  • Albert et al. [2011] E. Albert, M. Gómez-Zamalloa, and J. M. Rojas, “Resource-driven clp-based test case generation,” in International Symposium on Logic-Based Program Synthesis and Transformation.    Springer, 2011, pp. 25–41.
  • Panichella et al. [2018b] A. Panichella, F. M. Kifetew, and P. Tonella, “Automated test case generation as a many-objective optimisation problem with dynamic selection of the targets,” IEEE Transactions on Software Engineering, vol. 44, no. 2, pp. 122–158, 2018.
  • [28] authors. Github. https://github.com/sealuzh/dynamic-performance-replication.
  • Tonella [2004] P. Tonella, “Evolutionary testing of classes,” in Proceedings of the 2004 ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA ’04.    New York, NY, USA: ACM, 2004, pp. 119–128.
  • Goldberg [2006] D. E. Goldberg, Genetic algorithms.    Pearson Education India, 2006.
  • Scalabrino et al. [2016] S. Scalabrino, G. Grano, D. Di Nucci, R. Oliveto, and A. De Lucia, “Search-based testing of procedural programs: Iterative single-target or multi-target approach?” in Search Based Software Engineering, F. Sarro and K. Deb, Eds.    Cham: Springer International Publishing, 2016, pp. 64–79.
  • von Lücken et al. [2014] C. von Lücken, B. Barán, and C. Brizuela, “A survey on multi-objective evolutionary algorithms for many-objective problems,” Computational optimization and applications, vol. 58, no. 3, pp. 707–756, 2014.
  • Deb et al. [2002] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: Nsga-ii,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182–197, 2002.
  • Kifetew et al. [2013] F. M. Kifetew, A. Panichella, A. De Lucia, R. Oliveto, and P. Tonella, “Orthogonal exploration of the search space in evolutionary test case generation,” in Proceedings of the 2013 International Symposium on Software Testing and Analysis.    ACM, 2013, pp. 257–267.
  • Panichella et al. [2018c] A. Panichella, F. M. Kifetew, and P. Tonella, “Incremental control dependency frontier exploration for many-criteria test case generation,” in International Symposium on Search Based Software Engineering.    Springer, 2018, pp. 309–324.
  • Afshan et al. [2013b] S. Afshan, P. McMinn, and M. Stevenson, “Evolving readable string test inputs using a natural language model to reduce human oracle cost,” in Proceedings of the 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation, ser. ICST ’13.    Washington, DC, USA: IEEE Computer Society, 2013, pp. 352–361. [Online]. Available: http://dx.doi.org/10.1109/ICST.2013.11
  • Jin et al. [2012] G. Jin, L. Song, X. Shi, J. Scherpelz, and S. Lu, “Understanding and detecting real-world performance bugs,” in Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI ’12.    New York, NY, USA: ACM, 2012, pp. 77–88. [Online]. Available: http://doi.acm.org/10.1145/2254064.2254075
  • Shirazi [2003] J. Shirazi, Java performance tuning.    O’Reilly Media, Inc., 2003.
  • Yoo and Harman [2007] S. Yoo and M. Harman, “Pareto efficient multi-objective test case selection,” in Proceedings of the 2007 International Symposium on Software Testing and Analysis, ser. ISSTA ’07, 2007, pp. 140–150.
  • Huang et al. [2014] P. Huang, X. Ma, D. Shen, and Y. Zhou, “Performance Regression Testing Target Prioritization via Performance Risk Analysis,” in Proceedings of the 36th International Conference on Software Engineering, ser. ICSE 2014.    New York, NY, USA: ACM, 2014, pp. 60–71. [Online]. Available: http://doi.acm.org/10.1145/2568225.2568232
  • Mostafa et al. [2017] S. Mostafa, X. Wang, and T. Xie, “PerfRanker: Prioritization of Performance Regression Tests for Collection-Intensive Software,” in Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA ’17.    New York, NY, USA: ACM, 2017, pp. 23–34. [Online]. Available: http://doi.acm.org/10.1145/3092703.3092725
  • Chen and Kim [2015] N. Chen and S. Kim, “Star: stack trace based automatic crash reproduction via symbolic execution,” IEEE transactions on software engineering, vol. 41, no. 2, pp. 198–220, 2015.
  • Soltani et al. [2018] M. Soltani, A. Panichella, and A. Van Deursen, “Search-based crash reproduction and its impact on debugging,” IEEE Transactions on Software Engineering, 2018.
  • McAllister [2008] W. McAllister, Data Structures And Algorithms Using Java, 1st ed.    USA: Jones and Bartlett Publishers, Inc., 2008.
  • Barr et al. [2015] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle problem in software testing: A survey,” IEEE Transactions on Software Engineering, vol. 41, no. 5, pp. 507–525, May 2015.
  • Arcuri [2010] A. Arcuri, “It does matter how you normalise the branch distance in search based software testing,” in Third International Conference on Software Testing, Verification and Validation, ICST, April 2010, pp. 205–214.
  • Deb [2014] K. Deb, “Multi-objective optimization,” in Search methodologies.    Springer, 2014, pp. 403–449.
  • Črepinšek et al. [2013] M. Črepinšek, S.-H. Liu, and M. Mernik, “Exploration and exploitation in evolutionary algorithms: A survey,” ACM Computing Surveys (CSUR), vol. 45, no. 3, p. 35, 2013.
  • Köppen and Yoshida [2007] M. Köppen and K. Yoshida, “Substitute distance assignments in NSGA-II for handling many-objective optimization problems,” in Proceedings of the 4th International Conference on Evolutionary Multi-criterion Optimization, ser. EMO’07.    Berlin, Heidelberg: Springer-Verlag, 2007, pp. 727–741. [Online]. Available: http://dl.acm.org/citation.cfm?id=1762545.1762607
  • Fraser and Arcuri [2014] G. Fraser and A. Arcuri, “A large-scale evaluation of automated unit test generation using evosuite,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 24, no. 2, p. 8, 2014.
  • Panichella and Molina [2017] A. Panichella and U. R. Molina, “Java unit testing tool competition-fifth round,” in Search-Based Software Testing (SBST), 2017 IEEE/ACM 10th International Workshop on.    IEEE, 2017, pp. 32–38.
  • Shamshiri et al. [2015] S. Shamshiri, J. M. Rojas, G. Fraser, and P. McMinn, “Random or genetic algorithm search for object-oriented test suite generation?” in Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation.    ACM, 2015, pp. 1367–1374.
  • Arcuri and Fraser [2013] A. Arcuri and G. Fraser, “Parameter tuning or default values? an empirical investigation in search-based software engineering,” Empirical Software Engineering, vol. 18, no. 3, pp. 594–623, 2013.
  • Just et al. [2014] R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst, R. Holmes, and G. Fraser, “Are mutants a valid substitute for real faults in software testing?” in Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering.    ACM, 2014, pp. 654–665.
  • Andrews et al. [2005] J. H. Andrews, L. C. Briand, and Y. Labiche, “Is mutation an appropriate tool for testing experiments?” in Proceedings of the 27th international conference on Software engineering.    ACM, 2005, pp. 402–411.
  • Jia and Harman [2011] Y. Jia and M. Harman, “An analysis and survey of the development of mutation testing,” IEEE transactions on software engineering, vol. 37, no. 5, pp. 649–678, 2011.
  • Wei et al. [2012] Y. Wei, B. Meyer, and M. Oriol, “Is branch coverage a good measure of testing effectiveness?” in Empirical Software Engineering and Verification.    Springer, 2012, pp. 194–212.
  • Inozemtseva and Holmes [2014] L. Inozemtseva and R. Holmes, “Coverage is not strongly correlated with test suite effectiveness,” in Proceedings of the 36th International Conference on Software Engineering.    ACM, 2014, pp. 435–445.
  • Offutt [2011] J. Offutt, “A mutation carol: Past, present and future,” Information and Software Technology, vol. 53, no. 10, pp. 1098–1107, 2011.
  • Fraser and Arcuri [2015] G. Fraser and A. Arcuri, “Achieving scalable mutation-based generation of whole test suites,” Empirical Software Engineering, vol. 20, no. 3, pp. 783–812, 2015.
  • Georges et al. [2007] A. Georges, D. Buytaert, and L. Eeckhout, “Statistically rigorous java performance evaluation,” in Proceedings of the 22Nd Annual ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications, ser. OOPSLA ’07.    New York, NY, USA: ACM, 2007, pp. 57–76. [Online]. Available: http://doi.acm.org/10.1145/1297027.1297033
  • Conover [1999] W. Conover, Practical nonparametric statistics, ser. Wiley series in probability and statistics: Applied probability and statistics.    Wiley, 1999. [Online]. Available: https://books.google.nl/books?id=dYEpAQAAMAAJ
  • Vargha and Delaney [2000] A. Vargha and H. D. Delaney, “A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong,” Journal on Educational and Behavioral Statistics, vol. 25, no. 2, pp. 101–132, 2000.
  • Törngren and Sellgren [2018] M. Törngren and U. Sellgren, Complexity Challenges in Development of Cyber-Physical Systems.    Cham: Springer International Publishing, 2018, pp. 478–503. [Online]. Available: https://doi.org/10.1007/978-3-319-95246-8_27
  • Abbaspour Asadollah et al. [2015] S. Abbaspour Asadollah, R. Inam, and H. Hansson, “A survey on testing for cyber physical system,” in Proceedings of the 27th IFIP WG 6.1 International Conference on Testing Software and Systems - Volume 9447, ser. ICTSS 2015.    New York, NY, USA: Springer-Verlag New York, Inc., 2015, pp. 194–207. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-25945-1_12