JUGE: An Infrastructure for Benchmarking Java Unit Test Generators

06/14/2021 ∙ by Xavier Devroey, et al. ∙ Delft University of Technology 0

Researchers and practitioners have designed and implemented various automated test case generators to support effective software testing. Such generators exist for various languages (e.g., Java, C#, or Python) and for various platforms (e.g., desktop, web, or mobile applications). Such generators exhibit varying effectiveness and efficiency, depending on the testing goals they aim to satisfy (e.g., unit-testing of libraries vs. system-testing of entire applications) and the underlying techniques they implement. In this context, practitioners need to be able to compare different generators to identify the most suited one for their requirements, while researchers seek to identify future research directions. This can be achieved through the systematic execution of large-scale evaluations of different generators. However, the execution of such empirical evaluations is not trivial and requires a substantial effort to collect benchmarks, setup the evaluation infrastructure, and collect and analyse the results. In this paper, we present our JUnit Generation benchmarking infrastructure (JUGE) supporting generators (e.g., search-based, random-based, symbolic execution, etc.) seeking to automate the production of unit tests for various purposes (e.g., validation, regression testing, fault localization, etc.). The primary goal is to reduce the overall effort, ease the comparison of several generators, and enhance the knowledge transfer between academia and industry by standardizing the evaluation and comparison process. Since 2013, eight editions of a unit testing tool competition, co-located with the Search-Based Software Testing Workshop, have taken place and used and updated JUGE. As a result, an increasing amount of tools (over ten) from both academia and industry have been evaluated on JUGE, matured over the years, and allowed the identification of future research directions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the last decades, researchers have come up with various techniques to automate the generation of test cases. In particular, unit test generators seek to automate the production of tests for various purposes (e.g., validation, regression testing, fault localization, etc.) using different techniques, including random search (e.g.,  [43, 38]), search-based software testing (e.g.,  [19, 3, 15]), and symbolic (e.g.,  [51, 8]) and concolic execution (e.g.,  [70, 66]).

Juristo et al. [32] identified three essential features that each empirical evaluation should have to contribute to the software testing empirical body of knowledge. First, the evaluation should be fully defined, and the data should be analysed with appropriate techniques to allow a clear interpretation of the results. Second, the programs used for the evaluation and the setup and variables considered should be representative of the reality of the practice. Third, an evaluation should be replicable and come with a replication package to confirm previous results and reach an acceptable level of confidence in the hypothesis.

Similarly, to bridge the gap with industry, automated test case generators have to come with strong evidence that the approach can also be applied in practice. For instance, evidence-based software engineering [18] can help practitioners make informed decisions about the choice of a generator based on the current best evidence from research. Those pieces of evidence come in the form of empirical evaluations identifying the strengths and weaknesses of various generators.

In the case of unit test generators, conducting an empirical evaluation is not trivial. It requires a large manual effort to collect benchmarks (i.e., Java classes for which to generate test cases), setup the evaluation and the evaluation infrastructure, collect and analyze the produced unit tests, and compare the results with the state-of-the-art. The primary goal of our JUnit Generation benchmarking infrastructure (JUGE) [16] is to reduce the overall effort and ease the comparison of several generators by standardizing the evaluation process. This standardization allows researchers meeting the level of requirements enabling an effective contribution to the empirical body of knowledge in software testing.

JUGE has been initially developed in the context of the tool competition, co-located with the Search-Based Software Testing (SBST) workshop. Since 2013, eight editions of the tool competition have taken place and used the JUGE infrastructure to evaluate and compare automated unit test generators [6, 7, 61, 60, 48, 41, 35, 17, 50]. Consequently, JUGE has been improved and evolved over the years to integrate the latest advances from academia to enhance the comparison, and best practices from industry to achieve high automation. Several tools have entered the competition [22, 52, 23, 57, 25, 39, 63, 46, 53, 26, 64, 54, 28, 65, 27, 55, 11, 10, 56, 44, 68, 36, 1, 31] and matured over the years by fixing bugs evidenced by the evaluations using the JUGE infrastructure, but also by confronting the various approaches to different benchmarks to discover areas for improvement and future research directions. The current implementation is openly available at https://github.com/JUnitContest/JUGE and on Zenodo for long-term storage [16].

JUGE is suited for evaluating and comparing fully automated black, white, and grey-box unit test generators. For instance, in previous editions of the tool competition, JUGE has been applied to evaluate various types of tools relying on a variety of approaches, including search-based [19, 47, 62], random-based [38, 43, 58], and symbolic execution [8, 9]. In a nutshell, the generator takes as input the source code or the binaries of a Java project and generates unit tests for a given class or set of classes. A time budget limits the generation, and the generated tests are compared w.r.t. their structural coverage and mutation score. JUGE computes a score based on those metrics to rank the different generators using sound statistical analysis. The benchmarks, the tests, and the intermediate results can be saved and archived to be added to a replication package and enable future comparisons without requiring to re-execute all the generators, thanks to the standardized evaluation process implemented in JUGE.

2 Background and related work

When designing a new test case generation technique, conducting empirical evaluations is of paramount importance to position this new technique in the current software testing body of knowledge [32, 69]. When the technique gains in maturity, developers will also rely on those empirical evaluations to make informed decisions about choosing a tool relevant for their industrial context [18, 3]. For instance, Melo et al. [40] designed a recommender for concurrent software testing techniques based on the characteristics of the software under test and the current body of knowledge in concurrent software testing.

2.1 Empirical evaluation guidelines

Over the years, several guidelines, benchmarks, and infrastructures have been developed to design, execute, and assess test case generators. For instance, Arcuri and Briand [5] defined guidelines for the usage of statistical tools when evaluating and comparing randomized algorithms, which is the case of a large number of automated test case generators. In their systematic review of the empirical evaluation of search-based test case generation, Ali et al. [2]

identify the elements that should be reported in study designs. They found that search-based software testing has been focused on structural coverage and unit testing and that empirical studies should adopt a more rigorous and standardized execution and reporting approach. In particular, studies should account for random variation in the results by using appropriate statistical hypothesis testing and compare the techniques with other baselines to conclude that it brings any advantage.

More recently, in a significant effort to improve the review process in software engineering, Ralph et al. [59] defined Empirical Standards

listing specific attributes expected when conducting empirical evaluation following a given research methodology. The empirical evaluation of automated test cases generation is classified under the umbrella of

Optimization Studies in Software Engineering: i.e., research studies that focus on the formulation of software engineering problems as search problems, and apply optimization techniques to solve such problems [59]. Among the essential characteristics of such studies, the standards require the comparison of the approach under study to an appropriate baseline and the distribution of the dataset (i.e., the benchmarks used for the evaluation, if possible, and the results).

JUGE contributes to the general effort of improving the quality and reproducibility of empirical evaluations for unit test generators by (i) standardizing the evaluation process, using appropriate data analysis techniques, and (ii) enabling easy distribution of the benchmarks (i.e., classes under tests used for the evaluation) and results, including the test cases, coverage and mutation analysis, and statistical analysis for future comparisons and reproductions. Section 4 discusses the guidelines to design, execute, and report the results of an empirical evaluation using our infrastructure.

2.2 Comparison of test case generators

Besides structural coverage, like line or branch coverage, empirical evaluations also rely on mutation analysis to compare different test case generators [4]. Mutation analysis [14] applies mutation operators, e.g., replacing an arithmetic operator, to a program under test to produce faulty variants (i.e., mutants), and executes a test suite on those variants. If a test fails on a particular mutant, this mutant is considered as killed. The mutation score, i.e., the ratio of killed mutants to the total number of mutants, is used to measure the faults detection capabilities of the test suite [34].

For now, JUGE supports both structural coverage and mutation analysis of the generated tests. Other kinds of automated analysis can be plugged in into the extendable architecture of the infrastructure. Additionally, all the generated test suites are saved using a unique identifier and can be collected for additional manual inspection.

2.3 Benchmarks for software testing

Empirical evaluations can be performed on various kinds of benchmarks (i.e., classes under test). For instance, Fraser and Arcuri built SF110 [24]

, a corpus of 23,886 classes from 110 open-source projects used to evaluate and compare unit test generators. Other benchmarks follow a different approach by using actual bugs extracted from Java software systems. For instance,

Defects4J [33] is a collection of reproducible bugs and a supporting infrastructure that has been widely used for evaluating software testing and debugging approaches. In its latest version (v2.0.0), Defects4J contains 835 bugs from 17 Java software systems [29]. Similarly, BugSwarm [67] is a toolkit designed to mine reproducible failures and corresponding fixes to evaluate fault-detection, localization, and repair approaches.

JUGE supports the definition of customized benchmarks. For instance, previous editions of the tool competition have used classes from Defects4J’s projects and classes collected from open-source projects. Section 4.2 details the guidelines, based on our experience in the tool competition, to select classes under test for an empirical evaluation.

3 JUGE Infrastructure

JUGE is suited for evaluating and comparing fully automated black, white, and grey box unit test generators. The generator expects as input the source code or the binaries of a Java project and generate unit tests for a given class or set of classes. The generation is limited by a time budget, provided as input to the generator, and the execution is limited by a global timeout (equals to twice the time budget) to take the pre- and post-processing of the generator into account. For each benchmark (i.e., set of classes under test), JUGE runs the test case generator with the given time budget. JUGE can be parametrized to repeat the executions a given amount of times. Once the generation is completed, JUGE can measure structural coverage and perform mutation analysis of the generated tests, compute a score based on those metrics, and rank the different generators using sound statistical analysis.

Figure 1: JUGE architecture overview

JUGE is openly available at https://github.com/JUnitContest/JUGE and packaged as a Docker image. It contains scripts and tools supporting (i) the generation of unit tests for a given set of classes under test and time budget; (ii) the coverage and mutation analysis of the generated tests; and (iii) the statistical analysis and ranking of different unit test generators.

As illustrated in Figure 1, JUGE relies on an adapter, called runtool, to wrap specific calls to a unit test generator (Randoop in Figure 1). This adapter offers an interface to the benchmarktool, in charge of orchestrating the evaluation of the unit test generator. The communication between the host and the JUnitcontest container (B in Figure 1) is done via a common folder (A in Figure 1), mounted in the file tree structure of the image. This folder contains the executable binaries of the unit test generator and its runtool adapter. The generated tests, the metrics, and the results of the statistical analysis are saved in a subfolder (results/) to be made available to the host. The classes under test and the corresponding configuration file are saved in the Docker container (benchmarks/). Hence, to evaluate multiple tools, one can reuse the same container and only has to mount different folders, each one containing the unit test generator and its runtool adapter.

3.1 Unit test generation

One of the main challenges when building JUGE was to define a generic protocol for the generation of unit tests able to handle various unit test generators. For that, we rely on a set of conventions and a generic communication protocol between the benchmarktool and the runtool adapter.

Conventions.

By convention, the common folder (A in Figure 1) has to be named after the generator (randoop/ in our example) and mounted in the /home/ directory of the Docker container. For any generator, unit tests have to be generated in /home/randoop/temp/testcases/. For each class under test, unit tests have to be stored as one or more Java test files containing JUnit tests. Each Java test file has to declare a public class with a zero-argument public constructor, annotate test methods with @Test, and declare test methods public. Additional files may be saved to /home/randoop/temp/data/ for later offline analysis (e.g., for debugging of the generator).

Figure 2: Communication protocol of the runtool adapter.

Additionally, the /home/randoop/ folder must contain a runtool executable script or binary that will be called by the benchmarktool to start the generation of unit tests. For instance, for Randoop, the runtool script in Listing 1 contains a single command launching the Randoop specific implementation of the generic runtool module provided in the source code repository of our infrastructure.111Available at https://github.com/JUnitContest/JUGE/tree/master/runtool.

#!/bin/bash
APACHE_EXECS_LIB=lib/org/apache/commons/commons-exec/1.2/commons-exec-1.2.jar
TOOL=lib/runtool-1.0.0-SNAPSHOT.jar
java -cp $TOOL:$APACHE_EXECS_LIB  sbst.runtool.Main
Listing 1: Randoop runtool script

Communication protocol.

The adapter has to support the protocol described in Figure 2. In the first part (A), the benchmarktool signals the start of a new evaluation by sending the ’BENCHMARK’ message, followed by the paths to the source code and binaries of the software under test, the CLASSPATH, and a number of classes under test in the evaluation. Based on that information, the runtool adapter initializes the generator (in this case, Randoop).

After the initialisation, the generator can signal that it will use additional CLASSPATH entries for its execution. The adapter notifies the benchmarktool of those additional entries (B in Figure 2). In the third part (C), the adapter notifies the benchmarktool that the generator is ready to start the evaluation by sending the ’READY’ message. The benchmarktool then sends the time budget allocated for the generation and the class under test of the first run to the adapter that, in its turn, calls the generator. After the generation, the adapter notifies the benchmarktool that the generator is ready for the next class under test.

3.2 Coverage and mutation analysis

Once the test cases have been generated, JUGE can compute the different metrics for each test suite (A in Figure 1). Those metrics include (i) the number of flaky and non-compiling tests, (ii) the line and branch coverage, and (iii) the mutation score of the generated tests.

Flaky and non-compiling tests.

First, if the test suite (one per Java file) does not compile, it is tagged and ignored in the subsequent steps of the analysis. Once compiled, the test suite is executed 5 times. Test methods (identified using the @Test annotation) producing different results between different executions are marked as flaky and ignored for the remainder of the analysis.

Line and branch coverage.

JUGE relies on JaCoCo [30] for statement and conditions coverage of the generated tests. Coverage information is furthermore used to reduce the subsequent mutation analysis time by restricting the execution of the tests against a given mutant to the tests effectively covering the lines modified by the mutant.

Mutation analysis.

In the early versions of JUGE, we relied on PITest [12] to generate but also execute the mutants. This, however, raised several issues for unit test generators relying on a dedicated test execution environment. For instance, test cases generated using EvoSuite require to run with a dedicated runner to avoid flakiness, handle inputs and outputs, etc., preventing from using the PITest environment for test execution. To solve this issue, we refined the mutation analysis to use the default test execution environment, supporting ad-hoc test runners. We use PITest to generate the different mutants, and the results of the line coverage to reduce the analysis time by executing only tests reaching the mutated lines against each mutant. Additionally, we set a hard deadline (5 minutes by default) for the mutation analysis to avoid infinite executions.

3.3 Data analysis and ranking

To answer the different research questions, the generators can be compared based on the different metrics collected during the analysis of the generated tests. For that JUGE uses Friedman’s test and post-hoc Conover’s test for multiple pairwise comparison [13]. The former is a non-parametric test for significance, and it is widely used for multiple-problem analysis, where the problems correspond to the CUTs in our case. A significant -value for this test indicates that the evaluated tools statistically differ w.r.t. to the overall performance score (alternative hypothesis). While Friedman’s test does indicate whether the tools in the comparison are statistically different or not, it does not indicate for which pairs of tools such significance actually holds. Hence, the statistical analysis is complemented by using the post-hoc Conover’s test for the pairwise comparison. Notice that the -values produced by the post-hoc test are further adjusted with the Holm-Bonferroni procedure. This procedure corrects the statistical significance level (-value=0.05) in case of multiple comparisons [48].

Based on the results of the Friedman’s test, JUGE

produces a final ranking with the average measured value and standard deviation, together with the results of the post-hoc Conover’s test.

Tool Score Std.dev Ranking
EvoSuite 1457 192.72 1.55
JTExpert 849 102.03 2.71
T3 526 82.43 2.81
Randoop 448 34.75 2.92
Table 1: Example of scores obtained through Friedman’s test for the 5th edition of the tool competition [48].
EvoSuite JTExpert T3 Randoop
EvoSuite - - - -
JTExpert - - -
T3 - -
Randoop -
Table 2: Example of results of the post-hoc Conover’s test for the 5th edition of the tool competition [48].

Example of comparison.

JUGE allows combining different metrics to ease the overall comparison of different generators. For that, it relies on a scoring formula [48]. This formula has been developed and refined during the different editions of the tool competition and takes into account the line and branch coverage, the mutation score, and the time budget used by the generator, and applies a penalty for flaky and non-compiling tests. For example, Table 1 provides the ranking obtained through Friedman’s test for the fifth edition of the tool competition [48]. EvoSuite is ranked first with an average score of 1457, followed by JTExpert, T3, and Randoop. Table 2 gives the results of the post-hoc Conover’s test for the same edition of the competition and indicates that the various comparisons are statistically significant, except for T3 and Randoop for which the -value is above the confidence level of .

Definition 1 (Score per execution [48]).

For each execution () if a unit test generator on a class under test , and with a time budget , the score equals to

Where (resp. ) is the line (resp. branch) coverage of the generated tests (between 0 and 1); is the mutation score (i.e., the ratio between the number of mutants killed by the test suite and the total number of mutants generated); , , and are the weights set to, respectively, 1, 2, and 4 by default; is the total amount of time used by the generator (a penalty is applied if the generator exceeds to take pre- and post-processing into account); and is a penalty applied if the generator produced flaky or non compiling tests, computed using the following formula

Where and are the number of non-compiling and total number of test suites; and and are the number of flaky and total number of unit tests.

The score of a generator is computed by summing the scores for the different classes under test and time budgets.

Definition 2 (Final score [48]).

For a generator , a set of classes under test , and a set of time budgets , the final score equals to

Where corresponds to the average score of the different executions in case of multiple executions of the generator on the same class under test (which is recommended if the generator involves randomness [5]):

4 Evaluating unit test generators with JUGE

This section provides general guidelines regarding how to evaluate and compare unit test generation tools with JUGE.

4.1 Research questions and evaluation setup

JUGE can be used to evaluate automated unit test generators that do not require human intervention during the generation process. It relies on the source code or binaries of a set of benchmarks projects. JUGE comes with support for structural coverage and mutation analysis. Therefore, it is well suited for quantitative analysis, yet it still allows qualitative analysis. All the tests generated during the evaluation are saved and can be inspected or reused for other analysis. Additionally, as explained in subsection 3.1, JUGE allows the unit test generators to save any additional data for later analysis. For instance, a search-based unit test generator can save intermediate fitness values to analyse the fitness landscape evolution.

Generator meta-parameters.

Many unit test generators can be configured through a set of meta-parameters (e.g., 

mutation and crossover probabilities for search-based approaches). To ease the evaluation and processing of the results, we recommend considering each configuration as an individual generator with its own adapter in a dedicated folder (

A in Figure 1) and an explicit name reflecting the configuration. Configuring the generators with the right parameters to answer the research questions and reporting those configurations in the empirical study is of paramount importance to reduce the threats to the validity and enable the replicability of the results.

Juge meta-parameters.

The infrastructure has two meta-parameters: the time budget and the number of repetitions. The time budget corresponds to the budget allocated to generate a set of test cases for a given benchmark (i.e., a class under test). JUGE also uses the time budget to set a global timeout for each execution equals to twice the time budget. The time budget depends mainly on the type of approach used by the test case generator. For instance, previous research indicates that a time budget of three minutes is suited for a search-based generator like EvoSuite [19, 47] but is not enough for a symbolic execution approach like Tardis or Sushi [10].

Similarly, the number of repetitions varies if the generator relies on an exact approach or uses randomness. For exact approaches, one execution is enough (unless one of the research questions considers the execution time, in which case, several repetitions are necessary). For randomized approaches (e.g., search-based and random approaches), several repetitions are necessary to ensure the statistical power of the results. Arcuri and Briand [5]estimated that the number of repetitions is a compromise between the number of benchmarks used in the evaluation, the execution time of the generators, and the overall budget available to perform the evaluation. They concluded that each randomized generator should be executed 1,000 times, and, if it is not possible, report the reasons and the total execution time of the entire evaluation. However, the number of repetitions (for a larger number of benchmarks) should not be less than 10.

4.2 Benchmarks selection

The selection of the benchmarks (i.e., sets of classes under test) should follow a systematic approach and ensure that the benchmarks are diverse enough to reduce the threats to the validity of the research questions [42]. For instance, by considering projects from different application domains. Those projects (and classes under test) can come from existing benchmarks: e.g., Defects4J [33] or the previous editions of the tool competition relying on JUGE [60, 48, 41, 35, 17].222Benchmarks of previous tool competitions are available at https://github.com/JUnitContest/JUGE/tree/master/infrastructure.

Alternatively, there exist several ways to define a new set of benchmarks for a given set of projects. Based on our experience from the past tool competitions [17], we suggest the following two-steps procedure. In the first step, (i) identify the packages in the project that contain classes relevant for the evaluation (e.g., packages containing classes with the business logic); (ii) compute the McCabe’s cyclomatic complexity for the different classes of those packages and remove classes with a complexity lower than five. This reduces the risk to sample classes with few branches, easily covered by randomly generated tests [47].

Figure 3: Example of reporting of the line coverage, branch coverage and mutation score for the candidate (382 classes) and selected benchmarks (60 classes, 20 per project) from the 2020 tool competition [17].

In the second step, we suggest executing a random generator (e.g., Randoop) with a low time budget (e.g., ten seconds) on the remaining classes and filter out classes for which the generator could not generate any tests. This reduces the chances of running into technical difficulties during the evaluation of the different tools. If the number of remaining classes is still too high, one can randomly sample a subset of classes per project. In addition to the previous steps, one can also use JUGE to perform a coverage and mutation analysis of the tests produced by the random generator and report the results for the candidate and sampled classes (e.g., Figure 3). Finally, the different classes can be regrouped in one or more benchmarks, depending on the study goals. For instance, all the classes from the same project can be regrouped in one benchmark, or (as is the case for the tool competition) each class can be an individual benchmark.

1{
2  BCEL-1= { 
3    src=/var/benchmarks/projects/bcel-6.0-src/src/main/java 
4    bin=/var/benchmarks/projects/bcel-6.0-src/target/classes 
5    classes=(org.apache.bcel.classfile.Utility) 
6    classpath=(/var/benchmarks/projects/bcel-6.0-src/target/classes) 
7  }
8  BCEL-2= {
9    [...]
10  }
11}
Listing 2: Excerpt of a benchmarks.list configuration file

The benchmarks are described in a dedicated configuration file (benchmarks.list). 2 provides an excerpt of benchmarks.list configuration file from the JUGE example benchmarks. Each benchmark has a unique identifier (line 2), the path to the root folder of the source files of the project (line 2), the path to the root folder containing the compiled classes (line 2), the list of classes under test (line 2), and the classpath with all the dependencies to use for the generation and coverage and mutation analysis (line 2). Once the benchmarks are defined, JUGE allows building a new Docker image (B in Figure 1) that can be instantiated multiple times in different containers to run the different tools.

4.3 Evaluation execution and results processing

Once the benchmarks and meta-parameters are defined, JUGE can start the evaluation by running different commands from the home directory of the tool in the Docker image (e.g., /home/randoop in the example of Figure 1). We summarize hereafter the main steps and commands to use during the evaluation.333Details on how to start the Docker container and on the different commands available in JUGE are provided in the documentation at https://github.com/JUnitContest/JUGE/blob/master/docs/. If the available hardware allows it, it is possible to run several Docker containers in parallel (instantiated from the same JUGE Docker image), each responsible for executing a different generator. One should, however, be cautious not to overload the machine as it could impact the execution of the generators and provoke timeouts. Ideally, different Docker containers should be run on independent machines with the same hardware configuration. Practically, if this is not possible, we strongly recommend doing some initial tests to determine the adequate number of parallel Docker containers to avoid undesirable side effects.

Unit test generation.

To start the unit test generation, one can run the following commands:

contest_generate_tests.sh <tool> <repetitions> <starts-from> <budget>

For instance, to execute Randoop once with a time budget of 10 seconds, one can run contest_generate_tests.sh randoop 1 1 10. The results are placed in a folder named after the tool and the time budget (/home/randoop/results_randoop_10 in our example). For each benchmark and each repetition (starting from the given index), the results contains a folder with the tests generated by the tool (e.g., BCEL-1_1, BCEL-2_1, etc.). Those different folders also contain text files with the logs and additional data produced by the generator.

Coverage and mutation analysis.

Similarly the computation of the different coverage and mutation metrics can be started by the following command:

contest_compute_metrics.sh results_<tool>_<budget>

For instance, contest_compute_metrics.sh results_randoop_10 will run the coverage analysis using JaCoCo and the mutation analysis using PITest and store the results in a CSV file (transcript.csv) placed in the different sub-folders.

Score and statistical analysis.

Once the different metrics have been computed for the different tools and budgets, the different results folders can be regrouped in a single home directory (e.g., /home/all/), attached to a JUGE Docker container. The results can be collected in a single results.csv file using the command contest_transcript_single.sh. Finally the score and statistical analysis of the results can be run using the following command which will produce the different resports in the specified output folder: score.sh <results.csv> <output-folder>

4.4 Reporting, archiving and reproducibility

One of the goals of the JUGE infrastructure is to enhance repeatability and reproducibility of both the results and statistical and qualitative analysis. For that, we strongly recommend submitting an artifact containing the following elements:

  • the benchmarks, the generated tests, and additional data if any,

  • the files produced by the coverage and mutation, as well as any additional analysis,

  • the results of the statistical analysis, together with any other data analysis scripts used for the evaluation.

Suppose some of the benchmarks are under a non-disclosure agreement. In that case, we strongly recommend adding benchmarks coming from open source systems to the analysis and release those in the artifact. The design of such an artifact must be thought early on in the study. We recommend, for instance, to fork the JUGE repository and update the benchmarks configuration and files to generate a Docker image used to perform the evaluation. The fork can then be easily saved in a data repository (like Zenodo444https://zenodo.org, which has a GitHub integration) for long-term storage with a dedicated DOI.

In addition to the artifact, the reporting of the evaluation set-up should mention the following elements:

  • the randomized (or not) nature of the generators used in the evaluation;

  • the meta-parameter configuration(s) of each generator;

  • the meta-parameter configuration of JUGE (including the number of repetitions in the case of randomized generators) with a justification for those values;

  • the total number of independent executions and the total execution time took by the evaluation;

  • the specifications of the hardware and the number of Docker containers running in parallel;

  • the benchmarks selection procedure and the characteristics of the selected benchmarks relevant to the goals of the evaluation (e.g., the number of lines of code of the projects and classes under test, the average McCabe’s cyclomatic complexity of the benchmarks, etc.);

  • any additional data collected and statistical analysis performed on the results of the evaluation with a proper justification (e.g., see Arcuri and Briand [5] for a discussion on statistical analysis for randomized algorithms).

5 Impact of JUGE

The JUGE infrastructure played a significant role in the replication of previous results regarding the structural coverage and mutation score achieved by automated unit test generators. The configurability of the infrastructure through the meta-parameters and the benchmarks considered for the various editions of the tool competition allowed us to assess the generated tests under various conditions. It independently confirmed that (i) search-based unit test generation (as implemented in EvoSuite) achieves a better coverage and mutation score [60, 48, 41, 17]; and (ii) automatically generated tests can compete with manually written ones w.r.t. coverage and mutation score [41, 35].

The JUGE infrastructure and the tool competition also helped to push the boundaries of unit test generation by confronting industrial generators to academic ones and showcasing how research can contribute to the industrial practices [61]. But also, selecting various benchmarks from open source systems helped to improve the academic generators by confronting them to new classes under test, thereby increasing the generalisability of the underlying approaches. For instance, EvoSuite has entered the competition multiple times with several algorithms (whole suite approach [20], MOSA [45], DynaMOSA [47], etc.) and in 2019, the results of the competition lead to the fix of a major bug [11].

Edition Generators Budgets (in sec.) #C Projects
2013 [6] Randoop, EvoSuite [22], T3 [52] - 77 Apache Commons Lang, Apache Lucene, Barbecue, Joda Time, sqlsheet
2014 [7] Randoop, EvoSuite [23], T3 [57] - 63 Async Http Client, eclipse-cs, GData Java Client, Guava, Hibernate, JMLL, JWPL, Scribe, Twitter4j
2015 [61] Randoop, EvoSuite (whole-suite) [25], EvoSuite (MOSA) [46], GRT [39], JTExpert [63], T3 [53], undisclosed Commercial Tool (CT) - 63 Async Http Client, eclipse-cs, GData Java Client, Guava, Hibernate, JMLL, JWPL, Scribe, Twitter4j
2016 [60] Randoop, EvoSuite (whole-suite) [26], JTExpert [64], T3 [54] 60, 120, 240, 480 68 Defects4J
2017 [48] Randoop, EvoSuite (whole-suite) [28], JTExpert [65] 10, 30, 60, 120, 240, 300, 480 69 Apache Commons BCEL, Imaging, and Jxpath, Freehep, Gson, Re2J, LA4J, Okhttp
2018 [41] Randoop, EvoSuite (whole-suite) [27], T3 [55] 10, 60, 120, 240 59 Dubbo, FastJason, JSoup, Okio, Redisson, Webmagic, Zxing
2019 [35] Randoop, EvoSuite (DynaMOSA) [11], Sushi [10], Tardis [10], T3 [56] 10, 60, 120, 240 38 Antlr4, AuthzForce, Dubbo, Fescar, FastJason, Imixs-Workflow, Okio, Spoon, Webmagic, Zxing
2020 [17] Randoop, EvoSuite (DynaMOSA) [44] 60, 180 70 Fescar/Seata, Guava, PdfBox, Spoon
2021 [50] Randoop, EvoSuite [68], EvosuiteDSE [36], Kex [1], UtBot [31] 30, 120 98 Seata, Guava, FastJSON, Spoon, Weka, Okio
Table 3: Editions of the tool competitions relying on the JUGE infrastructure with the generators, the time budgets (in seconds), the number of classes under test (#C), and the projects considered for the edition.

Table 3 describes the main characteristics of the different editions of the tool competition. Over the years, various tools have entered the competition and evolved. Among the different tools, Randoop is used as a baseline, and EvoSuite has joined every year since the first edition.

The different editions have also tried different configurations w.r.t. to the execution of the tools and the time budget allocated for the generation. Before 2016, the time budget was left to the participants to decide (marked as - in Table 3). Since 2016, the organizers have tried various time budgets, including 10 seconds in 2017 and 2018 and 30 seconds in 2017 and 2021, to assess how the different tools react under a minimal budget.

Similarly, the different editions have used classes under tests from various open-source projects to allow the distribution of the benchmarks after the competition. This allows one to replicate the results and the participants of the next edition to try their implementation of the runtool adapter before submitting their tool to the competition. In 2016, the organizers decided to use Defects4J to generate regression tests and assess the tools’ capability to expose real-world faults. Also, in 2019, 78 classes under test were initially selected. However, due to issues faced in the infrastructure during metrics computation (and fixed since), the number of classes considered for the final ranking was reduced to 38.

Running the tool competition every year is not trivial. One of the main challenges the different organizers face is the hardware infrastructure it requires due to the limited time between the submission of the different tools and the limit for providing the results (which is around two weeks). Both the generation of the tests and their evaluation using coverage and mutation analysis is time-taking, requiring a powerful server or a cluster.

6 Discussion and lessons learned

Any empirical evaluation of automated unit test generation faces several technical and methodological challenges. JUGE seeks to alleviate those challenges by providing a standardized way of designing, running, and reporting such evaluations. Both the development of JUGE and the evaluation method reported in Section 4 took several years to develop. We discuss hereafter the main lessons learned, as well as potential new applications of JUGE.

6.1 Lessons learned

Diversity of the generators.

The main technical challenges for such an infrastructure come from the diversity of the generators that can be considered for an empirical evaluation (i.e., random-based, search-based, concolic/symbolic-based, etc.). It requires isolating the executions to avoid troubles in case of a bug in the generator (e.g., erasing files from the host file system [21]) while still having a standard communication interface. This is achieved through the usage of an adapter with a shared common set of commands used by JUGE to interact with the generator. Although the isolation of the generator during test case generation is not handled by JUGE, the whole infrastructure runs in a Docker container to add an extra layer of security.

Balancing threats to external validity and statistical power.

As for any empirical evaluation with a random-based generator, researchers have to balance the number of classes under test to consider reducing external validity with the number of executions to ensure enough statistical power, giving the external constraints on the overall execution time [5]. For instance, in the tool competition, the entire evaluation must be carried out in around two weeks. To cope with this limitation, organizers use sampling to select a subset of classes under test, limit the time budget (not more than 8 minutes), and the number of repetitions of the executions (between 6 and 10, depending on the year). As explained in Section 4.1, the time budget allocated to the generator depends on the type of approach and the research questions being answered.

Configuration of the meta-parameters.

In addition to the time budget and the number of repetitions, which can be configured for JUGE, the generators themselves usually come with various meta-parameters that will directly influence the generation process. As explained in Section 4.1, such parameters should be carefully considered and reported to reduce the threats to the validity and enable the replicability of the results. For instance, many test generators like EvoSuite and Randoop include post-processing mechanisms that can be activated to minimize the generated tests [19, 43]. Such mechanisms are time-consuming and can be deactivated to reduce the overall execution time when evaluating properties such as coverage or the mutation score. However, deactivating test case minimization has a significant impact on other properties, such as structural properties, readability, the execution time of the tests, etc. Researchers should be aware of such impacts and carefully consider them when designing their studies. In JUGE we consider each generator configuration (e.g., EvoSuite using different generation algorithm) as a generator itself that will require its own runtool adapter and corresponding common folder.

Analysis of the generated tests.

Automated test case generation itself is a challenging task and requires considering several mechanisms (e.g., code instrumentation, handling I/O operations on the system under test, etc.) to be effective. Among the possible mechanisms is using a specific scaffolding for the generated tests: for instance, EvoSuite controls elements that could be non-deterministic to avoid test flakiness. However, such mechanisms might cause undesired interactions with the infrastructure, and more specifically, with the tools used to analyze the generated tests. This has been the case for EvoSuite and the mutation analysis: the test runner (EvoRunner) used in the generated tests was not compatible with PITest and required to use the mutated .class files directly instead of relying on the optimized PITest infrastructure. In the latest version, JUGE includes options to parallelize the execution of the mutation analysis and reduce the overall execution time of the evaluation.

6.2 Future applications

The method described in Section 4 constitutes a standard that can be applied to unit test generation for other kinds of languages using an infrastructure similar to JUGE. For instance, Lukasczyk et al. [37] recently defined an approach to generate unit tests for Python. Of course, dynamically typed languages such as Python face other challenges than statically typed languages like Java. Those challenges have to be taken into account in the design of the infrastructure (e.g., running type inference engines during pre-processing) and the selection of the benchmark (e.g., considering classes with type annotations only, etc.), and reported in the description of the empirical evaluation.

Besides comparing unit test generators, the JUGE infrastructure can be used to generate large amounts of tests for various kinds of classes using different tools and configurations. This enables the continuous creation of an openly available corpus of automatically generated unit tests. Such a corpus would (i) directly contribute to the body of empirical evidence on which decision-makers can rely to assess the usage of a unit test generator in their industrial context [18]; and (ii) enable further empirical evaluations on automatically generated tests without requiring to configure and run the generators, which require a certain level of expertise. For instance, in a recent study, Panichella et al. [49] revisited previous studies on the presence of test smells in automatically generated tests and found out that previous results vastly overestimated their presence. Among the different problems, they pointed out a misconfiguration of EvoSuite and its minimization process, resulting in more prominent test cases more likely to contain certain smells. Building openly available corpora using the appropriate configuration for the generators, together with a description of the characteristics and kind of evaluations it can be used for, would avoid such issues.

7 Conclusion

JUGE sets a standard for the proper assessment of automated test case generators. It provides an infrastructure and a method to design, set up, and execute an empirical evaluation, collect and analyze the results, and produce a replication package to meet the level of requirements enabling an effective contribution to the software testing empirical body of knowledge. It includes recommendations for selecting benchmarks and for the parametrization of the generator and the infrastructure, depending on the considered research questions. JUGE was originally introduced and developed in the context of the tool competition and has been used with several generators and dozens of classes under test coming from various projects.

As future directions for researchers, we envision several possibilities: (i) include additional analysis in addition to coverage and mutation analysis (e.g., performances or readability); (ii) experiment with generators supporting different levels of testing (e.g., integration and system testing) for other types of systems (e.g., cloud-based systems); and (iii) investigate how JUGE can be extended to support other languages (e.g., dynamically typed languages such as Python).

Finally, the JUGE infrastructure availability opens several directions for practitioners, who would rely on a large body of empirical evidence to assess automated test case generation usage in their context, and researchers, who would benefit from corpora of automatically generated tests for further empirical evaluations. JUGE also provides guidelines for evaluating unit test generation in other programming languages and for the definition of similar infrastructures in other domains.

We would like to thank (in alphabetical order) Arthur Baars, Sebastian Bauersfeld, Matteo Biagiola, Ignacio Lebrero, Urko Rueda Molina, and Fiorella Zampetti for their contribution to the implementation of the JUGE infrastructure. We would also like to thank (in alphabetical order) Azat Abdullin, Marat Akhin, Giuliano Antoniol, Andrea Arcuri, Cyrille Artho, Mikhail Belyaev, Pietro Braione, Nikolay Bukharev, José Campos, Nelly Condori, Christoph Csallner, Giovanni Denaro, Gordon Fraser, Yann-Gaël Guéhéneuc, Masami Hagiya, Mainul Islam, Dmitry Ivanov, Kiran Lakhotia, Ignacio Manuel Lebrero Rial, Lei Ma, Alexey Menshutin, Arsen Nagdalian, Gilles Pesant, Simon Poulding, Wishnu Prasetya, José Miguel Rojas, Abdelilah Sakti, Hiroyuki Sato, Sebastian Schweikl, Gleb Stromov, Yoshinori Tanabe, Paolo Tonella, Artem Ustinov, Sebastian Vogl, Tanja Vos, Mitsuharu Yamamoto, and Cheng Zhang for their participation in previous editions of the competition and the feedback they provided on the infrastructure. Xavier Devroey was partially funded by the EU Horizon 2020 ICT-10-2016-RIA “STAMP” project (No.731529) and the Vici “TestShift” project (No. VI.C.182.032) from the Dutch Science Foundation NWO. Sebastiano Panichella and Annibale Panichella gratefully acknowledge the Horizon 2020 (EU Commission) support for the project COSMOS (DevOps for Complex Cyber-physical Systems), Project No. 957254-COSMOS. René Just’s work is partially supported by the National Science Foundation under grant CNS-1823172.

References

  • [1] A. Abdullin, M. Akhin, and M. Belyaev (2021) Kex at the 2021 SBST Tool Competition. In 2021 IEEE/ACM 14th International Workshop on Search-Based Software Testing (SBST), External Links: Document Cited by: §1, Table 3.
  • [2] S. Ali, L. C. Briand, H. Hemmati, and R. K. Panesar-Walawege (2010-11) A Systematic Review of the Application and Empirical Investigation of Search-Based Test Case Generation. IEEE Transactions on Software Engineering 36 (6), pp. 742–762. External Links: Document, ISSN 0098-5589 Cited by: §2.1.
  • [3] N. Alshahwan, X. Gao, M. Harman, Y. Jia, K. Mao, A. Mols, T. Tei, and I. Zorin (2018) Deploying Search Based Software Engineering with Sapienz at Facebook. In Search-Based Software Engineering. SSBSE 2018., LNCS, Vol. 11036. External Links: Document Cited by: §1, §2.
  • [4] J.H. Andrews, L.C. Briand, Y. Labiche, and A.S. Namin (2006-08) Using Mutation Analysis for Assessing and Comparing Testing Coverage Criteria. IEEE Transactions on Software Engineering 32 (8), pp. 608–624. External Links: Document, ISSN 0098-5589 Cited by: §2.2.
  • [5] A. Arcuri and L. Briand (2014) A Hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Software Testing, Verification and Reliability 24 (3), pp. 219–250. External Links: Document, ISSN 1099-1689 Cited by: §2.1, 7th item, §4.1, §6.1, Definition 2.
  • [6] S. Bauersfeld, T. E. J. Vos, K. Lakhotia, S. M. Poulding, and N. Condori-Fernández (2013) Unit testing tool competition. In Sixth IEEE International Conference on Software Testing, Verification and Validation, ICST 2013 Workshops Proceedings, Luxembourg, Luxembourg, March 18-22, 2013, pp. 414–420. External Links: Document Cited by: §1, Table 3.
  • [7] S. Bauersfeld, T. E. J. Vos, and K. Lakhotia (2013) Unit testing tool competitions - lessons learned. In Future Internet Testing - First International Workshop, FITTEST 2013, Istanbul, Turkey, November 12, 2013, Revised Selected Papers, T. E. J. Vos, K. Lakhotia, and S. Bauersfeld (Eds.), Lecture Notes in Computer Science, Vol. 8432, pp. 75–94. External Links: Document Cited by: §1, Table 3.
  • [8] P. Braione, G. Denaro, A. Mattavelli, and M. Pezzè (2017) Combining symbolic execution and search-based testing for programs with complex heap inputs. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis - ISSTA 2017, New York, New York, USA, pp. 90–101. External Links: Document, ISBN 9781450350761 Cited by: §1, §1.
  • [9] P. Braione, G. Denaro, A. Mattavelli, and M. Pezzè (2018-05) SUSHI: A Test Generator for Programs with Complex Structured Inputs. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings, New York, NY, USA, pp. 21–24. External Links: Document, ISBN 9781450356633 Cited by: §1.
  • [10] P. Braione and G. Denaro (2019) SUSHI and TARDIS at the SBST2019 tool competition. In Proceedings of the 12th International Workshop on Search-Based Software Testing, SBST@ICSE 2019, Montreal, QC, Canada, May 27, 2019, A. Gorla and J. M. Rojas (Eds.), pp. 25–28. External Links: Document Cited by: §1, §4.1, Table 3.
  • [11] J. Campos, A. Panichella, and G. Fraser (2019) EvoSuite at the SBST 2019 tool competition. In Proceedings of the 12th International Workshop on Search-Based Software Testing, SBST@ICSE 2019, Montreal, QC, Canada, May 27, 2019, A. Gorla and J. M. Rojas (Eds.), pp. 29–32. External Links: Document Cited by: §1, Table 3, §5.
  • [12] H. Coles, T. Laurent, C. Henard, M. Papadakis, and A. Ventresque (2016) PIT: a practical mutation testing tool for Java. In Proceedings of the 25th International Symposium on Software Testing and Analysis - ISSTA 2016, New York, New York, USA, pp. 449–452. External Links: Document, ISBN 9781450343909 Cited by: §3.2.
  • [13] W. J. Conover and R. L. Iman (1981-08) Rank Transformations as a Bridge Between Parametric and Nonparametric Statistics. The American Statistician 35 (3), pp. 124. External Links: Document, ISSN 00031305 Cited by: §3.3.
  • [14] R.A. DeMillo, R.J. Lipton, and F.G. Sayward (1978-04) Hints on Test Data Selection: Help for the Practicing Programmer. Computer 11 (4), pp. 34–41. External Links: Document, ISSN 0018-9162 Cited by: §2.2.
  • [15] P. Derakhshanfar, X. Devroey, A. Panichella, A. Zaidman, and A. Van Deursen (2020-08) Botsing, a Search-based Crash Reproduction Framework for Java. In 35th IEEE/ACM International Conference on Automated Software Engineering (ASE ’20), pp. 1278–1282. External Links: Document Cited by: §1.
  • [16] X. Devroey, A. Gambi, J. P. Galeotti, R. Just, F. Kifetew, A. Panichella, and S. Panichella (2021-06) JUGE: junit generation benchmarking infrastructure. Zenodo. External Links: Document, Link Cited by: §1, §1.
  • [17] X. Devroey, S. Panichella, and A. Gambi (2020) Java Unit Testing Tool Competition - Eighth Round. In 2020 IEEE/ACM 13th International Workshop on Search-Based Software Testing (SBST), External Links: Document Cited by: §1, Figure 3, §4.2, §4.2, item i, Table 3.
  • [18] T. Dyba, B.A. Kitchenham, and M. Jorgensen (2005-01) Evidence-based software engineering for practitioners. IEEE Software 22 (1), pp. 58–65. External Links: Document, ISSN 0740-7459 Cited by: §1, §2, item i.
  • [19] G. Fraser and A. Arcuri (2011) EvoSuite: Automatic Test Suite Generation for Object-Oriented Software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering - SIGSOFT/FSE ’11, ESEC/FSE ’11, New York, New York, USA, pp. 416. External Links: Document, ISBN 9781450304436 Cited by: §1, §1, §4.1, §6.1.
  • [20] G. Fraser and A. Arcuri (2013-02) Whole Test Suite Generation. IEEE Transactions on Software Engineering 39 (2), pp. 276–291. External Links: Document Cited by: §5.
  • [21] G. Fraser and A. Arcuri (2013-03) EvoSuite: On the Challenges of Test Case Generation in the Real World. In 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation, pp. 362–369. External Links: Document, ISBN 978-0-7695-4968-2 Cited by: §6.1.
  • [22] G. Fraser and A. Arcuri (2013) EvoSuite at the SBST 2013 tool competition. In Sixth IEEE International Conference on Software Testing, Verification and Validation, ICST 2013 Workshops Proceedings, Luxembourg, Luxembourg, March 18-22, 2013, pp. 406–409. External Links: Document Cited by: §1, Table 3.
  • [23] G. Fraser and A. Arcuri (2013) EvoSuite at the second unit testing tool competition. In Future Internet Testing - First International Workshop, FITTEST 2013, Istanbul, Turkey, November 12, 2013, Revised Selected Papers, T. E. J. Vos, K. Lakhotia, and S. Bauersfeld (Eds.), Lecture Notes in Computer Science, Vol. 8432, pp. 95–100. External Links: Document Cited by: §1, Table 3.
  • [24] G. Fraser and A. Arcuri (2014-12) A Large-Scale Evaluation of Automated Unit Test Generation Using EvoSuite. ACM Transactions on Software Engineering and Methodology 24 (2), pp. 1–42. External Links: Document, ISSN 1049331X Cited by: §2.3.
  • [25] G. Fraser and A. Arcuri (2015) EvoSuite at the SBST 2015 tool competition. In 8th IEEE/ACM International Workshop on Search-Based Software Testing, SBST 2015, Florence, Italy, May 18-19, 2015, G. Gay and G. Antoniol (Eds.), pp. 25–27. External Links: Document Cited by: §1, Table 3.
  • [26] G. Fraser and A. Arcuri (2016) EvoSuite at the SBST 2016 tool competition. In Proceedings of the 9th International Workshop on Search-Based Software Testing, SBST@ICSE 2016, Austin, Texas, USA, May 14-22, 2016, pp. 33–36. External Links: Document Cited by: §1, Table 3.
  • [27] G. Fraser, J. M. Rojas, and A. Arcuri (2018) Evosuite at the SBST 2018 tool competition. In Proceedings of the 11th International Workshop on Search-Based Software Testing, ICSE 2018, Gothenburg, Sweden, May 28-29, 2018, J. P. Galeotti and A. Gorla (Eds.), pp. 34–37. External Links: Document Cited by: §1, Table 3.
  • [28] G. Fraser, J. M. Rojas, J. Campos, and A. Arcuri (2017) EvoSuite at the SBST 2017 tool competition. In 10th IEEE/ACM International Workshop on Search-Based Software Testing, SBST@ICSE 2017, Buenos Aires, Argentina, May 22-23, 2017, pp. 39–42. External Links: Document Cited by: §1, Table 3.
  • [29] G. Gay and R. Just (2020) Defects4J as a Challenge Case for the Search-Based Software Engineering Community. In Symposium on Search-Based Software Engineering. SSBSE 2020., LNCS, Vol. 12420, pp. 255–261. External Links: Document, ISBN 9783030597610, ISSN 16113349 Cited by: §2.3.
  • [30] M. R. Hoffmann, E. Mandrikov, et al. (2014) JaCoCo java code coverage library. External Links: Link Cited by: §3.2.
  • [31] D. Ivanov, N. Bukharev, A. Menshutin, A. Nagdalian, G. Stromov, and A. Ustinov (2021) UtBot at the SBST2021 Tool Competition. In 2021 IEEE/ACM 14th International Workshop on Search-Based Software Testing (SBST), External Links: Document Cited by: §1, Table 3.
  • [32] N. Juristo, A. M. Moreno, and S. Vegas (2004-09) Towards building a solid empirical body of knowledge in testing techniques. ACM SIGSOFT Software Engineering Notes 29 (5), pp. 1–4. External Links: Document, ISSN 0163-5948 Cited by: §1, §2.
  • [33] R. Just, D. Jalali, and M. D. Ernst (2014) Defects4J: a database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis - ISSTA 2014, pp. 437–440. External Links: Document, ISBN 9781450326452 Cited by: §2.3, §4.2.
  • [34] R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst, R. Holmes, and G. Fraser (2014) Are mutants a valid substitute for real faults in software testing?. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering - FSE 2014, FSE 2014, pp. 654–665. External Links: Document, ISBN 9781450330565 Cited by: §2.2.
  • [35] F. M. Kifetew, X. Devroey, and U. Rueda (2019) Java unit testing tool competition: seventh round. In Proceedings of the 12th International Workshop on Search-Based Software Testing, SBST@ICSE 2019, Montreal, QC, Canada, May 27, 2019, A. Gorla and J. M. Rojas (Eds.), pp. 15–20. External Links: Document Cited by: §1, §4.2, item ii, Table 3.
  • [36] I. M. Lebrero Rial and J. P. Galeotti (2021) EvoSuiteDSE at the SBST 2021 Tool Competition. In 2021 IEEE/ACM 14th International Workshop on Search-Based Software Testing (SBST), External Links: Document Cited by: §1, Table 3.
  • [37] S. Lukasczyk, F. Kroiß, and G. Fraser (2020) Automated Unit Test Generation for Python. In Search-Based Software Engineering. SSBSE 2020., A. Aleti and A. Panichella (Eds.), LNCS, Vol. 12420, pp. 9–24. External Links: Document, ISBN 978-3-030-59761-0, ISSN 23318422 Cited by: §6.2.
  • [38] L. Ma, C. Artho, C. Zhang, H. Sato, J. Gmeiner, and R. Ramler (2015-11) GRT: Program-Analysis-Guided Random Testing (T). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 212–223. External Links: Document, ISBN 978-1-5090-0025-8 Cited by: §1, §1.
  • [39] L. Ma, C. Artho, C. Zhang, H. Sato, M. Hagiya, Y. Tanabe, and M. Yamamoto (2015) GRT at the SBST 2015 tool competition. In 8th IEEE/ACM International Workshop on Search-Based Software Testing, SBST 2015, Florence, Italy, May 18-19, 2015, G. Gay and G. Antoniol (Eds.), pp. 48–51. External Links: Document Cited by: §1, Table 3.
  • [40] S. M. Melo, F. M. Moura, P. S. L. Souza, and S. R. S. Souza (2019) SeleCTT: An Infrastructure for Selection of Concurrent Software Testing Techniques. In Proceedings of the IV Brazilian Symposium on Systematic and Automated Software Testing - SAST 2019, pp. 62–71. External Links: Document, ISBN 9781450376488 Cited by: §2.
  • [41] U. R. Molina, F. M. Kifetew, and A. Panichella (2018) Java unit testing tool competition: sixth round. In Proceedings of the 11th International Workshop on Search-Based Software Testing, ICSE 2018, Gothenburg, Sweden, May 28-29, 2018, J. P. Galeotti and A. Gorla (Eds.), pp. 22–29. External Links: Document Cited by: §1, §4.2, item i, item ii, Table 3.
  • [42] M. Nagappan, T. Zimmermann, and C. Bird (2013) Diversity in software engineering research. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering - ESEC/FSE 2013, New York, New York, USA, pp. 466. External Links: Document, ISBN 9781450322379 Cited by: §4.2.
  • [43] C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball (2007-05) Feedback-Directed Random Test Generation. In 29th International Conference on Software Engineering (ICSE’07), pp. 75–84. External Links: Document, ISBN 0-7695-2828-7, ISSN 0270-5257 Cited by: §1, §1, §6.1.
  • [44] A. Panichella, J. Campos, and G. Fraser (2020) EvoSuite at the SBST 2020 Tool Competition. In 2020 IEEE/ACM 13th International Workshop on Search-Based Software Testing (SBST), External Links: Document Cited by: §1, Table 3.
  • [45] A. Panichella, F. M. Kifetew, and P. Tonella (2015-04) Reformulating Branch Coverage as a Many-Objective Optimization Problem. In 2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST), pp. 1–10. External Links: Document Cited by: §5.
  • [46] A. Panichella, F. M. Kifetew, and P. Tonella (2015) Results for evosuite - MOSA at the third unit testing tool competition. In 8th IEEE/ACM International Workshop on Search-Based Software Testing, SBST 2015, Florence, Italy, May 18-19, 2015, G. Gay and G. Antoniol (Eds.), pp. 28–31. External Links: Document Cited by: §1, Table 3.
  • [47] A. Panichella, F. M. Kifetew, and P. Tonella (2018) Automated Test Case Generation as a Many-Objective Optimisation Problem with Dynamic Selection of the Targets. IEEE Transactions on Software Engineering 44 (2), pp. 122–158. External Links: Document Cited by: §1, item ii, §4.1, §5.
  • [48] A. Panichella and U. R. Molina (2017) Java unit testing tool competition - fifth round. In 10th IEEE/ACM International Workshop on Search-Based Software Testing, SBST@ICSE 2017, Buenos Aires, Argentina, May 22-23, 2017, pp. 32–38. External Links: Document Cited by: §1, §3.3, §3.3, Table 1, Table 2, §4.2, item i, Table 3, Definition 1, Definition 2.
  • [49] A. Panichella, S. Panichella, G. Fraser, A. A. Sawant, and V. J. Hellendoorn (2020-09) Revisiting Test Smells in Automatically Generated Tests: Limitations, Pitfalls, and Opportunities. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 523–533. External Links: Document, ISBN 978-1-7281-5619-4 Cited by: §6.2.
  • [50] S. Panichella, A. Gambi, F. Zampetti, and V. Riccio (2021) SBST tool competition 2021. In 2021 IEEE/ACM 14th International Workshop on Search-Based Software Testing (SBST), External Links: Document Cited by: §1, Table 3.
  • [51] M. Papadakis and N. Malevris (2010-11) Automatic Mutation Test Case Generation via Dynamic Symbolic Execution. In 2010 IEEE 21st International Symposium on Software Reliability Engineering, pp. 121–130. External Links: Document Cited by: §1.
  • [52] I. S. W. B. Prasetya (2013) Measuring T2 against SBST 2013 benchmark suite. In Sixth IEEE International Conference on Software Testing, Verification and Validation, ICST 2013 Workshops Proceedings, Luxembourg, Luxembourg, March 18-22, 2013, pp. 410–413. External Links: Document Cited by: §1, Table 3.
  • [53] I. S. W. B. Prasetya (2015) T3: benchmarking at third unit testing tool contest. In 8th IEEE/ACM International Workshop on Search-Based Software Testing, SBST 2015, Florence, Italy, May 18-19, 2015, G. Gay and G. Antoniol (Eds.), pp. 44–47. External Links: Document Cited by: §1, Table 3.
  • [54] I. S. W. B. Prasetya (2016) Budget-aware random testing with T3: benchmarking at the SBST2016 testing tool contest. In Proceedings of the 9th International Workshop on Search-Based Software Testing, SBST@ICSE 2016, Austin, Texas, USA, May 14-22, 2016, pp. 29–32. External Links: Document Cited by: §1, Table 3.
  • [55] I. S. W. B. Prasetya (2018) T3 @sbst2018 benchmark, and how much we can get from asemantical testing. In Proceedings of the 11th International Workshop on Search-Based Software Testing, ICSE 2018, Gothenburg, Sweden, May 28-29, 2018, J. P. Galeotti and A. Gorla (Eds.), pp. 30–33. External Links: Document Cited by: §1, Table 3.
  • [56] I. S. W. B. Prasetya (2019) Random testing with austere budgeting in T3: benchmarking at SBST2019 testing tool contest. In Proceedings of the 12th International Workshop on Search-Based Software Testing, SBST@ICSE 2019, Montreal, QC, Canada, May 27, 2019, A. Gorla and J. M. Rojas (Eds.), pp. 21–24. External Links: Document Cited by: §1, Table 3.
  • [57] I. S. W. B. Prasetya (2013) T3, a combinator-based random testing tool for java: benchmarking. In Future Internet Testing - First International Workshop, FITTEST 2013, Istanbul, Turkey, November 12, 2013, Revised Selected Papers, T. E. J. Vos, K. Lakhotia, and S. Bauersfeld (Eds.), Lecture Notes in Computer Science, Vol. 8432, pp. 101–110. External Links: Document Cited by: §1, Table 3.
  • [58] I. S. W. B. Prasetya (2014) T3, a Combinator-Based Random Testing Tool for Java: Benchmarking. In Future Internet Testing. FITTEST 2013, T. Vos, K. Lakhotia, and S. Bauersfeld (Eds.), LNCS, Vol. 8432, pp. 101–110. External Links: Document, ISBN 9783319077840, ISSN 16113349 Cited by: §1.
  • [59] P. Ralph, N. bin Ali, S. Baltes, D. Bianculli, J. Diaz, Y. Dittrich, N. Ernst, M. Felderer, R. Feldt, A. Filieri, B. B. N. de França, C. A. Furia, G. Gay, N. Gold, D. Graziotin, P. He, R. Hoda, N. Juristo, B. Kitchenham, V. Lenarduzzi, J. Martínez, J. Melegati, D. Mendez, T. Menzies, J. Molleri, D. Pfahl, R. Robbes, D. Russo, N. Saarimäki, F. Sarro, D. Taibi, J. Siegmund, D. Spinellis, M. Staron, K. Stol, M. Storey, D. Taibi, D. Tamburri, M. Torchiano, C. Treude, B. Turhan, X. Wang, and S. Vegas (2021) Empirical standards for software engineering research. External Links: 2010.03525 Cited by: §2.1.
  • [60] U. Rueda, R. Just, J. P. Galeotti, and T. E. J. Vos (2016) Unit testing tool competition: round four. In Proceedings of the 9th International Workshop on Search-Based Software Testing, SBST@ICSE 2016, Austin, Texas, USA, May 14-22, 2016, pp. 19–28. External Links: Document Cited by: §1, §4.2, item i, Table 3.
  • [61] U. Rueda, T. E. J. Vos, and I. S. W. B. Prasetya (2015) Unit testing tool competition - round three. In 8th IEEE/ACM International Workshop on Search-Based Software Testing, SBST 2015, Florence, Italy, May 18-19, 2015, G. Gay and G. Antoniol (Eds.), pp. 19–24. External Links: Document Cited by: §1, Table 3, §5.
  • [62] A. Sakti, G. Pesant, and Y. Gueheneuc (2015-03) Instance Generator and Problem Representation to Improve Object Oriented Code Coverage. IEEE Transactions on Software Engineering 41 (3), pp. 294–313. External Links: Document, ISBN 0098-5589, ISSN 0098-5589 Cited by: §1.
  • [63] A. Sakti, G. Pesant, and Y. Guéhéneuc (2015) JTExpert at the third unit testing tool competition. In 8th IEEE/ACM International Workshop on Search-Based Software Testing, SBST 2015, Florence, Italy, May 18-19, 2015, G. Gay and G. Antoniol (Eds.), pp. 52–55. External Links: Document Cited by: §1, Table 3.
  • [64] A. Sakti, G. Pesant, and Y. Guéhéneuc (2016) JTExpert at the fourth unit testing tool competition. In Proceedings of the 9th International Workshop on Search-Based Software Testing, SBST@ICSE 2016, Austin, Texas, USA, May 14-22, 2016, pp. 37–40. External Links: Document Cited by: §1, Table 3.
  • [65] A. Sakti, G. Pesant, and Y. Guéhéneuc (2017) JTeXpert at the SBST 2017 tool competition. In 10th IEEE/ACM International Workshop on Search-Based Software Testing, SBST@ICSE 2017, Buenos Aires, Argentina, May 22-23, 2017, pp. 43–46. External Links: Document Cited by: §1, Table 3.
  • [66] K. Sen (2007) Concolic testing. In Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering - ASE ’07, pp. 571–572. External Links: Document Cited by: §1.
  • [67] D. A. Tomassi, N. Dmeiri, Y. Wang, A. Bhowmick, Y. Liu, P. T. Devanbu, B. Vasilescu, and C. Rubio-Gonzalez (2019-05) BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 339–349. External Links: Document, ISBN 978-1-7281-0869-8 Cited by: §2.3.
  • [68] S. Vogl, S. Schweikl, G. Fraser, A. Arcuri, J. Campos, and A. Panichella (2021) EvoSuite at the SBST 2021 Tool Competition. In 2021 IEEE/ACM 14th International Workshop on Search-Based Software Testing (SBST), External Links: Document Cited by: §1, Table 3.
  • [69] T. E.J. Vos, B. Marin, M. J. Escalona, and A. Marchetto (2012-08) A Methodological Framework for Evaluating Software Testing Techniques and Tools. In 2012 12th International Conference on Quality Software, pp. 230–239. External Links: Document, ISBN 978-1-4673-2857-9, ISSN 15506002 Cited by: §2.
  • [70] I. Yun, S. Lee, M. Xu, Y. Jang, and T. Kim (2018-08) QSYM : a practical concolic execution engine tailored for hybrid fuzzing. In 27th USENIX Security Symposium (USENIX Security 18), Baltimore, MD, pp. 745–761. External Links: ISBN 978-1-939133-04-5 Cited by: §1.