A Note About: Critical Review of BugSwarm for Fault Localization and Program Repair

10/29/2019
by   David A. Tomassi, et al.
0

Datasets play an important role in the advancement of software tools and facilitate their evaluation. BugSwarm is an infrastructure to automatically create a large dataset of real-world reproducible failures and fixes. In this paper, we respond to Durieux and Abreu's critical review of the BugSwarm dataset, referred to in this paper as CriticalReview. We replicate CriticalReview's study and find several incorrect claims and assumptions about the BugSwarm dataset. We discuss these incorrect claims and other contributions listed by CriticalReview. Finally, we discuss general misconceptions about BugSwarm, and our vision for the use of the infrastructure and dataset.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

05/22/2019

Critical Review of BugSwarm for Fault Localization and Program Repair

Benchmarks play an important role in evaluating the efficiency and effec...
04/16/2021

High-Quality Automated Program Repair

Automatic program repair (APR) has recently gained attention because it ...
05/21/2022

Improving automatically generated code from Codex via Automated Program Repair

Large language models, e.g., Codex and AlphaCode, have shown capability ...
06/05/2019

Unsupervised Temporal Clustering to Monitor the Performance of Alternative Fueling Infrastructure

Zero Emission Vehicles (ZEV) play an important role in the decarbonizati...
09/12/2016

Comment on "Why does deep and cheap learning work so well?" [arXiv:1608.08225]

In a recent paper, "Why does deep and cheap learning work so well?", Lin...
07/06/2020

Incorrect Data in the Widely Used Inside Airbnb Dataset

Several recently published papers in Decision Support Systems discussed ...
01/30/2018

The Reification of an Incorrect and Inappropriate Spreadsheet Model

Once information is loaded into a spreadsheet, it acquires properties th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Datasets are imperative to the development and progression of software tools, not only to facilitate a fair and unbiased evaluation of their effectiveness, but also to inspire and enable the community to advance the state of the art. There have been various influential datasets developed in the Software Engineering community (e.g., [6, 9, 3, 11, 1, 2, 10]). Unfortunately, these datasets have required a substantial amount of manual effort to be created, which makes it difficult to grow them.

Recently we developed BugSwarm [12], an infrastructure that leverages continuous integration (CI) to automatically create a dataset of reproducible failures and fixes. BugSwarm comprises an infrastructure, dataset, REST API, and website. The initial dataset (version 1.0.0 and reported in [12]) consists of 3,091 pairs of failures and fixes (referred to as artifacts

) mined from Java and Python projects. Because artifacts mined from open-source software are bound to have different characteristics (number of failing tests, failure reason, fix location(s), patch size, etc.), we provide a REST API and website for users to navigate and select the artifacts that fit the needs of their tools.

BugSwarm is under active development, currently allowing the mining of failures and fixes that satisfy specific characteristics.

Parallel to the development of BugSwarm, Durieux and Abreu [5] conducted a review of the BugSwarm dataset (version 1.0.1) with respect to Automated Program Repair (APR) and Fault Localization (FL). The authors stated characteristics they consider necessary for artifacts to be used in studies that evaluate the state of the art in APR and FL. Additionally, the authors presented a high-level classification of failures, and discussed the cost of using BugSwarm artifacts. In the rest of this paper we refer to [5] as CriticalReview.

One of the purposes of datasets is to facilitate the evaluation of software tools. Instead, CriticalReview uses the general requirements/current limitations of the state-of-the-art APR tools to evaluate the BugSwarm dataset. While it is important that datasets possess key characteristics (e.g., failures that are relevant to the tools under evaluation), the existence of artifacts that do not have desired characteristics does not hinder studies if users can navigate and select artifacts relevant to their studies. Limiting a dataset to only include problems that certain tools can handle would be of no benefit to our community. Furthermore, the goal of the BugSwarm dataset is to identify the kinds of problems found in real software and the environment in which these problems occur, and thus inspire the community to advance the state of the art.

In addition to general misconceptions on datasets, CriticalReview discredits the use of the BugSwarm dataset based on multiple incorrect observations. Specifically, CriticalReview makes a false allegation against BugSwarm paper [12]’s reported data, and presents wrong results and conclusions led by misunderstandings of Travis-CI terminology and Docker’s architecture.

This paper discusses each of CriticalReview’s incorrect claims, which had already been communicated to the authors of CriticalReview upon their request for feedback prior to the archival of their study. We also discuss the two other contributions of CriticalReview: a GitHub repository to store the code and build logs of the BugSwarm artifacts, and CriticalReview’s own website to browse BugSwarm artifacts, both of which duplicate information already available in BugSwarm.

The rest of this paper presents a brief overview of BugSwarm in Section 2, and describes the methodology used by CriticalReview in Section 3. We discuss the incorrect findings reported by CriticalReview in Section 4, and the rest of the contributions of CriticalReview in Section 5. Finally, we clarify some misconceptions about BugSwarm, and re-affirm its goals and intended use in Section 6.

2 Overview on BugSwarm

Figure 1: Workflow for the BugSwarm toolkit

BugSwarm is comprised of three main components: (1) an infrastructure111https://github.com/BugSwarm – in the process of open sourcing. to automatically mine and reproduce failures and fixes from open-source projects that use continuous integration (Travis-CI), (2) a continuously growing dataset of real-world failures and fixes packaged in publicly available Docker images to facilitate reproducibility,222https://hub.docker.com/r/bugswarm/images/tags and (3) a website333http://www.bugswarm.org/dataset/ and a REST API444https://github.com/BugSwarm/common for dataset users to navigate and select artifacts based on a number of characteristics.

2.1 BugSwarm Infrastructure

BugSwarm’s methodology to create a continuously growing dataset of real-world failures and fixes is shown in Fig. 1. We briefly describe each component below. For more details please refer to the BugSwarm paper [12].

PairMiner.

PairMiner represents the first stage of the process. The role of PairMiner is to mine fail-pass job pairs from the Travis-CI’s build history of open-source projects hosted in GitHub. A project’s build history refers to all Travis-CI builds previously triggered. A build may include many jobs; for example, a build for a Python project might include separate jobs to test with Python versions 2.6, 2.7, 3.0, etc. The input to PairMiner is the repository slug (e.g., google/auto) of the project of interest. PairMiner analyzes the project’s build history to identify fail-pass build pairs, where a build fails and the next consecutive build passes. From these fail-pass build pairs, PairMiner will extract fail-pass job pairs. The output of PairMiner is a set of fail-pass job pairs found for the given project.

PairFilter.

PairFilter takes as input the Travis-CI fail-pass job pairs from PairMiner and ensures that essential data is available to allow for reproduction: (1) the state of the project at the time the job was executed, and (2) the environment in which the job was executed. If these essentials are not available then PairFilter will discard the fail-pass job pair. PairFilter will determine the Docker image that was the exact build environment for the fail-pass job pair and the specific commits that triggered each job. The output of PairFilter is the subset of fail-pass job pairs for which (1) and (2) are available.

Reproducer.

The goal of Reproducer is to reproduce each job in the fail-pass job pair in the same build environment as it was originally run. The input to Reproducer is a fail-pass job pair, the commits for each version, and the Docker image for the build environment. Reproducer conducts the following: (1) generates a job script, i.e., a shell script to build the project and run regression tests, (2) matches the build environment, as the job was originally ran in, via a Docker image from the PairFilter, (3) reverts the project to the specific version, and (4) runs the code for the job in the Docker image via the job script. The Reproducer can be ran in parallel via multiple processes for each job pair as shown in Fig. 1. The output of Reproducer is a build log, which is a transcript of everything that occurs at the command line during the build and testing process.

Analyzer.

The Analyzer parses the original (historical) and reproduced build logs, extracts key attributes, and compares the extracted attributes to ensure they match. The key attributes that are parsed are the status of the build (passed, failed, or errored), and the result of the test suite (number of tests ran, number tests failed, and names of failed tests). If the results match between the original and reproduced build logs, then metadata about the pair will be added to the BugSwarm database.

Artifact Creation.

The Reproducer and Analyzer are run five times. If a fail-pass job pair was reproducible all five times then we mark it as “reproducible”. If the number of times the pair was reproducible was less than five but more than zero then it will be marked as “flaky”. A pair can be flaky due to a variety of reasons but primarily because of test flakiness which can be caused by non-deterministic tests due to concurrency or environmental changes. Lastly, if a pair is reproducible zero times then it will be marked as “unreproducible”. A reproducible or flaky job pair is referred to as a BugSwarm artifact.

For each BugSwarm artifact, a Docker image is created which has both versions of the code and the job scripts to build and test each version. This Docker image is then stored on our DockerHub repository.555https://hub.docker.com/r/bugswarm/images We chose to package each BugSwarm artifact in a Docker image because Docker facilitates reproducibility. Docker is also a good choice because it is light-weight, and uses layering. Docker images are composed of multiple layers which can be shared across multiple Docker images to save space. Docker does not re-download or store a layer that is already on a system [4].

2.2 BugSwarm Dataset

The BugSwarm dataset is the first continuously growing dataset of reproducible real-world failures and fixes. The dataset was automatically created using the BugSwarm infrastructure without controlling for any specific attributes. Currently, the BugSwarm dataset (version 1.1.0) consists of 3,140 artifacts that are written in Java and Python. There are a diverse number of artifacts with different build systems ranging from Maven, Gradle, and Ant to different longevity from 2015 to 2019 and different testing frameworks such as JUnit and unittest. We expect a steady grow of the dataset in the next months as the BugSwarm infrastructure is set to run in dedicated servers.

2.3 BugSwarm Website and REST API

BugSwarm offers many different characteristics to filter by to create a subset that is useful in the evaluation of a given tool. Examples of such characteristics are: language, size of diff, build system, number of tests ran, number of failed tests, patch location (e.g., source code, test code, or build files), exceptions thrown during run time (e.g., NullPointerException), etc. The BugSwarm website and REST API allow the selection of artifacts based on the above attributes.

3 Methodology by CriticalReview [5]

The goal of CriticalReview’s study is to answer the following questions:

What are the main characteristics of BugSwarm’s pairs of builds regarding the requirements for APR and FL?

What is the execution and storage cost of BugSwarm?

Which pairs of builds meets the requirements of APR and FL?

Characteristics of BugSwarm’s Pairs of Builds.

CriticalReview characterizes the BugSwarm dataset with respect to requirements of current APR and FL tools: (1) behavioral bugs, (2) test suite is used with passing tests defining correct behavior and failing tests defining incorrect behavior, (3) execution set up is known in terms of path of source, test files, etc., (4) uniqueness of bugs, and (5) human patch availability. The above requires, for each artifact, the source code for the buggy version and the fixed version, the diff between the two versions, and the Travis-CI build log for the failing job.

CriticalReview queries for fully reproducible Java and Python artifacts (see Section 4.1 for further details) using the BugSwarm REST API. The resulting artifacts are then filtered for unique commits.666Note that multiple Travis-CI jobs may originate from a single Travis-CI build.

The diff of each artifact is calculated by retrieving the buggy and fixed versions of the artifact from its corresponding Docker image, pushing the code into a branch of a new GitHub repository, and then invoking the GitHub API to retrieve the diff between the two code versions. Unique diffs are identified based on md5 hash values, and artifacts are classified based on whether the extension of the changed files are

.java or .py. Lastly, a high-level classification of the reason of failure is conducted by using regular expressions to match certain patterns (test failures, style checkers, compilation errors, etc.) on Travis-CI build logs.

Execution and Storage of BugSwarm.

CriticalReviewestimates the size of the BugSwarm dataset for download and storage, as well as its usage cost. The size of the dataset is calculated using two metrics: counting every Docker layer, and counting every unique Docker layer. Note that Docker does not download or store a layer that is already in the system (see [4] and Section 4.2). CriticalReview gives a time estimate for download assuming a 80 Mbit/s stable connection. Finally, the cost of using the full dataset is estimated assuming a 20-minute experiment per artifact using Amazon Cloud Instances.

Pairs for APR and FL.

CriticalReview lists what the paper considers the requirements to use state-of-the-art APR and FL tools: (1) artifacts that have been reproduced five times, (2) artifacts whose Docker images are available, (3) non empty diff, (4) unique commit, (5) unique diff, (6) test case failure, and (7) only source files changed. CriticalReview then reports the number of BugSwarm artifacts that satisfy those requirements.

4 Incorrect Claims by CriticalReview [5]

After replicating the study presented by CriticalReview and inspecting its scripts, we identified incorrect claims made by CriticalReview related to inconsistencies in the number of artifacts reported in the BugSwarm paper [12], a misleading duplication of commits in the dataset, and calculations of the storage required by the dataset. Below we discuss each incorrect claim, organized per research question as presented in [5].

4.1 RQ1: Characteristics of BugSwarm’s Pairs of Builds

Incorrect Number of Artifacts.

CriticalReview reports the number of “builds” reproduced five times given a BugSwarm API request listed in [5, Section III-B].777http://www.api.bugswarm.org/v1/artifacts/?where={“reproduce_successes”: {“$gt”:4,“lang”:{“$in”:[“Java”,“Python”]}}} The API request returns 2,949 artifacts while the BugSwarm paper[12] gives 3,091 artifacts. Thus, CriticalReview reports a contradiction by the BugSwarm authors, which according to CriticalReview had stated that each “build” in the dataset was successfully reproduced five times.

CriticalReview states in [5, Section III-C]:

Indeed, we considered all pairs of builds that are reproduced successfully five times like it is described in BugSwarm’s paper (see Section 4-B). Surprisingly, BugSwarm authors did not consider their criteria in their final selection of the pairs of builds and consequently the reported number is in contradiction with the paper.

BugSwarm original paper states in [12, Section IV-B]:

We repeated the reproduction process 5 times for each pair to determine its stability. If the pair is reproducible all 5 times, then it is marked as ’reproducible’. If the pair is reproduced only sometimes, then it is marked as ’flaky’. Otherwise, the pair is said to be ’unreproducible’.

First, as discussed in the BugSwarm paper [12, Section III-C] and in Section 2 of this paper, BugSwarm is comprised of artifacts (Travis-CI job pairs), thus a request from the BugSwarm API will return the number of artifacts, not the number of builds.

Second, the BugSwarm API request used by CriticalReview is returning the number of artifacts successfully reproduced five times. In other words, the query is returning the number of fully reproducible artifacts. However, the BugSwarm dataset [12, Table III] includes both fully reproducible and flaky artifacts, which together account for a total of 3,091 artifacts. The correct BugSwarm REST API request888http://www.api.bugswarm.org/v1/artifacts/?where={“reproduce_successes”:{“$gt”:0},“reproduce_attempts”:5,“lang”:{“$in”:[“Java”,“Python”]}}} needs to filter based on a number of reproduce successes greater than zero and a number of attempts equal to five. All 3,091 artifacts included in the dataset were attempted five times.

At the time CriticalReview was written (BugSwarm dataset 1.0.1 from May 2019999http://www.bugswarm.org/releases/), the number of fully reproducible artifacts was indeed 2,949 and the number of flaky artifacts was 142. There is no contradiction on the selection criteria described in the BugSwarm paper: both reproducible and flaky artifacts are included in the dataset.

Duplicate Failing Commits.

CriticalReview reports a “new” finding regarding a high number of duplicate failing commits in the BugSwarm dataset that would introduce misleading results.

CriticalReview states in [5, Section II-C]:

Our second observation is that 40.08% ((2,949-1,767)/2,949) of the builds have a duplicate failing commit. It means that those 40.08% should not be considered by the approaches that only consider the source code of the application otherwise it introduces misleading results.

BugSwarm paper states in [12, Section IV-B]:

Recall from Section III-C that PairMiner mines job pairs. The corresponding number of reproducible unique build pairs is 1,837. The rest of the paper describes the results in terms of number of job pairs.

As stated in the BugSwarm paper [12, Section III-C] and in Section 2 of this paper, a BugSwarm artifact corresponds to a pair of jobs, not a pair of builds (as incorrectly interpreted throughout CriticalReview). A Travis-CI build can be composed of multiple jobs that test the same commit under different configurations. Early feedback from researchers in our community indicated that such artifacts can also be of interest to researchers.

As also described in the BugSwarm paper [12, Section III-B], a given experiment may require artifacts that meet specific criteria. If such criteria require uniqueness of job pairs, as reported by CriticalReview is the case for APR tools, then we provide a REST API and website that allow to consider uniqueness when selecting artifacts of interest. Thus, having the dataset include multiple jobs from a build does not represent a problem that would introduce misleading results.101010The difference between 1,767 and 1,837 is again due to CriticalReview omitting flaky artifacts.

4.2 Rq2: BugSwarm Execution and Storage Cost

Metrics in Gigabytes (GB) Java Python All
BugSwarm Docker layer size 5,107 3,813 8,921
BugSwarm unique Docker layer size 1,327 919 2,246
Avg. size 3.01 3.05 3.03
Download all layers (80Mbits/s) 6d, 7.8h 4d, 17.13h 11d, 1.16h
Download unique layers (80Mbits/s) 1d, 15.4h 1d, 3.3h 2d, 18.8h
Table 1: Table of Metrics of BugSwarm Downloading and Storage Cost from [5].

CriticalReview calculates the size of the BugSwarm dataset and provides estimated download time and cost for using the full dataset in Amazon Web Instances [5, Section 3-D]. The paper reports that the full dataset is 8,921 GB, which takes about 11 days 1.16 hours to download when using a 80 Mbits/s internet connection. Subsequently, the cost of using the BugSwarm dataset, assuming a 20-minute experiment, is $711.30 USD.

Download Size Calculation.

CriticalReview calculates the size of the BugSwarm dataset using two metrics: counting every Docker layer, and counting every unique Docker layer. The size of the dataset is reported (see Table 1 from CriticalReview) as 8,921 GB and 2,246 GB, respectively. However, counting every Docker layer is incorrect. Docker does not re-download or store a layer that is already on a system [4]. The average size (row 3) and download time (row 4) given in Table 1 are calculated based on all Docker layers (row 1), thus these table entries are also incorrect.

Compression Ratio.

CriticalReview estimates a compression ratio used to incorrectly calculate space in disk. A compression ratio is unnecessary in the first place; disk space is determined by the size of unique Docker layers, already given in row 2 of Table 1.

CriticalReview states in [5, Section III-D]:

According to our observations, the ratio between download size and disk storage is 2.48x and drops to 0.41x when considering the duplicate layers. […] Based on this observation, we estimate the total disk space required to 3,680.45 GB.

CriticalReview fails to mention that the above observations are based on 464 artifacts [7], not the full dataset. The script [8] used to calculate disk space lists 598.98 GB of storage used by the 464 artifacts. When we downloaded the same 464 artifacts, the disk space reported by the command docker system df is 353 GB, not 598.98 GB.

The compression ratio is then calculated by dividing the space in disk by the size of the 464 artifacts when considering all Docker layers: 598.98 GB / 1,452.02 GB = 0.42. However, when using this compression ratio, the estimated disk space reported for the full dataset is 3,680.45 GB, which is 63% higher than the actual size given in Table 1 in row 2, which is 2,246 GB.

Cost Calculation.

Because the cost of using the BugSwarm dataset is based on incorrect estimated download and storage sizes, the cost calculations are also incorrect. Additionally, as mentioned earlier and corroborated by CriticalReview, we expect that BugSwarm users will be interested in subsets of the dataset, as opposed to the full dataset. This must be taken into account when making such cost calculations.

5 Other Contributions & Findings by CriticalReview [5]

In addition to answering the questions described in Section 3, CriticalReview also provides a GitHub repository for the BugSwarm artifacts and a website to navigate and select artifacts. This section discusses these contributions as well as a finding regarding duplicate diffs.

GitHub Repository.

One of the contributions listed by CriticalReview is a new GitHub repository111111https://github.com/TQRG/BugSwarm to store BugSwarm artifacts. Specifically, there is a branch for each artifact that contains the buggy version of the code, the fixed version of the code, the diff between both versions, and the failing and passing Travis-CI build logs. The only artifact information not stored in the repository is the scripts to build the code and run regression tests.

However, the existence of the CriticalReview repository is not necessary. The buggy and fixed versions of the code can be directly accessed via the original repositories (the BugSwarm REST API and the website provide the commit information) or by downloading the BugSwarm Docker image for the artifact, which includes a copy of both versions of the code. The Travis-CI build logs can be directly accessed via the Travis-CI website using the information provided by the BugSwarm REST API or directly following the BugSwarm website links. Finally, the diff can be directly retrieved using the GitHub API (3-dot diff), or accessed via the BugSwarm website (2-dot diff).

Website to Browse and Select BugSwarm Artifacts.

Another contribution listed by CriticalReview is a website121212https://tqrg.github.io/BugSwarm to browse and select BugSwarm artifacts. The website displays the number of added/removed/modified lines and files, and allows to select artifacts based on unique commits, unique diffs, not empty diffs, containing failing tests, changing source code, a manual categorization of bug/non-bug patches, and a high-level categorization of failures.

BugSwarm already provides its own website131313http://www.bugswarm.org/dataset/ for browsing and selection based on the same attributes listed by CriticalReview (except for their two categorizations, which are complementary to our own). The BugSwarm website also allows to select artifacts based on the location of the fix (source files, configuration files, or test files). In addition to the website, BugSwarm provides a REST API to query the BugSwarm database directly, thus one is not restricted to the options provided in the website. BugSwarm provides a classification of artifacts based on runtime exceptions.

Duplicate Diffs.

CriticalReview reports that while controlling for unique failed commits there are duplicate diffs among them, reporting that 198 out of 1,767 artifacts have a duplicate diff [5, Section III-C].

Recently, we have discovered that Travis-CI can make a “double build” when a build is a Pull Request (PR).141414https://docs.travis-ci.com/user/pull-requests/#double-builds-on-pull-requests Travis-CI will create a build for the PR branch, and another build for the PR branch merged with the base branch. If no changes have been made to the base branch since the time the PR branch was created, then the diffs between both builds will be the same. This explains CriticalReview’s observation. Fortunately, we believe it is feasible to automatically detect these cases, and this detection will be incorporated into the BugSwarm infrastructure to avoid such cases in future versions of the dataset.

6 Discussion on the Role of BugSwarm

We would like to conclude this paper by briefly clarifying a few misconceptions about BugSwarm, and by discussing our vision for the BugSwarm infrastructure and dataset.

(1) BugSwarm is more than a dataset. As described in Section 2, BugSwarm is comprised of an infrastructure to automatically create a large-scale dataset of real-world failures and fixes, a continuously growing dataset, and a REST API and website to navigate and select artifacts from the dataset based on characteristics of interest.

(2) The BugSwarm dataset is not static. One of the main contributions of the BugSwarm infrastructure is that its full automation has enabled the creation of a continuously growing dataset. As discussed in the BugSwarm paper [12], the potential for size and diversity opens new opportunities, but it also presents several challenges. Some of these challenges include data versioning (discussed in CriticalReview), and automated bug classification to increase the usefulness of the dataset.

(3) The BugSwarm dataset is not meant for a single target application. Because of the size and diversity of the BugSwarm dataset, it is unrealistic to believe that all artifacts will be relevant to one application. As a result, BugSwarm facilitates navigating and selecting artifacts based on a set of characteristics via the BugSwarm website or REST API. Thus, it is easy to select artifacts for a given application (e.g., APR or FL) beforehand.

(4) BugSwarm artifacts with specific characteristics can be “grown”. The initial BugSwarm dataset was created without controlling for any particular attribute, such as diff size, patch location, or reason for failure. However, since the publication of the BugSwarm paper [12], target mining is now available and thus, it is possible to grow the dataset in specific directions. We believe that allowing for diverse characteristics does not hinder the evaluation of the state of the art. On the other hand, we hope that the existence of artifacts that the state of the art may not be able to handle today will further push advancement.

BugSwarm is a project under active development, and in process of open sourcing its infrastructure. We welcome feedback from the community. The BugSwarm dataset is publicly available in DockerHub. The website is also publicly available, and the REST API is available to anyone who would like to request a token to access the BugSwarm database.

Acknowledgments

We thank Bohan Xiao and Octavio Corona for their help in gathering some of the data discussed in this paper. This work was supported by NSF grant CNS-1629976 and a Microsoft Azure Award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or Microsoft.

References

  • [1] C. Cifuentes, C. Hoermann, N. Keynes, L. Li, S. Long, E. Mealy, M. Mounteney, and B. Scholz (2009) BegBunch: Benchmarking for C Bug Detection Tools. In DEFECTS ’09: Proceedings of the 2nd International Workshop on Defects in Large Software Systems, pp. 16–20. External Links: Link Cited by: §1.
  • [2] V. Dallmeier and T. Zimmermann (2007) Extraction of Bug Localization Benchmarks from History. In 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2007), November 5-9, 2007, Atlanta, Georgia, USA, pp. 433–436. External Links: Link Cited by: §1.
  • [3] H. Do, S. Elbaum, and G. Rothermel (2005-10) Supporting controlled experimentation with testing techniques: an infrastructure and its potential impact. Empirical Softw. Engg. 10 (4), pp. 405–435. External Links: ISSN 1382-3256 Cited by: §1.
  • [4] (Accessed 2019) Docker Storage Driver. Note: https://docs.docker.com/storage/storagedriver/ Cited by: §2.1, §3, §4.2.
  • [5] T. Durieux and R. Abreu (2019) Critical review of bugswarm for fault localization and program repair. CoRR abs/1905.09375. Cited by: A Note About: Critical Review of BugSwarm for Fault Localization and Program Repair, §1, §3, §4, §4.1, §4.1, §4.1, §4.2, §4.2, Table 1, §4, §5, §5.
  • [6] M. Hutchins, H. Foster, T. Goradia, and T. J. Ostrand (1994) Experiments of the Effectiveness of Dataflow- and Controlflow-Based Test Adequacy Criteria. In Proceedings of the 16th International Conference on Software Engineering, Sorrento, Italy, May 16-21, 1994., pp. 191–200. External Links: Link Cited by: §1.
  • [7] (Accessed 2019) Images downloaded by CriticalReview. Note: https://github.com/TQRG/BugSwarm/blob/master/docs/downloaded_images.json Cited by: §4.2.
  • [8] (Accessed 2019) Images downloaded by CriticalReview. Note: https://github.com/TQRG/BugSwarm/blob/master/script/compression_rate.py Cited by: §4.2.
  • [9] R. Just, D. Jalali, and M. D. Ernst (2014) Defects4J: a database of existing faults to enable controlled testing studies for java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis, ISSTA 2014, pp. 437–440. Cited by: §1.
  • [10] C. Le Goues, N. Holtschulte, E. K. Smith, Y. Brun, P. T. Devanbu, S. Forrest, and W. Weimer (2015) The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs. IEEE Trans. Software Eng. 41 (12), pp. 1236–1256. External Links: Link Cited by: §1.
  • [11] S. Lu, Z. Li, F. Qin, L. Tan, P. Zhou, and Y. Zhou (2005) Bugbench: Benchmarks for Evaluating Bug Detection Tools. In In Workshop on the Evaluation of Software Defect Detection Tools, Cited by: §1.
  • [12] D. A. Tomassi, N. Dmeiri, Y. Wang, A. Bhowmick, Y. Liu, P. T. Devanbu, B. Vasilescu, and C. Rubio-González (2019) BugSwarm: mining and continuously growing a dataset of reproducible failures and fixes. In ICSE, pp. 339–349. Cited by: A Note About: Critical Review of BugSwarm for Fault Localization and Program Repair, §1, §1, §2.1, §4.1, §4.1, §4.1, §4.1, §4.1, §4.1, §4.1, §4, §6, §6.