How Often Do Single-Statement Bugs Occur? The ManySStuBs4J Dataset
Program repair is an important but difficult software engineering problem. One way to achieve a "sweet spot" of low false positive rates, while maintaining high enough recall to be usable, is to focus on repairing classes of simple bugs, such as bugs with single statement fixes, or that match a small set of bug templates (Long and Rinard, 2016; Pradel and Sen, 2018). However, it is very difficult to estimate the recall of repair techniques based on templates or based on repairing simple bugs, as there are no datasets about how often the associated bugs occur in code. To fill this gap, we provide two versions of the dataset containing 24412 and 153751 single statement bug-fix changes mined from 100 popular open-source Java Maven projects and from 1000 popular open-source Java projects respectively, annotated by whether they match any of a set of 16 bug templates, inspired by state-of-the-art program repair techniques. We also administer a repository of Maven dependencies for the 100 projects dataset to facilitate tools that require building the projects. We hope that this dataset will prove a resource both for future work in automatic program repair and also for future studies in empirical software engineering. In an initial analysis, we find that for both datasets about 33 bug fixes match the templates, indicating that a remarkable number of single-statement bugs can be repaired with a relatively small set of templates. Further, we find that SStuBs appear with a frequency of about one bug per 1600-2500 lines of code (as measured by the size of the project's latest version), allowing researchers to make an informed case about the potential impact of improved program repair methods.
READ FULL TEXT