Software can not be seen or touched, but it has a physical existence. With software embedded into many devices today, software failures have caused not only inconveniences but also tragedies, such as the deaths of patients due to massive overdose caused by an avoidable error in a radiation therapy machine . A more recent case is Google’s self-driving cars (controlled by software), which experienced 272 failures in less than a year. These failures would have resulted in at least 13 crushes killing their human drivers if they had not intervened . Software failures are also the cause of massive economical losses, costing the global economy $41 billion annually . Repairing software faults, however, is becoming an extremely difficult and expensive task – constituting up to 90% of the software expenses  – due to the increasing complexity and size of software systems. A modern car, for example, has 100 million lines of code, and this number is expected to increase to 200-300 millions in the near future . Hence the critical task of software repair must be automated.
Automated Program Repair (APR) has been identified as the grand challenge in software engineering research . Many APR methods have shown promising results in fixing bugs with minimal, or even no human intervention [20, 18, 26, 45]. Despite many studies introducing various APR techniques, much remains to be learned, however, about what makes a particular technique work well (or not) for a specific software system . The effectiveness of APR techniques is likely to be problem dependent, which calls for an analysis of the software characteristics that impact their effectiveness in order to help practitioners select the most appropriate technique for their software system.
Research introducing new APR techniques or experimental studies investigating the performance of different techniques usually is based on a carefully selected set of software systems. These works offer little insight into the characteristics of the software systems and how they impact the effectiveness of APR techniques. In addition, the overwhelming majority of published work in APR only describes the benefits of the newly introduced technique and the innovation carried out during development, while just a few mention the limitations or present negative results . There has been discussion on certain limitations of APR techniques, such as the issue with patch over-fitting, i.e., a patch generated by a tool that, while being valid according to the correctness oracle, they are still incorrect and potentially introduce new bugs that can no be captured by the correctness oracle. On the other hand, negative results in terms of why some bugs can not be repaired has not been investigated in the literature so far.
In addition, results claiming the superior performance of an APR technique over other techniques on a selected set of software systems may not be generalizable to untested systems. It is likely that there are software systems where an APR technique excels because it is exploiting some particular characteristics of the software. Thus, an understanding of conditions under which an APR technique can be expected to succeed or fail is essential, however, this is rarely included in published studies. This paper addresses this research gap by answering the following research questions:
RQ1 What software features have an impact on the effectiveness of APR techniques? - Exiting research in assessing the effectiveness of APR techniques investigate whether the bugs that they solve are hard or easy, or of high or low priority . This line of work relates to the overall performance of APR techniques, producing insights into how far we have come to addressing the big challenge of automatically fixing bugs. Instead, in this paper, we aim to find out if particular features of a software system or bug make one technique more effective than others. We achieve these kind of insights by proposing a new method for analysing the effectiveness of APR techniques.
RQ2 How different are existing APR benchmark datasets? Most research in APR uses well-known datasets, such Defects4J, which can result in the techniques to be perfected to effectively solve particular problems, and as a result not generalise well for other problems. In this paper, we aim to show how different these datasets are in terms of the features that have an impact on the effectiveness of existing APR techniques. This allows us to understand if exiting benchmarks are sufficiently different for stress testing the effectiveness of APR techniques, and identify similar benchmark datasets.
RQ3 How different are existing APR techniques? In this research question we focus on individual techniques, and show how different they are in their effectiveness in producing valid patches. We use the most significant features learned in RQ1 to visualise the footprints of the different APR technique. A footprint of an APR technique is the area in the reduced instance space of buggy programs where the technique is able to generate a patch. The visualisation of the footprints allows us to assess how different the techniques are, and whether their effectiveness is affected by different features of the software. In addition, we present metrics about the size, purity and density of the footprints, which provides a quantitative way for assessing the differences.
How can we select the most suitable APR technique? The final aim of this research is to develop guidelines that can help users of APR techniques to select the most appropriate technique given their software system. E-APR uses a Support Vector Machine (SVM) to build a machine learning model that can be used to select the best APR technique among a portfolio of APR technique based on software features. Our approach could be integrated to APR infrastructures such as RepairThemAll  (which currently contains 11 repair tools) and the repair software bot Repairnator  to maximise the efficiency and effectiveness of repairing bugs.
To answer these research questions, we introduce a new framework which characterises both strengths and weaknesses of existing APR techniques, using software features extracted from code and the control flow graph of open source software systems. The framework provides means for a more objective assessment of existing APR techniques, and helps in understanding and explaining why certain bugs are hard for certain APR techniques. Finally, E-APR gives insights into how an APR technique can be selected to automatically build reliable software systems in a cost-effective way. We apply our framework on a large study of 2,141 bugs from 130 projects, and 23,551 repair attempts. For human programmers, software repair is challenging because fixing bugs is a difficult task. While there are bugs that can be trivially fixed, many of us can remember a bug that took hours, if not days and weeks to be understood and fixed. The approach we devise will also reveal if the software systems contain any of these challenging bugs.
2 Related Work on the Effectiveness of APR Techniques
Researchers working in the area of APR have acknowledged that evaluating the quality of patches produced by APR techniques is crucial [25, 35]. To this end, Qi et al.  studied the test-suite adequate patches generated by GenProg 
for C programs, and classified them asplausible (passing all tests), correct and overfitting (plausible and incorrect). They found that most of the reported patches were overfitting.
Other works have studied the ability of APR techniques to repair buggy Java programs. For example, Martinez et al.  manually studied the correctness of patches produced by three APR techniquess over defects from Defects4J benchmark. They found that only a small number of bugs (9/47) could be correctly repaired. Ye et al.  studied the repairability of bugs from QuixBugs , a dataset of 40 small buggy programs (between 9 and 69 LOC). They found that 15 bugs could be repaired by Nopol  and approaches from Astor , which generated in total 64 plausible patches. However, they found that 33 of them were incorrect.
The presence of overfitting patches has motivated researchers to investigate the amount of the overfiting patches (e.g., ) detect overfitting patches (e.g., DiffTGen , PatchSim ), and to avoid generating such patches (e.g., UnsatGuided , CapGen , Anti-pattern ).
Our work extends exiting research in analysing the effectiveness of APR techniques by examining what software features impacts the repairability of a software system. We characterise a software system using code features (e.g., depth of inheritance tree and method cohesion) and determine the most significant features that have impact on whether an APR technique can generate a patch.
There has also been some research in characterising patches generated by APR techniques to investigate how these patches differ from the ones generated by human programmers. Wang et al.  compared the difference between 177 correct patches for Defects4J bugs generated by APR techniques and the patches written by developers. To characterise the bugs, the authors considered 6 metrics: a) Patch size, b) Number of chunks c) Number of modified files, d) Number of modified methods e) Line coverage, and f) Branch coverage. They found that automatically generated patches are on average syntactically different compared to the patches generated by developers. Patches generated by APR techniques are usually longer, have a higher number of chunks, and have a higher line and branch coverage.
Similarly, Smith et al.  studied the quality of patches generated by two C program repair approaches (GenProg and TprAutoRepair). The authors used two metrics that were dynamically computed (i.e., by running the program under repair): a) number of passing and failing test cases, and b) test suite coverage.
Both Wang et al.  and Smith et al.  focus on analysing the kind of patches generated by APR techniques. The aim of these works is to understand how good the patches are, and how they are different from developer-generated patches. Our work, instead, aims at understanding what kind of software systems and bugs APR techniques are able to repair. This will help explain how and why they work, and ass a result, make it possible to select the right technique given a new buggy software system.
In their research, Smith et al.  state that “Automatic repair should be used in the appropriate contexts” and that “Our results suggest that more work is needed to fully understand and characterise test suite quality beyond coverage metrics alone”. The e-APR framework addresses these two research challenges by investigating 184 features, and building a machine learning model that enables the selection of the most suitable APR technique for a given buggy program.
Another related work is the one by Motwani et al.  which investigates correlations between the effectiveness of APR techniques and different aspects of bugs, such as bug importance and bug complexity. Results were analysed at course-grained level, with the findings showing weak to moderate correlation between bug importance and the ability of the APR technique to produce a patch. The results also show that APR techniques are effective in repairing easy bugs - as measured by the number of files and lines that have to be changed to fix the bug - while struggling with more complex bugs. This study makes an important step towards understanding where APR techniques work. In this paper, we take this research one step further by providing a more detailed analysis of the effectiveness of different APR techniques. The framework we propose allows us to examine the effectiveness of individual techniques in a visual and numerical way. We measure the footprints of the different APR techniques and whether their results overlap. This helps us understand the strengths and weaknesses of individual techniques, and their similarities in a more fine-grained way.
3 The e-APR Framework
An overview of the e-APR framework is presented in Figure 1. It involves three main parts: i. APR Feature Learning, which learns significant features that reveal why certain APR problems are hard, ii. APR Footprints Visualisation, where we visualise the footprints of APR techniques and expose their strengths and weaknesses in repairing bugs, and iii. APR Technique Selection, which addresses the problem of accurately selecting the most suitable technique for APR.
Given a buggy software system and a portfolio of APR techniques (APRTs), the main goal of the e-APR framework is to help the software developer select the most effective APR technique. Effectiveness refers to the capability of an APR technique to generate a patch, and in our framework we propose a new objective measure of the overall effectiveness of an APR technique measured by its footprint. A footprint of an APR technique is where good performance can be expected in the software instance space. Understanding and reporting the boundary of the effectiveness of an APR technique is critical for selecting the most suitable technique. Our approach also helps to avoid trial and error application of APR techniques. It is impractical to determine this boundary through exhaustive application of APR techniques on the very large number of possible software instances. To make this process more efficient our approach defines the boundary by generalising APRT effectiveness on a selected diverse software instance space, projected onto a two-dimensional plane for ease of visualization.
This two-dimensional plane is defined by a coordinate system composed of the most significant features of a software system, which we develop and learn in response to RQ1. The most significant features will allow us to visualise the differences between different APRTs and determine how effective an APRT is in producing a patch for a software system with particular features. One of our big challenges is to plot two software instances in the same space in such a way that if the two instances are similar according to some feature (e.g., depth of inheritance tree), they are close in the test instance space, and if they are dissimilar then they are far apart in the instance space. Since we focus on arranging the software instances in a space where the effective APRTs are separated from the non-effective ones, it is natural to represent each software instance as a feature vector that considers a number of properties known to correlate with APRT effectiveness. The footprint of an APRT is then determined by the cluster of software instances where it could generate a patch.
The APRT footprint and the features that explain the effectiveness of the APRT will inform APRT selection, which is part of RQ3. E-APR use the most significant features and the APRT footprints to determine which APRT to use for new buggy software.
3.1 APR Feature Learning
A critical step of e-APR is identifying features of software systems that have an impact on the effectiveness of APR techniques. This research task will provide an answer to the second research question. Features are problem dependent and must be chosen such that the varying complexities of the software instances are exposed, any known structural properties of the software systems are captured, and any known advantages and limitations of the different APR techniques are related to features.
For the purpose of this work, an APR technique is effective if it can generate a patch for a buggy software system, hence successfully repairing it. The generation of a valid patch can be affected by the structure of the particular software system, hence we extract software features characterise software structure.
While much is known and reported on features that correlate with software quality, we must consider that there may be other unknown features that have an impact on the effectiveness of APR techniques. In addition, it is possible that not all known features are useful for our goal of creating a footprint space that separates the hard and easy software instances. The candidate set of features may contain redundancy, with features measuring aspects of a problem instance that are either similar or not relevant to expose the hardness of the APR task itself. Thus, a small set of relevant features must be selected.
Learning significant features has two steps: first we must define how we will measure the quality of a particular set of features, and once this measure has been established, we can apply an optimisation algorithm to select the set that maximises this measure. In e-APR, a subset of features is considered of high quality if they result in an instance space – as defined by the 2-dimensional projection of the subset of features – with programs that elicit similar performance of repair tools clustered together.
The best subset of features is the one that can best discriminate between easy and hard program instances for APR techniques.
A buggy program instance is considered hard for an APR technique if it can not be repaired by that technique. E-APR aims at identifying software features that are able to create a clear separation in the program instance space, such that we can clearly see the different clusters of software systems where the techniques are effective. We refer to these clusters as APR footprints.
A common approach to locate significant variables is principal component analysis (PCA)
. PCA is a technique for extracting the orthogonal dimensions that explain relations between the variables in a dataset. This is achieved by learning linear combinations of the standardized independent variables, with the Principal Components (PCs) calculated in the following way. The first PC is the linear combination of the variables which explain the maximum amount of variance in the dataset. Each subsequent PC is orthogonal to all previously calculated PCs and captures a maximum variance under these conditions. In our work, the subset of variables that have large coefficients (i.e., loading of the variable) and therefore contribute significantly to the variance of each PC, are identified as the significant features which are selected to explain bugs.
Given software features, we can have at most
components which are estimated in decreasing order of the variance (measured through the eigenvalue of each PC) they explain in the dataset. We analyse for each PC the features that are found significant. This shows which dimensions are the main drivers of APR technique effectiveness and help explain why this is the case. In PCA, usually only the first few components are regarded as important. In our approach, we only retain the first 2 components, which makes visualising the footprints of the algorithms much easier.
We use a genetic algorithm to search the space of possible subsets of features, with the classification accuracy on an out-of-sample test set used as the fitness function to guide the search for the optimal subset. Similar approaches have been proposed in the literature for feature subset selection for machine learning , optimisation , and search-based software testing tasks 
. Certainly, other feature selection methods proposed in the literature would also be suitable for the task at hand.
3.2 APRT Footprints
Once the significant features have been identified through the feature learning procedure, they are used to analyse and visualise the footprints of the APR techniques. As explained in the previous section, in order to facilitate the visualisation of the footprints, we utilise PCA as a dimensionality reduction technique to project the instances to two dimensions, while making sure that we retain as much information as possible. PCA rotates the data to a new coordinate system , with axes defined by linear combinations of the selected features. The
new axes are the eigenvectors of thecovariance matrix. We retain the two principal eigenvectors which correspond to the two largest eigenvalues of the covariance matrix. The instance space is then projected on this two-dimensional space. We use the variance explained in the data by the two principal components as a measure of the loss in information due to dimensionality reduction. Following a similar approach to previous work on dimensionality reduction , we accept the new two dimensional instance space as adequate if most of the variance in the data is explained by the two principal axes. The two principal components are then used to visualise the footprints of the APR technique (APRT). We refer to the hypothetical example in Figure 2 to illustrate this point.
Figure 2 shows the footprint of an APRT. Each point represents one buggy program. If a patch is generated for that particular program, then the performance of the APRT is labelled as ‘GOOD’. Otherwise, the instance is labelled as ‘BAD’, indicating that the APRT was not effective in that particular instance. In this example, the footprint of the APRT is the cluster of the buggy programs represented with a dark circles, and as we can see, it is clearly distinguishable from the area where the APRT is expected to not perform well.
If our goal was only to make performance predictions on the best APR tool for repairing a particular software system, we could use machine learning algorithms to identify the relationship between software features and APR performance. Only using machine learning on our data does not allow for explanations as to why a particular APR technique works well. Our goal in this paper is much broader than only making prediction, as we aim to visualise the footprints of the different APR approaches and provide insights into the workings of these methods.
Figure 1(b) shows the footprint of one of the most significant features, which in this illustrative example is the depth of inheritance tree (DIT). It is clear that this APRT works well when the depth of inheritance tree is high, and cannot produce patches for software with low values of DIT. The feature footprint explains the performance of the APRT, thus helping understand why the technique works.
Finally, we calculate the relative size of APRT footprints by estimating the area of the hull covering the software instances where the technique is expected to perform well. This is a metric of the relative goodness of the APRT across the software instance space. Following recommendations by Smith-Miles et al. , we subtract from this measure the areas where we have evidence that the APRT doesn’t perform as well. The low performance areas may contain program instances that were repaired by the APRT, however, if these instances lie within the hull of other instances labelled as “BAD”, we consider this as evidence that contradicts the good performance of the APRT in those few program instances.
Formally, given the convex hull of an area defined by points , the area is given by
where the subset defines the extreme points of . Using Equation 1, we compare the relative size of the footprint of each APRT to determine which APRT has the largest footprint and explore the degree of overlap of the footprints.
3.3 APRT Selection
In the final step, E-APR predicts, based on the most significant software features, the most effective APR technique for particular APR problems. We use the two-dimensional space created in the footprints visualisation stage as an input to machine learning algorithms to learn the relationship between the instance features and APR method performance. For this purpose, we can use a variety of machine learning algorithms, such as decision trees, or support vector machines for binary labels (bad/good), or statistical prediction methods, such as regression algorithms or neural networks for continuous labels (e.g., time complexity of the approach). In our approach, we use a support vector machine (SVM) – which produced the best results in a pilot study of different algorithms.
At the end of this process, e-APR produces a model that can be used for algorithms selection in automated program repair. This model can be retrained and extended with more APR tools and features.
4 Experimental Design
We implement the e-APR framework described in Section 2, and conduct a set of experiments and analysis to answer the research questions stated in Section 1. In this section, we describe: the automated program repair techniques, the benchmark of buggy programs, and the set of software features.
4.1 Automated Program Repair Techniques and Tools
In this paper we focus on one family of repair approaches: test-suite based repair approaches . Approaches from this family aim at repairing bugs exposed by at least one failing test case. The main idea of these approaches is to use failed test cases to localise potential faults and then apply mutations to the source code until the program satisfies all unit test cases. The mutations that are applied to the program code can range from small changes like modification, addition or removal of a single code line  to complex edit operations [26, 17], which are mined from software repositories and used to fix a fault in a different context.
A repair tool that materialises an APRT is implemented in a particular programming language and targets to repair buggy applications written in the same or another program language. For example, the authors of GenProg repair technique has written in OCALM the GenProg tool  for repairing programs written in C. However, there exists other implementations of GenProg written and targeted others languages, such as Astor  and Arja . In this paper we chose 11 repair tools capable of repairing Java programs based on the study done by Durieux et al. . These are: ARJA , Cardumen , DynaMoth , jGenProg , GenProg-A , jKali , Kali-A , jMutRepair , Nopol , NPEFix , and RSRepair-A . Those belong to 3 categories of repair approaches: semantics-based (DynaMoth and Nopol), metaprogramming-based (NPEFix) and generate-and-validate (the other 8 tools).
jGenProg and GenProg
jGenProg is a Java implementation of GenProg. Both techniques use a generate-and-validate method to produce patches using a genetic programming approach. The search space consists of patches that are formed through combinations of removing code, and inserting and replacing code from elsewhere in the program under repair .
Cardumen  synthesises patches using the existing code as a basis, by taking code elements from elsewhere in the program and replacing the variables. Each potential patch is filtered based on location and type compatibility, and the remaining patches are prioritised based on how frequently the selected variables occur together.
jKali and Kali-A  are different implementations of Kali in Java. They attempt to come up with candidate patches by removing or skipping statements. Neither jKali nor Kali-A is a ’repair’ program, instead, they are more useful in identifying weak test suites and under-specified bugs. Since Kali simply removes or skips code, if a patch is found, it is a strong indication that the functionally of the removed code is not specified in the test-suite. In addition, if Kali finds a test-suite adequate patch, so can jGenProg or Nopol , the patches found by Kali, however, rarely work beyond the given test-suite.
jMutRepair  performs an exhaustive search of the code and applies the following three types mutation operators on suspicious if conditions. The relational mutation operator with the following values (==,!=,,,,), the logical mutation operator (AND, OR), and the Unary mutation operator which applies negation and positivation.
Nopol  focuses on repairing IF conditions, which are amongst the most error-prone elements of Java programs, and many one-change commits simply update an IF condition. Nopol has three main steps. First, it locates a fix location for a potential patch using “angelic fix localisation”. This process also involved finding “angelic values”, which are assigned values that can be used at the fix location to make all failing tests pass. Next, Nopol collects runtime data from a test execution, including a snapshot of the program state at candidate fix locations. Then, Nopol translates the angelic values and available variables at the fix location into a Satisfiability Modulo Theorem problem, and attempts to find a solution, which is then translated into a patch.
RSRepair-A  is a Java implementation of the RSRepair program repair tool written for C programs. RSRepair uses a generate-and-validate technique to prepare patches. It takes inspiration from the GenProg tool, however, instead of using genetic programming as its search method, RSRepair uses random search.
ARJA  uses Genetic Programming to modify and mutate suspicious statements in a program by performing three actions: i) deleting the suspicious statement, ii) replacing the suspicious statement, or iii) inserting extra statements before or after the suspicious statement. ARJA reduces the scope of the search and computation time to speed up the fitness process by applying rules that exclude statements that are not related to the problem .
NPEFix  repairs null pointer exceptions at runtime by using two strategies. The first strategy assigns an alternative value (which can be a valid value that is stored in another variable or a random value) for a null dereference. The second strategy skips the execution of the null dereference, by either skipping a single statement or skipping the complete method. All strategies are applicable for any arbitrary objects, including instances of library classes, and instances of domain classes.
In summary, the APR techniques discussed in this section can be broadly categorised based on their high-level repair strategy. For example, jGenProg , ARJA  and RSRepair-A  use or build upon genetic programming. Other techniques take more unique approaches and are designed to target specific bugs, like NPEFix  targeting null pointer exceptions. Other repair tools can only function if code is structured in a certain way, like Nopol , which only works when IF conditions are present, and will only find a valid patch if the patch involves changing IF conditions. These observations further support our hypothesise that the performance of each technique will likely be affected by the features of the code. Different repair strategies may favour different code features, and that different bug targeting will definitely perform badly on code with the wrong type of bug.
4.2 Buggy Software and Patches
The automated repair research community have used existing bug benchmarks or created new ones to evaluate their repair approaches and tools. Most of the Java approaches were previously evaluated over a single dataset Defects4 . Durieux et al.  is one of the few that performs an extensive evaluation of existing 11 APRTs on 4 peer-reviewed Java bug benchmarks: Bears , Bugs.jar , IntroClassJava and QuixBugs . Our analysis is based on the experimental data generated by Durieux et al. , which is available at github.com/program-repair/RepairThemAll_experiment.
In total, we consider 2,141 bugs from 130 projects, and 23,551 repair attempts. A repair attempt is the execution of an APRT on a buggy program. The execution of all repair attempts on the 4 benchmarks by the 11 APRTs took 314 days . The patches considered in this study are test-suite adequate patches. These patches produce: a) the failing test cases (that exposed the bug) pass, and b) the remaining test cases continue to pass. Previous work have shown that a test-suite adequate patch can produce passing all tests but they are yet incorrect. Those are overfitting patches  and can arise due to the weakness of the test-suite used for synthesising the patches. Overfitting detection is not yet mature (i.e., not capable of detecting all overfitting patches) and thus adopting such techniques could introduce some bias in this work, hence we consider all patches generated by the repair tools executed by RepairThemAll. This means that we did not filter out the outputs generated by APRTs.
The source of the bugs in the bug benchmark are diverse: Defects4J and Bugs.jar contains real bugs extracted from software repositories, Bears contains real bugs collected from breaking builds on Travis platforms, IntroClassJava contains buggy subjects from students, and QuixBugs contains buggy implementation of well-known algorithms (such as merge-sort). Our study is the first to analyse how diverse these datasets are.
4.3 Software Features
Features are problem dependent and must be chosen so that the varying complexities of the problem instances are exposed, any known structural properties of the software are captured, and any known advantages and limitations of the different program repair techniques are related to features. The most common measures and metrics used to characterise features of a software system are extracted from code.
Among others, we use objected oriented code metrics based on measurement theory and expertise of experienced software developers . These metrics are also mapped to the Quality Model for Object-Oriented Design , which is a comprehensive model that establishes a clearly defined and empirically validated model to assess object-oriented design quality attributes such as understandability and reusability, and relates them through mathematical formulas with structural object-oriented design properties such as encapsulation and coupling. The set includes simple features, which count the number of methods or lines of code, to more elaborated features that measure the interaction between methods and the depth of inheritance tree.
|Object oriented features|
|WMC||Weighted methods per class|
|DIT||Depth of inheritance Tree|
|NOC||Number of children|
|CBO||Coupling between object classes|
|RFC||Response for a class|
|LCOM||Lack of cohesion in methods|
|NPM||Number of public methods for a class|
|LCOM3||Lack of cohesion in methods|
|LOC||Lines of code|
|DAM||Data access metric|
|MOA||Measure of aggregation|
|CAM||Cohesion among methods of class|
|CBM||Coupling between methods|
|AMC||Average method complexity|
|Java specific method features|
|AC||Abstract methods count|
|ASMC||Abstract static methods count|
|DAMC||Default abstract methods count|
|DASMC||Default abstract static methods count|
|DMC||Default methods count|
|DSM||Default static methods count|
|GMC||General methods count|
|GSMC||General static methods count|
|PriAMC||Private abstract methods count|
|PriASMC||Private abstract static methods count|
|PriMC||Private methods count|
|PriMC||Private static methods count|
|ProAMC||Protected abstract methods count|
|ProASMC||Protected abstract static methods count|
|ProMC||Protected methods count|
|ProSMC||Protected static methods count|
|PubAMC||Public abstract methods count|
|PubASMC||Public abstract static methods count|
|PubMC||Public methods count|
|PubSMC||Public static methods count|
|SMC||Static methods count|
|Code Elements Features|
|Usage||Related to usage of e.g. variables and invocations|
|Syntax||Related to syntax of e.g. variable’s identifiers|
|Types||Related to types of e.g. variables, and parameters.|
In addition to code features widely used by software practitioners and researchers, we also consider a set of Code Elements Features, which are manually crafted for targeting different open challenges of automated program repair. Code Elements Features have been recently used for predicting source code transformations on buggy code  and for detecting incorrect patches . These features capture different characteristics of the buggy or patched program, and are grouped into three categories: 1) features related to the Usage of code elements, for example, the feature OUIA indicates if a statement references a local variable that has not been referenced in other statements before it, 2) features related to the Syntax of code elements, for example, the feature HVSN indicates whether, given a statement that references a variable, there exist other variables in the same scope that have a similar identifier name with that variable; 3) features related to the Types of code elements, for example, the feature VTSV indicates whether, given a statement that references a variable, there exist other variables in the same scope that are type compatible with that variable.
4.4 Feature Extraction
We extract the code features presented in section 4.3 as follows. For each buggy program considered in this experiment (Section 4.2) we first create a vector where each dimension corresponds to a particular feature. Moreover, we add to that vector an additional dimension per each APRT considered in this experiment. The values of such latter dimensions are ‘1’ if the corresponding APRT produced a patch and a ‘0’ otherwise. Table II shows as example of the features extracted from 4 buggy programs. Each row has the values of the features extracted for a program, and it is a vector of features. From the second to the fifth column, it shows the values corresponding to 4 object-oriented features (wmc, dit, npc and cbo). The last two columns indicate whether the buggy program could be repaired by two approaches (Kali and Arja). Since Object-oriented and Java-Specific method features are calculated at the class-level and the Code Elements Features are calculated at the statement-level, we compute the average value of each feature over all classes (or statements, resp.) to get the final values of the features that characterise the buggy program.
We present the results for each research question, and aim to provide insights into why the different APR techniques work. First we present the most significant features that impact APRT effectiveness. Second, we investigate the diversity of exiting buggy datasets used for APR. Next, we investigate the differences between exiting APRTs by analysing their strengths and weaknesses using the most significant features. Finally, we present the results from the SVM model ussed for APRT selection.
5.1 RQ1. What software feature have an impact on the effectiveness of APR techniques?
We performed feature learning on the 184 features that were extracted from the 1,282 classes. The aim is to select the best set of features that highlights the strengths and weaknesses of the APR techniques. To account for the randomness in the results, each trial of feature learning was run 10 times on each buggy program for each approach, using different random seeds, and the mean was considered. Out of the 184 features that were part of the study, we identified the following 9 optimal features which best capture the difficulty in generating patches for APR:
(F1) MOA: Measure of Aggregation
(F2) CAM: Cohesion Among Methods
(F3) AMC: Average Method Complexity
(F4) PMC: Private Method Count
(F5) AECSL: Atomic Expression Comparison Same Left indicates the number of statements with a binary expression that have more than an atomic expression (e.g., variable access). This feature belongs to Syntax category.
(F6) SPTWNG: Similar Primitive Type With Normal Guard indicates the number of statements that contain a variable (local or global) that is also used in another statement contained inside a guard (i.e., an If condition). This feature belongs to Usage category.
(F7) CVNI: Compatible Variable Not Included is the number of local primitive type variables within the scope of a statement that involves primitive variables that are not part of that statement. This feature belongs to Usage category.
(F8) VCTC: Variable Compatible Type in Condition measures the number of variables within an If condition that are compatible with another variable in the scope. This feature belongs to Type category.
(F9) PUIA: Primitive Used In Assignment measures the number of primitive variables in assignments. This feature belongs to Type category.
Using these features we were able to define the footprints of the techniques with an explained variance of 87%. In essence, the answer to the first research question is
RQ1: The most significant features that have an impact on the effectiveness of APR techniques are the Object-Oriented Features: F1. MOA, F2. CAM, F3. AMC, F4. MPC, and the Code Elements Features: F5. AECSL, F6. SPTWNG, F7. CVNI, F8. VCTC, and F9. PUIA.
5.2 RQ2. How different are existing APR benchmark datasets?
To visualise the results in a meaningful way, we apply PCA as a dimensionality reduction technique on the optimal subset of features. This allows us to analyse the location of the different benchmarks across the software instance space in 2D, which reveals how diverse they are. Two new axes were created, which are linear combinations of the selected set of most significant features. The coordinate system that defines the new software instance space is defined as:
The new coordinates are a combination of the 9 features. CAM, PMC and MOA have the highest contribution on , and SPTWNG, AMC and VCTC contribute the most to . CVNI, AECSL, and PUIA contribute equally to both coordinates.
The dataset footprint presented in Figure 3 shows the reduced feature space with instances labelled according to the dataset they belong to. Each point is a bug from a particular dataset.
We observe that there is a distinctive cluster on the left of Figure 3 composed of only bugs from IntroClassJava. It is clear that this dataset is significantly different from the other datasets. Further away from this cluster, is the footprint of Defects4J, which is on the rightmost side of the graph. This indicates that Defects4J is significantly different from IntroClassJava.
On the other hand, the footprints of Bears, Bugs.jar and QuixBugs overlap to a greater extent. They are spread between IntroClassJava and Defects4J and have a higher spread than the other datasets. It is evident that these three benchmark datasets are very similar in terms of the features that impact the effectiveness of APRTs. This means that the evaluation of APRTs on only these three datasets doesn’t present different challenging aspects. Bugs.jar contains some bugs obtained from the same software as the other datasets (e.g., Apache, Commons, Math), thus the bugs are eventually the same. QuixBugs is a set of buggy implementation of well known algoritms (e.g., Quixsort), and each buggy program in this dataset is a single class. The others datasets are real buggy programs, composed of several classes.
In summary, the answer to the second research question is as follows:
RQ2: IntroClassJava and Defects4J are significantly different from the other benchmark datasets, while Bears, Bugs.jar, and QuixBugs contain bugs with very similar features.
Our finding from this research task can inform researchers who develop new APRTs in the selection of the bug benchmark to test their technique. It wouldn’t be sufficient to test a new APRT on just Bears, Bugs.jar, and QuixBugs, and a technique that works for Defects4J may not produce good results when repairing IntroClassJava.
5.3 RQ3: How different are existing APR techniques?
This research question is concerned with explaining similarities and differences of exiting techniques for automated program repair. We answer this research question by assessing the footprints of APRTs in the reduced instance spaces presented in Figure 4. The feature learning algorithm reduced the feature space by selecting the most significant features that can best explain the differences between the different techniques. The most significant features were discussed in the response to RQ1. These features allow us to explain the kind of bugs that existing APR techniques can repair, which ultimately makes it possible to explain how different they are and why they work.
We plot the footprints of the 11 APRTs in Figure 4. Each point in the reduced instance space represents a buggy program. If an APR technique produced a patch for a particular program, it is considered good, otherwise, we label it as bad. Each graph in Figure 4 represents the footprint of one of the techniques that we study in this paper. To assess the similarities between techniques, we perform a visual inspection of the footprints, and observe that there are no identical footprints. While some techniques appear more similar than others (for example, jKali is more similar to Arja than NPEFix), each technique has its unique strengths. This suggests that the effectiveness of APR techniques is context dependent, and there is no approach that can be considered the best in all cases.
All APRTs apart form NPEFix repaired bugs located at the top-right of the footprints. These are bugs from Defects4J benchmark (see Figure 3), which confirms a long held hypothesis that APRTs are being perfected to repair bugs from this dataset. On the other hand, only three approaches, jMutRepair, RSRepair, and GenProgA are capable to repair bugs from IntroClassJava, which is the small cluster of buggy programs on the leftmost side of the footprints. In summary, the answer to the third research question is:
RQ3: By studying the overlap between the different APR techniques, we observe greater similarities between the following two groups of tools:
Group 1. jKali, jGenProg, Nopol and Arja, and
Group 2. jMutRepair, RSRepair, GenProgA.
The following APRTs have their own very distinct footprints and are very different from each other and the rest of the techniques: NPEFix, KaliA, Dynamoth, and Cardumen.
5.3.1 Footprints size
Table III shows the area size of the APRT footprints, measured using Equation 1. The size of the footprint is an indication of the overall performance of the APRT. The larger the footprint size, the more diverse bugs an APRT can repair.
While the sizes of the footprints of most techniques are relatively similar, Nopol is the clear winner. The footprint size is not based on the number of programs that a technique was able to repair. Instead, in our approach, the performance of an APRT is measured in terms of the diversity of the features of these programs. An APRT that can repair more diverse bugs is considered better.
5.3.2 Effect of Significant Software Features
In RQ1, we discovered 9 software features that have an impact on the effectiveness of APRTs. Here, we investigate in what way these significant software features impact the effectiveness of APRTs, and gain further insight into why they work and in what way they are similar or different. Using the same instance space and coordinate system as in Figure 3, we plot the feature footprints in Figure 5. These plots depict how the buggy program instances score in terms of the most significant features. Software metric values are normalised between 0 and 1, and blue represent lower values, while yellow is used for higher values.
MOA and PriMC The cluster of software systems where only jMutRepair, RSRepair and GenProgA are effective has a lower measure of aggregation (MOA) and private methods count (PriMC). MOA (as defined in Table I) is the percentage of data declaration in the system whose types are of user defined classes, as opposed to those of system defined classes, such as integers, real numbers etc. It indicates that, compared to other approaches, it is easier for jMutRepair, RSRepair and GenProgA to repair bugs originating from software systems that have fewer user declared types and lower number of private methods.
CAM The second most significant feature is cohesion among methods in a class (CAM), which is a measure of class cohesion. The cluster of software systems where only jMutRepair, RSRepair and GenProgA are effective is high in terms of CAM. High class cohesion is a desirable property and has previously been linked with high software quality. On the other hand, DynaMoth is very effective in repairing bugs from software programs with low cohesion, while Cardumen struggles with such bugs.
AMC Average Method Complexity is relatively high in the upper right part of the plot, where most APRTs are able to generate patches, which is a surprising but also good results. This indicates that most APRTs are effective at handling complex methods.
Code Elements Features These metrics capture different characteristics of the buggy parts of the programs. Out of the 143 features, e-APR identified 5 significant Code Elements Features – SPTWNG, VCTC, PUIA, CVNI and AECSL – whose footprints we show in Figure 5. Four of these five features – SPTWNG, VCTC, PUIA, CVNI – have very similar footprints. Those mentioned features belongs to Type or Usage categories of Code Elements Features.
As a reminder, SPTWNG (Similar Primitive Type With Normal Guard) indicates the number of statements that contain a variable (local or global) that is also used in another statement contained inside a guard (i.e., an If condition). VCTC (Variable Compatible Type in Condition) measures the number of variables within an If condition that are compatible with another variable in the scope. PUIA (Primitive Used In Assignment) measures the number of primitive variables in assignments. Finally, CVNI (Compatible Variable Not Included) is the number of local primitive type variables within the scope of a statement that involves primitive variables that are not part of that statement.
All these metrics are related to features of variables used in assignments or IF conditions. Most approaches are effective in repairing programs with higher values of these metrics, as indicated by the program instances in the top right corner of the plots. The IntroClassJava dataset, which is located in the leftmost cluster has lower values of these metrics. We find that jMutRepair, RSRepair and GenProgA are the only effective tools in this case. The last significant metrics is AECSL (Atomic Expression Comparison Same Left), which measures the number of statements with a binary expression that have more than an atomic expression. NPEFix appears effective in buggy programs with low AECSL values, while jMutRepair and KaliA appear more effective in programs with high AECSL values.
In summary, the effectiveness of APRTs is impacted by software features, which makes these methods problem dependent, and as such, no technique can be considered the best in all cases. We observe different strengths and weaknesses of existing APRTs, which calls for methods that make it possible to select the most suitable technique given a software system with particular features.
5.4 RQ4. How can we select the most suitable APRT?
To answer this question, the e-APR framework uses a Support Vector Machine to learn a model that can be used to predict the most suitable APRT to repair buggy programs with particular features. The fitted SVM produced 93.6 accuracy and 75.83 precision and the results are depicted in the reduced instance space shown in Figure 6.
R4: The SVM model can predict the most suitable technique with 93.6% accuracy and 75.83% precision.
It is clear from the footprint in Figure 6 that the effectiveness of APRTs is problem dependent and a single technique is not the best in all scenarious. Given the high performance of e-APR for predicting the most suitable APR technique makes it a h igh-priority for us to integrate this apparoach to existing repair infrastructures such as RepairThemAll  or Repairnator . For example, RepairThemAll has 11 automated repair techniques, but it does not offer any capabilities or guidelines in terms of which technique to select. Integrating e-APR with RepairThemAll would make it possible for users to select the most suitable APRT on the fly. Repairnator, on the other hand, is a software bot that automatically repairs broken Travis builds. Given a buggy program that produces a build to fail, Repairnator executes different repair approaches (including jGenProg, Nopol, among others) one by one, and the execution order is hard-coded. By incorporating e-APR, Repairnator could first execute e-APR to obtain the most suitable repair approaches for the buggy program, and execute them accordingly. This would increase the effectiveness if automated program repair in general. In summary, Our approach e-APR can help existing APR infrastructures such as RepairThemAll and software bot such as Repairnator to maximise the efficiency and effectiveness of repairing bugs.
In this paper, we introduced e-APR, which is a novel framework for assessing strengths and weaknesses of APR techniques for Automated Program Repair (APR). We identified nine significant software features that have an impact on APRT effectiveness. These features were then used to provide explanations about an APR technique’s behaviour across a range of buggy software systems. We introduced a method for visualising APRT footprints, which reveal strengths and weaknesses of the APR methods in solving APR problems.
We conducted an analysis of 11 different APR techniques applied to 2,141 bugs from 130 projects, constituting in total 23,551 repair attempts. Our approach effectively identified APRT footprints and the features that impact the effectiveness of an automated program technique. Using the most significant features, we developed a machine learning model that learns the relationship between software features and APRT effectiveness, which was able to predict the most effective technique with 93.6% accuracy and 75.83% precision. E-APR allows objective assessment of different APRT techniques by analysing both their strengths and weaknesses, thus providing clear guidelines on when to select an APRT.
The authors would like to acknowledge Prof. Kate Smith-Miles and her team working on Matilda who inspired this work. Some of our experiments were conducted on the Matilda tool matilda.unimelb.edu.au.
-  (2014) Choosing the appropriate forecasting model for predictive parameter control. Evolutionary computation 22 (2), pp. 319–349. Cited by: §3.1.
-  (2013) An orchestrated survey of methodologies for automated software test case generation. Journal of Systems Software 86 (8), pp. 1978–2001. External Links: Cited by: §1, §1.
-  (2003) Extensions to metric-based model selection. Journal of Machine Learning Research 3 (Mar), pp. 1209–1227. Cited by: §3.1.
-  (2009) This Car Runs on Code. Note: [Online; accessed 10-December-2018] External Links: Cited by: §1.
-  (1994) A metrics suite for object oriented design. Software Engineering, IEEE Transactions on 20 (6), pp. 476–493. Cited by: §4.3.
-  (2017) Dynamic Patch Generation for Null Pointer Exceptions Using Metaprogramming. In Proceedings of the 24th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER ’17), pp. 349–358. External Links: Cited by: §4.1, §4.1, §4.1.
-  (2019) Empirical review of java program repair tools: a large-scale experiment on 2,141 bugs and 23,551 repair attempts. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 302–313. External Links: Cited by: 4th item, §4.1, §4.2, §5.4.
-  (2016) DynaMoth: Dynamic Code Synthesis for Automatic Program Repair. In International Workshop on Automation of Software Test, pp. 85–91. External Links: Cited by: §4.1.
-  (2016) IntroClassJava: A Benchmark of 297 Small and Buggy Java Programs. Technical report Technical Report #hal-01272126, University of Lille, University of Lille. Cited by: §4.2.
-  (1997) My hairiest bug war stories. Communications of the ACM 40 (4), pp. 30–37. Cited by: §1.
-  (2004) Object-oriented design quality models a survey and comparison. In 2nd International Conference on Informatics and Systems, pp. 1–11. Cited by: §4.3.
-  (2003) An introduction to variable and feature selection. Journal of machine learning research 3 (Mar), pp. 1157–1182. Cited by: §3.1.
-  (2016) Google reports self-driving car mistakes: 272 failures and 13 near misses. Note: [Online; accessed 10-December-2018] External Links: Cited by: §1.
-  (2011) Principal component analysis. Springer. Cited by: §3.1.
-  (2014) Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proceedings of the 23rd International Symposium on Software Testing and Analysis, pp. 437–440. External Links: Cited by: §4.2.
-  (2008) Lessons learned in software testing. John Wiley & Sons. Cited by: §1.
-  (2013) Automatic patch generation learned from human-written patches. In International Conference on Software Engineering, pp. 802–811. External Links: Cited by: §4.1.
-  (2012) A systematic study of automated program repair: fixing 55 out of 105 bugs for $8 each. In International Conference on Software Engineering, ICSE, pp. 3–13. External Links: Cited by: §1, §2.
-  (2013) Current challenges in automatic software repair. Software Quality Journal 21 (3), pp. 421–443. External Links: Cited by: §1.
-  (2012) GenProg: a generic method for automatic software repair. Software Engineering, IEEE Transactions on 38 (1), pp. 54–72. External Links: Cited by: §1, §4.1, §4.1.
-  (2017) QuixBugs: A Multi-Lingual Program Repair Benchmark Set Based on the Quixey Challenge. In ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity, pp. 55–56. External Links: Cited by: §2, §4.2.
-  (2019) Bears: An Extensible Java Bug Benchmark for Automatic Program Repair Studies. In Proceedings of the 26th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER ’19), Hangzhou, China, pp. 468–478. Cited by: §4.2.
-  (2018) From start-ups to scale-ups: opportunities and open problems for static and dynamic program analysis. In IEEE International Working Conference on Source Code Analysis and Manipulation, pp. 1–23. Cited by: §1.
-  (2017) Automatic repair of real bugs in java: a large-scale experiment on the defects4j dataset. Empirical Software Engineering 22 (4), pp. 1936–1964. Cited by: §4.1.
-  (2017) Automatic Repair of Real Bugs in Java: A Large-scale Experiment on the Defects4J Dataset. Empirical Software Engineering 22 (4), pp. 1936–1964. External Links: Cited by: §2, §2, §4.1.
-  (2015) Mining software repair models for reasoning on the search space of automated program fixing. Empirical Software Engineering 20 (1), pp. 176–205. External Links: Cited by: §1, §4.1.
-  (2016) ASTOR: A Program Repair Library for Java. In Proceedings of the 25th International Symposium on Software Testing and Analysis, Demonstration Track, pp. 441–444. External Links: Cited by: §2, §4.1, §4.1, §4.1.
-  (2018) Ultra-Large Repair Search Space with Automatically Mined Templates: the Cardumen Mode of Astor. In International Symposium on Search-Based Software Engineering. Lecture Notes in Computer Science, vol 11036, T. E. Colanzi and P. McMinn (Eds.), pp. 65–86. External Links: Cited by: §4.1, §4.1.
-  (2019) Repairnator patches programs automatically. Ubiquity 2019. External Links: Cited by: 4th item, §5.4.
-  (2018) Do automated program repair techniques repair hard and important bugs?. Empirical Software Engineering 23 (5), pp. 2901–2947. External Links: Cited by: 1st item, §2, §4.1.
-  (2018) Mapping the effectiveness of automated test suite generation techniques. IEEE Transactions on Reliability 67 (3), pp. 771–785. External Links: Cited by: §3.1.
-  (2014) The Strength of Random Search on Automated Program Repair. In Proceedings of the 36th International Conference on Software Engineering, pp. 254–265. External Links: Cited by: §4.1, §4.1.
-  (2015) An Analysis of Patch Plausibility and Correctness for Generate-and-Validate Patch Generation Systems. In Proceedings of the 2015 International Symposium on Software Testing and Analysis (ISSTA ’15), pp. 24–36. External Links: Cited by: §2, §4.1.
-  (2018) Bugs.jar: A Large-scale, Diverse Dataset of Real-world Java Bugs. In International Conference on Mining Software Repositories, pp. 10–13. External Links: Cited by: §4.2.
-  (2015) Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE ’15), pp. 532–543. External Links: Cited by: §2, §2, §2, §2, §4.2.
-  (2014) Towards objective measures of algorithm performance across instance space. Computers & Operations Research 45, pp. 12–24. Cited by: §3.1, §3.2.
-  (2012) Measuring algorithm footprints in instance space. In 2012 IEEE Congress on Evolutionary Computation, pp. 1–8. Cited by: §3.2.
-  (2013) University of Cambridge Study: Failure to Adopt Reverse Debugging Costs Global Economy $41 Billion Annually. Note: [Online; accessed 10-December-2018] External Links: Cited by: §1.
-  (2016) Anti-patterns in search-based program repair. In ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 727–738. External Links: Cited by: §2.
-  (2019) How different is it between machine-generated and developer-provided patches? an empirical study on the correct patches generated by automated program repair techniques. External Links: Cited by: §2, §2.
-  (2009) Automatically Finding Patches Using Genetic Programming. In Proceedings of the 31st International Conference on Software Engineering, pp. 364–374. External Links: Cited by: §4.1.
-  (2018) Context-Aware Patch Generation for Better Automated Program Repair. In International Conference on Software Engineering, pp. 1–11. External Links: Cited by: §2.
-  (2017) Identifying test-suite-overfitted patches through test case generation. In ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 226–236. External Links: Cited by: §2.
-  (2018) Identifying patch correctness in test-based program repair. In International Conference on Software Engineering, pp. 789–799. Cited by: §2.
-  (2017) Nopol: automatic repair of conditional statement bugs in java programs. IEEE Transactions Software Engineering 43 (1), pp. 34–55. External Links: Cited by: §1, §2, §4.1, §4.1, §4.1.
-  (2019) Automated classification of overfitting patches with statically extracted code features. External Links: Cited by: §4.3.
-  (2019) A Comprehensive Study of Automatic Program Repair on the QuixBugs Benchmark. In International Workshop on Intelligent Bug Fixing (co-located with SANER), pp. 1–10. Cited by: §2.
-  (2019) Learning the relation between code features and code transforms with structured prediction. External Links: Cited by: §4.3.
-  (2018) Alleviating patch overfitting with automatic test generation: a study of feasibility and effectiveness for the nopol repair system. Empirical Software Engineering. Cited by: §2.
-  (2018) ARJA: Automated Repair of Java Programs via Multi-Objective Genetic Programming. IEEE Transactions on Software Engineering PP, pp. . External Links: Cited by: §4.1, §4.1, §4.1.
-  (2019) Support vector machine. In Fundamentals of Image Data Mining, pp. 179–205. Cited by: 4th item, §3.3.