On Reliability of Patch Correctness Assessment

Current state-of-the-art automatic software repair (ASR) techniques rely heavily on incomplete specifications, e.g., test suites, to generate repairs. This, however, may render ASR tools to generate incorrect repairs that do not generalize. To assess patch correctness, researchers have been following two typical ways separately: (1) Automated annotation, wherein patches are automatically labeled by an independent test suite (ITS) - a patch passing the ITS is regarded as correct or generalizable, and incorrect otherwise, (2) Author annotation, wherein authors of ASR techniques annotate correctness labels of patches generated by their and competing tools by themselves. While automated annotation fails to prove that a patch is actually correct, author annotation is prone to subjectivity. This concern has caused an on-going debate on appropriate ways to assess the effectiveness of numerous ASR techniques proposed recently. To address this concern, we propose to assess reliability of author and automated annotations on patch correctness assessment. We do this by first constructing a gold set of correctness labels for 189 randomly selected patches generated by 8 state-of-the-art ASR techniques through a user study involving 35 professional developers as independent annotators. By measuring inter-rater agreement as a proxy for annotation quality - as commonly done in the literature - we demonstrate that our constructed gold set is on par with other high-quality gold sets. We then compare labels generated by author and automated annotations with this gold set to assess reliability of the patch assessment methodologies. We subsequently report several findings and highlight implications for future studies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/30/2019

Automated Patch Assessment for Program Repair at Scale

In this paper, we do automatic correctness assessment for patches genera...
08/03/2020

On the Efficiency of Test Suite based Program Repair: A Systematic Assessment of 16 Automated Repair Systems for Java Programs

Test-based automated program repair has been a prolific field of researc...
04/19/2020

Interactive Patch Filtering as Debugging Aid

It is widely recognized that program repair tools need to have a high pr...
08/07/2020

Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair

A large body of the literature of automated program repair develops appr...
07/28/2021

Checking Patch Behaviour against Test Specification

Towards predicting patch correctness in APR, we propose a simple, but no...
11/29/2017

Senx: Sound Patch Generation for Security Vulnerabilities

Many techniques have been proposed for automatic patch generation and th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Bug fixing is notoriously difficult, time-consuming, and costly (Tassey, 2002; Britton et al., 2013). Hence, effective automatic software repair (ASR) techniques that can help reduce the onerous burden of this task, is of tremendous value. Interest in ASR has intensified as demonstrated by substantial recent work devoted to the area (Mechtaev et al., 2015, 2016; Xiong et al., 2017a; Long and Rinard, 2015, 2016b; Xuan et al., 2016; Le Goues et al., 2012; Kim et al., 2013; Le et al., 2015; Le et al., 2016a, pear; Chandra et al., 2011)

, bringing the futuristic idea of ASR closer to reality. ASR can be generally divided into two main families including heuristics- vs semantics-based approaches, classified by the way they generate and traverse the search space for repairs.

Traditionally, test cases are used as the primary criteria for correctness judgment of machine-generated patches – a patch is deemed as correct if it passes all tests used for repair (Le Goues et al., 2012). This assessment methodology, however, has been shown to be ineffective as there could be multiple patches passing all tests but are still indeed incorrect (Qi et al., 2015; Long and Rinard, 2016a). Although the search space of ASR varies depending on the nature of underlying techniques, it is often huge and contains many plausible repairs, which unduly pass all tests but fail to generalize to the expected behaviours. This problem, which is often referred to as patch overfitting (Smith et al., 2015; Le et al., 2017b), motivates the need of new methodologies to assess patch correctness. The new methodologies need to rely on additional criteria instead of using the test suite used for generating repair candidates (aka. repair test suite) alone.

To address this pressing concern, most recent works have been following two methods for patch correctness assessment separately:

  • [leftmargin=*]

  • Automated annotation by independent test suite. Independent test suites obtained via an automatic test case generation tool are used to determine correctness label of a patch – see for example (Smith et al., 2015; Le et al., 2016b). Following this method, a patch is deemed as correct or generalizable if it passes both the repair and independent test suites, and incorrect otherwise.

  • Author annotation. Authors of ASR techniques manually check correctness labels of patches generated by their own and competing tools – see for example (Xiong et al., 2017b; Liu et al., 2017). Following this method, a patch is deemed as correct if authors perceive semantic equivalence between generated patches and original developer patches.

While the former is incomplete, in the sense that it fails to prove that a patch is actually correct, the latter is prone to author bias. In fact, these inherent disadvantages of the methods have caused an on-going debate as to which method is better for assessing the effectiveness of various ASR techniques being proposed recently. Unfortunately, there has been no extensive study that objectively assesses the two patch validation methods and provides insights into how the evaluation of ASR’s effectiveness should be conducted in the future.

This study is conducted to address this gap in research. We start by creating a gold set of correctness labels for a collection of ASR generated patches, and subsequently use it to assess reliability of labels created through author and automated annotations. We study a total of 189 patches generated by 8 popular ASR techniques (ACS (Xiong et al., 2017b), Kali (Qi et al., 2015), GenProg (Xiong et al., 2017b), Nopol (Xuan et al., 2016), S3 (Le et al., 2017a), Angelix (Mechtaev et al., 2016), and Enumerative and CVC4 embedded in JFix (Le et al., pear)). These patches are for buggy versions of 13 real-world projects, of which six projects are from Defects4J (Just et al., 2014) (Math, Lang, Chart, Closure, Mockito, and Time) and seven projects are from S3’s dataset (Le et al., 2017a) (JFlex, Fyodor, Natty, Molgenis, RTree, SimpleFlatMapper, GraphHoper). To determine correctness of each patch, we follow best practice by involving multiple independent annotators in a user study. Our user study involves 35 professional developers; each ASR-generated patch is labeled by five developers by comparing the patch with its corresponding ground truth patch created by the original developer(s) who fixed the bug. By analyzing the created gold set and comparing it with labels generated by three groups of ASR tool authors (Martinez et al., 2017; Liu et al., 2017; Le et al., 2017a) and two automatic test case generation tools such as DiffTGen (Xin and Reiss, 2017) and Randoop (Pacheco et al., 2007), we seek to answer three research questions:

  • Can independent annotators agree on patch correctness?

  • How reliable are patch correctness labels generated by author annotation?

  • How reliable are patch correctness labels inferred through automatically generated independent test suite?

In RQ1, by measuring inter-rater agreement as a proxy of annotation quality – as commonly done in the literature (Christopher et al., 2008; Damessie et al., 2017a) – we demonstrate that our gold set is on par with other high-quality gold sets. In the subsequent two RQs, we investigate the strengths and deficiencies of author and automated patch correctness annotation.

We summarize our contributions below:

  • [leftmargin=*]

  • We are the first to investigate the reliability of author and automated annotation for assessing patch correctness. To perform such assessment, we have created a gold set of labelled patches created by a user study involving 35 professional developers. By means with this gold set, we highlight strengths and deficiencies of popular assessment methods employed by existing ASR studies.

  • Based on implications of our findings, we provide several recommendations for future ASR studies to better deal with patch correctness validation. Especially, we find that automated annotation, despite being less effective as compared to author annotation, can be used to augment author annotation and reduce the cost of manual patch correctness assessment.

The rest of the paper is organized as follows. Section 2 presents more information on various ASR techniques, existing methods used for patch correctness assessment, and best practice in gold set creation. Next, we describe details of our user study to collect gold set of patch correctness labels in Section 3. Subsequently, we answer RQ1, RQ2, and RQ3 to assess the quality of our gold set, author annotation, and automated annotation in Section 45, and 6 respectively. Section 7 discusses implications of our findings, our post-study survey, and threats to validity. We conclude and briefly describe future work in Section 9.

2. Background

In this section, we first present more information about automated software repair (ASR) techniques used in our experiments, including GenProg (Le Goues et al., 2012), Kali (Qi et al., 2015), Nopol (Xuan et al., 2016), ACS (Xiong et al., 2017b), S3 (Le et al., 2017a), Angelix (Mechtaev et al., 2016), and Enumerative and CVC4 embedded in JFix (Le et al., pear). We subsequently elaborate methods that have been used for assessing patch correctness in ASR research. Finally, we discuss best practice in building gold sets.

ASR techniques: GenProg (Le Goues et al., 2012)

is one of the first ASR techniques that sparks interests in ASR. Given a buggy program and a set of test cases, at least one of which is failing, GenProg uses a number of mutation operators, such as statement deletion, insertion, and append to create a large pool of repair candidates. It then uses genetic programming to evolve the buggy program until a candidate passing all tests is found. Kali 

(Qi et al., 2015) is a naive ASR technique, which just blindly deletes any statements that are identified as potentially buggy. Despite being simple, Kali has been shown to be as effective and efficient as GenProg. Nopol (Xuan et al., 2016) is a recently developed ASR technique that focuses on only repairing defective if-conditions. Nopol attempts to synthesize an if-condition expression that renders all tests to pass by using program synthesis. In a similar vein, ACS (Xiong et al., 2017b) also focuses on synthesizing repairs for buggy if-conditions. Like Nopol, ACS also uses program synthesis to synthesize repairs. Unlike Nopol, ACS attempts to rank the fix candidates using various ranking functions. Angelix (Mechtaev et al., 2016), S3 (Le et al., 2017a), and JFix (Le et al., pear) use symbolic execution to infer specifications and various program synthesis techniques to synthesize repairs conforming to the inferred specifications.

Evaluation of ASR Generated Patches: Traditionally, test cases are used as the sole criteria for judging correctness of machine-generated patches. By relying on the assumption that a patch that passes the repair test suite is regarded as correct, early repair techniques such as GenProg (Le Goues et al., 2012), AE (Weimer et al., 2013a), and RSRepair (Qi et al., 2014a) reported to produce many such correct patches. However, it has been shown in recent studies that this assumption does not hold true in practice since such patches that pass the repair test suite are indeed still incorrect (Qi et al., 2015; Long and Rinard, 2016a). This shows that repair test suite alone is a weak proxy for assessing patch correctness.

Motivated by the above concern, recent works have thus employed new methods to assess patch correctness: (1) Author annotation, in which authors of repair techniques manually check the correctness of patches generated by their and competing tools by themselves, see for example (Xiong et al., 2017b; Le et al., 2017a); (2) Automated annotation by independent test suite (ITS) generated by automatic test case generation tool, see for example (Smith et al., 2015; Le et al., 2016b). Both methods assume that a reference (correct) implementation of the buggy program, which is used as a basis for comparison, is available. Since most ASR techniques try to fix real buggy versions of real programs, the reference implementations can be found in the version control systems of the corresponding projects.

Early work that uses automated annotation by automatically-generated ITS, e.g., (Smith et al., 2015), uses general-purpose automatic test generation tool such as KLEE (Cadar et al., 2008) to generate an ITS that maximizes the coverage of the reference implementation written in C programming language. To automatically generate test cases for Java programs, Randoop (Pacheco et al., 2007) can be used to randomly generate sequences of method calls that create and mutate objects, plus an assertion about the result of a final method call. Recently, Xin et al. proposed DiffTGen, a test generation tool for Java programs specifically designed to generate tests that can identify incorrect patches generated by ASR tools (Xin and Reiss, 2017). DiffTGen attempts to generate test cases that cover the syntactic and semantic differences between the machine-patched and human-patched programs. If there are any such test cases that expose the differences in outputs of the programs, the machine-generated patch is deemed as incorrect since it results in a different output as compared to the corresponding ground truth human-patched program. DiffTGen has been shown to be able to identify incorrect patches produced by various state-of-the-art ASR tools such as GenProg (Le Goues et al., 2012), Kali (Qi et al., 2015), Nopol (Xuan et al., 2016), and HDRepair (Le et al., 2016c).

Best practice in building gold set: To build gold set objectively, a common approach is to employ many independent annotators and measure inter-rater agreement as proxy for annotation quality (Christopher et al., 2008; Dybkjaer et al., 2007). Information retrieval (IR) community, especially through the Text REtrieval Conference (TREC)111http://trec.nist.gov/, has employed many annotators through large scale collaborative effort to annotate many document corpora for various retrieval tasks. Many past software engineering studies have also involved independent annotators to construct gold sets. Based on the nature of various tasks, annotators include non-authors who could be undergraduate/graduate students (Rastkar et al., 2010; Gachechiladze et al., 2017; Buse and Weimer, 2010; De Lucia et al., 2014; Zou et al., 2015) or professional developers (Ormandjieva et al., 2007; Treude et al., 2015; Rastkar et al., 2010).

3. User Study

We conducted a user study with 35 professional developers to collect correctness labels of patches. In this study, every developer is required to complete several tasks by judging whether patches generated by ASR tools are semantically equivalent to ground truth human patches.

Patch Dataset. Since the eventual goal of our study is to assess reliability of author and automated annotations, we need a set of patches that have been labeled before by ASR tool authors and can be used as input to automated test case generation tools designed for program repair. We find the sets of patches recently released by Liu et al. (Liu et al., 2017), Martinez et al. (Martinez et al., 2017), and Le et al. (Le et al., 2017a) to be suitable. Liu et al. and Martinez et al. label a set of 210 patches generated by ASR tools designed by their research groups (i.e., ACS (Xiong et al., 2017b), and Nopol (Xuan et al., 2016)) and their competitors (i.e., GenProg (Le Goues et al., 2012), Kali (Qi et al., 2015)). Le et al. label a set of 79 patches generated by their ASR tool (i.e., S3 (Le et al., 2017a)) and their competitors (i.e., Angelix (Mechtaev et al., 2016), and Enumerative and CVC4 embedded in JFix (Le et al., pear)). The authors label these patches by manually comparing them with ground truth patches obtained from version control systems of the corresponding buggy subject programs.222Since authors of (Liu et al., 2017) and (Xiong et al., 2017b) overlap, we can use the labels to evaluate reliability of author labelling. These patches can be used as input to DiffTGen, which is a state-of-the-art test generation tool specifically designed to evaluate patch correctness (Xin and Reiss, 2017), and Randoop – a popular general purpose test case generation tool (Pacheco et al., 2007).

GenProg Kali Nopol ACS S3 Angelix Enum CVC4

Incorrect
14 14 84 4 0 7 6 6
Correct 4 1 6 14 10 2 4 4
Unknown 2 2 5 0 0 0 0 0
Total 20 17 95 18 10 9 10 10
Table 1. Selected Patches and their Author Label

Due to resource constraint, i.e., only 35 professional developers agree to spend an hour of their time in this user study, we cut down the dataset to 189 patches by randomly selecting these patches from their original datasets. Details of the dataset of 189 patches are shown in Table 1.

Task Design. At the start of the experiment, every participant is required to read a tutorial that briefly explains automated program repair and what they need to do to complete the tasks. Afterwards, they can complete the tasks one-by-one through a web interface.

Figure 2 shows the screenshot of an example task that we give to our user study participants through a web interface. For each task, we provide a ground truth patch taken from the version control system of the corresponding buggy subject program, along with a patch that is generated by an automated program repair tool. We also provide additional resources including full source code files that are repaired by the patch, link to the GitHub repository of the project, outputs of failing test cases333These information is generated using Defects4J (Just et al., 2014) info command., and source code of the failing test cases. Based on this information, participants are asked to evaluate the correctness of the patch by answering the question: Is the generated patch semantically equivalent to the correct patch? To answer this question, participants can choose one of the following options: “Yes”, “No” or “I don’t know”. Finally, if they wish to, they can provide some reasons that explain their decision. Our web interface will record participants’ answers and the amount of time they need to complete each task.

Participants and Task Assignment. Thirty three of the 35 professional developers participating in this study work for two large software development companies (named Company C1 and C2), while another two work as engineers for an educational institution. Company C1 currently has more than 500 employees and Company C2 has more than 2000 employees. Both companies have a large number of active projects that expose developers to various business knowledge and software engineering techniques. All the 35 developers work for projects that use Java as the main programming language.

Figure 1. Distribution of participant work experience

Figure 1 shows the distribution of years of work experience of our participants. The average number of years of work experience that these participants have is 3.5. Two developers from the educational institution are very senior, who have worked for 5.5 and 10 years, respectively. The most experienced developer from industry has worked for seven years, while some has only worked for one year. Based on their working experience, we group participants into two groups: junior and senior. There are 20 junior developers and 15 senior developers, respectively.

We divided the 35 participants into seven groups. The ratio of junior and senior developers for each group was kept approximately the same. Each patch generated by program repair tools is labeled by five participants. Participants in the same group receive the same set of patches to label.

Figure 2. A sample task viewed through our web interface. (1) and (2) are the correct patch and the patch generated by an ASR tool; (3) and (4) are the links to source code files that contain the patches; (5) is the link to the corresponding project’s GitHub repository; (6) and (7) are the output of the failed test cases and their source files; (8) is the question we asked a participant to answer.

4. Assessing Independent Annotators’ Labels

Our user study presented in Section 3 was conducted to build a set of gold standard labels for machine-generated patches, which can reliably be used to assess reliability of author and automated annotations. Before using the labels produced by our user study, we need to first ascertain their quality. Agreement among annotators is often used as a measure of quality (Christopher et al., 2008; Damessie et al., 2017b; Scholer et al., 2011). Thus, in this section, we investigate the degree to which the annotators agree with one another. This answers RQ1: Can independent annotators agree on patch correctness?

Methodology. To answer RQ1, we first compute some simple statistics highlighting the number of agreements and disagreements among annotators. We then calculate several well-accepted measures of inter-rater reliability. Finally, we perform some sanity checks to substantiate whether or not annotators are arbitrary in making their decisions.

Results. To recap, our annotators are 35 professional developers who are tasked to annotate 189 machine-generated patches. Each patch is annotated by five professional developers; each provides either one of the following labels: incorrect, correct, or unknown. Table 2 summarizes the number of agreements and disagreements among annotators. The number of patches in which all developers agree on each patch’s label is 118 (62.4% of all patches); of which 95 patches are labeled as incorrect and 23 patches are labeled as correct. Moreover, ignoring unknown labels, the number of patches for which the remaining annotators fully agree on their labels is 155 (82.0% of all patches). Out of these, the numbers of patches that are labeled as incorrect and correct are 132 and 23, respectively. Lastly, for 187 out of 189 patches (98.9% of all patches), there is a majority decision (i.e., most annotators agree on one label). Out of these, 152 and 35 patches are identified as incorrect and correct, respectively.

All Agree All Agree - Unk Majority Agree

Incorrect
95 132 152
Correct 23 23 35
Total 118 155 187
Table 2. Results of participant annotations. First column indicates the number of patches that every developer agrees on the label of each patch as correct or incorrect. Second column indicates the number of patches, wherein each patch has least one developer labeling it as unknown and the remaining developers agrees on the label of the patch. Last column indicates the number of patches that the label of each patch can be determined by a majority voting among developers’ labels.

We also compute several inter-rater reliability scores: mean pairwise Cohen’s kappa (Christopher et al., 2008; Cohen, 1960) and Krippendorff’s alpha (Krippendorff, 1970). Using the earlier test we consider three different ratings (i.e., correct, incorrect, and unknown), while the latter test allows us to ignore unknown ratings444Krippendorff’s alpha allows us to have different number of ratings for each data point.. Inter-rater reliability scores measure how much homogeneity, or consensus, there is between raters/labelers. The importance of rater reliability hinges on the fact that it represents the extent to which the data collected in the study are correct representations of the variables being measured. A low inter-rater reliability suggests that either the rating scale used in the study is defective or raters need to be retrained for the rating task or the task is highly subjective. The higher the inter-rater reliability the more reliable the data is.

Score Range Interpretation
poor agreement
slight agreement
fair agreement
moderate agreement
substantial agreement
almost perfect agreement
Table 3. Interpretation of Inter-Rater Reliability Scores by Landis and Koch (Landis and Koch, 1977).

Table 3 shows details of interpretations of reliability score values by Landis and Koch  (Landis and Koch, 1977). It is worth noting that there is another interpretation of kappa value by Manning et al. (Christopher et al., 2008), which indicates that a kappa value falling between 0.67 and 0.8 demonstrates a fair agreement between raters – the second highest level of agreement by their interpretation. It has been shown that this fair level of inter-rater agreement normally happens in popular datasets such as those used for TREC evaluations555Text REtrieval Conference (TREC), which is championed by US National Institute of Standards and Technology (NIST) since 1992, provides benchmark datasets for various text retrieval tasks – see http://trec.nist.gov/data.html. and medical IR collections (Christopher et al., 2008).

[width=8.5cm] The computed mean pairwise Cohen’s kappa and Krippendorff’s alpha for our data are 0.691 and 0.734 respectively, which highlight a substantial agreement among participants and satisfies the standard normally met by quality benchmark datasets.

Figure 3. Time taken by annotators to decide whether a patch’s label is either known (confirmed as correct or incorrect) or unknown.

To further validate the annotations, we perform two sanity checks to substantiate whether or not annotators are arbitrary in their decisions:

  • [leftmargin=*]

  • First, we expect conscientious annotators to spend more time inspecting patches that are eventually labeled as unknown than other patches. Annotators who label patches as unknown without thinking much would be likely making arbitrary decisions. Figure 3 depicts a box plot showing the time participants took on patches that are labeled as unknown and other patches. It can be seen that participants took more time on the earlier set of patches. Wilcoxon signed-rank test returns a p-value that is less than 0.005, indicating a statistically significant difference. Moreover, the Cliff’s delta666Cliff defines a delta of less than 0.147, between 0.147 to 0.33, between 0.33 and 0.474, and above 0.474 as negligible, small, medium, and large effect size, respectively (Cliff, 1993)., which is a non-parametric effect size measure, is 0.469 (medium).

  • Second, we expect conscientious annotators to spend more time inspecting difficult patches than easy ones. We consider disagreement among annotators as proxy for patch difficulty. We compare the time taken by participants in identifying patches for which there is complete agreement to those for which disagreement exists. Figure 4 shows a box plot which shows that participants spend more time on disagreement cases. Wilcoxon signed-rank test returns a p-value that is less than 0.05, indicating statistically significant difference. Moreover, the Cliff’s delta is 0.178 (small).

Figure 4. Time taken by annotators to decide a patch’s label for full-agreement and disagreement cases.

The above results substantiate the quality of our dataset. In the subsequent sections, which answer RQ2 and RQ3, we use two versions of our dataset ALL-AGREE (see “All Agree” column in Table 2) and MAJORITY-AGREE (see “Majority Agree” column in Table 2), to assess the reliability of author and automated annotations.

5. Assessing Author Annotation

A number of studies proposing automated repair approaches evaluate the proposed approaches through manual annotation performed by authors (Liu et al., 2017; Xiong et al., 2017b; Le et al., 2016d). Author subjectivity may cause bias which can be a threat to the internal validity of the study. Author bias has been actively discussed especially in the medical domain, e.g., (Vaccaro et al., 2011). Unfortunately so far, there has been no study that investigates presence or absence of bias in author annotation and its impact to the validity of the labels in automated program repair. This section describes our effort to fill this need by answering RQ2: How reliable is author annotation?

Methodology. Recall that our user study makes use of patches released by three research groups, including Liu et al. (Liu et al., 2017), Martinez et al. (Martinez et al., 2017), and Le et al. (Le et al., 2017a) who created program repair tools namely ACS, Nopol, and S3, respectively. Authors of each tool manually labeled the patches generated by their tool and its competing approaches by themselves. To answer RQ2, we compare labels produced by the three research groups with those produced by our independent annotators whose quality we have validated in Section 4. We consider the ALL-AGREE and MAJORITY-AGREE datasets mentioned in Section 4.

Indep Annotators-Authors All Agree Majority Agree

Same
Incorrect-Incorrect 82 133
Correct-Correct 23 33
Different Incorrect-Correct 6 10
Correct-Incorrect 0 2
Incorrect-Unknown 7 9
Correct-Unknown 0 0
Total 118 187
Table 4. Results of labels by authors compared to independent annotators.

Results. Table 4 shows the detailed results on the comparisons between authors’ labels and independent annotators’ labels. We found that for ALL-AGREE dataset, authors’ labels match with independent annotators’ labels (Same) for 105 out of 118 patches (89.0%). There are 13 patches for which authors’ labels mismatch those by independent annotators (Different). Among these patches, 6 are identified by independent annotators as incorrect, but identified by authors as correct (Incorrect-Correct). For the other 7 patches, authors’ labels are unknown while independent annotators’ labels are incorrect (Incorrect-Unknown). For the MAJORITY-AGREE dataset, 88.8% of the labels match. There are 21 mismatches; 10 belong to Incorrect-Correct cases, 2 to Correct-Incorrect cases, and 9 to Incorrect-Unknown cases. Figure 5 shows an example patch generated by Nopol (Xuan et al., 2016) that has mismatched labels. It is labeled as correct by Martinez et al. and incorrect by independent annotators.

1@@ -115,9 +115,7 @@ public class StopWatch {
2public void stop() {
3   if(this.runningState != STATE_RUNNING && this.runningState != STATE_SUSPENDED) {
4       throw new IllegalStateException("...");
5   }
6+  if(this.runningState == STATE_RUNNING)// Developer patch
7+  if(-1 == stopTime)// Generated patch
8       stopTime = System.currentTimeMillis();
9   this.runningState = STATE_STOPPED;
10}
Figure 5. An example of a patch that has mismatched labels. Martinez et al. identified the patch (at line 7) as correct, while independent annotators identified this patch as incorrect. The ground truth (developer) patch is shown at line 6.

We also compute inter-rater reliability of authors’ labels and labels in ALL-AGREE and MAJORITY-AGREE datasets. The Cohen’s kappa values are 0.719 and 0.697 considering the ALL-AGREE and MAJORITY-AGREE datasets, respectively777The Krippendorf’s alpha values are 0.717 and 0.695. Comparing these scores with Landis and Koch’s interpretation in Table 3, there is substantial agreement.

[width=8.5cm] A majority (88.8-89.0%) of patch correctness labels produced by author annotation match those produced by independent annotators. Inter-rater reliability scores indicate a substantial agreement between author and independent annotator labels.

To better characterize cases where author and independent annotator labels match (Same) and those where they do not match (Different), we investigate the time that participants of our user study took to label the two sets of patches. Since the number of mismatches is smaller in the ALL-AGREE dataset, we focus on comparing labels in MAJORITY-AGREE dataset. Figure 6 depicts a box plot showing the distribution of completion time corresponding to the two sets of patches. According to the figure, patches with matching labels took participants a shorter period of time to label comparing to those whose labels mismatched. Wilcoxon signed-rank test returns a p-value that is less than 0.05, indicating statistically significant difference. The Cliff’s delta is equal to 0.278 (small). Since task completion time can be used as a proxy for measuring task difficulty or lack thereof (Wickens, 1991), we consider participants completion time as a proxy of difficulty in assessing patch correctness. The result suggests that disagreements between authors and independent annotators happen for more difficult cases.

Figure 6. Participant completion time for patches for which author and independent annotator labels match (Same) and those whose labels mismatch (Different)

6. Assessing Automated Annotation

In this research question, we investigate the reliability of the use of automatically generated independent test suite (ITS) in annotating patch labels. ITS has been used as an objective proxy to measure patch correctness – a patch is deemed as incorrect if it does not pass the ITS, and as correct or generalizable otherwise (Smith et al., 2015; Le et al., 2016b). It is unequivocal that incorrect patches determined by ITS are indeed incorrect. However, it is unclear if ITS can detect a large proportion of incorrect patches. Moreover, the extent to whether correct (generalizable) patches determined by ITS are indeed correct remain questionable. Thus, to assess the usefulness of ITS, we investigate the answer to RQ3: How reliable is automatically generated ITS in determining patch correctness?

Methodology: We employ the recently proposed test case generation tool DiffTGen by Xin et. al (Xin and Reiss, 2017) and Randoop (Pacheco et al., 2007) to generate ITS. To generate ITS using DiffTGen and Randoop, the human-patched program is used as ground truth. For DiffTGen, we run using its best configuration reported in (Xin and Reiss, 2017), allowing it to invoke Evosuite (Fraser and Arcuri, 2011) in 30 trials with the search time of each trial limited to 60 seconds. A machine-generated patch is identified as incorrect if there is a test in the DiffTGen-generated ITS that witnesses the output differences between the machine and human patches. For Randoop, we run it on the ground truth program with 30 different seeds with each run limited to 5 minutes. A machine-generated patch is identified as incorrect if there is at least one test case in the Randoop-generated ITS that exhibits different test results in machine-patched and human-patched (ground truth) programs, e.g., it fails on the machine-patched program but passes on the ground truth program, or otherwise. By this way, we allow both tools to generate multiple test suites. It is, however, worth noting that DiffTGen and Randoop are incomplete in the sense that they do not guarantee to always generate the test cases that witness incorrect patches.

We use test cases generated by the tools to automatically annotate the 189 patches and compare the generated labels to those in ALL-AGREE and MAJORITY-AGREE datasets which are created by our user study.

Results: Out of the 189 patches in our study, DiffTGen generates test cases that witness 27 incorrect (overfitting) patches. Details of these patches are shown in Table 6. The ALL-AGREE ground truth identifies 17 of these 27 patches as incorrect (the other 10 patches lie outside of the ALL-AGREE dataset), while the MAJORITY-AGREE dataset identifies all of them as incorrect. Unfortunately, most of the patches labelled as incorrect in ALL-AGREE (65 patches) and MAJORITY-AGREE (121 patches) datasets failed to be detected as such by ITS generated by DiffTGen. Randoop performs similarly as compared to DiffTGen. It identifies 31 patches as incorrect, all of which are also identified as incorrect in the MAJORITY-AGREE dataset. Note that, DiffTGen and Randoop when combined can identify totally 51 unique patches as incorrect.

In their studies, Smith et al. (Smith et al., 2015) and Le et al. (Smith et al., 2015) assume a patch is incorrect if it does not pass an ITS, and correct or generalizable otherwise. Using the same assumption to generate correctness labels, we can compute inter-rater reliability between labels automatically annotated by running ITS generated by DiffTGen and Randoop and labels in ALL-AGREE and MAJORITY-AGREE datasets. As readers may have expected, the kappa values are very low as shown in Table 5, e.g., Cohen’s kappa values when using DiffTGen-generated ITS for ALL-AGREE and MAJORITY-AGREE are 0.078 and 0.075, repsectively.888The corresponding Krippendorff’s alpha values are -0.32 and -0.336

All Agree Majority Agree
DiffT Rand Comb DiffT Rand Comb
Cohen’s Kappa 0.078 0.073 0.158 0.075 0.072 0.146
Kripp’s Alpha -0.32 -0.3 -0.057 -0.336 -0.313 -0.097
Table 5. Kappa values when using DiffTGen, Randoop, and their combination to label patches in ALL-AGREE and MAJORITY-AGREE datasets.

[width=8.5cm] Independent test suite generated by DiffTGen and Randoop can only label fewer than a fifth of incorrect patches as such in ALL-AGREE and MAJORITY-AGREE datasets.

We now compare author labels discussed in Section 5 with ITS labels. Table 6 shows the author labels of the 27 and 31 patches identified as incorrect by DiffTGen and Randoop, respectively. For these patches, the majority of the labels by authors and DiffTGen match. However, there are three special patches identified as incorrect by DiffTGen, including Math_80 generated by Kali, Chart_3 generated by GenProg, and Math_80_2015 generated by Nopol, while author labels are “Unknown”. One special patch identified as incorrect by Randoop (Math_73 generated by GenProg), is labelled as correct by authors.

DiffTGen Randoop Annot Authors

Kali
Time_4 Incorrect Incorrect Incorrect Incorrect
Math_32 Incorrect Incorrect Incorrect
Math_2 Incorrect Incorrect Incorrect
Math_80 Incorrect Incorrect Unknown
Math_95 Incorrect Incorrect Incorrect Incorrect
Math_40 Incorrect Incorrect Incorrect
Chart_13 Incorrect Incorrect Incorrect
Chart_26 Incorrect Incorrect Incorrect
Chart_15 Incorrect Incorrect Incorrect Incorrect
Chart_5 Incorrect Incorrect Incorrect Incorrect
GenProg Math_2 Incorrect Incorrect Incorrect
Math_8 Incorrect Incorrect Incorrect
Math_80 Incorrect Incorrect Incorrect
Math_81 Incorrect Incorrect Incorrect
Math_95 Incorrect Incorrect Incorrect Incorrect
Math_40 Incorrect Incorrect Incorrect
Math_73 Incorrect Incorrect Correct
Chart_1 Incorrect Incorrect Incorrect
Chart_3 Incorrect Incorrect Unknown
Chart_5 Incorrect Incorrect Incorrect Incorrect
Chart_15 Incorrect Incorrect Incorrect Incorrect
Nopol Math_33 Incorrect Incorrect Incorrect
Math_73_2017 Incorrect Incorrect Incorrect
Math_80_2017 Incorrect Incorrect Incorrect
Math_80_2015 Incorrect Incorrect Unknown
Math_97 Incorrect Incorrect Incorrect
Math_105 Incorrect Incorrect Incorrect
Time_16 Incorrect Incorrect Incorrect
Time_18 Incorrect Incorrect Incorrect
Chart_13_2017 Incorrect Incorrect Incorrect
Chart_13_2015 Incorrect Incorrect Incorrect
Chart_21_2017 Incorrect Incorrect Incorrect
Chart_21_2015 Incorrect Incorrect Incorrect
Closure_7 Incorrect Incorrect Incorrect
Closure_12 Incorrect Incorrect Incorrect
Closure_14 Incorrect Incorrect Incorrect
Closure_20 Incorrect Incorrect Incorrect
Closure_30 Incorrect Incorrect Incorrect
Closure_33 Incorrect Incorrect Incorrect
Closure_76 Incorrect Incorrect Incorrect
Closure_111 Incorrect Incorrect Incorrect
Closure_115 Incorrect Incorrect Incorrect
Closure_116 Incorrect Incorrect Incorrect
Closure_120 Incorrect Incorrect Incorrect
Closure_124 Incorrect Incorrect Incorrect
Closure_130 Incorrect Incorrect Incorrect
Closure_121 Incorrect Incorrect Incorrect
Mockito_38 Incorrect Incorrect Incorrect
Angelix Lang_30 Incorrect Incorrect Incorrect
CVC4 Lang_30 Incorrect Incorrect Incorrect
Enum Lang_30 Incorrect Incorrect Incorrect
Table 6. Labels by Independent annotators (“Annot” column) and authors (“Authors” column) of patches identified by independent test suite (ITS) generated by DiffTGen or Randoop as incorrect .

Finally, we want to investigate the difficulty of judging correctness of patches that DiffTGen and Randoop generated ITSs label as incorrect. To do so, we compare participant completion time for the set of 51 unique patches and the set of other patches. Figure 7 shows time spent by participants labelling these two sets of patches. We find that they are more or less the same. Wilcoxon signed-rank test confirms that the difference is not statistically significant. Thus, patches that ITS successfully label as incorrect are not necessarily the ones that participants require more time to manually label.

Figure 7. Participant completion time for the 51 unique patches labelled by DiffTGen’s and Randoop’s ITSs as incorrect versus that for other patches.

7. Discussion

In this section, we first provide implications of our findings. We then discuss our post-study survey, in which we asked a number of independent annotators for rationales behind their patch correctness judgements. At the end of this section, we discuss some threats to validity.

7.1. Implications

To recap, we have gained insights into the reliability of patch correctness assessment by authors and by automatically generated independent test suite (ITS); each of them has their own advantages and disadvantages. Based on these insights, we provide several implications as follows.

[width=8.5cm] Authors’ evaluation of patch correctness should be made publicly available to the community.

Liu et al., Martinez et al., and Le et al. released their patch correctness labels publicly (Liu et al., 2017; Martinez et al., 2017; Le et al., 2017a), which we are grateful for. We believe that considerable effort has been made by authors to ensure the quality of the labels. Still, we notice that for slightly more than 10% of the patches, authors’ labels are different from the ones produced by multiple independent annotators. Thus, we encourage future ASR paper authors to release their datasets for public inspection. The public (including independent annotators) can then provide inputs on the labels and possibly update labels that may have been incorrectly assigned. Our findings here (e.g., author annotations are fairly reliable) may not generalize to patches labelled by authors which have not been released publicly. It is possible that the quality of correctness labels for those patches (which are not made publicly available) to be lower. Also, as criticized by Monperrus et al. (Monperrus, 2014), the conclusiveness of the evaluation of techniques that keep patches and their correctness labels private is questionable.

[width=8.5cm] Collaborative effort is needed to distribute the expensive cost of ASR evaluation.

In this study, we have evaluated correctness of 189 automatically generated patches by involving independent annotators. We have shown that the quality of the resultant labels (measured using inter-rater reliability) are on par with high-quality text retrieval benchmarks (Christopher et al., 2008). Unfortunately, evaluation using independent annotators is expensive. To evaluate 189 patches, we need to get 35 professional developers; Each agrees to spend up to an hour of their time. This process may not be scalable especially considering the large number of new ASR techniques that are released in the literature year by year. Thus, there is a need for a more collaborative effort to distribute the cost of ASR evaluation. One possibility is to organize a competition involving impartial industrial data owners (e.g., software development houses willing to share some of their closed bugs) who are willing to judge correctness of generated patches. Similar competitions with industrial data owners have been held to advance various fields such as forecasting999http://www.cikm2017.org/CIKM_AnalytiCup_task1.html and fraud detection101010http://research.larc.smu.edu.sg/fdma2012/.

Figure 8. A machine-generated patch labeled by ITS as incorrect but labeled by author annotation as unknown.

[width=8.5cm] Independent test suite (ITS) alone should not be used to evaluate the effectiveness of ASR.

Independent test suites (ITSs) generated by DiffTGen (Xin and Reiss, 2017) and Randoop (Pacheco et al., 2007) have been shown to be ineffective in annotating correctness labels for patches (see Section 6). Only fewer than a fifth of the incorrect patches are identified as such by ITSs generated by DiffTGen and Randoop. Based on effectiveness of state-of-the-art test generation tool for automatic repair that we assessed in this study, we believe that ITS alone should not be used for fully automated patch labeling. The subject of ITS generation for program repair is new though and we encourage future studies to improve the quality of automatic test generation tools so that more incorrect patches can be detected. That being said, automated patch annotation may not be a silver bullet; the general problem of patch correctness assessment (judging the equivalence of developer patch and automatically generated patch) is a variant of program equivalence problem which has been proven to be undecidable with no algorithmic solution (Sipser, 1996).

[width=8.5cm] Independent test suite, despite being less effective, can be used to augment author annotation.

It has been shown in Section 6 that ITS generated by DiffTGen and Randoop identified four patches as incorrect whereas the labels generated by author annotation are unknown and correct. An example of such patch is shown in Figure 8. From the figure, we can notice that it is hard to judge whether the patch is correct or incorrect. From this finding, we believe that ITS, despite being less effective than author annotation in identifying correct patches, can be used to augment author annotation by helping to resolve at least some of the ambiguous cases. Authors can possibly run DiffTGen and Randoop to identify clear cases of incorrect patches; the remaining cases can then be manually judged. The use of both author and automated annotation via ITS generation can more closely approximate multiple independent annotators’ labels while requiring less cost.

7.2. Post-Study Survey

We conducted a post-study survey to investigate why a developer chooses a different answer from the majority. Among the 189 patches, there are several patches where the majority, but not all participants, agree on patch correctness. Among participants annotating these patches, we selected 11 who answered differently from the majority and emailed them to get deeper insights into their judgments. In our email, we provided a link to the same web interface used in our user study to allow participants to revisit their decision for the patch in question. Notice that we did not inform the participants that their answers were different from the majority. We received replies from 8 out of the 11 participants (72.7% response rate).

We found that 5 out of 8 developers changed their correctness labels after they looked into the patch again; their revised labels thus became consistent with the labels that the majority agree. The remaining three kept their correctness labels; two judged two different patches as incorrect (while the majority labels are correct) while another judged a patch as correct (while the majority label is incorrect). These participants kept their decision for different reasons; one was unsure of a complex expression involved in the patch, another highlighted a minor difference that may be considered ignorable by others, and the other participant viewed the generated and ground truth patch to have similar intentions. An excerpt of the patch in question for the last mentioned participant is shown in Figure 9.

Figure 9. An example of a patch in post-study

7.3. Threats to Validity

Threats to internal validity. These threats relate to potential errors and biases in our study. The following are a few relevant threats that deserve further discussion:

  • [leftmargin=*]

  • There may be errors in the web interface, that we provide to participants in our study, and the code for analyzing the collected data. To reduce the possibility of errors in the web interface, we conducted a pilot study with a few graduate students and incorporated their feedback. We also performed a thorough check on our code.

  • Due to constrained resources (we only have 35 professional developers agreeing to devote an hour of their time), we do not use all patches in the original dataset by Liu et al. (Liu et al., 2017), Martinez et al. (Martinez et al., 2017), and Le et al. (Le et al., 2017a). If the whole collection of patches is used, it is possible that results may differ. To mitigate this threat, we randomly selected patches that are included in this study while keeping the ratios of patches coming from different ASR tools approximately the same.

  • Professional developers included in our user study are not the original developers of the buggy code and ground truth patches. Unfortunately, since the original developer patches included in Liu et al.’s study were committed many years ago (the earliest being 2006), it is hard to get the original developers to participate in our study. Also, the original developers may also have forgotten the detail of the patches. Additionally, since the patches are small, the task of comparing two patches and judging whether they are equivalent or not should be managable to professional developers. Indeed, in our study, our respondents are able to provide definite labels to a majority of patches (i.e., only 44 out of 750 labels (5.9%) are unknown, while the rest are either incorrect or correct). To improve the reliability of the labels, we ask not only one professional developer but five of them. As highlighted in Section 4 there is a substantial agreement among participants satisfying standard followed by high-quality benchmark datasets. Furthermore, to help developers understand patches, we also provide multiple resources including source code files, failed test cases, GitHub link of the project, etc. A large number of past software engineering, studies e.g., (Gachechiladze et al., 2017; Buse and Weimer, 2010; Baysal et al., 2014; Ko et al., 2014; Gachechiladze et al., 2017; Daka et al., 2015; Ormandjieva et al., 2007) have also involved third-party labellers (who are not content creators) to assign labels to data. The same situation was also followed in other related areas, e.g., information retrieval (Damessie et al., 2017a; Bailey et al., 2008). We also make the 189 patches and participants’ responses publicly available for public inspection.111111URL omitted for double blind reviewing but would be made available later.

Threats to external validity. These threats relate to the generalizability of our results. The following are a few relevant threats that deserve further discussion:

  • [leftmargin=*]

  • In this study, we included 189 patches generated by 8 ASR tools to fix buggy code from 13 software projects. We believe this is a substantial number of patches generated by a substantial number of state-of-the-art ASR tools. Past empirical studies on ASR, e.g., (Qi et al., 2015), include five tools and 55 patches from 105 bugs. Still, we acknowledge that results may differ if more patches from more projects and more ASR tools are considered.

  • We have included 35 professional developers in our user study. The number of professional developers included in this study is larger or similar to those considered in many prior work, e.g., (Kevic et al., 2015; Johnson et al., 2013; Rubin and Rinard, 2016). Admittedly, it is possible that results differ for other groups of developers. To reduce this threat, we have selected the developers from two large IT companies and a large educational institution. We have also included a mix of junior and senior developers.

Threats to construct validity.

These threats relate to the suitability of our evaluation metrics. In this study, we use Krippendorff’s alpha and average pairwise Cohen’s kappa to evaluate the reliability of the patch labels from independent annotators. We also use the two to measure agreement between independent annotators’ labels and those produced by author and automated annotations. These metrics are widely used in many research areas, e.g., information retrieval 

(Castillo et al., 2006; Meij, 2011; Amigó et al., 2013), software engineering (Chaparro et al., 2017; Abdalkareem et al., 2017), etc. Thus, we believe there is little threat to construct validity.

8. Related Work

Program repair. We briefly discuss other repair techniques beyond the techniques used in our study (e.g., GenProg (Le Goues et al., 2012), Kali (Qi et al., 2015), Nopol (Xuan et al., 2016), and ACS (Xiong et al., 2017b), etc), which have been described in Section 2. General program repair techniques can typically be divided into two main branches: heuristic- and semantics-based repair. Heuristics-based repair techniques heuristically search for repairs commonly via genetic programming algorithm. RSRepair (Qi et al., 2014b) and AE (Weimer et al., 2013b) replace the search strategy in GenProg by random and adaptive search strategies, respectively. PAR (Kim et al., 2013) generates repairs based on repair templates manually learned from human written patches. Prophet (Long and Rinard, 2016b) and HDRepair (Le et al., 2016c) learn and mine repair models from historical data for ranking patches, preferring those that match frequent human fix patterns. Tan et al. propose anti-patterns to prevent heuristics-based repair tools from generating trivial repairs (Tan et al., 2016).

Semantics-based repair techniques, such as SemFix (Nguyen et al., 2013), DirectFix (Mechtaev et al., 2015), and Angelix (Mechtaev et al., 2016), synthesize repairs using symbolic execution and program synthesis. In a similar vein, S3 (Le et al., 2017a) additionally proposes to employ various measures on the syntactic and semantics distances between candidates fixes and the original program to rank the search space. Other semantics-based techniques include SPR (Long and Rinard, 2015), which targets defects in if-conditions. Qlose (D’Antoni et al., 2016) uses program execution traces as an additional criteria to rank patches, and encode program repair problem into a program synthesis tool namely SKETCH (Solar-Lezama et al., 2005). SearchRepair (Ke et al., 2015) lies between heuristic- and semantic-based repair, using semantic search as its underlying mutation approach to produce higher-granularity, high-quality patches. However, it does not yet scale as well as other approaches. Le et al. proposed to combine both search- and semantics-based repair into a single approach (Le, 2016).

Empirical studies on patch correctness assessment. To address patch correctness, two popular methods have been used, including author annotation and automated annotation via independent test suites generated by automatic test generation tools. Qi et al. (Qi et al., 2015) empirically studied patches generated by GenProg (Le Goues et al., 2012), RSRepair (Qi et al., 2014a), and AE (Weimer et al., 2013a). They manually investigated the patches, wrote additional test cases, and reported the results on running the patches against additional test cases. Authors of PAR (Kim et al., 2013) performed a user study on the acceptability of patches generated by their tool. They employed 89 students and 164 developers to confirm that patches generated by PAR are more acceptable than GenProg. Monperrus et al. (Monperrus, 2014) discuss the main evaluation criteria of automatic software repair including understandability, correctness and completeness. They suggest that repair techniques having their generated patches along with correctness labels kept private, such as PAR, are questionable. To avoid potential bias of manual human investigation, Smith et al. use automatic test case generation tool KLEE (Carterette and Soboroff, 2010) to generate independent test suites (ITS) that maximize coverage of ground-truth program to assess machine-generated patches (Smith et al., 2015). Using ITS, they evaluate the effectiveness of GenProg, RSRepair (aka. TrpAutoRepair), and AE on the IntroClass dataset (Le Goues et al., 2015) containing thousands of small programs. Our study is different from the mentioned studies in that we objectively assess the reliability of author annotation and automated annotation.

Empirical studies on biases and reliability. A number of empirical studies have analyzed biases and reliability issues that affect how automated software engineering solutions are evaluated. Bird et al. highlighted that only a fraction of bug fixes are labelled in version control systems and this causes a systematic bias in the evaluation of defect prediction tools (Bird et al., 2009). Herzig et al. manually examined 7,000 reports from issue tracking systems of open source projects and reported that 33.8% of all bug reports to be misclassified (Herzig et al., 2013). They showed that the misclassification introduces bias to defect prediction studies since a substantial number of files is wrongly marked as defective. The goal of our study is similar to the goals above mentioned studies – we want to highlight and reduce bias in the evaluation of existing automated software engineering tools.

9. Conclusion and Future Work

In this paper, to assess reliability of existing patch correctness assessment methods, we conducted a user study with 35 professional developers to construct a gold set of correctness labels for 189 patches generated by different ASR techniques. By measuring inter-rater agreement (which was found to be substantial and on par with other high-quality benchmarks), we validated the quality of annotation labels in our gold set. We then compare our gold set with labels produced by authors (i.e., Liu et al. (Liu et al., 2017), Martinez et al. (Martinez et al., 2017), and Le et al. (Le et al., 2017a)) and independent test suites generated by DiffTGen (Xin and Reiss, 2017) and Randoop (Pacheco et al., 2007), and report their strengths and deficiencies. In particular, we find that a majority (88.8-89.0%) of patch correctness labels generated by authors match those produced by independent annotators. On the other hand, only fewer than a fifth of incorrect patches can be labelled by independent test suites (ITSs) generated by DiffTGen and Randoop as such. DiffTGen and Randoop can however generate ITSs that can uncover multiple incorrect patches that are labeled as “unknown” or “correct” by authors. Based on our findings, we recommend that ASR authors release their patch correctness labels for public inspection. We also encourage more collaborative effort to distribute the expensive cost of ASR evaluation especially through user studies like ours. We also stressed that ITS alone should not be used to fully judge patch correctness labels; still, they can be used in conjunction with author annotation to help the latter produce labels that can more closely approximate independent annotators’ labels.

In the future, we plan to expand our gold set by recruiting more professional developers and collecting more patches generated by additional ASR techniques through a large-scale collaborative effort among ASR researchers. We also plan to explore the possibility of organizing competitions with industrial data owners (e.g., with our two industrial partners whose developers have participated in this study) for further ASR research.

References

  • (1)
  • Abdalkareem et al. (2017) Rabe Abdalkareem, Olivier Nourry, Sultan Wehaibi, Suhaib Mujahid, and Emad Shihab. 2017. Why do developers use trivial packages? an empirical case study on npm. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 385–395.
  • Amigó et al. (2013) Enrique Amigó, Julio Gonzalo, and Felisa Verdejo. 2013. A general evaluation measure for document organization tasks. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 643–652.
  • Bailey et al. (2008) Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P de Vries, and Emine Yilmaz. 2008. Relevance assessment: are judges exchangeable and does it matter. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 667–674.
  • Baysal et al. (2014) Olga Baysal, Reid Holmes, and Michael W Godfrey. 2014. No issue left behind: Reducing information overload in issue tracking. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 666–677.
  • Bird et al. (2009) Christian Bird, Adrian Bachmann, Eirik Aune, John Duffy, Abraham Bernstein, Vladimir Filkov, and Premkumar T. Devanbu. 2009. Fair and balanced?: bias in bug-fix datasets. In Proceedings of the 7th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2009, Amsterdam, The Netherlands, August 24-28, 2009. 121–130.
  • Britton et al. (2013) Tom Britton, Lisa Jeng, Graham Carver, Paul Cheak, and Tomer Katzenellenbogen. 2013. Reversible Debugging Software. Technical Report. University of Cambridge, Judge Business School.
  • Buse and Weimer (2010) Raymond PL Buse and Westley R Weimer. 2010. Learning a metric for code readability. IEEE Transactions on Software Engineering 36, 4 (2010), 546–558.
  • Cadar et al. (2008) Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. 2008. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs.. In Symposium on Operating Systems Design and Implementation (OSDI). 209–224.
  • Carterette and Soboroff (2010) Ben Carterette and Ian Soboroff. 2010. The effect of assessor error on IR system evaluation. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, 539–546.
  • Castillo et al. (2006) Carlos Castillo, Debora Donato, Luca Becchetti, Paolo Boldi, Stefano Leonardi, Massimo Santini, and Sebastiano Vigna. 2006. A reference collection for web spam. In ACM Sigir Forum, Vol. 40. ACM, 11–24.
  • Chandra et al. (2011) Satish Chandra, Emina Torlak, Shaon Barman, and Rastislav Bodik. 2011. Angelic debugging. In International Conference on Software Engineering (ICSE’11). 121–130.
  • Chaparro et al. (2017) Oscar Chaparro, Jing Lu, Fiorella Zampetti, Laura Moreno, Massimiliano Di Penta, Andrian Marcus, Gabriele Bavota, and Vincent Ng. 2017. Detecting missing information in bug descriptions. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 396–407.
  • Christopher et al. (2008) D Manning Christopher, Raghavan Prabhakar, and Schütze Hinrich. 2008. Introduction to information retrieval. An Introduction To Information Retrieval 151 (2008), 177.
  • Cliff (1993) Norman Cliff. 1993. Dominance Statistics: Ordinal Analyses to Answer Ordinal Questions. Psychological Bulletin 114, 3 (1993), 494.
  • Cohen (1960) Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20, 1 (1960), 37–46.
  • Daka et al. (2015) Ermira Daka, José Campos, Gordon Fraser, Jonathan Dorn, and Westley Weimer. 2015. Modeling readability to improve unit tests. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 107–118.
  • Damessie et al. (2017a) Tadele T Damessie, Thao P Nghiem, Falk Scholer, and J Shane Culpepper. 2017a. Gauging the Quality of Relevance Assessments using Inter-Rater Agreement. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1089–1092.
  • Damessie et al. (2017b) Tadele Tedla Damessie, Thao P. Nghiem, Falk Scholer, and J. Shane Culpepper. 2017b. Gauging the Quality of Relevance Assessments using Inter-Rater Agreement. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017. 1089–1092.
  • D’Antoni et al. (2016) Loris D’Antoni, Roopsha Samanta, and Rishabh Singh. 2016. Qlose: Program repair with quantitative objectives. In International Conference on Computer Aided Verification (CAV). Springer, 383–401.
  • De Lucia et al. (2014) Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, Annibale Panichella, and Sebastiano Panichella. 2014. Labeling source code with information retrieval methods: an empirical study. Empirical Software Engineering 19, 5 (2014), 1383–1420.
  • Dybkjaer et al. (2007) Laila Dybkjaer, Holmer Hemsen, and Wolfgang Minker. 2007. Evaluation of Text and Speech Systems (1st ed.). Springer Publishing Company, Incorporated.
  • Fraser and Arcuri (2011) Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: automatic test suite generation for object-oriented software. In SIGSOFT/FSE’11 19th ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE-19) and ESEC’11: 13th European Software Engineering Conference (ESEC-13), Szeged, Hungary, September 5-9, 2011. 416–419. https://doi.org/10.1145/2025113.2025179
  • Gachechiladze et al. (2017) D Daviti Gachechiladze, F Lanubile, N Novielli, and A Serebrenik. 2017. Anger and its direction in Apache Jira developer comments. In Proc. of the Int. Conf. on Software Engineering (ICSE).
  • Herzig et al. (2013) Kim Herzig, Sascha Just, and Andreas Zeller. 2013. It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In 35th International Conference on Software Engineering, ICSE ’13, San Francisco, CA, USA, May 18-26, 2013. 392–401.
  • Johnson et al. (2013) Brittany Johnson, Yoonki Song, Emerson Murphy-Hill, and Robert Bowdidge. 2013. Why don’t software developers use static analysis tools to find bugs?. In Software Engineering (ICSE), 2013 35th International Conference on. IEEE, 672–681.
  • Just et al. (2014) René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In International Symposium on Software Testing and Analysis (ISSTA ’14). 437–440.
  • Ke et al. (2015) Yalin Ke, Kathryn T. Stolee, Claire Le Goues, and Yuriy Brun. 2015. Repairing Programs with Semantic Code Search. In International Conference on Automated Software Engineering (ASE). 295–306.
  • Kevic et al. (2015) Katja Kevic, Braden M. Walters, Timothy R. Shaffer, Bonita Sharif, David C. Shepherd, and Thomas Fritz. 2015. Tracing software developers’ eyes and interactions for change tasks. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, Bergamo, Italy, August 30 - September 4, 2015. 202–213.
  • Kim et al. (2013) Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Automatic patch generation learned from human-written patches. In International Conference on Software Engineering (ICSE ’13). 802–811.
  • Ko et al. (2014) Andrew J Ko, Bryan Dosono, and Neeraja Duriseti. 2014. Thirty years of software problems in the news. In Proceedings of the 7th International Workshop on Cooperative and Human Aspects of Software Engineering. ACM, 32–39.
  • Krippendorff (1970) Klaus Krippendorff. 1970. Estimating the Reliability, Systematic Error, and Random Error of Interval Data. Educational and Psychological Measurement 30, 1 (1970), 61–70.
  • Landis and Koch (1977) J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. biometrics (1977), 159–174.
  • Le (2016) Xuan-Bach D Le. 2016. Towards efficient and effective automatic program repair. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 876–879.
  • Le et al. (2017a) Xuan Bach Dinh Le, Duc Hiep Chu, David Lo, Claire Le Goues, and Willem Visser. 2017a. S3: syntax-and semantic-guided repair synthesis via programming by example. FSE. ACM (2017).
  • Le et al. (pear) Xuan-Bach D Le, Duc-Hiep Chu, David Lo, Claire Le Goues, and Willem Visser. 2017 (to appear). JFIX: Semantics-Based Repair of Java Programs via Symbolic PathFinder. In International Symposium on Software Testing and Analysis (ISSTA’17).
  • Le et al. (2016a) Xuan Bach D. Le, Quang Loc Le, David Lo, and Claire Le Goues. 2016a. Enhancing Automated Program Repair with Deductive Verification. In International Conference on Software Maintenance and Evolution (ICSME). 428–432.
  • Le et al. (2015) Xuan-Bach D Le, Tien-Duy B Le, and David Lo. 2015. Should fixing these failures be delegated to automated program repair?. In International Symposium on Software Reliability Engineering (ISSRE). 427–437.
  • Le et al. (2016b) Xuan-Bach D Le, David Lo, and Claire Le Goues. 2016b. Empirical study on synthesis engines for semantics-based program repair. In International Conference on Software Maintenance and Evolution (ICSME’16). 423–427.
  • Le et al. (2016c) Xuan Bach D Le, David Lo, and Claire Le Goues. 2016c. History driven program repair. In International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 213–224.
  • Le et al. (2017b) Xuan Bach D Le, Ferdian Thung, David Lo, and Claire Le Goues. 2017b. Overfitting in Semantics-based Automated Program Repair. Empirical Software Engineering Journal (2017).
  • Le et al. (2016d) Xuan-Bach D. Le, David Lo, and Claire Le Goues. 2016d. History Driven Program Repair. In IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering, SANER 2016, Suita, Osaka, Japan, March 14-18, 2016 - Volume 1. 213–224.
  • Le Goues et al. (2012) Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer. 2012. A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each. In International Conference on Software Engineering (ICSE’12). 3–13.
  • Le Goues et al. (2015) Claire Le Goues, Neal Holtschulte, Edward K Smith, Yuriy Brun, Premkumar Devanbu, Stephanie Forrest, and Westley Weimer. 2015. The ManyBugs and IntroClass benchmarks for automated repair of C programs. Transactions on Software Engineering (TSE) 41, 12 (Dec. 2015), 1236–1256.
  • Liu et al. (2017) Xinyuan Liu, Muhan Zeng, Yingfei Xiong, Lu Zhang, and Gang Huang. 2017. Identifying Patch Correctness in Test-Based Automatic Program Repair. arXiv preprint arXiv:1706.09120 (2017).
  • Long and Rinard (2015) Fan Long and Martin Rinard. 2015. Staged Program Repair with Condition Synthesis. In European Software Engineering Conference and International Symposium on Foundations of Software Engineering (ESEC/FSE). 166–178.
  • Long and Rinard (2016a) Fan Long and Martin Rinard. 2016a. An analysis of the search spaces for generate and validate patch generation systems. In International Conference on Software Engineering (ICSE). ACM, 702–713.
  • Long and Rinard (2016b) Fan Long and Martin Rinard. 2016b. Automatic Patch Generation by Learning Correct Code. In Symposium on Principles of Programming Languages (POPL). 298–312.
  • Martinez et al. (2017) Matias Martinez, Thomas Durieux, Romain Sommerard, Jifeng Xuan, and Martin Monperrus. 2017. Automatic repair of real bugs in java: a large-scale experiment on the defects4j dataset. Empirical Software Engineering 22, 4 (2017), 1936–1964. https://doi.org/10.1007/s10664-016-9470-4
  • Mechtaev et al. (2015) Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2015. Directfix: Looking for simple program repairs. In International Conference on Software Engineering (ICSE). IEEE Press, 448–458.
  • Mechtaev et al. (2016) Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2016. Angelix: Scalable multiline program patch synthesis via symbolic analysis. In International Conference on Software Engineering (ICSE). IEEE, 691–701.
  • Meij (2011) Edgar Meij. 2011. Combining concepts and language models for information access. In SIGIR Forum, Vol. 45. 80.
  • Monperrus (2014) Martin Monperrus. 2014. A critical review of automatic patch generation learned from human-written patches: essay on the problem statement and the evaluation of automatic software repair. In Proceedings of the 36th International Conference on Software Engineering. ACM, 234–242.
  • Nguyen et al. (2013) Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chandra. 2013. Semfix: Program repair via semantic analysis. In International Conference on Software Engineering (ICSE). IEEE Press, 772–781.
  • Ormandjieva et al. (2007) Olga Ormandjieva, Ishrar Hussain, and Leila Kosseim. 2007. Toward a text classification system for the quality assessment of software requirements written in natural language. In Fourth international workshop on Software quality assurance: in conjunction with the 6th ESEC/FSE joint meeting. ACM, 39–45.
  • Pacheco et al. (2007) Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball. 2007. Feedback-Directed Random Test Generation. In 29th International Conference on Software Engineering (ICSE 2007), Minneapolis, MN, USA, May 20-26, 2007. 75–84. https://doi.org/10.1109/ICSE.2007.37
  • Qi et al. (2014a) Yuhua Qi, Xiaoguang Mao, Yan Lei, Ziying Dai, and Chengsong Wang. 2014a. The strength of random search on automated program repair. In Proceedings of the 36th International Conference on Software Engineering. ACM, 254–265.
  • Qi et al. (2014b) Yuhua Qi, Xiaoguang Mao, Yan Lei, Ziying Dai, and Chengsong Wang. 2014b. The strength of random search on automated program repair. In International Conference on Software Engineering (ICSE). ACM, 254–265.
  • Qi et al. (2015) Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. 2015. An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. In International Symposium on Software Testing and Analysis. ACM, 24–36.
  • Rastkar et al. (2010) Sarah Rastkar, Gail C Murphy, and Gabriel Murray. 2010. Summarizing software artifacts: a case study of bug reports. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. ACM, 505–514.
  • Rubin and Rinard (2016) Julia Rubin and Martin Rinard. 2016. The challenges of staying together while moving fast: An exploratory study. In Software Engineering (ICSE), 2016 IEEE/ACM 38th International Conference on. IEEE, 982–993.
  • Scholer et al. (2011) Falk Scholer, Andrew Turpin, and Mark Sanderson. 2011. Quantifying test collection quality based on the consistency of relevance judgements. In Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, July 25-29, 2011. 1063–1072.
  • Sipser (1996) Michael Sipser. 1996.

    Introduction to the Theory of Computation

    (1st ed.).
    International Thomson Publishing.
  • Smith et al. (2015) Edward K Smith, Earl T Barr, Claire Le Goues, and Yuriy Brun. 2015. Is the cure worse than the disease? overfitting in automated program repair. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 532–543.
  • Solar-Lezama et al. (2005) Armando Solar-Lezama, Rodric Rabbah, Rastislav Bodík, and Kemal Ebcioğlu. 2005. Programming by sketching for bit-streaming programs. In ACM SIGPLAN Notices. ACM, 281–294.
  • Tan et al. (2016) Shin Hwei Tan, Hiroaki Yoshida, Mukul R Prasad, and Abhik Roychoudhury. 2016. Anti-patterns in search-based program repair. In International Symposium on Foundations of Software Engineering. ACM, 727–738.
  • Tassey (2002) G. Tassey. 2002. The economic impacts of inadequate infrastructure for software testing. Planning Report, NIST (2002).
  • Treude et al. (2015) Christoph Treude, Martin P Robillard, and Barthélémy Dagenais. 2015. Extracting development tasks to navigate software documentation. IEEE Transactions on Software Engineering 41, 6 (2015), 565–581.
  • Vaccaro et al. (2011) Alexander R. Vaccaro, Alpesh Patel, and Charles Fisher. 2011. Author Conflict and Bias in Research: Quantifying the Downgrade in Methodology. Spine 30, 14 (2011).
  • Weimer et al. (2013a) Westley Weimer, Zachary P Fry, and Stephanie Forrest. 2013a. Leveraging program equivalence for adaptive program repair: Models and first results. In Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering. IEEE Press, 356–366.
  • Weimer et al. (2013b) Westley Weimer, Zachary P Fry, and Stephanie Forrest. 2013b. Leveraging program equivalence for adaptive program repair: Models and first results. In International Conference on Automated Software Engineering (ASE). 356–366.
  • Wickens (1991) Christopher D Wickens. 1991. Processing resources and attention. Multiple-task performance 1991 (1991), 3–34.
  • Xin and Reiss (2017) Qi Xin and Steven P Reiss. 2017. Identifying test-suite-overfitted patches through test case generation. In International Symposium on Software Testing and Analysis. ACM, 226–236.
  • Xiong et al. (2017a) Yingfei Xiong, Jie Wang, Runfa Yan, Jiachen Zhang, Shi Han, Gang Huang, and Lu Zhang. 2017a. Precise condition synthesis for program repair. In International Conference on Software Engineering (ICSE). IEEE Press, 416–426.
  • Xiong et al. (2017b) Yingfei Xiong, Jie Wang, Runfa Yan, Jiachen Zhang, Shi Han, Gang Huang, and Lu Zhang. 2017b. Precise condition synthesis for program repair. In International Conference on Software Engineering. IEEE Press, 416–426.
  • Xuan et al. (2016) Jifeng Xuan, Matias Martinez, Favio Demarco, Maxime Clément, Sebastian Lamelas, Thomas Durieux, Daniel Le Berre, and Martin Monperrus. 2016. Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs. Transactions on Software Engineering (2016).
  • Zou et al. (2015) Yanzhen Zou, Ting Ye, Yangyang Lu, John Mylopoulos, and Lu Zhang. 2015. Learning to rank for question-oriented software text retrieval (t). In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on. IEEE, 1–11.