How to trust auto-generated code patches? A developer survey and empirical assessment of existing program repair tools

08/30/2021 ∙ by Yannic Noller, et al. ∙ National University of Singapore Association for Computing Machinery 0

Automated program repair is an emerging technology that seeks to automatically rectify bugs and vulnerabilities using learning, search, and semantic analysis. Trust in automatically generated patches is necessary for achieving greater adoption of program repair. Towards this goal, we survey more than 100 software practitioners to understand the artifacts and setups needed to enhance trust in automatically generated patches. Based on the feedback from the survey on developer preferences, we quantitatively evaluate existing test-suite based program repair tools. We find that they cannot produce high-quality patches within a top-10 ranking and an acceptable time period of 1 hour. The developer feedback from our qualitative study and the observations from our quantitative examination of existing repair tools point to actionable insights to drive program repair research. Specifically, we note that producing repairs within an acceptable time-bound is very much dependent on leveraging an abstract search space representation of a rich enough search space. Moreover, while additional developer inputs are valuable for generating or ranking patches, developers do not seem to be interested in a significant human-in-the-loop interaction.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Automated program repair technologies (Goues et al., 2019) are getting increased attention. In recent times, program repair has found its way into the automated fixing of mobile apps in the SapFix project in Facebook (Marginean et al., 2019), automated repair bots as evidenced by the Repairnator project (Urli et al., 2018), and has found certain acceptability in companies such as Bloomberg (Kirbas et al., 2021). While all of these are promising, large-scale adoption of program repair where it is well integrated into our programming environments is considerably out of reach as of now. In this article, we reflect on the impediments towards the usage of program repair by developers. There can be many challenges towards the adoption of program repair like scalability, applicability, and developer acceptability. A lot of the research on program repair has focused on scalability to large programs and also to large search spaces (Long and Rinard, 2016; Mechtaev et al., 2016; Gao et al., 2019; Marginean et al., 2019). Similarly, there have been various works on generating multi-line fixes (Mechtaev et al., 2016; Gao et al., 2021), or on transplanting patches from one version to another (Shariffdeen et al., 2021) — to cover various use cases or scenarios of program repair.

Surprisingly, there is very little literature or systematic study from either academia or industry — on the developer trust in program repair. In particular, what changes do we need to bring into the program repair process so that it becomes viable to have conversations on its wide-scale adoption? Part of the gulf in terms of lack of trust comes from a lack of specifications — since the intended behavior of the program is not formally documented, it is hard to trust that the automatically generated patches meet this intended behavior. Overall, we seek to examine whether the developer’s reluctance to use program repair may partially stem from not relying on automatically generated code. This can have profound implications because of recent developments on AI-based pair programming111Github Copilot https://copilot.github.com/, which holds out promise for significant parts of coding in the future to be accomplished via automated code generation.

In this article, we specifically study the issues involved in enhancing developer trust on automatically generated patches. Towards this goal, we first settle on the research questions related to developer trust in automatically generated patches. These questions are divided into two categories (a) expectations of developers from automatic repair technologies, and (b) understanding the possible shortfall of existing program repair technologies with respect to developer expectations. To understand the developer expectations from program repair, we outline the following research questions.

  1. To what extent are the developers ready to accept and apply automated program repair (henceforth called APR)?

  2. Can software developers provide additional inputs that would cause higher trust in generated patches? If yes, what kind of inputs can they provide?

  3. What evidence from APR will increase developer trust in the patches produced?

For a comprehensive assessment of the research questions, we engage in both qualitative and quantitative studies. Our assessment of the questions primarily comes in three parts. To understand the developer expectations from program repair, we conduct a detailed survey (with 35 questions) among more than 100 professional software practitioners. Most of our survey respondents are developers, with a few coming from more senior roles such as architects. The survey results amount to both quantitative and qualitative inputs on the developer expectations since we curate and analyze respondents’ comments on topics such as desired evidence from automated repair techniques. Based on the survey findings, we note that developers are largely open-minded in terms of trying out a small number of patches (no more than 10) from automated repair techniques, as long as these patches are produced within a reasonable time, say less than 1 hour. Furthermore, the developers are open to receiving specifications from the program repair method (amounting to evidence of patch correctness). They are also open-minded in terms of providing additional specifications to drive program repair. The most common specifications the developers are ready to give and receive are tests.

Based on the comments received from survey participants, we then conduct a quantitative comparison of certain well-known program repair tools on the widely used ManyBugs benchmarks (Le Goues et al., 2015). To understand the possible deficiency of existing program repair techniques with respect to outlined developer expectations as found from the survey, we formulate the following research questions.

  1. Can existing APR techniques pinpoint high-quality patches in the top-ranking (e.g., among top-10) patches within a tolerable time limit (e.g., 1 hour)?

  2. What is the impact of additional inputs (say, fix locations and additional passing test cases) on the efficacy of APR?

We note that many of the existing papers on program repair use liberal timeout periods to generate repair, while in our experiments the timeout is strictly maintained at no more than one hour. We are also restricted to observing the first few patches, and we examine the impact of the fix localization by either providing and not providing the developer location. Based on a quantitative comparison of well-known repair tools Angelix (Mechtaev et al., 2016), CPR (Shariffdeen et al., 2021), GenProg (Le Goues et al., 2012), Prophet (Long and Rinard, 2016) and Fix2Fit (Gao et al., 2019) — we conclude that the search space representation has a significant role in deriving plausible/correct patches within an acceptable time period. In other words, an abstract representation of the search space (aided by constraints that are managed efficiently or aided by program equivalence relations) is at least as critical as a smart search algorithm to navigate the patch space. We discuss how the tools can be improved to meet developer expectations, either by achieving compilation-free repair or by navigating/suggesting abstract patches with the help of simple constraints (such as interval constraints).

Last but not the least, we note that program repair can be seen as automated code generation at a micro-scale. By studying the trust issues in automated repair, we can also obtain an initial understanding of trust enhancement in automatically generated code.

2. Specifications in Program Repair

The goal of APR is to correct buggy programs to satisfy given specifications. In this section, we review these specifications and discuss how they can impact patch quality.

Test Suites as Specification

APR techniques such as GenProg (Le Goues et al., 2012), Prophet (Long and Rinard, 2016), treat test-suite as correctness specification. The test suite usually includes a set of passing tests and at least one failing test. The repair goal is to correct the buggy program to pass all the given test suites. Although test suites are widely available, they are usually incomplete specifications that specify part of the intended program behaviors. Hence, the automatically generated patch may overfit the tests, meaning that the patched program may still fail on program inputs outside the given tests. For instance, the following is a buggy implementation that copies characters from source array to destination array , and returns the number of copied characters. A buffer overflow happens at line 6 when the size of or is less than . By taking the following three tests (one of them can trigger this bug) as specification, a produced patch ( ) can make the program pass the given tests. Obviously, the patched program is still buggy on test inputs outside the given tests.

1int lenStrncpy(char[] src, char[] dest, int n){
2    if(src == NULL || dest == NULL)
3        return 0;
4    int index = -1;
5    while (++index < n)
6        dest[index] = src[index]; // buffer overflow
7    return index;
8}
Type Output Expected Output
Passing SOF COM 3 3 3
Passing DHT APP0 3 3 3
Failing APP0 DQT 4 *crash 3

Constraints as Specification

Instead of relying on tests, another line of APR research, e.g., ExtractFix (Gao et al., 2021) and CPR(Shariffdeen et al., 2021), take constraints as correctness specifications. Constraints have the potential to represent a range of inputs or even the whole input space. Driven by constraints, the goal of APR is to patch the program to satisfy the constraints. However, different from the test suite, the constraints are not always available in practice; for this reason, techniques like Angelix (Mechtaev et al., 2016) and SemFix (Nguyen et al., 2013) take tests as specifications but extract constraints from tests. Certain existing APR techniques take as input coarse-grained constraints, such as assertions or crash-free constraints. For instance, ExtractFix relies on predefined templates to infer constraints that can completely fix vulnerabilities. For the above example, according to the template for buffer overflow, the inferred constraint is . Once the patched program satisfies this constraint, it is guaranteed that the buffer overflow is completely fixed. Guarantees from such fixing of overflows/crashes do not amount to full functionality correctness guarantee of the fixed program.

Code Patterns as Specification

Besides test suites and constraints, code patterns can also serve as specifications for repair systems. Specifically, given a buggy program that violates a code pattern, the repair goal is to correct the program to satisfy the rules defined by the code pattern. The code patterns can be manually defined (Tan et al., 2016), from static analyzers (van Tonder and Goues, 2018), automatically mined from large code repositories (Bader et al., 2019; Bavishi et al., 2019), etc. Similar to the inferred constraints, code patterns cannot ensure functionality correctness.

3. Survey Methodology

Since constructing formal program specifications is notoriously difficult, the specifications used by APR tools cannot ensure patch correctness. Unreliable overfitting patches cause developers to lose trust in APR tools. It motivates us to enquire/survey developers on how APR can be enhanced to gain their trust.

We designed and conducted a survey with software practitioners, specifically to answer the first three research questions (RQ1-3). In June 2021, we distributed a questionnaire to understand how developers envision the usage of automated program repair and what can be provided to increase trust in automatically generated patches. Note that we followed our institutional guidelines and received approval from the Institutional Review Board (IRB) of our organization, prior to administering the survey.

Survey Instrument

We asked in total 35 questions about how trustworthy APR can be deployed in practice. Our questions are structured into six categories:

  1. Usage of APR (RQ1): whether and how developers would engage with APR.

  2. Availability of inputs/specifications (RQ2): what kind of input artifacts developers can provide for APR techniques.

  3. Impact on trust (RQ2): how additional input artifacts would impact the trust in generated patches.

  4. Explanations (RQ3): what kind of evidence/explanation developers expect for auto-generated patches.

  5. Usage of APR side-products (RQ3): what side-products of APR are useful for the developers say for manual bug-fixing.

  6. Background: the role and experience of the participants in the software development process.

C1 will provide insights for RQ1, C2 and C3 for RQ2, and C4 and C5 for RQ3. The questions were a combination of open-ended questions like ”How would you like to engage with an APR tool?” and close-ended questions like ”Would it increase your trust in auto-generated patches if additional artifacts such as tests/assertions are used during patching?” with Multiple Choice or a 5-point Likert scale. The questionnaire itself was created and deployed with Microsoft Forms. A complete list of our questions can be found in Table 1 and in our replication package.

Category Question Type
Q1.1 Are you willing to review patches that are submitted by APR techniques? 5-Point Likert Scale
Q1.2 How many auto-generated patches would you be willing to review before losing trust/interest in the technique? Selection + Other…
C1 Usage of Q1.3 How much time would you be giving to any APR technique to produce results? Selection + Other…
APR Q1.4 How much time do you spend on average to fix a bug? Selection + Other…
Q1.5 Do you trust a patch that has been adopted from another location/application, where a similar patch was already accepted by other developers? 5-Point Likert Scale
Q1.6 Would it increase your confidence in automatically generated patches if some kind of additional input (e.g., user-provided test cases) were considered? 5-Point Likert Scale
Q1.7 Besides some additional input that is taken into account, what other mechanism do you see to increase the trust in auto-generated patches? Open-Ended
Q2.1 Can you provide additional test cases (i.e., inputs and expected outputs) relevant for the reported bug? 5-Point Likert Scale
C2 Availability Q2.2 Can you provide additional assertions as program instrumentation about the correct behavior? 5-Point Likert Scale
of Inputs Q2.3 Can you provide a specification for the correct behavior as logical constraint? 5-Point Likert Scale

Q2.4 Would you be fine with classifying auto-generated input/output pairs as incorrect or correct behavior?

5-Point Likert Scale
Q2.5 How many of such queries would you answer? Selection + Other…
Q2.6 For how long would you be willing to answer such queries? Selection + Other…
Q2.7 What other type of input (e.g., specification or artifact) can you provide that might help to generate patches? Open-Ended
Q2.8 Please describe how you would like to engage with an APR tool. For example shortly describe the dialogue between you (as user of the APR tool) and the APR tool. Which input would you pass to the APR tool? What do you expect from the APR tool? Open-Ended
Q3.1 Would it increase your trust in auto-generated patches if additional artifacts such as tests/assertions are used during patching? 5-Point Likert Scale
C3 Impact on trust Q3.2 Which of the following additional artifacts will increase your trust? Multiple Choice
Q3.3 What are other additional artifacts that will increase your trust? Open-Ended
Q4.1 Would it increase your trust when the APR technique shows you the code coverage achieved by the executed test cases that are used to construct the repair? 5-Point Likert Scale
C4 Explanations for generated Q4.2 Would it increase your trust when the APR technique presents the ratio of input space that has been successfully tested by the inputs used to drive the repair? 5-Point Likert Scale
patches Q4.3 What other type of evidence or explanation would you like to come with the patches, so that you can select an automatically generated patch candidate with confidence? Open-Ended
Q5.1 Which of the following information (i.e., potential side-products of APR) would be helpful to validate the patch? Multiple Choice
C5 Usage of APR Q5.2 What other information (i.e., potential side-products of APR) would be helpful to validate the patch? Open-Ended
side-products Q5.3 Which of the following information (i.e., potential side-products of APR) would help you to fix the problem yourself (without using generated patches)? Multiple Choice
Q5.4 What other information (i.e., potential side-products of APR) would help you to fix the problem yourself (without using generated patches)? Open-Ended
Q6.1 What is your (main) role in the software development process? Selection + Other…
C6 Background Q6.2 How long have you worked in software development? Selection
Q6.3 How long have you worked in your current role? Selection
Q6.4 How would you characterize the organization where you are employed for software development related activities? Selection + Other…
Q6.5 What is your highest education degree? Selection + Other…
Q6.6 What is your primary programming language? Selection + Other…
Q6.7 What is your secondary programming language? Selection + Other…
Q6.8 How familiar are you with Automated Program Repair? 5-Point Likert Scale
Q6.9 Are you applying any Automated Program Repair technique at work? Yes/No
Q6.10 Which Automated Program Repair technique are you applying at work? Open-Ended
Table 1. Complete list of questions from the developer survey. In total 35 questions in 6 categories.

Participants

We distributed the survey via two channels: (1) Amazon MTurk, and (2) personalized email invitations to contacts from global-wide companies. As incentives, we offered each participant on MTurk 10 USD as compensation, while for each other participant, we donated 2 USD to a COVID-19 charity fund. We received 134 responses from MTurk. To filter low-quality and non-genuine responses, we followed the known principles (Ehrich, 2020) and used quality-control questions. In particular, we asked the participants to describe their role in software development and name their main activity. In combination with the other open-ended questions, we have been able to quickly identify non-genuine answers. After this manual post-processing, we ended up with 34 valid responses from MTurk. From our company contacts, we received 81 responses, from which all have been genuine answers. From these in combination 115 valid responses, we selected 103 relevant responses, which excluded responses from participants who classified themselves as Project Manager, Product Owner, Data Scientist, or Researcher. Our goal was to include answers from software practitioners that have hands-on experience in software development. Figure 1 and 2 show the roles and experiences for the final subset of 103 participants.

Figure 1. Responses for Q6.1 What is your (main) role in the software development process?
Figure 2. Responses for Q6.2 How long have you worked in software development?

Analysis

For the questions with a 5-point Likert scale, we analyzed the distribution of negative (1 and 2), neutral (3), and positive (4 and 5) responses. For the Multiple Choice questions, we analyzed which choices were selected most, while the open-ended ”Other” choices were analyzed and mapped to the existing choices or treated as new ones if necessary. For all open-ended questions, we performed a qualitative content analysis coding (Schreier, 2012) to summarize the themes and opinions. The first iteration of the analysis and coding was done by one author, followed by the review of the other authors. In the following sections, we will discuss the most mentioned responses, and indicate in the brackets behind the responses how often the topics are mentioned among the 103 participants. All data and codes are included in our replication package.

4. Survey Results

Figure 3. Results for the questions with the 5-point Likert Scale (103 responses).

4.1. Developer engagement with APR (RQ1)

In this section, we discuss the responses for the questions in category C1 and question Q2.8, which was explicitly exploring how the participants want to engage with an APR tool. First of all, a strong majority (72% of the responses) indicate that the participants are willing to review auto-generated patches (see Q1.1 in Figure 3). It generally confirms the efforts in the APR community to develop such techniques. Only 7% of the participants are reluctant to apply APR techniques in their work.

Figure 4. Cumulative illustration of the responses for Q1.2 How many auto-generated patches would you be willing to review before losing trust/interest in the technique?

As shown in Figure 4, we note that 72% of the participants want to review only up to 5 patches, while only 22% would review up to 10 patches. Furthermore, 6% mention that it would depend on the specific scenario. At the same time, the participants expect relatively quick results: 63% would not wait longer than one hour, of which the majority (72% of them) prefer to not even wait longer than 30 minutes. The expected time certainly depends on the concrete deployment, e.g., repair can also be deployed along a nightly Continuous Integration (CI) pipeline, but our results indicate that direct support of manual bug fixing requires quick fix suggestion or hints. In fact, 82% of the participants state that they usually spend not more than 2 hours on average to fix a bug, and hence, the APR techniques need to be fast to provide a benefit for the developer. To increase the trust in the generated patches, 80% agree that additional artifacts (e.g., test cases), which are provided as input for APR, are useful (see Q1.6 in Figure 3). As consistency check, we asked a similar question at a later point (see Q3.1 in Figure 3), and obtained that even 85% agree that additional artifacts can increase the trust. The most mentioned other mechanisms to increase trust are the extensive validation of the patches with a test suite and static analysis tools (17/103), the actual manual investigation of the patches (10/103), the reputation of the APR tool itself (9/103), the explanation of patches (8/103), and the provisioning of additionally generated tests (7/103).

[boxrule=1pt,left=1pt,right=1pt,top=1pt,bottom=1pt] RQ1 – Acceptability of APR: Additional user-provided artifacts like test cases are helpful to increase trust in automatically generated patches. However, our results indicate that full developer trust requires a manual patch review. At the same time, test reports of automated dynamic and static analysis, as well as explanations of the patch, can facilitate the reviewing effort.

The responses for the explicit question about developers’ envisioned engagement with APR tools (Q2.8) can be categorized into four areas: the extent of interaction, the type of input, the expected output, and the expected integration into the development workflow.

Interaction

Most participants (71/103) mention that they prefer a rather low amount of interaction, i.e., after providing the initial input to the APR technique, there will be no further interaction. Only a few responses (6/103) mention the one-time option to provide more test cases or some sort of specification to narrow down the search space when APR runs into a timeout, or the generated fixes are not correct. Only 3 participants envision a high level of interaction, e.g., repeated querying of relevant test cases.

Input

Most participants appear ready to provide failing test cases (22/103) or relevant test cases (20/103). Others mentioned that APR should take a bug report as input (15/103), which can include the stack trace, details of the environment, and execution logs. Some also mentioned that they envision only the provision of the bare minimum, i.e., the program itself or the repository with the source code (11/103).

Output

Besides the generated patches, the most mentioned helpful output from an APR tool is explanations of the fixed issue including its root cause (9/103). This answer is followed by the requirement to present not only one patch but a list of potential patches (8/103). Additionally, some participants mentioned that it would be helpful to produce a comprehensive test report (6/103).

Integration

The most mentioned integration mechanism is to involve APR smoothly in the DevOps pipeline (17/103), e.g., whenever a failing test is detected by the CI pipeline, the APR would be triggered to generate appropriate fix suggestions. A developer would manually review the failed test(s) and the suggested patches. Along with the integration the participants mentioned that the primary goal of APR should be to save time for the developers (8/103).

[boxrule=1pt,left=1pt,right=1pt,top=1pt,bottom=1pt] RQ1 – Interaction with APR: Developers envision a low amount of interaction with APR, e.g., by only providing initial artifacts like test cases. APR should quickly (within 30 min - 60 min) generate a small number (between 5 and 10) of patches. Moreover, APR needs to be integrated into the existing DevOps pipelines to support the development workflow.

4.2. Availability/Impact of Artifacts (RQ2)

In this section, we look more closely in the categories C2 and C3 to investigate which additional artifacts can be provided by developers, and how these artifacts influence the trust in APR. We first explore the availability of additional test cases (69% positive), program assertions (71% positive), and logical constraints (59% positive) (see the results for Q2.1, Q2.2, and Q2.3 in Figure 3). Furthermore, 58% of the participants are positive about answering queries to classify generated tests as failing or passing. This can be understood as participants want to have low interaction (i.e., asking questions to the tool), but if the tool offers to provide queries, they are ready to answer some of them (typically respondents preferring to answer no more than 10 queries). Based on the results for open-ended question Q2.7, majority of the participants (70/103) do not see any other additional artifacts (beyond tests/assertions/logical-constraints/user-queries) that they could provide to APR. The most mentioned responses by other participants are different forms of requirements specification (7/103), e.g., written in a domain-specific language, execution logs (6/103), documentation of interfaces with data types and expected value ranges (5/103), error stack traces (4/103), relevant source code locations (3/103), and reference solutions (3/103), e.g., existing solutions for similar problems.

[boxrule=1pt,left=1pt,right=1pt,top=1pt,bottom=1pt] RQ2 – Artifact Availability: Software developers can provide additional artifacts like test cases, program assertions, logical constraints, execution logs, and relevant source code locations.

Figure 5. Responses for Q3.2 Which of the following additional artifacts will increase your trust?

On the increase the trust in patches by the incorporation of additional artifacts driving repair, 93% of the participants agree that additional test cases are helpful (Figure 5). This is also interesting from the perspective of recent automated repair tools (Yang et al., 2017; Shariffdeen et al., 2021) which perform automated test generation to achieve less overfitting patches. Logical constraints (70%) and program assertions (68%) perform worse in this respect. Although user queries allow more interaction with the APR technique, they would not necessarily increase the trust more than the other artifacts, as only 59% agreed on their benefit. Most of the participants (88/103) did not mention a trust gain by other artifacts. However, a notable artifact has been non-functional requirements (3/103) like performance or security aspects, which is related to a concern that auto-generated patches may harm existing performance characteristics or introduce security vulnerabilities.

[boxrule=1pt,left=1pt,right=1pt,top=1pt,bottom=1pt] RQ2 – Impact on Trust: Additional test cases would have a great impact on the trustworthiness of APR. There exists the possibility of automatically generating tests to increase trust in APR.

4.3. Patch Explanation/Evidence (RQ3)

In this section, we explore which patch evidence and APR side-products can support trust in APR (see categories C4 and C5). We first proposed two possible pieces of evidence that could be presented along with the patches: the code coverage achieved by the executed test cases that are used to construct the repair, and the ratio of input space that has been successfully tested by the automated patch validation. 76% of the participants agree that code coverage would increase trust, and 71% agree with the input ratio (see Q4.1 and Q4.2 in Figure 3). The majority of the participants (78/103) do not mention other types of evidence that would help to select a patch with confidence. Nevertheless, the most mentioned response is a fix summary (10/103), i.e., an explanation of what has been fixed including the root cause of the issue, how it has been fixed, and how it can prevent future issues. Other participants mention the success rate in case of patch transplants (5/103), and a test report summarizing the patch validation results (3/103). These responses match the observations for RQ1, where we asked how developers want to interact with trustworthy APR and what output they expect.

[boxrule=1pt,left=1pt,right=1pt,top=1pt,bottom=1pt] RQ3 – Patch Evidence: Software developers want to see evidence for the patch’s correctness to efficiently select patch candidates. Developers want to see information such as code coverage as well as the ratio of the covered input space.

Figure 6. Responses for Q5.1 Which of the following information (i.e., potential side-products of APR) would be helpful to validate the patch?

A straightforward way to provide explanations and evidence is to provide outputs that are already created by APR as side-products. We listed some of them and asked the participants to select which of them would be helpful to validate the patches (see results in Figure 6). 85% agree that the identified fault and fix locations are helpful to validate the patch followed by the generated test cases with 79% agreement. In addition, a few participants emphasize the importance of a test report (4/103), an explanation of the root cause and the fix attempt (4/103).

Figure 7. Responses for Q5.3 Which of the following information (i.e., potential side-products of APR) would help you to fix the problem yourself (without using generated patches)?

Finally, we explore which side-products are most useful for developers, even when APR cannot identify the correct patch. Figure 7 shows that the identified fault and fix locations are of most interest (82%), followed by the generated test cases (75%). Very few participants add that an issue summary (2/103) and the potential results of a data flow analysis (2/103) could be helpful as well.

[boxrule=1pt,left=1pt,right=1pt,top=1pt,bottom=1pt] RQ3 – APR’s Side-Products: Our results indicate that side-products of APR like the fault and fix locations and the generated test cases can assist manual patch validation, and hence, enhance trust in APR.

5. Evaluation Methodology

We now investigate to which extent existing APR techniques support the expectations and requirements collected with our survey. Not all aspects of our developer survey can be easily evaluated. For example, the evaluation of the amount of interaction, the integration into existing workflows, the output format for the efficient patch selection, and the patch explanations, require additional case studies and further user experiments. In this evaluation, we focus on the quantitative evaluation of the relatively short patching time (30-60 min), the limited number of patches to manually investigate (5 to 10), handling of additional test cases and logical constraints, and the ability to generate a repair at a provided fix location. We explore whether state-of-the-art repair techniques can produce correct patches under configurations that match these expectations and requirements. Specifically, we aim to provide answers to the research questions RQ4 and RQ5.

APR Representatives

In our evaluation, we use the following representative state-of-the-art repair techniques: GenProg (Le Goues et al., 2012), Angelix (Mechtaev et al., 2016), Prophet (Long and Rinard, 2016), Fix2Fit (Gao et al., 2019), and CPR (Shariffdeen et al., 2021). GenProg (Le Goues et al., 2012) is a search-based program repair tool that evolves the buggy program by mutating program statements. It is a well-known representative of the generate-and-validate repair techniques. Angelix (Mechtaev et al., 2016) is a semantic program repair technique that applies symbolic execution to extract constraints, which serve as a specification for subsequent program synthesis. Prophet (Long and Rinard, 2016)

combines search-based program repair with machine learning. It learns a code correctness model from open-source software repositories to prioritize and rank the generated patches.

Fix2Fit (Gao et al., 2019) combines search-based program repair with fuzzing. It uses grey-box fuzzing to generate additional test inputs to filter overfitting patches that crash the program. The test generation prioritizes tests that refine an equivalence class based patch space representation. CPR (Shariffdeen et al., 2021) proposes the use of semantic program repair and concolic test generation for refining abstract patches and for discarding overfitting patches. It takes a logical constraint as additional user input to reason about the generated tests.

Subject Programs

We use the ManyBugs (Le Goues et al., 2015) benchmark, which consists of 185 defects in 9 open-source projects. For each subject, ManyBugs includes a test suite created by the original developers. Note that all of the studied repair techniques require and/or can incorporate a test suite in their repair process. For our evaluation, we filter the 185 defects that have been fixed by the developer at a single fix location. We remove defects from ”Valgrind” and ”FBC” subjects due to the inability to reproduce the defects. Finally, we obtain 60 defects in 6 different open-source projects (Table 2).

Program Description LOC Defects Tests
LibTIFF Image processing library  77k 7 78
lighttpd Web server  62k 2 295
PHP Interpreter  1046k 43 8471
GMP Math Library  145k 1 146
Gzip Data compression program  491k 3 12
Python Interpreter  407k 4 355
Table 2. Experiment subjects and their details

Experimental Configurations and Setup

All tools are configured to run in full-exploration mode; which will continue to generate patches even after finding one plausible patch until the timeout or the completion of exploring the search space. To study the impact of fix locations and test case variations (see RQ5), we evaluate each tool using different configurations (see Table 3). Note that in each configuration we provide the relevant source file to all techniques, however, with ”developer fix location” we provide the exact source line number as well.

ID Fix Location Passing Tests Timeout
EC1 tool fault localization 100% 1hr
EC2 developer fix location 100% 1hr
EC3 developer fix location 0% 1hr
EC4 developer fix location 50% 1hr
Table 3. Experiment configurations
Subject Def. Angelix Prophet GenProg Fix2Fit CPR
EC1 EC2 EC3 EC4 EC1 EC2 EC3 EC4 EC1 EC2 EC3 EC4 EC1 EC2 EC3 EC4 EC2 EC3 EC4
LibTIFF 7 3/1 3/1 3/1 3/1 1/0 1/0 1/0 1/0 5/0 5/0 5/0 5/0 5/1 4/1 4/1 4/1 4/2 4/2 4/2
lighttpd 2 - - - - 1/0 0/0 0/0 0/0 1/0 1/0 1/0 1/0 1/0 1/0 1/0 1/0 - - -
PHP 43 0/0 0/0 0/0 0/0 0/0 0/0 2/1 3/1 0/0 0/0 10/1 0/0 8/1 4/2 7/2 5/1 5/4 5/4 5/4
GMP 1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 1/1 1/1
Gzip 3 0/0 1/0 1/0 1/0 0/0 1/1 1/1 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 3/1 3/1 3/1
Python 4 - - - - 0/0 1/1 1/1 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 - - -
Overall 60 3/1 4/1 4/1 4/1 2/0 3/2 5/3 6/3 6/0 6/0 16/1 6/0 14/2 9/3 12/3 10/2 13/8 13/8 13/8
Table 4. Experimental results for the various configurations. Each cell shows the number of subjects, for which the technique was able to identify at least one Plausible/Correct patch with regard to the specific configuration.
Subject Angelix Prophet GenProg Fix2Fit
EC1 EC2 EC1 EC2 EC1 EC2 EC1 EC2
LibTIFF 86 100 24 93 1 27 100 100
lighttpd - - 20 100 ¡1 51 100 100
PHP 96 100 22 96 ¡1 91 63 80
GMP 100 100 41 100 5 100 - -
Gzip 100 100 6 100 18 100 100 100
Python - - 14 100 1 100 - -
Overall 95 100 21 98 4 78 91 95
Table 5. Experimental results for the average exploration ratio for EC1 and EC2.

Evaluation Metrics

In order to assess the techniques, we consider eight metrics: M1 the search space size of the repair tool, M2 the number of enumerated/explored patches, M3 the explored ratio with respect to the search space, M4 the number of non-compilable patches, M5 the number of non-plausible patches, i.e., patches that have been explored but ruled out because existing or generated test cases are violated, M6 the number of plausible patches, M7 the number of correct patches, and M8 the highest rank of a correct patch. M1-M6 help to analyze the overall search space creation and navigation of each technique. The definition of the search space size (M1) for the defect, as well as the definition of an enumerated/ explored patch (M2), vary for each tool. We include all experiment protocols in our replication artifact, which describes how to collect these metrics for each tool. M7-M9 assess the repair outcome, i.e., the identification of the correct patch. We define a patch as correct whenever it is semantically equivalent to the developer patch from our benchmark. To check for the correct patch, we manually investigated only the top-10 ranked patches because our survey concluded that developers would not explore beyond that. Note that not all techniques provide a patch ranking (e.g., Angelix, GenProg, and Fix2Fit). In these cases, we use the order of generation as ranking.

Hardware

All our experiments were conducted using Docker containers on top of AWS (Amazon Web Services) EC2 instances. We used the c5a.8xlarge instance type, which provides 32 vCPU processing power and 64GiB memory capacity.

Replication

Our replication package contains all experiment logs and subjects, as well as protocols that define the methodology used to analyze the output of each repair tool.

6. Evaluation Results

Table 4 summarizes our evaluation results. For each APR technique we show its performance under the given experimental configuration (see Table 3). Each cell shows /, where is the number of defects for which the tool was able to generate at least one plausible patch (i.e., M6), and similarly is the number of defects for which the tool was able to generate a correct patch among the top-10 plausible patches. For example, the LibTIFF project has 7 defects, for which Angelix was able to generate 3 plausible and 1 correct patch for the setup EC1 (i.e., 1-hour timeout, tool fault localization, and all available test cases). Due to limitations in its symbolic execution engine Klee (Cadar et al., 2008), Angelix and CPR do not support lighttpd and python, and the corresponding cells are marked with “-”. For CPR, we are not able to produce results for EC1 because it has no own fault localization, and hence, requires the fix location as an input. Additionally, Table 5 presents the average patch exploration/enumeration ratio of the techniques with respect to the patch space size, computed as a percentage of M2/M1 for each defect considered in each subject.

6.1. APR within realistic boundaries (RQ4)

The numbers in Table 4 show that the overall repair success is comparably low. For example, Fix2Fit can generate plausible patches for 14 defects with EC1, while CPR can generate correct patches for 8 defects given the correct fix location. Compared to previous studies, the number of plausible patches is significantly lower in our experiments, mainly due to the 1-hour timeout. Prior research on program repair have experimented with 10-hours (Mechtaev et al., 2018), 12-hours (Mechtaev et al., 2016; Long and Rinard, 2016) and 24-hours (Gao et al., 2019) timeouts, and determine if a correct patch can be identified among all generated plausible patches. The focus of these prior experiments was to evaluate the capability to generate a patch, whereas, in our work, we focus on the performance within a tolerable time limit set by developers.

[boxrule=1pt,left=1pt,right=1pt,top=1pt,bottom=1pt] RQ4 – Repair Success: Current state-of-the-art repair techniques perform poorly with a 1-hour timeout and the top-10 ranking restriction. Most techniques cannot identify any plausible patch for most defects in the ManyBugs benchmark.

In general, the repair success of an APR technique is determined by (1) its search space, (2) the exploration of this search space, and (3) the ranking of the identified patches. In a nutshell, this means, if the correct patch is not in the search space, the technique cannot identify it. If the correct patch is in the search space, but APR does not identify it within a given timeout or other resource limitations, it cannot report it as a plausible patch. If it identifies the patch within the available resources but cannot pinpoint it in the (potentially huge) space of generated patches, the user/developer will not recognize it. By means of these impediments for repair success in real-world scenarios, we examine the considered repair techniques. Our goal is to identify the concepts in APR that are necessary to achieve the developers’ expectations, and hence, to improve the state-of-the-art approaches.

Search Space

Table 5 shows that Angelix explores almost its complete search space within the 1-hour timeout, while Table 4 shows that it can identify plausible patches for only one defect (with EC1). As described in (Mechtaev et al., 2018), the program transformations (to build/explore the search space) by Angelix only include the modification of existing side-effect-free integer expressions/conditions and the addition of if-guards. Therefore, we conclude that Angelix’s search space is too limited to contain the correct patches. The other techniques, on the other hand, consider larger search spaces. Prophet also considers the insertion of statements and the replacement of function calls. GenProg can insert/remove any available program statement. Fix2Fit uses the search space by f1x (Mechtaev et al., 2018), which combines the search spaces of Angelix and Prophet to generate a larger search space. CPR uses the same program transformations as Angelix but is amenable to user inputs like additional synthesis components, which enriches its search space.

[boxrule=1pt,left=1pt,right=1pt,top=1pt,bottom=1pt] RQ4 – Search Space: Successful repair techniques need to consider a wide range of program transformations and should be able to take user input into account to enrich the search space.

Search Space Exploration

Prophet and GenProg show a low exploration ratio with 21% and 4% respectively (see EC1 in Table 5) that leads to a low number of plausible patches. Instead, Fix2Fit fully explores the patch search space for most of the considered defects (except for PHP), which leads to a high possibility of finding a plausible patch. CPR (not shown in the table) fully explores its search space in our experiments. In contrast to Prophet and GenProg, Fix2Fit and CPR perform grouping and abstracting of patches, to explore them efficiently. Fix2Fit groups the patches by their behavior on test inputs and uses this equivalence relation to guide the generation of additional inputs. CPR represents the search space in terms of abstract patches, which are patch templates, accompanied by constraints. CPR enumerates abstract patches instead of concrete patches, and hence, can reason about multiple patches at once to remove or refine patches. Prophet and GenProg, however, need to explore and evaluate all concrete patches, which causes a significant slowdown. Reduction of the patch validation time is possible if we can validate patches without the need to re-compile the program for each concrete patch (Durieux et al., 2017; Chen et al., 2021; Wong et al., 2021).

[boxrule=1pt,left=1pt,right=1pt,top=1pt,bottom=1pt] RQ4 – Patch Space Exploration: A large/rich search space requires an efficient exploration strategy, which can be achieved by, e.g., using search space abstractions.

Patch Ranking

Although Fix2Fit builds a rich search space and can efficiently explore it, it still cannot produce many correct patches. One reason is that Fix2Fit can identify a correct patch but fails to pinpoint it in the top-10 patches because it only applies a rudimentary patch ranking, which uses the edit-distance between the original and patched program. For instance, Fix2Fit generates the correct patch for the defect 865f7b2 in the LibTiff subject but ranks it below position 10, and hence, it is not considered in our evaluation. Furthermore, Fix2Fit’s patch refinement and ranking is based on crash-avoidance, which is not suitable for a test-suite repair benchmark such as ManyBugs that does not include many crashing defects. CPR improves on that by leveraging the user-provided logical constraint to reason about additionally generated inputs, while the patch behaviors on these inputs are collected and used to rank the patches. But still, overall, it cannot produce many correct patches within the top-10. We also investigated how many of the correct patches are within the top-5 because 72% of our survey participants strongly favored reviewing only up to 5 patches (see Figure 4). We observed that most identified correct patches within the top-10 are ranked very high so that there is not much difference if a top-5 threshold is applied. Recent work (Xiong et al., 2018; Wong et al., 2021) propose to use test behavior similarity between original/patched programs to rank plausible patches, which is a promising future direction.

[boxrule=1pt,left=1pt,right=1pt,top=1pt,bottom=1pt] RQ4 – Patch Ranking: After exploring the correct patch, an effective patch ranking is the last impediment for the developer.

6.2. Impact of additional inputs (RQ5)

Providing Fix Location as User input

In Table 4, the column EC1 shows the results with the tool’s fault localization technique, and column EC2 shows the results by repairing only at the developer-provided (correct) fix location. Intuitively, one expects that equipped with the developer fix location, the results of each repair technique should improve. However, the results by Angelix and GenProg do not change (except for one more plausible patch with Angelix). From the previous discussion about the search space, we conclude that the program transformations by Angelix are the main limiting factor so that even the provision of the correct fix location has no impact. For GenProg we know from the EC3 configuration that there is at least one correct patch in the search space (see Table 4). Therefore, we conclude that GenProg suffers from its inefficient space exploration so that even the space reduction by setting the fix location has no impact. Prophet instead can generate two additional correct patches in EC2, and hence, benefits from the precise fix location. The exploration ratio in Table 5 shows that Prophet almost fully explores its search space in EC2, indicating the significantly smaller search space. Fix2Fit can generate one more correct patch as compared to EC1. Similar to Prophet, Fix2Fit benefits from the precise fix location and can explore more of its search space. Note that CPR is not included in the comparison between EC1 and EC2 because it does not apply for EC1. However, for EC2, it can generate the highest number of correct patches. Besides its efficient patch space abstraction, we attribute this to its ability to incorporate additional user inputs like the fix location and the user-provided logical constraint.

[boxrule=1pt,left=1pt,right=1pt,top=1pt,bottom=1pt] RQ5 – Fix Location: Our results show that the provision of the precise and correct fix location does not necessarily improve the outcome of the state-of-the-art APR techniques due to their limitations in search space construction and exploration. However, being amenable to such additional inputs can significantly improve the repair success, as shown by results from CPR.

Varying Passing Test Cases

To examine the impact of the passing test cases, we consider the differences between the columns EC2, EC3, and EC4 in Table 4. In general, more passing test cases can lead to high-quality patches because they represent information about the correct behavior. In line with this, we observe that more passing test cases lead to less plausible patches because the patch validation can remove more overfitting patches. For Angelix however, we observe that there is no difference due to its limited search space. CPR is also not affected by the varying number of passing test cases. It uses the failing test cases to synthesize the search space and the passing test cases as seed inputs for its input generation. But since CPR always fully explores the search space in our experiments, the variation of the initial seed inputs has no effect within the 1 hour. Overall, we observe three different effects: (a) For techniques with a limited search space (e.g., Angelix), passing test cases have very low or no effect. (b) For techniques that suffer from inefficient space exploration strategies (e.g., GenProg and Prophet), having fewer passing test cases can speed up the repair process and lead to more plausible (possibly overfitting) patches. (c) Otherwise (e.g., Fix2Fit), variations in the passing test cases can still influence the ranking. Whether more tests are better depends on the APR strategy and its characteristics, as discussed in Section 6.1. Therefore, we suggest that APR techniques incorporate an intelligent test selection or filtering mechanism, which is not yet studied extensively in the context of APR. Recently, (Lou et al., 2021) suggested applying traditional regression test selection and prioritization to achieve better repair efficiency. Further developing and using such a mechanism represents a promising research direction. Note that in the discussed experiments, the fix location was defined beforehand. However, if APR techniques use a test-based fault localization technique (like in EC1), the test cases have an additional effect on the search space and repair success.

[boxrule=1pt,left=1pt,right=1pt,top=1pt,bottom=1pt] RQ5 – Test Cases: Variation of passing test cases causes different effects depending on the characteristics of the APR techniques. Overall, one needs an intelligent test selection method.

7. Threats to Validity

External Validity

Although we reached out to different organizations in different countries, we cannot guarantee that our survey results can be generalized to all software developers. To mitigate this threat, we made all research artifacts publicly available so that other researchers and practitioners can replicate our study. To reduce the risk of developers not participating or the volunteer bias, we designed the survey for a short completion time (15-20 min) and provided incentives like charity donations and (in the case of MTurk) monetary compensations. In our empirical analysis, we do not cover all available APR tools, but instead, we cover the main APR concepts: search-based, semantics-based, and machine-learning-based techniques. With ManyBugs (Le Goues et al., 2015) we have chosen a benchmark that is a well-known collection of defects in open-source projects. Additionally, it includes many test cases, which are necessary to evaluate the aspects of test case provision.

Construct Validity

In our survey, to encourage candid responses from participants, we did not collect any personally identifying information. Additionally, we applied control questions to filter non-genuine answers. To mitigate the risk of wrong interpretation of the collected responses, we performed qualitative analysis coding, for which all codes have been checked and agreed by at least two authors. Although we found general agreement across participants for many questions, we consider our results only as a first step towards exploring trustworthy APR. The metrics in our quantitative evaluation measure the patch generation progress, measuring repair efficiency/effectiveness via variations in configurations (EC1-EC4).

Internal Validity

Our participants could have misunderstood our survey questions, as we could not clarify any particulars due to the nature of online surveys. To mitigate this threat, we performed a small pilot survey with five developers, in which we asked for feedback about the questions, the survey structure, and the completion time. Additionally, there is a general threat that participants could submit multiple responses because our survey was completely anonymous. To mitigate the threat of errors in our setup of the empirical experiments, we performed preliminary runs with a subset of the benchmark and manually investigated the results.

8. Related Work

Our related work includes considerations of trust issues (Ryan et al., 2019; Alarcon et al., 2020; Bertram et al., 2020) and studies about the human aspects in automated program repair (Cambronero et al., 2019; Tao et al., 2014; Liang et al., 2020; Fry et al., 2012; Kim et al., 2013), user studies about debugging (Parnin and Orso, 2011), and empirical studies about repair techniques (Liu et al., 2020; Kong et al., 2018; Motwani et al., 2020; Wang et al., 2019; Wen et al., 2017; Yang et al., 2020; Martinez et al., 2017; Liu et al., 2021). With regard to human aspects in automated program repair, our survey study contributes novel insights about the developers’ expectations on their interaction with APR and which mechanism help to increase trust. With regard to empirical studies, our evaluation contributes a fresh perspective on existing APR techniques.

Trust Aspects in APR

Trust issues in automated program repair emerge from the general trust issues in automation. Lee and See (Lee and See, 2004) discuss that users tend to reject automation techniques whenever they do not trust them. Therefore, for the successful deployment of automated program repair in practice, it will be essential to focus on its human aspects. With respect to this, our presented survey contributes to the knowledge base of how developers want to interact with repair techniques, and what makes them trustworthy.

Existing research on trust issues in APR focuses mainly on the effect of patch provenance, i.e., the source of the patch. Ryan and Alarcon et al. (Ryan et al., 2019; Alarcon et al., 2020) performed user studies, in which they asked developers to rate the trustworthiness of patches, while the researchers varied the source of the patches. Their observations indicate that human-written patches receive a higher degree of trust than machine-generated patches. Bertram et al. (Bertram et al., 2020) conducted an eye-tracking study to investigate the effect of patch provenance. They confirm a difference between human-written and machine-generated patches and observe that the participants prefer human-written patches in terms of readability and coding style. Our study, on the other hand, explores the expectations and requirements of developers for trustworthy APR. The work of  (Weimer et al., 2016) proposed strategies to assess repaired programs to increase human trust. Our study results confirm that an efficient patch assessment is crucial and desired by the developers. We note that (Weimer et al., 2016) focuses on how to assess APR, while we focus on how to enhance/improve APR in general, specifically in terms of its trust.

Human Aspects in APR

Other human studies in the APR context focus on how developers interact with APR’s output, i.e., the patches. Cambronero et al. (Cambronero et al., 2019) observed developers while fixing software issues. They infer that developers would benefit from patch explanation and summaries to efficiently select suitable patches. They propose to explain the roles of variables and their relation to the original code, to list the characteristics of patches, and to summarize the effect of the patches on the program. Tao et al. (Tao et al., 2014) explored how machine-generated patches can support the debugging process. They conclude that, compared to debugging knowing only the buggy location, high-quality patches can support the debugging effort, while low-quality patches can actually compromise it. Liang et al. (Liang et al., 2020) concluded that even incorrect patches are helpful if they provide additional knowledge like fault locations. Fry et al. (Fry et al., 2012) explored the understandability and maintainability of machine-generated patches. While their participants label machine-generated patches as “slightly” less maintainable than human-written patches, they also observe that some augmentation of patches with synthesized documentation can reverse this trend. Kim et al. (Kim et al., 2013) proposed their template-based repair technique PAR and evaluated the patch acceptability compared to GenProg. All of these preliminary works explore the reactions on the output of APR. While our findings confirm previous hypotheses, e.g., that fault locations are helpful side-products of APR (Liang et al., 2020) or that an efficient patch selection is important (Weimer et al., 2016; Liang et al., 2020), our work also considers the input to APR, the interaction with APR during patch generation, and how trust can be accomplished.

Debugging

Parnin and Orso (Parnin and Orso, 2011) investigate the usefulness of debugging techniques in practice. They observe that many assumptions made by automated debugging techniques often do not hold in practice. Although we focus on automated program repair, our research theme is related to (Parnin and Orso, 2011). We strive to understand how developers want to use automated program repair and whether current techniques support these aspects.

Empirical Evaluation of APR

The living review article on automated program repair by Martin Monperrus (Monperrus, 2018) lists 43 empirical studies. Most of them are concerned about patch correctness to compare the success of APR techniques. Other frequently explored aspects are repair efficiency (Kong et al., 2018; Liu et al., 2020; Martinez et al., 2017; Liu et al., 2021), the impact of fault locations (Liu et al., 2020; Wen et al., 2017; Yang et al., 2020; Liu et al., 2021), and the diversity of bugs (Kong et al., 2018; Liu et al., 2021, 2021). Less frequently studied aspects are the impact of the test suite (Kong et al., 2018; Motwani et al., 2020) and its provenance (Motwani et al., 2020; Le et al., 2018), specifically the problem of test-suite overfitting (Le et al., 2018; Liu et al., 2021), and how close the generated patches come to human-written patches (Wang et al., 2019). Our empirical evaluation is not just another empirical assessment of APR technologies. It is specifically linked to the collected developer expectations from our survey. It limits the timeout to 1 hour, only explores the top-10 patches, and explores various configurations of passing tests as well as the impact of fix locations. Together with our survey results, our empirical/quantitative evaluation provides the building blocks to create trustworthy APR techniques, which will need to be validated via future user studies with practitioners.

9. Discussion

In this paper, we have investigated the issues involved in enhancing developer trust in automatically generated patches. Through a detailed study with more than 100 practitioners, we explore the expectations and tolerance levels of developers with respect to automated program repair tools. We then conduct a quantitative evaluation to show that existing repair tools do not meet the developer’s expectations in terms of producing high-quality patches in a short time period. Our qualitative and quantitative studies indicate directions that need to be explored to gain developer trust in patches — low interaction with repair tools, exchange of artifacts such as generated tests as inputs as well as output of repair tools, and paying attention to abstract search space representations over and above search algorithmic frameworks.

We note that increasingly there is a move towards automated code generation such as the recently proposed Github Copilot, but this raises the question of whether such automatically generated code can be trusted. Developing technologies to support mixed usage of manually written and automatically generated code, where program repair lies at the manual-automatic boundary – could be an enticing research challenge for the community.

Dataset from our work

Our replication package with the survey and experiment artifacts are available upon request, by emailing all the authors.

References

  • G. M. Alarcon, C. Walter, A. M. Gibson, R. F. Gamble, A. Capiola, S. A. Jessup, and T. J. Ryan (2020) Would you fix this code for me? effects of repair source and commenting on trust in code repair. Systems 8 (1). External Links: Link, ISSN 2079-8954, Document Cited by: §8, §8.
  • J. Bader, A. Scott, M. Pradel, and S. Chandra (2019) Getafix: learning to fix bugs automatically. Proceedings of the ACM on Programming Languages 3 (OOPSLA), pp. 1–27. Cited by: §2.
  • R. Bavishi, H. Yoshida, and M. R. Prasad (2019) Phoenix: automated data-driven synthesis of repairs for static analysis violations. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 613–624. Cited by: §2.
  • I. Bertram, J. Hong, Y. Huang, W. Weimer, and Z. Sharafi (2020) Trustworthiness Perceptions in Code Review: An Eye-Tracking Study. In Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), ESEM ’20, New York, NY, USA. External Links: Document, ISBN 9781450375801, Link Cited by: §8, §8.
  • C. Cadar, D. Dunbar, and D. Engler (2008) KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI’08, USA, pp. 209–224. Cited by: §6.
  • J. P. Cambronero, J. Shen, J. Cito, E. Glassman, and M. Rinard (2019) Characterizing developer use of automatically generated patches. In 2019 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), Vol. , pp. 181–185. External Links: Document Cited by: §8, §8.
  • L. Chen, Y. Ouyang, and L. Zhang (2021) Fast and precise on-the-fly patch validation for all. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Vol. , pp. 1123–1134. External Links: Document Cited by: §6.1.
  • T. Durieux, B. Cornu, L. Seinturier, and M. Monperrus (2017) Dynamic patch generation for null pointer exceptions using metaprogramming. In 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER), Vol. , pp. 349–358. External Links: Document Cited by: §6.1.
  • K. Ehrich (2020) Mechanical turk: potential concerns and their solutions. Note: https://www.summitllc.us/blog/mechanical-turk-concerns-and-solutions Cited by: §3.
  • Z. P. Fry, B. Landau, and W. Weimer (2012) A human study of patch maintainability. In Proceedings of the 2012 International Symposium on Software Testing and Analysis, ISSTA 2012, New York, NY, USA, pp. 177–187. External Links: ISBN 9781450314541, Link, Document Cited by: §8, §8.
  • X. Gao, S. Mechtaev, and A. Roychoudhury (2019) Crash-avoiding program repair. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2019, New York, NY, USA, pp. 8–18. External Links: ISBN 9781450362245, Link, Document Cited by: §1, §1, §5, §6.1.
  • X. Gao, B. Wang, G. J. Duck, R. Ji, Y. Xiong, and A. Roychoudhury (2021) Beyond tests: program vulnerability repair via crash constraint extraction. ACM Trans. Softw. Eng. Methodol. 30 (2). External Links: ISSN 1049-331X, Link, Document Cited by: §1, §2.
  • C. L. Goues, M. Pradel, and A. Roychoudhury (2019) Automated program repair. Commun. ACM 62 (12), pp. 56–65. External Links: ISSN 0001-0782, Link, Document Cited by: §1.
  • D. Kim, J. Nam, J. Song, and S. Kim (2013) Automatic patch generation learned from human-written patches. In Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, pp. 802–811. External Links: ISBN 9781467330763 Cited by: §8, §8.
  • S. Kirbas, E. Windels, O. McBello, K. Kells, M. Pagano, R. Szalanski, V. Nowack, E. Winter, S. Counsell, D. Bowes, T. Hall, S. Haraldsson, and J. Woodward (2021) On the introduction of automatic program repair in bloomberg. IEEE Software 38 (04), pp. 43–51. External Links: ISSN 1937-4194, Document Cited by: §1.
  • X. Kong, L. Zhang, W. E. Wong, and B. Li (2018) The impacts of techniques, programs and tests on automated program repair: An empirical study. Journal of Systems and Software 137, pp. 480–496. External Links: Document, ISSN 0164-1212, Link Cited by: §8, §8.
  • C. Le Goues, N. Holtschulte, E. K. Smith, Y. Brun, P. Devanbu, S. Forrest, and W. Weimer (2015) The manybugs and introclass benchmarks for automated repair of c programs. IEEE Transactions on Software Engineering 41 (12), pp. 1236–1256. External Links: Document Cited by: §1, §5, §7.
  • C. Le Goues, T. V. Nguyen, S. Forrest, and W. Weimer (2012) GenProg: a generic method for automatic software repair. IEEE Transactions on Software Engineering 38 (1), pp. 54–72. External Links: Document Cited by: §1, §2, §5.
  • X. B. D. Le, F. Thung, D. Lo, and C. Le Goues (2018) Overfitting in semantics-based automated program repair. Empirical Software Engineering 23 (5), pp. 3007–3033. Cited by: §8.
  • J. D. Lee and K. A. See (2004) Trust in automation: designing for appropriate reliance. Human Factors 46 (1), pp. 50–80. External Links: Document, Link, https://doi.org/10.1518/hfes.46.1.50_30392 Cited by: §8.
  • J. Liang, R. Ji, J. Jiang, Y. Lou, Y. Xiong, and G. Huang (2020) Interactive patch filtering as debugging aid. arXiv preprint arXiv:2004.08746. Cited by: §8, §8.
  • K. Liu, L. Li, A. Koyuncu, D. Kim, Z. Liu, J. Klein, and T. F. Bissyandé (2021) A critical review on the evaluation of automated program repair systems. Journal of Systems and Software 171, pp. 110817. External Links: Document, ISSN 0164-1212, Link Cited by: §8, §8.
  • K. Liu, S. Wang, A. Koyuncu, K. Kim, T. F. Bissyandé, D. Kim, P. Wu, J. Klein, X. Mao, and Y. L. Traon (2020) On the efficiency of test suite based program repair: a systematic assessment of 16 automated repair systems for java programs. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ICSE ’20, New York, NY, USA, pp. 615–627. External Links: ISBN 9781450371216, Link, Document Cited by: §8, §8.
  • F. Long and M. Rinard (2016) Automatic patch generation by learning correct code. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’16, New York, NY, USA, pp. 298–312. External Links: ISBN 9781450335492, Link, Document Cited by: §1, §1, §2, §5, §6.1.
  • Y. Lou, S. Benton, D. Hao, L. Zhang, and L. Zhang (2021) How does regression test selection affect program repair? an extensive study on 2 million patches. arXiv preprint arXiv:2105.07311. Cited by: §6.2.
  • A. Marginean, J. Bader, S. Chandra, M. Harman, Y. Jia, K. Mao, A. Mols, and A. Scott (2019) SapFix: automated end-to-end repair at scale. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), Vol. , pp. 269–278. External Links: Document Cited by: §1.
  • M. Martinez, T. Durieux, R. Sommerard, J. Xuan, and M. Monperrus (2017) Automatic repair of real bugs in java: a large-scale experiment on the defects4j dataset. Empirical Software Engineering 22 (4), pp. 1936–1964. External Links: Document, ISSN 1573-7616, Link Cited by: §8, §8.
  • S. Mechtaev, X. Gao, S. H. Tan, and A. Roychoudhury (2018) Test-equivalence analysis for automatic patch generation. ACM Trans. Softw. Eng. Methodol. 27 (4). External Links: ISSN 1049-331X, Link, Document Cited by: §6.1, §6.1.
  • S. Mechtaev, J. Yi, and A. Roychoudhury (2016) Angelix: scalable multiline program patch synthesis via symbolic analysis. In Proceedings of the 38th International Conference on Software Engineering, ICSE ’16, New York, NY, USA, pp. 691–701. External Links: ISBN 9781450339001, Link, Document Cited by: §1, §1, §2, §5, §6.1.
  • M. Monperrus (2018) The living review on automated program repair. Technical report Technical Report hal-01956501, HAL/archives-ouvertes.fr. Cited by: §8.
  • M. Motwani, M. Soto, Y. Brun, R. Just, and C. Le Goues (2020) Quality of Automated Program Repair on Real-World Defects. IEEE Transactions on Software Engineering, pp. 1. External Links: Document Cited by: §8, §8.
  • H.D.T. Nguyen, D. Qi, A. Roychoudhury, and S. Chandra (2013) SemFix: program repair via semantic analysis. In International Conference on Software Engineering, Cited by: §2.
  • C. Parnin and A. Orso (2011) Are automated debugging techniques actually helping programmers?. In Proceedings of the 2011 International Symposium on Software Testing and Analysis, ISSTA ’11, New York, NY, USA, pp. 199–209. External Links: ISBN 9781450305624, Link, Document Cited by: §8, §8.
  • T. J. Ryan, G. M. Alarcon, C. Walter, R. Gamble, S. A. Jessup, A. Capiola, and M. D. Pfahler (2019) Trust in automated software repair. In HCI for Cybersecurity, Privacy and Trust, A. Moallem (Ed.), Cham, pp. 452–470. External Links: ISBN 978-3-030-22351-9 Cited by: §8, §8.
  • M. Schreier (2012) Qualitative content analysis in practice. Sage publications. Cited by: §3.
  • R. Shariffdeen, Y. Noller, L. Grunske, and A. Roychoudhury (2021) Concolic program repair. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2021, New York, NY, USA, pp. 390–405. External Links: ISBN 9781450383912, Link, Document Cited by: §1, §2, §4.2, §5.
  • R. S. Shariffdeen, S. H. Tan, M. Gao, and A. Roychoudhury (2021) Automated patch transplantation. ACM Trans. Softw. Eng. Methodol. 30 (1). External Links: ISSN 1049-331X, Link, Document Cited by: §1.
  • S. H. Tan, H. Yoshida, M. R. Prasad, and A. Roychoudhury (2016) Anti-patterns in search-based program repair. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 727–738. Cited by: §2.
  • Y. Tao, J. Kim, S. Kim, and C. Xu (2014) Automatically generated patches as debugging aids: a human study. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, New York, NY, USA, pp. 64–74. External Links: ISBN 9781450330565, Link, Document Cited by: §8, §8.
  • S. Urli, Z. Yu, L. Seinturier, and M. Monperrus (2018) How to design a program repair bot? insights from the repairnator project. In 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), Vol. , pp. 95–104. External Links: Document Cited by: §1.
  • R. van Tonder and C. L. Goues (2018) Static automated program repair for heap properties. In Proceedings of the 40th International Conference on Software Engineering, pp. 151–162. Cited by: §2.
  • S. Wang, M. Wen, L. Chen, X. Yi, and X. Mao (2019) How Different Is It Between Machine-Generated and Developer-Provided Patches? : An Empirical Study on the Correct Patches Generated by Automated Program Repair Techniques. In 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 1–12. External Links: Document Cited by: §8, §8.
  • W. Weimer, S. Forrest, M. Kim, C. Le Goues, and P. Hurley (2016) Trusted Software Repair for System Resiliency. In 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W), pp. 238–241. External Links: Document Cited by: §8, §8.
  • M. Wen, J. Chen, R. Wu, D. Hao, and S. Cheung (2017) An empirical analysis of the influence of fault space on search-based automated program repair. arXiv preprint arXiv:1707.05172. Cited by: §8, §8.
  • C. Wong, P. Santiesteban, C. Kästner, and C. Le Goues (2021) VarFix: balancing edit expressiveness and search effectiveness in automated program repair. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021, New York, NY, USA, pp. 354–366. External Links: ISBN 9781450385626, Link, Document Cited by: §6.1, §6.1.
  • Y. Xiong, X. Liu, M. Zeng, L. Zhang, and G. Huang (2018) Identifying patch correctness in test-based program repair. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, New York, NY, USA, pp. 789–799. External Links: ISBN 9781450356381, Link, Document Cited by: §6.1.
  • D. Yang, Y. Qi, X. Mao, and Y. Lei (2020) Evaluating the usage of fault localization in automated program repair: an empirical study. Frontiers of Computer Science 15 (1), pp. 151202. External Links: Document, ISSN 2095-2236, Link Cited by: §8, §8.
  • J. Yang, A. Zhikhartsev, Y. Liu, and L. Tan (2017) Better test cases for better automated program repair. In Joint Meeting on Foundations of Software Engineering (ESEC-FSE), Cited by: §4.2.