Adversarial Patch Generation for Automatic Program Repair

12/21/2020 ∙ by Abdulaziz Alhefdhi, et al. ∙ University of Wollongong The University of Melbourne 0

Automatic program repair (APR) has seen a growing interest in recent years with numerous techniques proposed. One notable line of research work in APR is search-based techniques which generate repair candidates via syntactic analyses and search for valid repairs in the generated search space. In this work, we explore an alternative approach which is inspired by the adversarial notion of bugs and repairs. Our approach leverages the deep learning Generative Adversarial Networks (GANs) architecture to suggest repairs that are as close as possible to human generated repairs. Preliminary evaluations demonstrate promising results of our approach (generating repairs exactly the same as human fixes for 21.2

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Software bugs are costly to be detected and rectified [19, 6, 4]. Due to short time to market, software programs are often delivered with known or unknown bugs. [1]. The number of bugs may be far more than the amount of human resources available to address them. As a result, it can take days or even years for software defects to be repaired [4]. Automated approaches to detecting and rectifying software bugs are thus of tremendous value to reduce these costs.

Automatic program repair (APR), is now an active and exciting research area, which engages both academia and the software industry. Real-world defects from large programs have been shown to be efficiently and effectively repaired by state-of-the-art APR techniques [22]. Notably, in 2018, Facebook announced the first-ever industrial-scale APR technique, namely GetAFix [2], being developed and widely used in house. GetAFix was directly inspired from a recent research work in APR [11], demonstrating that research in APR has a great potential to make practical and immediate impact.

APR can generally be divided to two main families, namely search-based and semantics-based approaches. While semantics-based approaches use semantic analyses such as static analyses or symbolic execution [17, 13]

, search-based approaches often use syntactic analyses to generate and traverse a syntactic search space such as genetic programming 

[22]

, pattern recognition 

[11]

, and machine learning 

[16]. Semantics-based approaches are typically precise, but limited by the capability of the underlying semantic analysers, (e.g., symbolic execution [17, 13]). Search-based approaches typically generate a large syntactic search space, rendering it difficult to navigate through to find correct solutions.

Both APR families however suffer from the same issue, namely overfitting, in which generated repairs fail to generalise beyond the specifications (e.g., test suite) used for generating the repairs, hindering the efficiency and effectiveness of APR [12, 20]. The overfitting problem occurs mainly due to the fact that software specifications such as test suites are known to be incomplete in practice. They do not comprehensively coverall unintended behaviours of a software program. Hence, using those approaches in practice requires additional resources beyond the incomplete specifications to reliably generate and suggest a few high-quality candidate patches for developers to review in a timely manner.

The availability of millions of software projects (e.g. over 17 millions on GitHub) has given us an unprecedented opportunity in fixing software defects and vulnerabilities. A massive amount of repositories of data about defects, vulnerabilities and code patches has enabled us to automatically learn and discover patterns of vulnerabilities and their patches. These empower us to develop Artificial Intelligence/Machine Learning (AI/ML) algorithms that are capable of automatically learning patterns of bug fixes and suggesting fixes for newly discovered bugs.

Traditional machine learning techniques have been utilised for patch ranking (e.g. [15]). Recent work (e.g. [14, 5]

) has started exploring to apply breakthroughs in deep learning, particularly neural machine translation (NMT), to APR. They often formulate APR as an NMT problem: translating buggy code into fixed code. Most of them use a sequence-to-sequence translation model which typically consists of two main components: an

encoder which takes as input a buggy code sequence, and a decoder

which generates a corresponding fixed code sequence. Those models often rely on the maximum likelihood estimation principle for training (i.e. maximising the probability of the target ground-truth code sequence conditioned on the source sequence). Hence, fixes generated by those models might not be natural and correct with respect to human generated fixes. In this article, we present a new approach, which leverages the deep learning Generative Adversarial Networks (GANs) architecture

[8], to automatically generate bug fixes.

Adversarial Patch Generation

Figure 1: The overall workflow of our approach. In the training phase, the Patch Discriminator is used to help the Patch Generator learn to imitate human fixes. The trained Patch Generator is then used in the production phase to generate patches for buggy code.

Many adversarial scenarios exist in life. For example, a counterfeiter tries to deposit fake money while a bank teller tries to detect it. The counterfeiter tries to bring more realistic money each time while the bank teller tries to improve their detection skills. These adversarial settings, whether initiated intentionally or unintentionally, can be exploited for learning and development purposes. The well-known deep learning architecture, Generative Adversarial Networks (GANs), was introduced in 2014 by Goodfellow et al. [8]

to explicitly address this. This architecture consists of two neural networks

111

The mechanism in which neural networks operate is inspired by how the human brain and its networks of neurons process information. An artificial neural network consists of

layers of nodes connected to each other by weights.
: a generator and a discriminator. The former aims to generate new data examples from existing data, while the latter aims to detect the authenticity of the examples. Both of them are connected and trained together in a zero-sum game style. The generator tries to fool the discriminator into thinking that the data it generates are “real”, meaning that it is a data point from the dataset under experimentation. The discriminator tries to distinguish between “fake” data produced by the generator and real data from the dataset. The adversarial setting of GANs allows for the two networks to learn from each other, and consequently improve their own performance over time.

Bugs and patches can be seen as adversaries in which bugs are introduced into code and patches are provided to remove the existence of those bugs. Thus, we leverages the ideas of GANs to develop a system which aims to generate patches which are as close as possible to human-generated fixes. This system consists of two main components: Patch Generator and Patch Discriminator (see Figure 1). The Patch Generator takes as input a buggy code fragment (e.g., t.setAutoFlush(false);) and generates a new code fragment in which the bug (hopefully) has been fixed (e.g. t.setAutoFlush(true);). We adapt the Conditional GANs (cGANs) architecture [18] (an extension of GANs) to develop the Patch Generator since cGANs allows for the output (i.e., patched code) to be conditioned on a certain input (i.e., buggy code).

On the other hand, the Patch Discriminator takes as input a pair of buggy code and patched code fragments and determines if the patched code is provided by human developers (assuming that it was correctly confirmed by them) or artificially generated. The key novelty here is that both the Patch Generator and Patch Discriminator are trained together in a way that they are competing with one another (i.e., one’s loss is another’s gain). This mechanism enables the Patch Generator to learn to produce fixes that resemble the way that human developers fix bugs.

We have implemented the Patch Generator using the encoder-decoder architecture for sequence-to-sequence learning [21]

. The encoder and decoder are two Long Short-Term Memory (LSTM) networks

[9]. The Patch Discriminator was also implemented using LSTMs, but with sigmoid activation. Input code fragments are first tokenised (based on the syntax and grammar in which the code was written) into code sequences as input to our model. We also implemented an attention mechanism [3] for an enhanced input-output unit-mapping. To aid our model in capturing the ground truth distribution, random noises were introduced by applying dropout [7] to the Patch Generator’s layers [10].

Buggy Code Fragment Human-Suggested Fix Model-Generated Fix Notes
return new HiveQueryResultSet.Builder() return new HiveQueryResultSet.Builder(null) return new HiveQueryResultSet.Builder(null) - Identical fix
detectDeadlock(e, "unlock"); detectDeadlock(dbConn, e, "unlock" ); detectDeadlock(dbConn, e, +++e); - Provides partial solution: adding the dbConn argument - Syntax error
Utilities.clearWorkMap(); Utilities.clearWorkMap(jconf); Utilities.clearWorkMap(jc); - Suggests different fix: different variable name.
Processor childProcessor = routeContext.createProcessor(this); Processor childProcessor = this.createChildProcessor(routeContext, true); Processor childProcessor = this.createChildProcessor(routeContext, false); - Similar fix: solution guide/hint - Semantic difference: false instead of true
Table 1: Sample fixes generated by our model compared with the human fixes

To train our model, we used the BigFix dataset [14], which consists of pairs of buggy code and the corresponding human-provided fixed code. We removed duplicate pairs and collected only single-line fixes (as done in previous work  [14, 5]). Our final dataset consists of 5,749 buggy-fixed pairs, 500 of which were used for evaluation and the rest was used for training the model. During training, each buggy code fragment is input to the Patch Generator, which generates a candidate repair for the buggy code. The Patch Discriminator then examines two pairs. The first pair consists of the buggy code and the code fixed by human developers (collected from the dataset, e.g., t.setAutoFlush(false, true);). The second pair consists of the same buggy code and the code generated by the Patch Generator. The Patch Discriminator is trained to determine which pair contains a human fix and which pair contains an artificially generated fix.

During training, the Patch Discriminator

passes the gradient of the loss (the degree to which it correctly classifies an artificial or human fix) to the

Patch Generator. The Patch Generator uses this loss (combining with the loss calculated from the difference between the generated fixes and the ground truths) to improve its learning and continues trying to “fool” the Patch Discriminator by generating fixes as close as possible to human fixes. Once the training phase is finished, we use the trained Patch Generator for bug fixing and discard the Patch Discriminator. The Patch Discriminator’s job in out setting is to help improve the Patch Generator in the patch-generation game. In production phase (see Figure 1), the trained Patch Generator is presented with new buggy code fragments (e.g., Processor childProcessor = routeContext.createProcessor(this);), and it will suggest patches for fixing the bugs (e.g., Processor childProcessor = this.createChildProcessor(routeContext, false);).

Early Results and Future Directions

The current experimentation of our approach demonstrates promising results. Overall, in the 500 bugs used for evaluation, our model achieves a Bleu-4 score of 0.42, generating 106 fixes (21.2%) that are identical to the human-fixes (i.e., the ground truths). We present here (see Table 1) a few examples of how our model generates fixes compared with the fixes provided by software developers.

The first example in Table 1 demonstrates that our model was able to generate fixes that are exactly the same as human-generated fixes. The patch correctly adds a parameter (null) to the constructor. The remaining three examples in Table 1 represent cases where the generated fixes provide partial solutions, which are slightly different from the human fixes. They offer valuable insight for further improvements and deployment of our approach. We discuss them in detail below.

In the second example, our model was able to correctly suggest the addition of the argument dbConn to fix the bug, but it changes the last argument (from unlock to +++e), rendering a syntactical error. These errors can be detected by employing a program analysis (in a similar way like a compiler) filter which checks for the syntactical correctness of the suggested patches. This filtering functionality can be added as a post-processing step of the Patch Generator to ensure that syntactically incorrect candidate patches are removed. Alternatively, this functionality can be added to the Patch Discriminator to augment its capability of detecting syntactically incorrect patches (and the Patch Generator will indirectly benefit from it due to the adversarial learning setting we setup).

The third example demonstrates a case where the generated patch is similar to the human fix in suggesting the addition of an argument to function clearWorkMap(). However, they differ in terms of the variables used: our model suggested to use variable jc, while the human developer used jconf. If jc does not exist, the filtering mechanism above can easily detect and remove this candidate patch. However, if jc does exist, this suggests another improvement direction for our work. We need to widen our input scope and capture the context of the surrounding code, especially in this case, the definition-use dependencies between variables. Encoding this context into the Patch Generator and/or the Patch Discriminator will help improve the correctness of generated patches.

Finally, the last example demonstrates that our model was capable of generating complex fixes, which involve multiple correct changes (e.g., adding this, providing the correct method createChildProcessor instead of createProcessor), and providing the correct number of arguments passed to the method with a correct first argument). Compared to state-of-the-art repair techniques such as [22, 11, 13], this result is impressive as the fix involves multiple actions/mutations altogether. The model however incorrectly suggested the value of the last argument (false instead of true). Although this looks simple, it requires the capability of understanding the semantics of the code and its intended behaviour. Semantics-based approaches (e.g.,  [17, 13]) rely on some underlying semantic analysers (e.g., symbolic execution) to provide this capability. We plan to explore how to combine our approach with semantics-based approaches to better combine both syntactic and semantic reasoning.

In summary, we have presented in this article an AI-powered approach to automatic program repair. Our approach leverages the notions of Generative Adversarial Networks (which naturally match with the adversarial settings of bugs and repairs) to develop a new, adversarial patch generation framework. Our approach is automated as it does not require hard-coding of bug-fixing rules. In addition, our solution does not require a set of test cases, enabling early repairs and interventions of software bugs (thus saving the significant cost increase later). Furthermore, our approach does not perform enumerative search for repairs, but learns to fix a common class of bugs directly from patch examples. Thus, we can offer a fast and scalable solution that can generalize across software applications. Promising results from our preliminary evaluation not only suggests the feasibility of deploying our approach in practice, but also generate insights for future directions for further enhancements.

References

  • [1] J. Anvik, L. Hiew, and G. C. Murphy (2005) Coping with an open bug repository. In Proceedings of the OOPSLA workshop on Eclipse technology eXchange, pp. 35–39. Cited by: INTRODUCTION.
  • [2] J. Bader, A. Scott, M. Pradel, and S. Chandra (2019) Getafix: learning to fix bugs automatically. Proceedings of the ACM on Programming Languages 3 (OOPSLA), pp. 1–27. Cited by: INTRODUCTION.
  • [3] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: Adversarial Patch Generation.
  • [4] M. Böhme and A. Roychoudhury (2014) Corebench: studying complexity of regression errors. In Proceedings of the International Symposium on Software Testing and Analysis, pp. 105–115. Cited by: INTRODUCTION.
  • [5] Z. Chen, S. J. Kommrusch, M. Tufano, L. Pouchet, D. Poshyvanyk, and M. Monperrus (2019) Sequencer: sequence-to-sequence learning for end-to-end program repair. IEEE Transactions on Software Engineering. Cited by: INTRODUCTION, Adversarial Patch Generation.
  • [6] E. Engström and P. Runeson (2010) A qualitative survey of regression testing practices. In International Conference on Product Focused Software Process Improvement, pp. 3–16. Cited by: INTRODUCTION.
  • [7] Y. Gal and Z. Ghahramani (2016)

    A theoretically grounded application of dropout in recurrent neural networks

    .
    Advances in neural information processing systems 29, pp. 1019–1027. Cited by: Adversarial Patch Generation.
  • [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: INTRODUCTION, Adversarial Patch Generation.
  • [9] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: Adversarial Patch Generation.
  • [10] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 1125–1134. Cited by: Adversarial Patch Generation.
  • [11] X. B. D. Le, D. Lo, and C. Le Goues (2016) History driven program repair. In 23rd International Conference on Software Analysis, Evolution, and Reengineering, Vol. 1, pp. 213–224. Cited by: INTRODUCTION, INTRODUCTION, EARLY RESULTS AND FUTURE DIRECTIONS.
  • [12] X. B. D. Le, F. Thung, D. Lo, and C. Le Goues (2018) Overfitting in semantics-based automated program repair. Empirical Software Engineering 23 (5), pp. 3007–3033. Cited by: INTRODUCTION.
  • [13] X. D. Le, D. Chu, D. Lo, C. Le Goues, and W. Visser (2017) S3: syntax-and semantic-guided repair synthesis via programming by examples. In Proceedings of the 11th Joint Meeting on Foundations of Software Engineering, pp. 593–604. Cited by: INTRODUCTION, EARLY RESULTS AND FUTURE DIRECTIONS.
  • [14] Y. Li, S. Wang, and T. N. Nguyen (2020) DLfix: context-based code transformation learning for automated program repair. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 602–614. Cited by: INTRODUCTION, Adversarial Patch Generation.
  • [15] F. Long, P. Amidon, and M. Rinard (2017) Automatic inference of code transforms for patch generation. In Proceedings of the 11th Joint Meeting on Foundations of Software Engineering, pp. 727–739. Cited by: INTRODUCTION.
  • [16] F. Long and M. Rinard (2016) Automatic patch generation by learning correct code. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 298–312. Cited by: INTRODUCTION.
  • [17] S. Mechtaev, J. Yi, and A. Roychoudhury (2016) Angelix: scalable multiline program patch synthesis via symbolic analysis. In Proceedings of the 38th international conference on software engineering, pp. 691–701. Cited by: INTRODUCTION, EARLY RESULTS AND FUTURE DIRECTIONS.
  • [18] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: Adversarial Patch Generation.
  • [19] A. K. Onoma, W. Tsai, M. Poonawala, and H. Suganuma (1998) Regression testing in an industrial environment. Communications of the ACM 41 (5), pp. 81–86. Cited by: INTRODUCTION.
  • [20] E. K. Smith, E. T. Barr, C. Le Goues, and Y. Brun (2015) Is the cure worse than the disease? overfitting in automated program repair. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering, pp. 532–543. Cited by: INTRODUCTION.
  • [21] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: Adversarial Patch Generation.
  • [22] W. Weimer, T. Nguyen, C. Le Goues, and S. Forrest (2009) Automatically finding patches using genetic programming. In 31st International Conference on Software Engineering, pp. 364–374. Cited by: INTRODUCTION, INTRODUCTION, EARLY RESULTS AND FUTURE DIRECTIONS.