ReCode: Robustness Evaluation of Code Generation Models

12/20/2022
by   Shiqi Wang, et al.
0

Code generation models have achieved impressive performance. However, they tend to be brittle as slight edits to a prompt could lead to very different generations; these robustness properties, critical for user experience when deployed in real-life applications, are not well understood. Most existing works on robustness in text or code tasks have focused on classification, while robustness in generation tasks is an uncharted area and to date there is no comprehensive benchmark for robustness in code generation. In this paper, we propose ReCode, a comprehensive robustness evaluation benchmark for code generation models. We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format. They are carefully designed to be natural in real-life coding practice, preserve the original semantic meaning, and thus provide multifaceted assessments of a model's robustness performance. With human annotators, we verified that over 90 prompt. In addition, we define robustness metrics for code generation models considering the worst-case behavior under each type of perturbation, taking advantage of the fact that executing the generated code can serve as objective evaluation. We demonstrate ReCode on SOTA models using HumanEval, MBPP, as well as function completion tasks derived from them. Interesting observations include: better robustness for CodeGen over InCoder and GPT-J; models are most sensitive to syntax perturbations; more challenging robustness evaluation on MBPP over HumanEval.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/20/2023

A Study on Robustness and Reliability of Large Language Model Code Generation

Recently, the large language models (LLMs) have shown extraordinary abil...
research
12/20/2022

Towards Robustness of Text-to-SQL Models Against Natural and Realistic Adversarial Table Perturbation

The robustness of Text-to-SQL parsers against adversarial perturbations ...
research
03/29/2022

Can NMT Understand Me? Towards Perturbation-based Evaluation of NMT Models for Code Generation

Neural Machine Translation (NMT) has reached a level of maturity to be r...
research
11/29/2022

How Important are Good Method Names in Neural Code Generation? A Model Robustness Perspective

Pre-trained code generation models (PCGMs) have been widely applied in n...
research
01/19/2022

GAP-Gen: Guided Automatic Python Code Generation

Automatic code generation from natural language descriptions can be high...
research
07/09/2022

A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities

Transformer-based models have demonstrated state-of-the-art performance ...
research
09/14/2022

Automatic Comment Generation via Multi-Pass Deliberation

Deliberation is a common and natural behavior in human daily life. For e...

Please sign up or login with your details

Forgot password? Click here to reset