A Study on Robustness and Reliability of Large Language Model Code Generation

08/20/2023
by   Li Zhong, et al.
0

Recently, the large language models (LLMs) have shown extraordinary ability in understanding natural language and generating programming code. It has been a common practice of software engineers to consult LLMs when encountering coding questions. Although efforts have been made to avoid syntax errors and align the code with the intended semantics, the reliability and robustness of the code generationfrom LLMs have not yet been thoroughly studied. The executable code is not equivalent to the reliable and robust code, especially in the context of real-world software development. The misuse of APIs in the generated code could lead to severe problem, such as resource leaks, program crashes. To make things worse, the users of LLM code generation services are actually the developers that are most vulnerable to these code that seems right – They are always novice developers that are not familiar with the APIs that LLMs generate code for them. Therefore, they could hardly tell the misuse in the code generated by LLMs, which further facilitates the incorrect code applied in real-world software. Existing code evaluation benchmark and datasets focus on crafting small tasks such as programming questions in coding interviews, which however deviates from the problem that developers would ask LLM for real-world coding help. To fill the missing piece, in this work, we propose a dataset RobustAPI for evaluating the reliability and robustness of code generated by LLMs. We collect 1208 coding questions from StackOverflow on 24 representative Java APIs. We summarize thecommon misuse patterns of these APIs and evaluate them oncurrent popular LLMs. The evaluation results show that evenfor GPT-4, 62 unexpected consequences if the code isintroduced into real-world software.

READ FULL TEXT
research
08/25/2023

Does Asking Clarifying Questions Increases Confidence in Generated Code? On the Communication Skills of Large Language Models

Large language models (LLMs) have significantly improved the ability to ...
research
06/05/2023

A Static Evaluation of Code Completion by Large Language Models

Large language models trained on code have shown great potential to incr...
research
12/20/2022

ReCode: Robustness Evaluation of Code Generation Models

Code generation models have achieved impressive performance. However, th...
research
08/25/2023

COCO: Testing Code Generation Systems via Concretized Instructions

Code generation systems have been extensively developed in recent years ...
research
05/20/2021

Measuring Coding Challenge Competence With APPS

While programming is one of the most broadly applicable skills in modern...
research
02/01/2023

On the Robustness of Code Generation Techniques: An Empirical Study on GitHub Copilot

Software engineering research has always being concerned with the improv...
research
12/18/2022

Chatbots in a Botnet World

Question-and-answer formats provide a novel experimental platform for in...

Please sign up or login with your details

Forgot password? Click here to reset