Large Language Models Are State-of-the-Art Evaluators of Code Generation

04/27/2023
by   Terry Yue Zhuo, et al.
0

Recent advancements in the field of natural language generation have facilitated the use of large language models to assess the quality of generated text. Although these models have shown promising results in tasks such as machine translation and summarization, their applicability in code generation tasks remains limited without human involvement. The complexity of programming concepts required for such tasks makes it difficult to develop evaluation metrics that align with human judgment. Token-matching-based metrics, such as BLEU, have demonstrated weak correlations with human practitioners in code generation tasks. Moreover, the utilization of human-written test suites to evaluate functional correctness can be challenging in domains with low resources. To overcome these obstacles, we propose a new evaluation framework based on the GPT-3.5 (), for code generation assessments. Our framework addresses the limitations of existing approaches by achieving superior correlations with functional correctness and human preferences, without the need for test oracles or references. We evaluate the efficacy of our framework on two different tasks and four programming languages, comparing its performance with the state-of-the-art CodeBERTScore metric, which relies on a pre-trained model. Our results demonstrate that our framework surpasses CodeBERTScore, delivering high levels of accuracy and consistency across various programming languages and tasks. We also make our evaluation framework and datasets available to the public at <https://github.com/terryyz/llm-code-eval>, encouraging further research in the evaluation of code generation.

READ FULL TEXT

page 2

page 13

page 28

page 32

page 33

page 34

page 37

page 38

research
05/24/2023

Not All Metrics Are Guilty: Improving NLG Evaluation with LLM Paraphrasing

Most research about natural language generation (NLG) relies on evaluati...
research
02/10/2023

CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

Since the rise of neural models of code that can generate long expressio...
research
03/27/2023

KPEval: Towards Fine-grained Semantic-based Evaluation of Keyphrase Extraction and Generation Systems

Despite the significant advancements in keyphrase extraction and keyphra...
research
03/01/2023

Can ChatGPT Assess Human Personalities? A General Evaluation Framework

Large Language Models (LLMs) especially ChatGPT have produced impressive...
research
06/26/2023

Discriminating Human-authored from ChatGPT-Generated Code Via Discernable Feature Analysis

The ubiquitous adoption of Large Language Generation Models (LLMs) in pr...
research
01/31/2023

Execution-based Code Generation using Deep Reinforcement Learning

The utilization of programming language (PL) models, pretrained on large...
research
08/07/2023

Evaluating and Explaining Large Language Models for Code Using Syntactic Structures

Large Language Models (LLMs) for code are a family of high-parameter, tr...

Please sign up or login with your details

Forgot password? Click here to reset