CodeScore: Evaluating Code Generation by Learning Code Execution

01/22/2023
by   Yihong Dong, et al.
0

A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing CEMs can be categorized into match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) and execution-based CEMs (e.g., AvgPassRatio and Pass@k), but both of them suffer from some issues. The former only measures differences in surface form regardless of the functional equivalence of codes, while the latter has huge execution overheads, including collecting expensive test cases, resolving tedious execution dependencies, and enormous execution time. To address these issues, in this paper, we propose CodeScore, an efficient and effective CEM for code generation, which estimates test case PassRatio of generated code without executing code. We also present a framework named UniCE for training unified code evaluation models by learning code execution, i.e., learning PassRatio and Executability of generated code. In order to learn code execution comprehensively, we construct more than 100 test cases for each task in several popular benchmark datasets, covering MBPP, APPS, and HumanEval. Experimental results show that CodeScore has obtained a state-of-the-art correlation with execution-based CEMs. CodeScore is strongly correlated with AvgPassPatio, and binary CodeScore is moderately correlated with Pass@1. In particular, CodeScore eliminates the need for test cases and execution dependencies in inference, and CodeScore reduces execution time by three orders of magnitude compared to AvgPassPatio and Pass@1.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/16/2023

LEVER: Learning to Verify Language-to-Code Generation with Execution

The advent of large language models trained on code (code LLMs) has led ...
research
12/20/2022

Execution-Based Evaluation for Open-Domain Code Generation

To extend the scope of coding queries to more realistic settings, we pro...
research
05/06/2023

Self-Edit: Fault-Aware Code Editor for Code Generation

Large language models (LLMs) have demonstrated an impressive ability to ...
research
07/21/2022

CodeT: Code Generation with Generated Tests

The task of generating code solutions for a given programming problem ca...
research
09/21/2023

Revealing Performance Issues in Server-side WebAssembly Runtimes via Differential Testing

WebAssembly (Wasm) is a bytecode format originally serving as a compilat...
research
11/09/2021

Test cases as a measurement instrument in experimentation

Background: Test suites are frequently used to quantify relevant softwar...
research
11/17/2022

Execution-based Evaluation for Data Science Code Generation Models

Code generation models can benefit data scientists' productivity by auto...

Please sign up or login with your details

Forgot password? Click here to reset