ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation

08/03/2023
by   Xueying Du, et al.
0

In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on it, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we have the following main findings. First, we find that all existing LLMs show much worse performance on class-level code generation compared to on standalone method-level code generation benchmarks like HumanEval; and the method-level coding ability cannot equivalently reflect the class-level coding ability among LLMs. Second, we find that GPT-4 and GPT-3.5 still exhibit dominate superior than other LLMs on class-level code generation, and the second-tier models includes Instruct-Starcoder, Instruct-Codegen, and Wizardcoder with very similar performance. Third, we find that generating the entire class all at once (i.e. holistic generation strategy) is the best generation strategy only for GPT-4 and GPT-3.5, while method-by-method generation (i.e. incremental and compositional) is better strategies for the other models with limited ability of understanding long instructions and utilizing the middle information. Lastly, we find the limited model ability of generating method-dependent code and discuss the frequent error types in generated classes. Our benchmark is available at https://github.com/FudanSELab/ClassEval.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/31/2023

BioCoder: A Benchmark for Bioinformatics Code Generation with Contextual Pragmatic Knowledge

Pre-trained language models like ChatGPT have significantly improved cod...
research
08/14/2023

OctoPack: Instruction Tuning Code Large Language Models

Finetuning large language models (LLMs) on instructions leads to vast pe...
research
02/28/2022

Local and Global GANs with Semantic-Aware Upsampling for Image Generation

In this paper, we address the task of semantic-guided image generation. ...
research
12/27/2019

Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation

In this paper, we address the task of semantic-guided scene generation. ...
research
05/20/2021

Measuring Coding Challenge Competence With APPS

While programming is one of the most broadly applicable skills in modern...
research
06/26/2023

InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback

Humans write code in a fundamentally interactive manner and rely on cons...
research
05/24/2023

Who Wrote this Code? Watermarking for Code Generation

Large language models for code have recently shown remarkable performanc...

Please sign up or login with your details

Forgot password? Click here to reset