MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation

08/17/2022
by   Federico Cassano, et al.
6

Large language models have demonstrated the ability to generate both natural language and programming language text. Such models open up the possibility of multi-language code generation: could code generation models generalize knowledge from one language to another? Although contemporary code generation models can generate semantically correct Python code, little is known about their abilities with other languages. We propose MultiPL-E, a system for translating unit test-driven code generation benchmarks to new languages. We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark and MBPP benchmark to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex, CodeGen, and InCoder. We find that Codex matches or even exceeds its performance on Python for several other languages. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible, making it straightforward to evaluate new models, benchmarks, and languages.

READ FULL TEXT

page 5

page 11

page 13

page 20

page 21

page 22

page 23

page 25

research
02/03/2023

Measuring The Impact Of Programming Language Distribution

Current benchmarks for evaluating neural code models focus on only a sma...
research
03/16/2022

MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages

While there has been a recent burgeoning of applications at the intersec...
research
10/26/2022

Multi-lingual Evaluation of Code Generation Models

We present MBXP, an execution-based code completion benchmark in 10+ pro...
research
05/25/2023

ChatGPT for PLC/DCS Control Logic Generation

Large language models (LLMs) providing generative AI have become popular...
research
11/29/2022

Coder Reviewer Reranking for Code Generation

Sampling diverse programs from a code language model and reranking with ...
research
06/07/2023

StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code

Code LLMs are being rapidly deployed and there is evidence that they can...
research
05/24/2023

The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python

Large Language Models (LLMs) have successfully been applied to code gene...

Please sign up or login with your details

Forgot password? Click here to reset