Multi-lingual Evaluation of Code Generation Models

10/26/2022
by   Ben Athiwaratkun, et al.
0

We present MBXP, an execution-based code completion benchmark in 10+ programming languages. This collection of datasets is generated by our conversion framework that translates prompts and test cases from the original MBPP dataset to the corresponding data in a target language. Based on this benchmark, we are able to evaluate code generation models in a multi-lingual fashion, and in particular discover generalization ability of language models on out-of-domain languages, advantages of large multi-lingual models over mono-lingual, benefits of few-shot prompting, and zero-shot translation abilities. In addition, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages. These solutions can be used for other code-related evaluations such as insertion-based, summarization, or code translation tasks where we demonstrate results and release as part of our benchmark.

READ FULL TEXT

page 22

page 32

page 33

page 34

page 37

page 38

page 40

page 41

research
08/17/2022

MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation

Large language models have demonstrated the ability to generate both nat...
research
03/06/2023

xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval

The ability to solve problems is a hallmark of intelligence and has been...
research
10/17/2020

PPL Bench: Evaluation Framework For Probabilistic Programming Languages

We introduce PPL Bench, a new benchmark for evaluating Probabilistic Pro...
research
12/13/2022

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

Software engineers working with the same programming language (PL) may s...
research
12/19/2022

MultiCoder: Multi-Programming-Lingual Pre-Training for Low-Resource Code Completion

Code completion is a valuable topic in both academia and industry. Recen...
research
08/19/2021

Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval

We present Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual ...
research
03/16/2022

MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages

While there has been a recent burgeoning of applications at the intersec...

Please sign up or login with your details

Forgot password? Click here to reset