Out of the BLEU: how should we assess quality of the Code Generation models?

08/05/2022
by   Mikhail Evtikhiev, et al.
0

In recent years, researchers have created and introduced a significant number of various code generation models. As human evaluation of every new model version is unfeasible, the community adopted automatic evaluation metrics such as BLEU to approximate the results of human judgement. These metrics originate from the machine translation domain and it is unclear whether they are applicable for the code generation tasks and how well do they agree with the human evaluation on this task. There also are two metrics, CodeBLEU and RUBY, that were developed to estimate the similarity of code and take into account the code properties. However, for these metrics there are hardly any studies on their agreement with the human evaluation. Despite all that, minimal differences in the metric scores are used to claim superiority of some code generation models over the others. In this paper, we present a study on applicability of six metrics – BLEU, ROUGE-L, METEOR, ChrF, CodeBLEU, RUBY – for evaluation of the code generation models. We conduct a study on two different code generation datasets and use human annotators to assess the quality of all models run on these datasets. The results indicate that for the CoNaLa dataset of Python one-liners none of the metrics can correctly emulate human judgement on which model is better with >95% certainty if the difference in model scores is less than 5 points. For the HearthStone dataset, which consists of classes of particular structure, the difference in model scores of at least 2 points is enough to claim the superiority of one model over the other. Using our findings, we derive several recommendations on using metrics to estimate the model performance on the code generation task.

READ FULL TEXT
research
05/15/2023

NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

In this study, we analyze NLG automatic metrics based on whether human e...
research
11/17/2022

Execution-based Evaluation for Data Science Code Generation Models

Code generation models can benefit data scientists' productivity by auto...
research
09/22/2020

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Evaluation metrics play a vital role in the growth of an area as it defi...
research
10/12/2022

Better Smatch = Better Parser? AMR evaluation is not so simple anymore

Recently, astonishing advances have been observed in AMR parsing, as mea...
research
03/30/2022

Reproducibility Issues for BERT-based Evaluation Metrics

Reproducibility is of utmost concern in machine learning and natural lan...
research
03/29/2022

Investigating Data Variance in Evaluations of Automatic Machine Translation Metrics

Current practices in metric evaluation focus on one single dataset, e.g....
research
10/03/2020

Code to Comment "Translation": Data, Metrics, Baselining Evaluation

The relationship of comments to code, and in particular, the task of gen...

Please sign up or login with your details

Forgot password? Click here to reset