CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

02/10/2023
by   Shuyan Zhou, et al.
0

Since the rise of neural models of code that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this paper, we propose CodeBERTScore: an automatic evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020). Instead of measuring exact token matching as BLEU, CodeBERTScore computes a soft similarity score between each token in the generated code and in the reference code, using the contextual encodings of large pretrained models. Further, instead of encoding only the generated tokens as in BERTScore, CodeBERTScore also encodes the programmatic context surrounding the generated code. We perform an extensive evaluation of CodeBERTScore across four programming languages. We find that CodeBERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics. That is, generated code that receives a higher score by CodeBERTScore is more likely to be preferred by humans, as well as to function correctly when executed. Finally, while CodeBERTScore can be used with a multilingual CodeBERT as its base model, we release five language-specific pretrained models to use with our publicly available code at https://github.com/neulab/code-bert-score . Our language-specific models have been downloaded more than 25,000 times from the Huggingface Hub.

READ FULL TEXT
research
04/27/2023

Large Language Models Are State-of-the-Art Evaluators of Code Generation

Recent advancements in the field of natural language generation have fac...
research
04/21/2019

BERTScore: Evaluating Text Generation with BERT

We propose BERTScore, an automatic evaluation metric for text generation...
research
10/06/2021

PSG@HASOC-Dravidian CodeMixFIRE2021: Pretrained Transformers for Offensive Language Identification in Tanglish

This paper describes the system submitted to Dravidian-Codemix-HASOC2021...
research
06/08/2021

Reading StackOverflow Encourages Cheating: Adding Question Text Improves Extractive Code Generation

Answering a programming question using only its title is difficult as sa...
research
02/01/2021

Text-to-hashtag Generation using Seq2seq Learning

In this paper, we studied if models based on BiLSTM and BERT can generat...
research
01/31/2023

Execution-based Code Generation using Deep Reinforcement Learning

The utilization of programming language (PL) models, pretrained on large...
research
07/14/2020

Contextualized Code Representation Learning for Commit Message Generation

Automatic generation of high-quality commit messages for code commits ca...

Please sign up or login with your details

Forgot password? Click here to reset