Does BLEU Score Work for Code Migration?

06/12/2019
by   Ngoc Tran, et al.
0

Statistical machine translation (SMT) is a fast-growing sub-field of computational linguistics. Until now, the most popular automatic metric to measure the quality of SMT is BiLingual Evaluation Understudy (BLEU) score. Lately, SMT along with the BLEU metric has been applied to a Software Engineering task named code migration. (In)Validating the use of BLEU score could advance the research and development of SMT-based code migration tools. Unfortunately, there is no study to approve or disapprove the use of BLEU score for source code. In this paper, we conducted an empirical study on BLEU score to (in)validate its suitability for the code migration task due to its inability to reflect the semantics of source code. In our work, we use human judgment as the ground truth to measure the semantic correctness of the migrated code. Our empirical study demonstrates that BLEU does not reflect translation quality due to its weak correlation with the semantic correctness of translated code. We provided counter-examples to show that BLEU is ineffective in comparing the translation quality between SMT-based models. Due to BLEU's ineffectiveness for code migration task, we propose an alternative metric RUBY, which considers lexical, syntactical, and semantic representations of source code. We verified that RUBY achieves a higher correlation coefficient with the semantic correctness of migrated code, 0.775 in comparison with 0.583 of BLEU score. We also confirmed the effectiveness of RUBY in reflecting the changes in translation quality of SMT-based translation models. With its advantages, RUBY can be used to evaluate SMT-based code migration models.

READ FULL TEXT
research
09/30/2015

Enhanced Bilingual Evaluation Understudy

Our research extends the Bilingual Evaluation Understudy (BLEU) evaluati...
research
05/29/2021

Grammar Accuracy Evaluation (GAE): Quantifiable Intrinsic Evaluation of Machine Translation Models

Intrinsic evaluation by humans for the performance of natural language g...
research
08/14/2019

On The Evaluation of Machine Translation Systems Trained With Back-Translation

Back-translation is a widely used data augmentation technique which leve...
research
09/22/2020

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Evaluation metrics play a vital role in the growth of an area as it defi...
research
06/15/2021

Code to Comment Translation: A Comparative Study on Model Effectiveness Errors

Automated source code summarization is a popular software engineering re...
research
04/29/2020

Revisiting Round-Trip Translation for Quality Estimation

Quality estimation (QE) is the task of automatically evaluating the qual...
research
03/16/2021

From Innovations to Prospects: What Is Hidden Behind Cryptocurrencies?

The great influence of Bitcoin has promoted the rapid development of blo...

Please sign up or login with your details

Forgot password? Click here to reset