AVATAR: A Parallel Corpus for Java-Python Program Translation

08/26/2021
by   Wasi Uddin Ahmad, et al.
1

Program translation refers to migrating source code from one programming language to another. It has a tremendous practical value in software development as porting software across different languages is time-consuming and costly. Automating program translation is of paramount importance in software migration, and recently researchers explored unsupervised approaches due to the unavailability of parallel corpora. However, the availability of pre-trained language models for programming languages enable supervised fine-tuning with a small amount of labeled examples. In this work, we present a corpus of 8,475 programming problems and their solutions written in two popular languages, Java and Python. We collect the dataset from competitive programming sites, online platforms, and open source repositories. We present several baselines, including models trained from scratch or pre-trained on large-scale source code collection and fine-tuned on our proposed dataset. Experiment results show that while the models perform relatively well in terms of the lexical match, they lack in generating code that is accurate in terms of syntax and data-flow match.

READ FULL TEXT
research
02/08/2023

Syntax and Domain Aware Model for Unsupervised Program Translation

There is growing interest in software migration as the development of so...
research
10/11/2021

Using Document Similarity Methods to create Parallel Datasets for Code Translation

Translating source code from one programming language to another is a cr...
research
02/21/2023

On ML-Based Program Translation: Perils and Promises

With the advent of new and advanced programming languages, it becomes im...
research
01/01/2022

Cross-Domain Deep Code Search with Few-Shot Meta Learning

Recently, pre-trained programming language models such as CodeBERT have ...
research
03/16/2023

Knowledge Transfer for Pseudo-code Generation from Low Resource Programming Language

Generation of pseudo-code descriptions of legacy source code for softwar...
research
02/07/2023

J-Parallelio – automatic parallelization framework for Java virtual machine code

Manual translation of the algorithms from sequential version to its para...
research
10/13/2021

Leveraging Automated Unit Tests for Unsupervised Code Translation

With little to no parallel data available for programming languages, uns...

Please sign up or login with your details

Forgot password? Click here to reset