NatGen: Generative pre-training by "Naturalizing" source code

06/15/2022
by   Saikat Chakraborty, et al.
0

Pre-trained Generative Language models (e.g. PLBART, CodeT5, SPT-Code) for source code yielded strong results on several tasks in the past few years, including code generation and translation. These models have adopted varying pre-training objectives to learn statistics of code construction from very large-scale corpora in a self-supervised fashion; the success of pre-trained models largely hinges on these pre-training objectives. This paper proposes a new pre-training objective, "Naturalizing" of source code, exploiting code's bimodal, dual-channel (formal natural channels) nature. Unlike natural language, code's bimodal, dual-channel nature allows us to generate semantically equivalent code at scale. We introduce six classes of semantic preserving transformations to introduce un-natural forms of code, and then force our model to produce more natural original programs written by developers. Learning to generate equivalent, but more natural code, at scale, over large corpora of open-source code, without explicit manual supervision, helps the model learn to both ingest generate code. We fine-tune our model in three generative Software Engineering tasks: code generation, code translation, and code refinement with limited human-curated labeled data and achieve state-of-the-art performance rivaling CodeT5. We show that our pre-trained model is especially competitive at zero-shot and few-shot learning, and better at learning code properties (e.g., syntax, data flow).

READ FULL TEXT
research
12/04/2021

Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding

With the great success of pre-trained models, the pretrain-then-finetune...
research
02/01/2023

CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models

Code generation models based on the pre-training and fine-tuning paradig...
research
09/10/2022

Code Compliance Assessment as a Learning Problem

Manual code reviews and static code analyzers are the traditional mechan...
research
06/05/2023

CONCORD: Clone-aware Contrastive Learning for Source Code

Deep Learning (DL) models to analyze source code have shown immense prom...
research
11/18/2019

patch2vec: Distributed Representation of Code Changes

Deep learning methods, which have found successful applications in field...
research
03/09/2021

Finding Inlined Functions in Optimized Binaries

Much software, whether beneficent or malevolent, is distributed only as ...
research
06/29/2021

Making the most of small Software Engineering datasets with modern machine learning

This paper provides a starting point for Software Engineering (SE) resea...

Please sign up or login with your details

Forgot password? Click here to reset