CERT: Continual Pre-Training on Sketches for Library-Oriented Code Generation

06/14/2022
by   Daoguang Zan, et al.
0

Code generation is a longstanding challenge, aiming to generate a code snippet based on a natural language description. Usually, expensive text-code paired data is essential for training a code generation model. Recently, thanks to the success of pre-training techniques, large language models are trained on large-scale unlabelled code corpora and perform well in code generation. In this paper, we investigate how to leverage an unlabelled code corpus to train a model for library-oriented code generation. Since it is a common practice for programmers to reuse third-party libraries, in which case the text-code paired data are harder to obtain due to the huge number of libraries. We observe that library-oriented code snippets are more likely to share similar code sketches. Hence, we present CERT with two steps: a sketcher generates the sketch, then a generator fills the details in the sketch. Both the sketcher and the generator are continually pre-trained upon a base model using unlabelled data. Furthermore, we craft two benchmarks named PandasEval and NumpyEval to evaluate library-oriented code generation. Experimental results demonstrate the impressive performance of CERT. For example, it surpasses the base model by an absolute 15.67 available at https://github.com/microsoft/PyCodeGPT.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/31/2022

When Language Model Meets Private Library

With the rapid development of pre-training techniques, a number of langu...
research
07/28/2023

Private-Library-Oriented Code Generation with Large Language Models

Large language models (LLMs), such as Codex and GPT-4, have recently sho...
research
09/19/2023

OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch

Large language models (LLMs) with billions of parameters have demonstrat...
research
02/13/2023

SkCoder: A Sketch-based Approach for Automatic Code Generation

Recently, deep learning techniques have shown great success in automatic...
research
11/26/2020

Copy-and-Patch Binary Code Generation

Runtime compilation of runtime-constructed code is becoming standard pra...
research
06/28/2022

Joint Generator-Ranker Learning for Natural Language Generation

Due to exposure bias, most existing natural language generation (NLG) mo...
research
08/22/2022

Incorporating Domain Knowledge through Task Augmentation for Front-End JavaScript Code Generation

Code generation aims to generate a code snippet automatically from natur...

Please sign up or login with your details

Forgot password? Click here to reset