CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences

02/14/2022
by   Maliheh Izadi, et al.
0

Code completion is an essential feature of IDEs, yet current autocompleters are restricted to either grammar-based or NLP-based single token completions. Both approaches have significant drawbacks: grammar-based autocompletion is restricted in dynamically-typed language environments, whereas NLP-based autocompleters struggle to understand the semantics of the programming language and the developer's code context. In this work, we present CodeFill, a language model for autocompletion that combines learned structure and naming information. Using a parallel Transformer architecture and multi-task learning, CodeFill consumes sequences of source code token names and their equivalent AST token types. Uniquely, CodeFill is trained both for single-token and multi-token (statement) prediction, which enables it to learn long-range dependencies among grammatical and naming elements. We train CodeFill on two datasets, consisting of 29M and 425M lines of code, respectively. To make the evaluation more realistic, we develop a method to automatically infer points in the source code at which completion matters. We compare CodeFill against four baselines and two state-of-the-art models, GPT-C and TravTrans+.CodeFill surpasses all baselines in single token prediction (MRR: 70.9 art for multi-token prediction (ROUGE-L: 63.7 tokens). We publicly release our source code and datasets.

READ FULL TEXT
research
12/29/2020

Multi-task Learning based Pre-trained Language Model for Code Completion

Code completion is one of the most useful features in the Integrated Dev...
research
06/18/2021

Learning to Generate Code Sketches

Traditional generative models are limited to predicting sequences of ter...
research
11/09/2022

Syntax-Aware On-the-Fly Code Completion

Code completion aims to help improve developers' productivity by suggest...
research
10/23/2020

Neural Code Completion with Anonymized Variable Names

Source code processing heavily relies on the methods widely used in natu...
research
09/24/2022

In-context Learning and Induction Heads

"Induction heads" are attention heads that implement a simple algorithm ...
research
04/10/2020

Sequence Model Design for Code Completion in the Modern IDE

Code completion plays a prominent role in modern integrated development ...

Please sign up or login with your details

Forgot password? Click here to reset