Maybe Deep Neural Networks are the Best Choice for Modeling Source Code

Statistical language modeling techniques have successfully been applied to source code, yielding a variety of new software development tools, such as tools for code suggestion and improving readability. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. But traditional language models limit the vocabulary to a fixed set of common words. For code, this strong assumption has been shown to have a significant negative effect on predictive performance. But the open vocabulary version of the neural network language models for code have not been introduced in the literature. We present a new open-vocabulary neural language model for code that is not limited to a fixed vocabulary of identifier names. We employ a segmentation into subword units, subsequences of tokens chosen based on a compression criterion, following previous work in machine translation. Our network achieves best in class performance, outperforming even the state-of-the-art methods of Hellendoorn and Devanbu that are designed specifically to model code. Furthermore, we present a simple method for dynamically adapting the model to a new test project, resulting in increased performance. We showcase our methodology on code corpora in three different languages of over a billion tokens each, hundreds of times larger than in previous work. To our knowledge, this is the largest neural language model for code that has been reported.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/17/2020

Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code

Statistical language modeling techniques have successfully been applied ...
research
04/10/2020

Sequence Model Design for Code Completion in the Modern IDE

Code completion plays a prominent role in modern integrated development ...
research
04/03/2019

Modeling Vocabulary for Big Code Machine Learning

When building machine learning models that operate on source code, sever...
research
04/28/2020

Fast and Memory-Efficient Neural Code Completion

Code completion is one of the most widely used features of modern integr...
research
10/18/2018

Open Vocabulary Learning on Source Code with a Graph-Structured Cache

Machine learning models that take computer program source code as input ...
research
03/21/2018

Exploring the Naturalness of Buggy Code with Recurrent Neural Networks

Statistical language models are powerful tools which have been used for ...
research
11/24/2016

Learning Python Code Suggestion with a Sparse Pointer Network

To enhance developer productivity, all modern integrated development env...

Please sign up or login with your details

Forgot password? Click here to reset