Learning Python Code Suggestion with a Sparse Pointer Network

11/24/2016
by   Avishkar Bhoopchand, et al.
0

To enhance developer productivity, all modern integrated development environments (IDEs) include code suggestion functionality that proposes likely next tokens at the cursor. While current IDEs work well for statically-typed languages, their reliance on type annotations means that they do not provide the same level of support for dynamic programming languages as for statically-typed languages. Moreover, suggestion engines in modern IDEs do not propose expressions or multi-statement idiomatic code. Recent work has shown that language models can improve code suggestion systems by learning from software repositories. This paper introduces a neural language model with a sparse pointer network aimed at capturing very long-range dependencies. We release a large-scale code suggestion corpus of 41M lines of Python code crawled from GitHub. On this corpus, we found standard neural language models to perform well at suggesting local phenomena, but struggle to refer to identifiers that are introduced many tokens in the past. By augmenting a neural language model with a pointer network specialized in referring to predefined classes of identifiers, we obtain a much lower perplexity and a 5 percentage points increase in accuracy for code suggestion compared to an LSTM baseline. In fact, this increase in code suggestion accuracy is due to a 13 times more accurate prediction of identifiers. Furthermore, a qualitative analysis shows this model indeed captures interesting long-range dependencies, like referring to a class member defined over 60 tokens in the past.

READ FULL TEXT
research
05/09/2023

StarCoder: may the source be with you!

The BigCode community, an open-scientific collaboration working on the r...
research
02/03/2023

Measuring The Impact Of Programming Language Distribution

Current benchmarks for evaluating neural code models focus on only a sma...
research
08/24/2023

Code Llama: Open Foundation Models for Code

We release Code Llama, a family of large language models for code based ...
research
05/12/2018

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context

We know very little about how neural language models (LM) use prior ling...
research
10/08/2019

Do People Prefer "Natural" code?

Natural code is known to be very repetitive (much more so than natural l...
research
03/13/2019

Maybe Deep Neural Networks are the Best Choice for Modeling Source Code

Statistical language modeling techniques have successfully been applied ...
research
07/20/2020

Jupyter Notebooks on GitHub: Characteristics and Code Clones

Jupyter notebooks have emerged as a standard tool for data science progr...

Please sign up or login with your details

Forgot password? Click here to reset