Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale. In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported. All datasets, code, and trained models used in this work are publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/13/2019

Maybe Deep Neural Networks are the Best Choice for Modeling Source Code

Statistical language modeling techniques have successfully been applied ...
research
04/03/2019

Modeling Vocabulary for Big Code Machine Learning

When building machine learning models that operate on source code, sever...
research
11/09/2020

Pointing to Subwords for Generating Function Names in Source Code

We tackle the task of automatically generating a function name from sour...
research
08/25/2023

Investigating the Impact of Vocabulary Difficulty and Code Naturalness on Program Comprehension

Context: Developers spend most of their time comprehending source code d...
research
10/18/2018

Open Vocabulary Learning on Source Code with a Graph-Structured Cache

Machine learning models that take computer program source code as input ...
research
09/05/2023

Language Models for Novelty Detection in System Call Traces

Due to the complexity of modern computer systems, novel and unexpected b...
research
04/16/2020

Deep Generation of Coq Lemma Names Using Elaborated Terms

Coding conventions for naming, spacing, and other essentially stylistic ...

Please sign up or login with your details

Forgot password? Click here to reset