Splitting source code identifiers using Bidirectional LSTM Recurrent Neural Network

05/26/2018
by   Vadim Markovtsev, et al.
0

Programmers make rich use of natural language in the source code they write through identifiers and comments. Source code identifiers are selected from a pool of tokens which are strongly related to the meaning, naming conventions, and context. These tokens are often combined to produce more precise and obvious designations. Such multi-part identifiers count for 97 tokens in the Public Git Archive - the largest dataset of Git repositories to date. We introduce a bidirectional LSTM recurrent neural network to detect subtokens in source code identifiers. We trained that network on 41.7 million distinct splittable identifiers collected from 182,014 open source projects in Public Git Archive, and show that it outperforms several other machine learning models. The proposed network can be used to improve the upstream models which are based on source code identifiers, as well as improving developer experience allowing writing code without switching the keyboard case.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/26/2019

Learning Semantic Vector Representations of Source Code via a Siamese Neural Network

The abundance of open-source code, coupled with the success of recent ad...
research
12/13/2019

Associating Natural Language Comment and Source Code Entities

Comments are an integral part of software development; they are natural ...
research
01/30/2021

ICodeNet – A Hierarchical Neural Network Approach for Source Code Author Identification

With the open-source revolution, source codes are now more easily access...
research
03/23/2020

Improving Yorùbá Diacritic Restoration

Yorùbá is a widely spoken West African language with a writing system ri...
research
09/30/2019

Multi-Modal Attention Network Learning for Semantic Source Code Retrieval

Code retrieval techniques and tools have been playing a key role in faci...
research
03/21/2018

Exploring the Naturalness of Buggy Code with Recurrent Neural Networks

Statistical language models are powerful tools which have been used for ...
research
08/21/2018

Automatic Generation of Text Descriptive Comments for Code Blocks

We propose a framework to automatically generate descriptive comments fo...

Please sign up or login with your details

Forgot password? Click here to reset