Splitting source code identifiers using Bidirectional LSTM Recurrent Neural Network

05/26/2018 ∙ by Vadim Markovtsev, et al. ∙ source{d} 0

Programmers make rich use of natural language in the source code they write through identifiers and comments. Source code identifiers are selected from a pool of tokens which are strongly related to the meaning, naming conventions, and context. These tokens are often combined to produce more precise and obvious designations. Such multi-part identifiers count for 97 tokens in the Public Git Archive - the largest dataset of Git repositories to date. We introduce a bidirectional LSTM recurrent neural network to detect subtokens in source code identifiers. We trained that network on 41.7 million distinct splittable identifiers collected from 182,014 open source projects in Public Git Archive, and show that it outperforms several other machine learning models. The proposed network can be used to improve the upstream models which are based on source code identifiers, as well as improving developer experience allowing writing code without switching the keyboard case.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The descriptiveness of source code identifiers is critical for readability and maintainability [21]. This property is hard to ensure by using exclusively single words. Therefore it is common practice to concatenate several multiple words into a single identifier. Whitespace characters in identifiers are forbidden by most programming languages, so there are naming conventions [5] like CamelCase or snakecase

which specify the concatenation rules. It is possible to apply simple heuristics, backtrack those rules and restore the original words from identifiers. For example,

FooBar or foobar are trivially disassembled into foo and bar. However, if there is a compound identifier consisting of only lowercase or only uppercase characters, splitting requires domain knowledge and cannot be easily performed.

According to our estimations, up to 7.5% of the identifiers are not splittable by style driven heuristics; among the rest, 15% contain further splittable parts after heuristics. This leads to bigger vocabulary sizes, worse performance, and reduced quality of upstream investigation in the areas of source code analysis and Machine Learning on Source Code (MLonCode). A deep learning-based parser, capable of learning to tokenize identifiers from many training examples, can enhance the quality of research in topics like identifier embeddings

[27], deduplication [22], topic modeling [25], and naming suggestions [9].

The main contributions of this paper are:

  • [noitemsep]

  • We built the biggest dataset of 49.2 million source code identifiers extracted from Public Git Archive [26] - The 182,014 most popular GitHub repositories.

  • We are the first to apply a recurrent neural network to split ”unsplittable” identifiers. We show that the character-level bidirectional LSTM type of recurrent neural network (RNN) performs better than the character-level bidirectional GRU, character-level convolutional neural network, the gradient boosted decision tree, the statistical dynamic programming model, and the unsmoothed maximum likelihood character-level model.

2 Identifier extraction

This section describes the source code identifier extraction pipeline which was used to generate the train dataset from the Public Git Archive.

We processed each Git repository with the source{d} engine [7] to determine the main branch and its head revision, took the files from that revision, and identifed the programming languages of those files. We extracted identifiers from files according to the identified language with babelfish [3] and pygments [6]. Babelfish is a self-hosted server for universal source code parsing, it converts code files into Universal Abstract Syntax Trees. We fall back to Pygments for those languages which are not supported yet by Babelfish. Pygments was developed to highlight source code and uses regular expressions, however it makes mistakes and introduces noise.

We obtained 62.0 million identifiers after removing duplicates. This number reduced to 49.2 after manual rule-based filtering of noisy output from Pygments. We then split the identifiers according to the common naming conventions. For example FooBarBaz becomes foo bar baz, and methodbase turns into method base. The listing of the function code which implements the heuristics is provided in appendix A. We left only those which consisted of more than one part and obtained 36.1 million distinct subtoken sequences. Some identifiers aliased to the same subtoken sequence. The distribution of identifier lengths had a long tail as seen on Fig. 2, so we put a threshold of maximum identifier length to 40 characters. The length threshold further reduced the dataset to 34.9 million unique subtoken sequences. All the models we trained used as input the lowercase strings created by merging the subtoken sequences together and the corresponding indices of subtoken boundaries. Figure 2 depicts the head of the frequency distribution of the subtokens.

The raw dataset of 49.2 million identifiers is available for download on GitHub [4]. Currently available datasets of source code identifiers contain less than a million entities and focus on particular programming languages, such as Java [12].

Figure 1: Distribution of identifier lengths Figure 2: Distribution of most frequent subtokens

3 Baselines

We now describe the models to which we compare our character-level bidirectional LSTM recurrent neural network.

Maximum likelihood character-level model

Probabilistic maximum likelihood language models (ML LM) are typical for Natural Language Processing

[28]. Given the sequence of characters representing an identifier, we evaluate for each character

the probability that the subsequence

is a prefix, and pick the prefix that maximizes that probability, assuming . We repeat this procedure from the character following the chosen prefix. In the case of prefixes for which we have no prior knowledge we slide the root forward until the match is found. Similarly to n-gram models [17], our character-level LM makes the Markov assumption that the sequence of characters is a memoryless stochastic process, so we assert that . We estimate these conditional probabilities using maximum likelihood [23]. We trained two unsmoothed models independently, corresponding to forward and backward reading direction. Finally, we combined them via the logical conjunction and disjunction. The tree depth was 11 due to the technical limitations - bigger depths require too much operating memory. The implementation was CharStatModel [24].

Dynamic programming

Inspired by the dynamic programming approach to splitting words [20], we implemented the similar solution based on word frequencies. By making the hypothesis that the words are independent from one another, we can model the probability of a sequence of words using frequencies computed on a corpus. We trained on the generic Wikipedia corpus and on the unique subtokens in our identifier dataset, either assuming Zipf prior or the posterior. Our implementation was based on wordninja [10].

The main limitation of the statistical approaches is their inability to predict out of vocabulary words, especially method and class names which represent the substantial portion of identifiers in the validation set. The only way to compensate this drawback is to increase the length of the context on which we compute priors, simultaneously worsening the data sparsity problem [9] and increasing the time and memory requirements.

Gradient boosting on decision trees

We trained the gradient boosting on decision trees (GBDT)) using XGBoost [13]

. The tree input was a 10-character window with ”a”-aligned ASCII codes instead of one-hot encoding. We didn’t choose a larger window to avoid introducing noise, given bulk of our identifiers were shorter then the 40 character limit. The windows were centered at each split point and we also generated 80% negative samples at random non-split positions. The maximum tree depth was 30, the number of boosting trees was 50.

Character-level Convolutional Neural Network

We stacked 3 Inception layers [30]

, with 1-dimensional ReLU kernels spanning over 2, 4, 8, and 16 one-hot encoded characters, and 32 dimensionality reducing ReLU kernels of size 1. Thus the output of each layer was shaped 40 by 32. The last layer was connected to the time-distributed dense layer with sigmoid activation and binary labels. There was no regularization as the dataset size was big enough and we used RMSProp optimizer


4 Character-level bidirectional recurrent neural network

Character-level bidirectional recurrent neural networks (BiRNNs)

[29] are a family of models that combine two recurrent networks moving through each character in a sequence in the opposite directions and starting from the opposite sides. BiRNNs are an effective solution for sequence modeling, so we tried them for the splitting task. Given that the tokens may be long, we chose LSTM [16]

over vanilla RNNs to overcome the vanishing gradients problem. Besides, we compared LSTM to GRU

[15] as GRU was shown to perform with similar quality but are faster to train.

Figure 3: BiLSTM network with one layer running on foobar. The vertical dashed line indicates the separation point.

Fig. 3 demonstrates the architecture of a BiLSTM network to split identifiers. It processes the characters of each identifier in both directions. The schema contains a single recurrent layer for simplicity, however, the real network is built with two stacked layers. The second recurrent layer is connected to the time-distributed dense layer with binary outputs and sigmoid activation. An output of 1 means the character is a split point and 0 it is not. Sigmoid activation was used instead of softmax because there can be more than one split point per identifier.

We trained our BiLSTM network on two NVIDIA GTX 1080 GPUs using Keras [14] with a Tensorflow backend [8]

. It took approximately 14 hours to complete 10 epochs. Table


lists the hyperparameters we chose using Hyperopt

[11]. The training curves are on Figure 4.

RNN sequence length 40 Layer sizes 256, 256 Batch size 512 Epochs 10 Optimizer Adam [19] Learning rate 0.001
Table 1: Network train parameters
Figure 4: Training curves for the BiLSTM

5 Evaluation

We divided the dataset into 80% train and 20% validation and calculated precision, recall and score for each of the models. Precision is defined as the ratio of correct splitting predictions and the total number of predictions, recall as the ratio of correct splitting predictions and the ground truth number of splits, and

score is the harmonic average of precision and recall. The results are shown on Fig.

5 and Table 5. The worst models are clearly the statistical ones, however, the conjunction of character-level ML LMs achieved the highest precision among all with 96.6%. Character-level CNN is close to the top, it has great evaluation speed and can be chosen if the run time is important. LSTM performed better than GRU and achieved the highest score with 95% precision and 96% recall.

Model Precision Recall Char. ML LM 0.563 0.936 0.703 Char. ML LM 0.966 0.573 0.719 Stat. dyn. prog., Wiki 0.741 0.912 0.818 Stat. dyn. prog., Zipf 0.937 0.783 0.853 Stat. dyn. prog., posterior 0.931 0.892 0.911 GBDT 0.931 0.924 0.928 Char. CNN 0.922 0.938 0.930 Char. BiGRU 0.945 0.955 0.949 Char. BiLSTM 0.947 0.958 0.952
Table 2: Evaluation results
Figure 5: Model comparison,  isocurves are dashed

6 Applications

The presented identifier splitters reduce the number of unique subtokens by 64%. We ran the BiLSTM model on the subtoken sequences from the dataset and generated refined subtoken sequences. We then measured the new number of unique identifier parts, which was reduced from 2,940,710to 1,065,153. Samples of identifiers split by heuristics and our model are listed in appendix B. Smaller vocabulary size leads to faster training of the upstream models such as identifier embeddings based on the structural co-occurrence scope as a context [2] or topic models of files and projects [25].

It is also possible to use our model to automatically split identifiers written in the same case without whitespace on the keyboard. This simplifies and speeds up typing the code provided by the number of splitting errors is low enough. Depending on the naming style, the described algorithm may save ”Shift” or ”Shift + Underscore” keystrokes. The reached quality metrics are good enough, our network makes an error with 50% probability after identifiers assuming that each identifier contains a single split point.

7 Conclusion

We created and published a dataset with 49.2 million distinct source code identifiers extracted from Public Git Archive, the largest one to date. We trained several machine learning models on that dataset and showed that the character-level bidirectional LSTM recurrent neural network (BiLSTM) performs best, reaching 95% precision and 96% recall on the validation set. To our knowledge, it is the first time RNNs were applied to the source code identifier split problem. BiLSTM significantly (by 2 times) reduces the core vocabulary size in upstream problems and is good enough to improve the speed at which people write code.


Appendix A Identifier splitting algorithm, Python 3.4+

NAME_BREAKUP_RE = re.compile(r”[ˆa-zA-Z]+”)
min_split_length = 3
def split(token):
  token = token.strip()
  prev_p = [””]
  def ret(name):
    r = name.lower()
    if len(name) >= min_split_length:
      yield r
      if prev_p[0]:
        yield prev_p[0] + r
        prev_p[0] = ””
      prev_p[0] = r
  for part in NAME_BREAKUP_RE.split(token):
    if not part:
    prev = part[0]
    pos = 0
    for i in range(1, len(part)):
      this = part[i]
      if prev.islower() and this.isupper():
        yield from ret(part[pos:i])
        pos = i
      elif prev.isupper() and this.islower():
        if 0 < i - 1 - pos <= min_split_length:
          yield from ret(part[pos:i - 1])
          pos = i - 1
        elif i - 1 > pos:
          yield from ret(part[pos:i])
          pos = i
      prev = this
    last = part[pos:]
    if last:
      yield from ret(last)

Appendix B Examples of identifiers from the dataset processed by heuristics and the BiLSTM model

Input identifier Output TokenParser Output BiLSTM
OMXBUFFERFLAGCODECCONFIG omx bufferflag codecconfig omx buffer flag codec config
metamodelength metamodelength meta mode length
rESETTOUCHCONTROLS r esettouchcontrols reset touch controls
IDREQUESTRESPONSE id requestresponse id request response
%afterfor afterfor after for
simpleblogsearch simpleblogsearch simple blog search
namehashfromuid namehash from uid name hash from uid
GPUSHADERDESCGETCACHEID gpushaderdesc getcacheid gpu shader desc get cache id
oneditvaluesilence oneditvaluesilence on edit value silence
XGMACTXSENDAPPGOODPKTS xgmac tx sendappgoodpkts xgmac tx send app good pkts
closenessthreshold closenessthreshold closeness threshold
testwritestartdocument test writestartdocument test write start document
dspacehash dspacehash d space hash
testfiledate testfiledate test file date
ASSOCSTRSHELLEXTENSION assocstr shellextension assoc str shell extension