Modeling Vocabulary for Big Code Machine Learning

by   Hlib Babii, et al.
Free University of Bozen-Bolzano

When building machine learning models that operate on source code, several decisions have to be made to model source-code vocabulary. These decisions can have a large impact: some can lead to not being able to train models at all, others significantly affect performance, particularly for Neural Language Models. Yet, these decisions are not often fully described. This paper lists important modeling choices for source code vocabulary, and explores their impact on the resulting vocabulary on a large-scale corpus of 14,436 projects. We show that a subset of decisions have decisive characteristics, allowing to train accurate Neural Language Models quickly on a large corpus of 10,106 projects.



page 7


Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code

Statistical language modeling techniques have successfully been applied ...

The Adverse Effects of Code Duplication in Machine Learning Models of Code

The field of big code relies on mining large corpora of code to perform ...

Maybe Deep Neural Networks are the Best Choice for Modeling Source Code

Statistical language modeling techniques have successfully been applied ...

A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

There is an emerging interest in the application of deep learning models...

Open Vocabulary Learning on Source Code with a Graph-Structured Cache

Machine learning models that take computer program source code as input ...

Neural Networks for Modeling Source Code Edits

Programming languages are emerging as a challenging and interesting doma...

Using Source Code Density to Improve the Accuracy of Automatic Commit Classification into Maintenance Activities

Source code is changed for a reason, e.g., to adapt, correct, or adapt i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Many works have taken advantage of the ”naturalness” of software (Hindle et al., 2012) to assist a variety of software engineering tasks, including code completion (Raychev et al., 2014), correcting syntax errors (Santos et al., 2018), detecting possibly buggy code (Ray et al., 2016), or API migration (Phan et al., 2017), among many others (Allamanis et al., 2018a)

. These approaches analyze large amounts of source code (e.g., hundreds to thousands of software projects), which allows them to build predictive models of various source code properties, using a variety of probabilistic or machine learning (ML) models, inspired by Natural Language Processing (NLP) techniques.

However, the use of NLP techniques in this context—even when taking advantage of the increased structure of software—, relies on the textual nature of source code, and as such rely on a notion of vocabulary. Thus, a crucial early decision to make when modeling source code for a ML task is how to model software’s vocabulary. This is all the more important because, unlike in natural language, software developers are free to create any identifiers they like, and can make them arbitrarily complex

. This introduces the issue that any model that is trained on a large-scale software corpus has to deal with an extremely large vocabulary. Hellendoorn and Devanbu observe this issue first-hand for the task of Language Modeling, showing that a Neural Language Model (NLM) has difficulties scaling beyond as few as a hundred projects, while a more traditional N-gram language model does not have such an issue

(Hellendoorn and Devanbu, 2017). Section 2 details how vocabulary issues impact Language Modeling.

Given that NLMs (and Neural approaches in general) are the state-of-the-art in the field of NLP, finding ways to scale them to a larger software corpus is very desirable. An additional reason to scale them is that recent results in NLP (Howard and Ruder, 2018; Peters et al., 2018; Devlin et al., 2018)

show that NLMs can be used as upstream tasks in a transfer learning scenario, leading to state-of-the-art improvement in downstream tasks.

Section 3 presents our first contribution: a detailed explanation of the possible modeling choices for source code that we identified. A variety of modeling choices for vocabulary are available, including: which source code elements to include or exclude; whether to filter out unfrequent tokens or not; how to handle different natural languages; and how to handle compound tokens. Some of these choices may have a large impact on vocabulary size, directly impacting the feasibility of training neural approaches.

After listing the possible modeling choices for software vocabulary, we present our second contribution: an empirical study of the impact of the modeling choices in practice. Section 4 investigate how the modeling choices affect vocabulary size, number of tokens, and out-of-vocabulary (OOV) rate on a large-scale corpus of 14,436 projects. We find that the choices have a drastic impact on these metrics, leading to variations in vocabulary of up to three orders of magnitude. Importantly, we find that the most common ways to reduce vocabulary (such as splitting identifiers according to case information), are not enough to obtain a vocabulary of a manageable size; advanced approaches such as adaptations of the Byte-Pair Encoding algorithm (Gage, 1994; Sennrich et al., 2015) are needed to reach this goal.

Following the vocabulary study, we evaluate how these modeling choices impact the training and performance of NLMs. Section 5 presents our third contribution: we find that, with the right set of choices, it is possible to scale NLMs to a large source code corpus: we successfully train several NLMs on a corpus that contains more than 10,106 projects. We evaluate two scenarios: language modeling and code completion. Our results show that our language models are competitive with previous approaches, even at very large scales.

We discuss the modeling choices in light of the results (including the implications) in Section 6. Finally, we discuss the limitations of the study in Section 7 and close the paper in Section 8.

2. Background and Related Work

2.1. Language Modeling

A Language Model (LM) estimates the probabilities of sequences of words based on a training corpus. In NLP, these models are used in tasks such as speech recognition

(Creutz et al., 2007) or machine translation (Jean et al., 2014).

N-gram Language Models. Traditional language models are based on n-grams: the probability of a token is computed based on the previous tokens in the sequence. N-gram LMs have shown extensive success in NLP applications. However, n-gram models have two issues. First, they operate on small ranges (the previous tokens), with usually low values of (usually 3 to 5; 6 for Java (Hellendoorn and Devanbu, 2017)). Increasing does not scale well if the vocabulary is large: for a vocabulary of size , there are possible n-grams. Second, they suffer from data sparsity: not all possible n-grams are present in the corpus. Smoothing techniques (Chen and Goodman, 1999) alleviate—but not eliminate—the issue.

Neural Language Models. The state-of-the-art in NLP is made of Neural Language Models (NLM) (Bengio et al., 2003a)

. NLMs represent words in a continuous vector space, which has attractive properties. In these models, words that are semantically similar are close in vector space

(Mikolov et al., 2013)

, allowing the model to infer relationships between words, even if they do not appear in a specific context during training. This allows these models to better deal with data sparsity. In addition, some neural architectures such as Recurrent Neural Networks (RNN)

(Mikolov et al., 2010)

, Long Short-Term Memory (LSTM)

(Hochreiter and Schmidhuber, 1997; Sundermeyer et al., 2012), or Transformer (Vaswani et al., 2017) are able to model much longer range dependencies: a study of LSTM language models showed that they use context as far as 250 words (Khandelwal et al., 2018). In addition, NLMs have shown to be versatile: recent work shows that NLMs can be used as upstream tasks for transfer learning. The intuition behind this is that a model that is trained to predict the next word given a sequence of words has to learn features that are useful for other tasks, such as syntax (Blevins et al., 2018). This property is very attractive since language modeling is an unsupervised task, while the downstream tasks are often supervised tasks. An LSTM LM can be re-purposed for classification tasks by replacing its last layers (performing word predictions) with layers performing classification, before fine-tuning it on the new task (Howard and Ruder, 2018). Similar results have been shown for a variety of additional NLP tasks, including question answering and entailment (Peters et al., 2018), sequence to sequence models for translation or summarization tasks (Ramachandran et al., 2016). Unidirectional (Radford et al., 2018) and bidirectional (Devlin et al., 2018) Transformer models can also be fine-tuned for a variety of downstream tasks, while even larger models show adaptation to downstream tasks with no or extremely little fine-tuning (Radford et al., 2019). Many of these tasks showed state-of-the-art improvements stemming from this pre-training.

Language Models in Software Engineering. Seminal studies have laid the groundwork for the use of language models on source code: Gabel and Su show that software is very repetitive (Gabel and Su, 2010). Hindle et al. compare software to natural language, finding that software is much more repetitive than natural language (Hindle et al., 2012); they build language models of source code, finding applications in code completion. Tu et al. (Tu et al., 2014) find that software is even more repetitive taking local context into account. Rahman et al. refines those results and finds that while some aspects of software are not as repetitive as previously thought (non-syntax elements), others are even more so (API sequences) (Rahman et al., 2019). Allamanis et al. describe the field of probabilistic models of source code (Allamanis et al., 2018a); we cover a subset of these works below.

2.2. Large Vocabularies in Machine Learning

ML models in general, and Language Models in particular, do not deal well with large vocabularies. Since most ML algorithms work on numerical data, text has to be converted to a numerical representation. As part of pre-processing, words are converted to vector representations via one-hot-encoding, producing (sparse) vectors of length equal to the vocabulary. NLMs convert these to word embeddings, dense word vectors of much smaller dimensions (usually in the hundreds), in their first layer. Given enough training data, words that are close in this vector space are semantically similar, and some arithmetic operations are semantically meaningful (e.g., the closest vector to the sum of

"Germany" and "capital" is the vector corresponding to "Berlin" (Mikolov et al., 2013)). For a vocabulary of size and embeddings of size , the embedding layer is represented by a dense matrix of size .

This solution is not without issues: first, the vocabulary must be known in advance and will be built based on the training corpus. Any new word will not be able to be one-hot encoded as the resulting vector would exceed the expected dimensions. (Some NLM fine-tuning approaches allow the addition of new words by adding rows in the embedding matrix; those are initialized with the mean of the embedding matrix.) A common workaround is to have a specific unknown

token, and replace any word not previously seen by this token. This is far from ideal, as this amounts to losing information. Second, embeddings are usually computed based on word co-occurrences; deriving meaningful embeddings for rare words is difficult since there is very little data to work with. Third, approaches that generate sequences, such as language models or translation approaches, must output probability distributions that span the entire vocabulary. This is usually done with a Softmax layer; unfortunately, this operation scales linearly with the size of the vocabulary. For an efficient language model implementation, most of the computation can be spent in the Softmax layer for even a small vocabulary of 10,000 words

(Bradbury et al., 2016). For larger vocabularies, it can dominate computations; Jozefowicz et al. qualify a Softmax operating on a vocabulary of 800,000 elements as “prohibitively slow” (Jozefowicz et al., 2016). Further, vocabulary size impacts the size of the model as both the embedding layer and the Softmax layer depend on it. This increases memory requirements for the model, which might impact other parameters (e.g., decreasing batch size to have the model fit in memory, slowing down training; or decreasing Back Propagation Through Time, thus shrinking available context). Finally, a very large vocabulary negatively impacts performance (particularly when out-of-vocabulary words are present (Jean et al., 2014)).

While the Softmax issue is specific to the Language Modeling task, any Neural Model operating on textual sequences will be affected by the other issues. Further, Neural Models using more complex structures (e.g., trees (Alon et al., 2019) or graphs (Allamanis et al., 2018b)) also need to model the textual aspects of source code.

2.3. Handling Large Vocabularies

Several approaches exist to deal with large vocabularies.

Softmax improvements. The Softmax operation can become a bottleneck even for low vocabularies (Bradbury et al., 2016). Several approaches have been proposed to make the computation of the Softmax more efficient. Goodman proposes to speed up maximum entropy language models by grouping words (via clustering) in classes, and divide the operation in predicting a class, before predicting a word among a subset (Goodman, 2001). Morin and Bengio adapt this approach to Neural LMs (Morin and Bengio, 2005). Bengio and Senécal propose the use of importance sampling to train a Neural LM (Bengio et al., 2003b)

, although this approach provides a speedup only during training, not inference. Importance sampling was later applied in Neural Machine Translation for large vocabularies (500,000 words)

(Jean et al., 2014). Grave et al. propose an adaptation of the hierarchical Softmax that is efficiently computed on a GPU (Grave et al., 2017).

Approaches with subwords. Another set of approaches focuses on the issues of vocabulary size, and modeling rare or out-of-vocabulary words with subwords. Creutz et al. observe that “morphologically rich” natural languages such as Finnish, Estonian, and Turkish pose issues for language models as their vocabulary can be very large (Creutz et al., 2007). They decompose words into subword units called morphemes to build subword n-gram LMs, leading to improvements in speech recognition. Mikolov et al. compare language models at the character level and the subword level (modeling out-of-vocabulary words as sequences of two or three characters), finding that subword models improved on character models (Mikolov et al., 2012). Sennrich et al. adapt the Byte-Pair Encoding (BPE) algorithm to decompose words into subwords, finding improvements in Neural Machine Translation (Sennrich et al., 2015). Bojanowski et al. propose to represent words as bags of characters n-grams to compute more descriptive word embeddings, allowing the computation of word vectors for out-of-vocabulary words (Bojanowski et al., 2017). Kim et al.

combine a character-level convolutional neural network with a NLM

(Kim et al., 2016). Vania and Lopez compare various subword decompositions (words, morphs, character n-grams, BPE) on several natural languages (Vania and Lopez, 2017).

Large Vocabularies in Software Engineering. While Hindle et al. observe that in general, source code is more repetitive than natural language (Hindle et al., 2012), Hellendoorn and Devanbu notice that NLMs trained on a software corpus would struggle due to vocabulary size (Hellendoorn and Devanbu, 2017). To produce a model that can be trained in a reasonable amount of time, Hellendoorn and Devanbu impose drastic limits: the number of projects to train on is set to 107 (1% of the original corpus (Allamanis and Sutton, 2013)), and furthermore, they replace words which occur less than 5 times in the corpus with the “unknown” token. Despite this, the resulting vocabulary size is still rather large, totalling more than 76,000 words. Moreover, the prediction performance of a NLM is significantly hurt when it has to predict words that are out of its vocabulary. In parallel to this work, Karampatsis and Sutton (Karampatsis and Sutton, 2019) also investigate the problem of large vocabularies in source code. Our works are complementary: while we study the impact of various vocabulary choices in depth before training a selection of language models, their work starts with the application of Byte-Pair Encoding and explores language model training in more depth than we do.

2.4. Related work

Several researchers have developed and exploited probabilistic models of source code; Allamanis et al. (Allamanis et al., 2018a) give an overview of this research in a survey. We briefly illustrate the related work dividing it into four parts: constructing language models in software engineering, tackling naming problems, translation approaches, and approaches that aim to model the structure of software systems.

Constructing Language Models in Software Engineering. Allamanis et al. (Allamanis et al., 2014) develop a framework that learns the style of a codebase to suggests revisions for stylistic consistency. Nguyen et al. (Nguyen et al., 2013) develop a statistical semantic language model for source code to incorporate semantic information into code tokens and to model the regularities/patterns of such semantic annotations. Raychev et al. (Raychev et al., 2014) address the problem of synthesizing code completions for programs using APIs. They then learn a probabilistic model from existing data and use it to predict properties (e.g., variable names or type annotations) of unseen programs et al. (Raychev et al., 2015) . Tu et al. (Tu et al., 2014) introduce a cache language model that consists of an n-gram and an added ”cache” component to exploit local regularities. White et al. (White et al., 2015)

motivate deep learning for software language modeling and apply it to code suggestions. Hellendoorn and Devanbu

(Hellendoorn and Devanbu, 2017) adapt N-gram models for source code (with deeply nested scopes and changing vocabularies) to create language models with a prediction accuracy that surpasses RNNs and LSTM based deep-learning models. Nguyen et al. (Nguyen et al., 2018) present a Deep Neural Network language model that complements the local context of lexical code elements with both syntactic and type contexts. Efstathiou et al. (Efstathiou et al., 2018) release SE-specific word embeddings trained over 15GB of textual data from Stack Overflow posts, they show the model disambiguates polysemous words better thanks to its SE context. Santos et al. (Santos et al., 2018) use language models trained on correct source code to find syntax errors, and compare n-gram and LSTM LMs.

Naming. Several works predict a name for a source code entity given its context. Allamanis et al. (Allamanis et al., 2015) suggest class and method names with a neural probabilistic language model for source code. They later apply a convolutional neural network with attention to do a similar task (Allamanis et al., 2016). Vasilescu et al. (Vasilescu et al., 2017) describe an approach to recover original names from minified JavaScript programs based on statistical machine translation (SMT). Bavishi et al. (Bavishi et al., 2018) accomplish this using a deep learning-based technique. Jaffe et al. (Jaffe et al., 2018) generate meaningful variable names for decompiled code by combining a translation model trained on a parallel corpus with a language model trained on unmodified C code.

Translation approaches. Gu et al. (Gu et al., 2016) propose a deep learning based approach to generate API usage sequences for a given natural language query. They then propose to learn joint semantic representations of bilingual API call sequences from big source code data to support API call migration (Gu et al., 2017). Phan et al. (Phan et al., 2017) use a word2vec model to generate a sequence of C# API elements and related control units that are needed to migrate a given Java code fragment. Yin et al. (Yin et al., 2018) mine pairs of natural language and code from Stack Overflow to support tasks like code synthesis from natural language. Alon et al. (Alon et al., 2018a) present an approach that represents a code snippet as the set of compositional paths in its abstract syntax tree and uses attention to select the relevant paths while decoding to generate natural language sequences from source code snippets. Hu et al. (Hu et al., 2018) propose to use NLP and deep neural networks to automatically generate code comments. Tufano et al. (Tufano et al., 2019) investigate the ability of a Neural Machine Translation model to automatically apply code changes implemented by developers during pull requests.

Structured data beyond sequences. Several approaches integrate aspects of the structure of software systems. Note that in each of these cases, vocabulary still needs to be modeled. Allamanis et al. (Allamanis et al., 2018b) present how to construct graphs from source code and how to scale Gated Graph Neural Networks training to large graphs. Alon et al. (Alon et al., 2019) represent a code snippet as a single fixed-length code vector, which can be used to predict semantic properties of the snippet. Tufano et al. (Tufano et al., 2018) apply deep learning to learn code similarities from different representations. Alon et al. (Alon et al., 2018b) present a general AST path-based representation for learning from programs to predict program properties such as names or expression types. Ben-Nun et al. (Ben-Nun et al., 2018) define an embedding space, inst2vec, based on an Intermediate Representation of the code to recover the semantics of statements based on their context.

3. modeling choices

We present a series of modeling choices for source code vocabulary. These choices may be implicitly made by researchers, without evaluating the alternatives, and may not always be documented in their studies. By making them explicit, we hope that researchers will consider them and document them. Moreover, making them explicit allows us to study their impact on the training and performance of a language models. We make some assumptions behind the modeling choices explicit. The choices that we explore are geared towards specific desirable properties:

  • [leftmargin=*]

  • The models should be able to scale to large sizes (thousands of software projects). The training time should not increase much more than linearly as more data is added to the model.

  • To be versatile, models should avoid losing information. A model should be able to represent the original input as much as possible. Aggressive techniques restrict vocabulary drastically (e.g. to a few hundred tokens (Tufano et al., 2019)), but they lose much information.

  • Out-of-vocabulary tokens are not desirable, since they prevent a model to reconstruct these tokens.

  • Depending on the task, some categories of source code elements (e.g. comments) may not be needed.

3.1. Filtering Infrequent Tokens

The most common technique is to filter out uncommon tokens (less frequent than a threshold ). They are replaced by an <unk> token. The advantage of this technique is that it is extremely simple to implement in practice. This modeling choice however loses extensive amounts of information; as such, our goal is to avoid it.

3.2. Natural Language

Developers may comment their source code in another language, use identifiers in another language, or include non-English literals for testing or internationalization purposes. Thus, source code can contain non-English words in identifiers, strings, and comments. As handling multilingual corpora is an NLP research topic in itself, we evaluate the simplifying assumption to limit a corpus to English.

We adopt a conservative heuristic to determine that a word is non-English: a word is non-English if it contains non-ASCII characters (we tried dictionary-based heuristics, but they had too many false positives). This heuristic still has some false positives: words such as “café” and “naïve” are considered non-English. Non-English words are replaced with a

<non-en> placeholder.

This processing filters out non-English words when most of the file is in English, or when the code is in English, but comments and literals may not be. We handle projects that are mostly in another language, and testing/localization files separately. These files would be either non-informative (full of <non-en>), or could dramatically expand the vocabulary. During pre-processing, we filter out files with more than a threshold 0.6% of non-English words in code, or more than 1.9% in code and strings. The threshold was set by inspecting a sample of files with non-English words, around this threshold, and aiming for no false positives as we favor a conservative approach. This processing removes 0.62% of files in the corpus; many of them are concentrated in non-English projects that are entirely removed, with the remainder being localization or testing files.

Choices. We consider the following choices:

  • [leftmargin=*]

  • Keep non-English words regardless

  • Replace non-English words with <non-en>. If a token is split (see below), only non-English subtokens will be replaced.

  • In addition to the previous choice, attempt to remove files that contain a large number of non-English tokens.

3.3. Literals

In a programming language, tokens have clearly defined types. Some token types have more importance than others: for instance, some types of literals may be less important for some tasks.

Choices. We consider the following choices:

  • [leftmargin=*]

  • Do not filter any literals.

  • Have a different minimum frequency for each token type. For instance, the minimum frequency could be higher for numbers, and lower for source code identifiers.

  • Replace all literals of a specific type with placeholders.

  • For numbers: only keep “likely frequent” numbers, such as numbers less than 100, replacing others with placeholders.

3.4. Comments and Strings

Comments and literal strings often contain natural language, rather than source code. Since source code is more repetitive than natural language, one would expect that the contents of comments and string would be much less repetitive. Moreover, while some tasks (e.g. detecting self-admitted technical debt (da Silva Maldonado et al., 2017)) rely on source code comments, others (such as autocompletion) do not.

Choices. We consider the following choices:

  • [leftmargin=*]

  • Keep string literals and source code comments intact. Each string literal or comment is modeled as a single token. This choice leads to an explosion of possible tokens, as it models entire sentences or paragraphs as unique tokens.

  • Keep string literals and source code comments, but model them as sequences of sub-tokens separated by whitespace. This treats these entities as the sequence of words they likely are, and allows to keep all the information in the source code.

  • If the loss of comments is acceptable, replace comments with <comment> placeholder. Strings are processed as above.

  • If the loss of strings is acceptable as well, replace both comments and strings with placeholders (<comment> and <string>).

3.5. Whitespace

Some applications (e.g., pretty-printers (Allamanis et al., 2014)) may care about the layout of source code. Others may not, giving importance only to syntactic or semantic aspects (unless code layout is syntactically important, such as in Python). Note that whitespace has a negligible effect on vocabulary, as less than a handful of distinct tokens are needed. It does however significantly increase the amount of tokens in the final corpus.

Choices. We consider the following choices:

  • [leftmargin=*]

  • Tabs, spaces, and newlines are modeled as an individual token.

  • Different tokens are used to represent two tabs, three tabs, etc.

  • Formatting is not important: tabs and newlines are removed.

3.6. Word Splitting and Casing

Word splitting. At 70% of source code (Deissenboeck and Pizka, 2006), identifiers are the bulk source code and its vocabulary. While new identifiers can be created at will, developers tend to follow conventions when creating them. When an identifier is made of several words, it is nearly universal they are visually separated to ease reading, either in camelCase or in snake_case (Binkley et al., 2009). Thus, an effective way to reduce vocabulary is to split compound words according to these word delimiters.

To split, or not to split. The decision whether to split compound words or not has important ramifications. First, it introduces additional complexity: the LM can no longer rely on the assumption that source code is a sequence of tokens. Instead, compound words are modeled as sequences of subtokens. Predicting a compound word in a large vocabulary becomes predicting a sequence of subtokens, albeit in a smaller vocabulary. Note that in some cases (e.g., machine translation), techniques such as beam search can be used to keep track of more than one prediction. Second, subtokens increases the length of the sequences, making it harder to relate the current subtokens to the past context, as it increases in size.

On the other hand, splitting tokens has advantages: the most obvious one is that the vocabulary can be—drastically—smaller. The second is that the out-of-vocabulary rate can be significantly reduced as a consequence. A third is that the model may be able to infer relationships (e.g. via embedding) between subwords, even if the composed word is rare, as the subwords are more common than the composed word. Approaches using subtokens have shown that splitting tokens allows a model to suggest neologisms, tokens unseen in the training data (Allamanis et al., 2015).

Word casing. A subsequent decision is whether and how to keep case information. By default, words in different case (e.g. value, Value, VALUE) will be distinct words for the LM. This could cause the vocabulary to increase by a factor of up to 3 times, and make it harder for the LM to infer that words are similar. On the other hand, entirely removing case information loses information. Our solution is to encode case information in separator tokens (e.g., <_>, <Upper>, <UPPER>), at the cost of further increasing the size of the sequences. Table 1 provides examples of how we encode compound words. Other encodings could further reduce the number of tokens.

<w> <Upper> malformed <UPPER> url <Upper> exception </w>
<w> <UPPER> layout _ <UPPER> inflater _ <UPPER> service </w>
<upper> tokenbreakingconventions
Case preserving
<w> Malformed URL Exception </w>
Table 1. Example word splits

Choices. The following decisions are possible:

  • [leftmargin=*]

  • Keep tokens as is, unsplit.

  • Split tokens in subtokens, according to case, and keep it.

  • Split tokens in subtokens, according to case, and encode case in separator tokens.

3.7. Subword Splitting

Even with word splitting, vocabulary may still grow large. First, natural vocabulary is large: many similar words (plural forms, past tenses, etc) will be modeled as entirely distinct words. Developers may not follow conventions to separate words, or the conventions may always apply (e.g. package names in Java are in lower case). Finally, identifiers can contain arbitrary numbers or sequences of characters (such as auto-generated identifiers). If we split in subtokens in the first place, why not go even further?

Character models. At the extreme, words are sequences of characters. The vocabulary needed would just be the set of possible characters; the out-of-vocabulary issue vanishes. Unfortunately, this drastically inflates sequence lengths, so a character model is not desirable. However there are interesting intermediate choices.

Numbers. Numbers are responsible for a large proportion of the vocabulary, yet their vocabulary is very limited. Thus, an alternative to filtering them out is to model them as a sequence of digits.

Byte-pair encoding. Byte-Pair Encoding (BPE) is an algorithm originally designed for data compression, in which bytes that are not used in the data replace the most frequently occurring byte pairs or sequences (Gage, 1994). This approach has been adapted to build vocabularies in NLP (Sennrich et al., 2015): the most frequently occurring sequences are merged to form new vocabulary words. The only parameter BPE needs is the number of merges () to do. BPE starts by splitting all the words in characters. Then, it finds the most common pair of successive items in the corpus (initially characters, then tokens). This pair is merged in a new token which is added to the vocabulary; all occurrences of the pair are replaced with the new token. The process is repeated times.

BPE has several advantages. First, like a character model, no word is out-of-vocabulary; unknown words at test time are represented by subsequences. Second, it dynamically adapts to the frequency of the sequences: common subsequences will be merged, infrequent ones will not. Common words will be represented by a single word (eg, exception), while rare ones will be segmented in roots, prefixes and suffixes (as prefixes and suffixes are common). This ensures that each sequence is common enough to have useful embeddings. Finally, it allows for a fine-grained control of vocabulary size, by tuning the number of merges BPE does. A larger vocabulary will have more complete words and less sequences, smaller ones will have longer sequences.

Choices. Excluding character models, the choices are whether to apply BPE on split tokens, and the number of merges to apply.

3.8. Discarded choices

We considered stemming (Willett, 2006) to reduce vocabulary size but decided against since: 1) stemming approaches work in the context of a specific language, while the corpus is multi-lingual to some degree; 2) stemming loses information: it is not always obvious how to recover the original word from its stem; 3) all words would be stemmed, whereas BPE decomposes infrequent words, keeping frequent words intact. Character-based models result in extremely long sequences, even for very common words; subword models outperform them (Mikolov et al., 2012). We do not consider aggressive vocabulary abstraction approaches (e.g., replacing identifiers with placeholders (Tufano et al., 2019), as this prevents recovering the original identifiers.

4. Vocabulary results

In this section, we present how the vocabulary modeling choices impact vocabulary at scale. We consider the full Allamanis corpus (Allamanis and Sutton, 2013), 14,436 projects, and study:

  • [leftmargin=*]

  • Vocabulary size. How large is the resulting vocabulary?

  • Number of tokens. Several approaches split tokens in subtokens. How does the corpus size grows in response to this?

  • Out-of-vocabulary. We study the impact of replacing rare tokens with <unk>. We report the threshold needed to bring vocabulary size 100K (in line with Hellendoorn and Devanbu’s 76K (Hellendoorn and Devanbu, 2017)), and the resulting percentage of <unk> tokens.

  • Number of projects. How does the vocabulary grow when more projects are considered?

We cover the most important configuration, with the first part of the comparisons shown in Table 2. Each combination is compared to a previous configurations as the baseline.

[0] 1/2/3

Configuration Vocabulary Vs baseline Corpus Tokens Vs baseline 100K filter (OOV) Baseline
Unsplit corpus
Unsplit <tab> 11,357,210 0.98    2,477,820,538 0.99    200-300 ( 4%) Unfiltered <tab>
Unfiltered <tab> 11,555,212 1.00    2,448,156,244 1.00    200-300 ( 4%) Unfiltered <tab>
Unsplit /**/  "str" 11,357,196 1.00    1,806,747,721 0.74    200-300 ( 5.5%) Unsplit <tab>
Unsplit /**/  "str" 10,753,203 0.95    1,238,234,376 0.69    200-300 ( 7.5%) Unsplit /**/  "str"
Unsplit /**/  "str" 9,499,013 0.84    1,133,050,827 0.63    150-200 ( 6.5%) Unsplit /**/  "str"
Word splitting (compoundWord → compound ¡cap¿ word)
Split /**/  "str" 1,588,777 0.14    2,972,812,831 1.65    45-50 ( 0.21%) Unsplit /**/  "str"
Split /**/  "str" 1,382,189 0.13    2,245,853,706 1.81    30-35 ( 0.21%) Unsplit /**/  "str"
Split /**/  "str" 974,606 0.10    2,087,403,458 1.84    20-25 ( 0.15%) Unsplit /**/  "str"
Spliting numbers (123 → 1 2 3)
Splitnum /**/  "str" 999,885 0.63    3,045,857,316 1.02    35-40 ( 0.15%) Split /**/  "str"
Splitnum /**/  "str" 832,994 0.60    2,295,315,822 1.02    25-30 ( 0.14%) Split /**/  "str"
Splitnum /**/  "str" 504,660 0.52    2,125,058,914 1.01    20-25 ( 0.10%) Split /**/  "str"
ASCII filtering (über → ¡non-English¿)
ASCII /**/  "str" 978,089 0.98    3,045,857,316 1.00    35-40 ( 0.15%) Splitnum /**/  "str"
ASCII /**/  "str" 817,742 0.98    2,295,315,822 1.00    25-30 ( 0.14%) Splitnum /**/  "str"
ASCII /**/  "str" 504,431 1.00    2,125,058,838 1.00    20-25 ( 0.15%) Splitnum /**/  "str"
Keeping case (compoundWord → compound Word)
Case /**/  "str" 1,231,375 1.26    2,593,099,484 0.85    70-75 ( 0.30%) ASCII /**/  "str"
Case /**/  "str" 900,806 1.26    2,440,806,776 0.85    55-60 ( 0.23%) ASCII /**/  "str"
Case /**/  "str" 635,517 1.26    1,783,880,337 0.84    35-40 ( 0.20%) ASCII /**/  "str"
Table 2. Corpus statistics (vocabulary, tokens, out-of-vocabulary (OOV))

4.1. Unsplit models

Full model. Our most complete configuration is “Unsplit full”: it contains all the files in the (de-duplicated (Allamanis, 2018)) corpus, including whitespace, comments, literals, and unsplit tokens. The only pre-processing it has is that comments and strings are modeled as sequences of words, rather than whole entities (doing so would roughly double the vocabulary). This vocabulary contains an excess of 11,5 million unique words. To reduce the vocabulary to less than 100,000 words involves replacing words that appear less than 200 to 300 times in the corpus. The <unk> token would be 4% of the corpus and would be the 6th the most frequent token.

Non-English files. While our heuristic to remove non-English files is conservative, it has little effect: it reduces vocabulary size by roughly 2% (200,000 tokens), and removes roughly 1% of the tokens.

Whitespace. Models that do not need whitespace can reduce the size of the corpus by 25% by removing spaces, tabs, and newlines. However, among non-whitespace tokens, the amount of <unk> tokens needed to reach a 100K vocabulary is even higher.

Replacing comments and strings with placeholders (/**/, "str"). Both methods reduce vocabulary, by 5% for comments and a further 11% for strings. Removing both makes the vocabulary smaller than 10 million words. However the proportion of <unk> tokens to reach a 100K vocabulary rises to 6.5% (the 5th most common token). It appears source code tokens are more varied than comments and strings. Removing comments reduces corpus size by 30%.

[left=0mm,right=0mm,boxrule=0.25mm,colback=gray!5!white] Modeling tokens without splitting them in sub-tokens is impossible to do at scale without extremely aggressive filtering. <unk> would be one of the most frequent tokens.

4.2. Word splitting

Word splitting. The effect of splitting according to camelCase and snake_case is considerable: the vocabulary reduces by a factor of up 7 to 10, depending on the presence of strings and comments. The decrease is larger for models without strings and comments, who are richer in compound identifiers; the split model reaches a size of less than one million tokens. The flipside is that the number of tokens in the corpus considerably increases, as compound words are now sequences (including tokens encoding case): the corpus increases by 65 to 84%. To reach 100K words, the thresholds are much lower (even if still high, ranging from 20 to 50). The percentage of tokens that are <unk> is also much lower (0.15–0.21%).

Splitting numbers. While the improvement is important, the vocabulary is still extremely large. Splitting numbers in digits yields a considerable decrease vocabulary from more than a third to nearly half, at the cost of a very modest increase in number of tokens (1–2%). Thus splitting numbers (or replacing them with placeholders) is very effective: the smallest configuration hovers just above half a million tokens—a 23 times improvement over the initial one.

Non-English words. Filtering non-English words by the ASCII encoding heuristic offers very limited improvements: either the heuristic is too conservative, or much of the improvement was already done in the initial filtering of non-English files. (The heuristic is much more effective for BPE.)

Keeping case. The benefit of keeping case is that compoound words are described by shorter sequences, as case-encoding tokens are no longer needed. Compared to equivalent configurations, it decreases the size of the corpus by 15%, but increases vocabulary by 25%. Keeping case could have increased vocabulary by anything from 1 time to 3 times; so 1.25 times is in the lower range of estimates.

[left=0mm,right=0mm,boxrule=0.25mm,colback=gray!5!white] Word splitting heuristics are very useful to decrease vocabulary size, at the cost of increasing corpus size. However, the vocabulary is, at best, five times more than our pre-defined threshold.

4.3. Vocabulary growth

While these results are encouraging, the growth of the vocabulary as projects are added provides another perspective. Figure 1, left compares the growth of the vocabulary size for unsplit configurations, and for the largest of the split corpora. The difference is large and widens significantly as more projects are added.

Figure 1. Growth of vocabulary for the corpus

Figure 1, right shows growth of vocabulary for three split configurations. We also see widening gaps between the configurations. However, none of the curves appear to plateau. There is no indication that the vocabulary will stabilize at some point. Going from 75 to 100% of projects with the best configuration adds nearly 20% new words. As unseen projects are added, out-of-vocabulary words are more likely.

[left=0mm,right=0mm,boxrule=0.25mm,colback=gray!5!white] Vocabulary grows in an apparent linear fashion as new projects are added; word splitting is not enough to get it under control

4.4. Byte-Pair Encoding

BPE allows us to specify our vocabulary size. Thus, the question is not whether it reduces vocabulary—it will!—, but how much is a good tradeoff between vocabulary size and sequence size.

Quantitative evidence. Corpus sizes for BPE configurations are shown in Table 3. If we compare the models with our lowest vocabulary (ASCII, numbers split, /**/, "str"), to vocabularies obtained with BPE, we see that a vocabulary of (slightly more) than 1,000 words grows the corpus size by 22%. Most interestingly, a similar model with 5,000 words (20 times less than the 100K threshold), grows it by 4%. Finally a model with 10,000 words grows the corpus by only 1%, but reduces the vocabulary by a factor of 50! (Note that including non-ASCII words add a considerable amount of unicode characters, growing the vocabulary by 5,000 words in each case.)

In addition, since BPE merges based on frequency, the resulting subtokens, no matter their size, are frequent. Depending on the configuration, between 91 and 96% of subtokens occur more than 1,000 times in the corpus, and 97–99% occur more than 100 times. The fact that the items are frequent means that it is much more likely that good embeddings can be computed for therm.

Configuration Tokens (M) Vs baseline Baseline
1K "str"/**/ 2,600 1.22    ASCII "str"/**/
5K "str"/**/ 2,209 1.04    ASCII "str"/**/
10K "str"/**/ 2,153 1.01    ASCII "str"/**/
5K "str"/**/ 2,396 1.04    ASCII "str"/**/
10K "str"/**/ 2,328 1.01    ASCII "str"/**/
Case 5K "str"/**/ 2,173 1.12    Case "str"/**/
Case 10K "str"/**/ 2,043 1.06    Case "str"/**/
5K "str"/**/ 3,228 1.06    ASCII "str"/**/
10K "str"/**/ 3,095 1.02    ASCII "str"/**/
Case 10K "str"/**/ 2,753 1.06    Case "str"/**/
Case 20K "str"/**/ 2,647 1.02    Case "str"/**/
Table 3. Vocabulary statistics for BPE variants

Qualitative evidence. We inspected 110 random identifiers longer than 25 characters long, alongside with the splits produced by BPE—the rationale being that longer identifiers are more likely to provide interesting splits. We show some examples in Table 4. While some of the words have optimal splits even at BPE 1K (example a), some are clearly sub-optimal (example b), but are optimal at BPE 5K. The model handles rare words due to typos gracefully (example c), and splits words correctly without case information, with enough merges (example d). Some words have satisfying splits at low BPEs, yet improve as BPE increases (example e). Finally, the model degrades gracefully for non-English words: those are not out-of-vocabulary, just long sequences (example f).

We classified each of the 110 splits in 3 categories:

good (reproduces the expected case split), acceptable (one word was split in root and prefix or suffix, such as Grid  ify, or an acronym was not well reconstructed, such as I  BAN), and degraded (one or more words split incorrectly, or in more than 2 parts). We found 7 degraded splits: 2 foreign words, 2 words with typos (Fragement, INCULDED), 1 with rare words (TheImprisonedGourmet), an all-lowercase sequence of 8 words, and a word were the split was unclear (appirate). Of the good splits, 11 were found at BPE 1K (including common words such as exception, configuration, or attribute), 51 at BPE 5K, 28 at BPE 10K, and 8 at BPE 20K. While BPE 1K is too small, 5K is competive, 10K is optimal, and 20K offers disminishing returns.

Adding back string, comments, and case. Encouraged by these results, we increased the base vocabulary. We find that adding words found in strings and comments appears to have little impact on BPE 5K and 10K, both of which slightly increase the size of the corpus by 1–2%. A vocabulary of 10K words is more than 1,000 times smaller than the initial configuration (11,357,210), at the cost of increasing the number of tokens in the corpus by a factor of 1.7.

However, adding back case has a larger impact, as a relatively large number of words have at least two versions (example g). A second manual inspection of the same splits revealed that more words were decomposed in subwords (e.g., adjusted becomes Adjust ed, or implicitly becomes Implicit ly). Raising the amount of merges to 20,000 is necessary, but it increases the corpus by 2% only, for a corpus with strings and comments.

We conclude that as a rule of thumb, a BPE with a 1–2% token increase performs very well. We note that our BPE also includes all numbers and literals: some sequences that were merged were common numbers. An approach that can afford to filter uncommon literals and numbers with a low out-of-vocabulary threshold (e.g., 5), may perform even better in the resulting vocabulary.

[left=0mm,right=0mm,boxrule=0.25mm,colback=gray!5!white] BPE shrinks source code vocabulary very effectively. Moreover, most of the vocabulary is frequent, improving embeddings.

Configuration Token / BPE Split
a) Optimal at BPE 1K
BPE 1K layout  inflater  service
b) Optimal at 5K
Original MalformedURLException
BPE 1K m  al  for  me  d  url  exception
BPE 5, 10, 20K malformed  url  exception
c) Effect of typos
BPE 1, 5, 10, 20K inc  ul  ded  template
d) Splitting without case
Original cmd_reloadquestconfig
BPE 1K c  m  d  re  load  quest  config
BPE 5,10,20K cmd  reload  quest  config
e) Continuous improvement
Original httpclientandroidlib
BPE 1K http  client  android  li  b
BPE 5K http  client  android  lib
BPE 10K httpclient  android  lib
BPE 20K httpclientandroidlib
f) Handling non-English words
Original vormerkmedienauflister
BPE 5K vor  mer  k  medi  en  au  f  list  er
BPE 20K vor  mer  k  medi  en  auf  lister
g) Impact of preserving case
Original alternativeEndpointsAndQueries
BPE 5k alternative  end  points  and  queries
BPE 5k (case) al  tern  ative  End  points  And  Qu  eries
BPE 10k (case) alternative  End  points  And  Queries
Table 4. Examples of Byte-Pair Encoding Splits

5. Training Language Models

In this section, we test whether we can successfully train large language models with our vocabulary choices, reporting on some results considering training time. We also consider the model’s performance at the language modeling task. This is especially important in light of recent results in NLP that show that the knowledge learned to be able to do unsupervised language modeling effectively can transfer to supervised tasks (see Sect. 2). While there is early evidence that pre-training NLMs can be useful in Software Engineering NLP tasks, particularly for small datasets (Robbes and Janes, 2019), applying the same techniques to source code involves solving vocabulary issues. Finally we also consider the more concrete code completion scenario, where we compare our models on some of the code completion scenarios of Hellendoorn and Devanbu (Hellendoorn and Devanbu, 2017).

5.1. Methodology

Models. For the NLMs, we use an AWD-LSTM (Merity et al., 2017), a state-of-the-art implementation of the LSTM, with a variety of strategies that improve its regularization capabilities, such as a version of dropout (Srivastava et al., 2014) adapted for LSTMs. The hyper-parameters were manually tuned on a fraction of the training set; we report on 4 configurations (BPE 5K and 10K, with and without strings). All LSTMs have an embedding layer of size 300, 650 hidden units, and 3 LSTM layers. We set a learning rate of , a weight decay factor of ; use the Adam optimizer with parameters 0.7 and 0.99 (Kingma and Ba, 2014). For the n-gram models, we use the implementation of Hellendoorn and Devanbu.

Corpus. Since our focus is to test whether NLMs can scale, we reuse the large-scale corpus of Allamanis and Sutton (Allamanis and Sutton, 2013). We divide the corpus in a training set (10,106 projects), a testing set (2,165 projects), and a validation set (2,165 projects). Recent work by Allamanis (Allamanis, 2018) points out that large-scale code duplication can bias the performance of ML approaches; the models train on source files that they may see during evaluation. We thus use the “dataset errata” of Allamanis to remove clone groups in the entire corpus.

Vocabulary choices. Our initial goal was to have a configuration as close as possible to the one of Hellendoorn and Devanbu. Similarly to them, we omit source code comments. One difference with the original setup lies in the treatments of string literals: Hellendoorn and Devanbu keep strings, but replace all strings longer than 15 characters with the empty string; we have both kind of models run without strings instead, plus some LSTM variants with strings. (We assumed all strings were kept, and discovered this undocumented behaviour rather late; omitting strings altogether was the choice that had the fewest ramifications.)

Since best performance for NLMs rely on token splitting, we try 2 BPE configurations, one at 5K and another at 10K. Both vocabularies are built on the training set only. For n-grams, we keep tokens unsplit: splitting them would result in the n-gram model reducing its context window and could thus impact performance.

Language Modeling performance. Language modeling performance is our primary metric of interest as it opens up possibilities for transfer learning. We report the metrics of entropy for all models. However, the entropy depends on vocabulary size and the number of tokens, which vary accross configurations. The effects of this choice are hard to predict: while a model operating on subwords has less vocabulary words to choose from, it also has to make more predictions. Indeed, one could argue that the subword level is a more accurate reflection of true performance, as the percentage of prediction on syntax tokens (e.g., ;, (, ), …, which are extremely common and “easy to predict” (Rahman et al., 2019)) is lower.

Mikolov et al. compared disparate models (subword and character models) by converting word-level entropy to character-level entropy: . This was possible since none of the models were predicting out-of-vocabulary tokens. Similarly, we convert subtoken-level to word-level entropies: .

Code completion performance. Although code competion is not our primary focus, we also investigate this task. We report the mean reciprocal rank metric (MRR), similarly to Hellendoorn and Devanbu. MRR is the mean—over all predictions—of the inverse of the rank of the correct choice in a list of prediction: a correct prediction at rank is scored . Similarly to the entropy metric, this metric is provided at the subword level for the NLMs, so a direct comparison with word-level prediction is not possible.

Evaluation scenarios. Hellendoorn and Devanbu have 3 evaluation scenarios, out of which 2 are suitable for NLMs: the static and the dynamic scenario. The static scenario is a cross-project scenario, in which models never train on the test data. The dynamic scenario allows model to train on test data after having seen it, which according to Hellendoorn and Devanbu advantages NLMs. As our primary focus is the transfer learning potential of NLMs, we focus on the static scenario.

A dynamic scenario could be interesting for fine-tuning a language model on a new project (similarly to (Howard and Ruder, 2018)), but it would be different from the dynamic scenario, as the NLM would be allowed to see the project multiple times; we reserve this fine-tuning for future work, but provide entropy results for n-gram models.

Use of cache. For n-gram models, Hellendoorn and Devanbu have several cache settings: plain (no cache), cache, and nested caches (various caching following the package structure of software systems). Similarly to before, we do not focus on caches as they do not improve performance in a transfer learning scenario. We evaluate the n-gram model with a cache (the cache-less n-gram did not exhibit good performance).

Training Speed.

We also report some metrics on the training speed of NLMs: time to complete an epoch, number of projects per minute, and number of files per second. This gives us insight on the ability these models have to quickly adapt to new data. As n-gram models do not have performance issues, we do not report these.

5.2. Results

All the results are presented in Table 5. We note that some of the NLMs have not yet fully converged, and may improve further with additional training.

Configuration Subtoken Entropy Token entropy MRR Min/Epoch Projects/min Files/s
LSTM ASCII BPE 5K (3 epochs) "str"/**/ 1.82 3.59 0.80 375 26.9 52.7
LSTM ASCII BPE 10K (1 epoch) "str"/**/ 2.19 4.22 0.78 670 15.1 29.5
LSTM ASCII BPE 5K (1.5 epochs) "str"/**/ 1.96 3.84 0.79 406 24.9 48.8
LSTM ASCII BPE 10K (1.5 epochs) "str"/**/ 2.14 4.06 0.78 709 14.2 27.9
6-gram with cache, Unsplit corpus "str"/**/ 5.33 0.59
Table 5. Performance statistics for a selection of language models

Training Speed. Our fastest models can process an entire epoch of data (10,106 projects, 1,187,620 files) in roughly 6 hours. This represents roughly 27 projects a minute, or more than 50 source code files per second. These metrics were computed on a consumer-grade GPU (Geforce GTX 1080, released in 2016). Since one epoch is so large, the models perform well after one epoch only, even if they can improve with more epochs. Models with strings are slower, but have to predict more tokens; the increase is approximately linear. Likewise, doubling the vocabulary roughly slows training by half.

Language modeling and completion. We find that with appropriate modeling choices, NLMs can be competitive with N-gram models: the LSTMs have a significantly lower entropy than the N-gram model (1.82–2.19 bits), even when converting it to token-level entropy (3.58–4.22 bits). This is despite the fact that the N-gram model caches, while the LSTMs never updates on test data. In addition, two of the LSTMs predict strings as well. Regarding code completion, while a direct comparison is not possible due to the difference in granularities of the predictions, we find that the MRR of the LSTMs are very competitive as well.

Dynamic n-gram models. We also computed entropies for dynamic n-gram models, finding much better token-level entropies of 3.69 bits (no cache), and 2.86 bits (with cache). While our best LSTM still edges out the former, the latter outperforms out all of our LSTMs. This is expected, since maximizing project-level information is helpful for completion; our LSTMs in the fully static setting ignore it.

Further training. We note that training our LSTMs for longer periods (8 epochs) allowed them to outperform the dynamic model, all of them achieving entropies between 3.32 bits and 3.67 bits.

Performance of n-grams. The performance of the cached n-gram model is much lower than reported by Hellendoorn and Devanbu. We used their implementation for lexing and parsing; our only change was to convert all strings to empty strings, which should improve performance. We think that two factors cause the difference. The first is the removal of duplicate files (25% in this corpus). Allamanis also observed significant performance drops when evaluating models on a de-duplicated corpus (Allamanis, 2018). The second is the size of the test set, and the number of (unsplit) tokens not encountered in the training set, which is very high: the simple cache is not enough to cover all the cases, as the dynamic scenario results show.

Additional results We experimented with n-gram models on split corpora. The reduced vocabulary did increase the performance, but when accounting for subtokens, the change was minimal. We tried higher order n-grams to compensate for the longer sequences, but saw little improvement with 7 or 8-grams.

For the LSTMs, we experimented with a neural cache (Grave et al., 2016), which is the neural equivalent of a regular cache model. The neural cache has a fixed-size window of size , in which the previous activations of the LSTM’s last hidden layer are stored with the next word at that time step, at test time. Importantly, this cache mechanism is a test-time only addition, and does not require training, unlike other alternatives (Merity et al., 2016). We did observe slight improvements, but less than we hoped. This is however not surprising, since the neural cache’s main benefit is the prediction of out-of-vocabulary tokens, which is not an issue for our model thanks to BPE.

6. Discussion

Vocabulary. With BPE 10K, the vocabulary size of an ASCII model shrinks by a factor of more than 1,000 over the initial configuration, and removes the out-of-vocabulary issue. On the other hand, the size of the corpus (in number of tokens), is a little less than double (excluding whitespace). Bradbury et al. (Bradbury et al., 2016) showed that for a QRNN language model, the Softmax layer can start to dominate computation costs for vocabularies as low as 10,000 words. For a similar scenario, reducing the vocabulary a thousand fold, while increasing double corpus size could lead to an execution that is several hundred times faster. More generally, vocabulary has a very large impact on model training and performance, which is why vocabulary modelling choices should be clearly documented.

Training Speed. Our models were trained for less than a day on 10,106 projects. While they have not fully converged, they already show excellent performance. Such models are able to process 20 to 50 files per second. While a model may need to see a given file several times to fully integrate it, this is still impressive. Moreover, LSTM variants such as QRNNs (Bradbury et al., 2016) can training faster as they better parallelize on the GPU. This has several practical implications:

  • [leftmargin=*]

  • The slowdown of training larger models with more capacity may be acceptable, yielding potentially higher performance.

  • Since the vocabulary does not grow, it is easier to train on even more data as training time scales linearly with data.

  • Taking a generic language model and fully fine-tuning it on a specific project may not be costly, and may be perhaps measured in minutes, rather than hours.

  • If training is fast, perhaps having dynamic NLMs reacting to context changes is feasible, similar to Hellendoorn and Devanbu’s nested caches.

7. Limitations of this study

Peer-review data. Sharing data at this time is impractical: we work with multiple pre-processed versions of a large corpus, and train large language models. Re-running the scripts is also time-consuming. We will release the data, scripts, and models if the paper is accepted.

Non-exhaustive choices. While we wanted to cover as many vocabulary modeling choices as we could, we can not guarantee that the choices are exhaustive.

Filtering. Our heuristics to filter non-English files are rather crude: they are conservative, and perhaps better heuristics could lead to further vocabulary reductions. On the other hand, some legitimate English words with latin accents may be filtered out. Techniques to recover the unaccented characters such as character folding might improve the model further.

Handling Non-English languages. While non-English source code is uncommon in our corpus, non-English comments are more common. Thus a proper handling of other languages would be welcome.

Other tasks. We focus on language modeling. While there is potential to transfer language modeling to other tasks, this potential still has to be realized in future work.

Other architectures. Neural architectures such as QRNNs (Bradbury et al., 2016) or Transformers (Vaswani et al., 2017) are more computationally efficient than LSTMS. Investigating them would be welcome. More importantly, investigating architectures that take fully advantage of software’s structure, via trees (Alon et al., 2019) or graphs (Allamanis et al., 2018b; Tufano et al., 2018) would be a much better architectural choice. Finally, caching should still be improved for the code completion task.

Improvements for code completion. Our experiments on the code completion scenario were not the main focus of the paper and as could be expanded. A comparison with n-gram and nested caches would be instructive. Improving on the cache is possible: an example would be a hybrid cache that “corrects” a predicted sequence of subtokens if it does not exist in the project, but a similar one does. Another would be adding beam search, if the slowdown is acceptable. In parallel to our work, Karampatsis and Sutton proposed an NLM using BPE, with beam search (Karampatsis and Sutton, 2019).

Improvements on language modeling. We did not explore the entire space of modeling choices. In particular, perhaps a good tradeoff exists for filtering uncommon literals that are very likely to be unique (e.g. a string of random characters). This would also increase the quality of BPE splits at the same amount of merges. This may enable us to have acceptable quality for an model keeping case information at less than 20K.

8. Conclusions and future work

While software is more repetitive than natural language, software vocabulary is much more diverse as developers can create new identifiers at will. This is a serious hurdle to make NLMs work at scale. In this paper, we showed how modeling choices for source code vocabulary drastically influence the resulting vocabulary, and that the techniques that allow the vocabulary to be kept under control (such as BPE) are not necessarily intuitive.

We showed how applying a set of modeling choices on a large corpus of 14,436 software projects made it possible to reduce the vocabulary by three orders of magnitude, while less than doubling the amount of tokens to consider. Further, such a vocabulary is not affected by the out-of-vocabulary problem.

As a consequence, we were able to train large-scale NLMs at scale: a model trained on 10,106 projects can be trained in less than a day, averaging 27 projects per minute and 50 source code files per second. This LM was competitive at language modeling and code suggestion tasks. Moreover, this kind of performance opens the door to the pretraining of large NLMs for software, and the transfer learning possibilities pretrained NLMs enable.

The computational results presented in this paper have been achieved using the Vienna Scientific Cluster (VSC).