Log In Sign Up

Code Prediction by Feeding Trees to Transformers

In this paper, we describe how to leverage Transformer, a recent neural architecture for learning from sequential data (such as text), for code completion. As in the realm of natural language processing, Transformers surpass the prediction accuracy achievable by RNNs; we provide an experimental confirmation of this over a Python dataset. Furthermore, we show that the way to obtain even better accuracy from Transformers is to expose the syntactic structure of code, which is easily recovered by parsing, to the neural network. This works significantly better than presenting the code as a linear token sequence, which is how Transformers were originally intended to be used. To accomplish this, we propose a novel enhancement to the self-attention mechanism of the Transformer. We enable the mechanism to learn weights—that is, how much to focus on each preceding token in the input—not only on the basis of a token's value, but also on the basis of the spatial relationships, as in their positions in the abstract syntax tree, between each pair of tokens. We provide comprehensive experimental evaluation of our proposal, along with alternative design choices, on a standard Python dataset, as well as on a Python corpus internal to Facebook.


page 1

page 2

page 3

page 4


Adversarial Token Attacks on Vision Transformers

Vision transformers rely on a patch token based self attention mechanism...

Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

Transformers have emerged as a powerful tool for a broad range of natura...

Attention Hijacking in Trojan Transformers

Trojan attacks pose a severe threat to AI systems. Recent works on Trans...

Syntax-guided Localized Self-attention by Constituency Syntactic Distance

Recent works have revealed that Transformers are implicitly learning the...

TreeGen: A Tree-Based Transformer Architecture for Code Generation

A code generation system generates programming language code based on an...

Learned Token Pruning for Transformers

A major challenge in deploying transformer models is their prohibitive i...

glassoformer: a query-sparse transformer for post-fault power grid voltage prediction

We propose GLassoformer, a novel and efficient transformer architecture ...

Code Repositories


This repo will contain replication package for the paper "Feeding Trees to Transformers for Code Completion"

view repo

1. Introduction

Last several years have witnessed exciting progress in the application of machine learning techniques to developer productivity tools 

(Allamanis et al., 2018a), and in particular, to code prediction (Hindle et al., 2016; Raychev et al., 2016; Li et al., 2018; Brockschmidt et al., 2019). The idea of code prediction in general is to predict the next code element given previously written code. Code prediction is commonly used in an IDE for auto-complete, where based on the developer’s cursor position and the code already written up to the cursor position, the IDE offers the most likely next tokens (perhaps as a drop down list to choose from.) Auto-complete, not only saves the developer from having to type in the next token(s), but is also an effective code learning mechanism: for instance, a developer might not know the name of an API call he needs off the top of his head, but is able to choose among the choices shown by an auto-complete tool.

Recent work has shown the promise of code prediction based on machine learning. A common idea here is to use language models trained over large code corpora—treating code as text in a natural language (Hindle et al., 2016; Allamanis et al., 2018a)

—to enable highly accurate code prediction. These models have leveraged natural language processing techniques: n-grams 

(Hindle et al., 2016; Hellendoorn and Devanbu, 2017b), and more recently, deep neural networks such as RNNs (Parvez et al., 2018; Li et al., 2018; Liu et al., 2020).

A different line of work has proposed code prediction based on statistics on syntactic structure of code, as opposed to seeing code as text. These include probabilistic context-free grammars and probabilistic higher-order grammars  (Bielik et al., 2016a; Raychev et al., 2016, 2016, 2016). This class of models considers code artifacts as abstract syntax trees, and make their predictions based on information gleaned selectively across paths in the code’s AST. Specifically, Raychev et al. (Raychev et al., 2016)

learn a decision tree model that uses this information essentially as features.

Researchers in the NLP community have recently developed Transformers, a new neural architecture for even more effective natural language processing (Vaswani et al., 2017). As we discuss later, Transformers promise to overcome some of the limitations of RNNs. We investigated the use of Transformers for code prediction, treating code as textual data, and validated experimentally that Transformers indeed outperform RNNs on the next code token prediction task.

Given this already strong baseline, we consider the question of whether informing the Transformer of code’s syntactic structure can further improve prediction accuracy. Our main result is that a better way to use transformers for code prediction is to expose the syntactic structure of code to the network. The details of how to do this are interesting, as encoding the structure of a program’s abstract syntax tree is not natural for sequence models. We show a range of design choices for communicating the AST structure to the Transformer. We find that the more faithfully we communicate the tree structure to the Transformer, the better the accuracy we obtain!

1.1. Key Results

We report results based on training and evaluating various models code prediction on the py150 (1) dataset.

  • We show that a neural model based on the Transformer architecture is able to outperform state-of-the-art neural (e.g. RNN-based (e.g. in (Hellendoorn and Devanbu, 2017a; Karampatsis et al., 2020)) as well as non-neural models (e.g. Deep3 (Raychev et al., 2016)). Measured on the leaf tokens of the ASTs, our best Transformer model improves the mean reciprocal rank (reported as a percentage, see Sec 5) significantly over the prior work: upon the RNN model (40.0% v 55.5%) as well as upon the corresponding Deep3 model (43.9% v 73.6%).

  • We show that a key to obtaining superior performance from the Transformer model is to feed not just the source token sequence as is common in NLP tasks, but in making the Transformer aware of the syntactic structure of the code. We show that with more detailed syntactic structure, we get better accuracy (from 65.7% to 74.1% on leaf tokens).

    We provide a preliminary investigation into why the Transformer model that is aware of tree structure works better than one without, by using saliency maps (Simonyan et al., 2014).

  • Our key technical novelty is a novel enhancement to the Transformer’s self-attention mechanism. We enable the mechanism to learn weights—how much to focus on each preceding token in the input—by factoring in the spatial relationship in the abstract syntax tree between each pair of tokens.

  • We also evaluated our trained model on a dataset selected from a Python code repository internal to Facebook, and found the relative benefits to be similar to those on py150. The accuracy on this other corpus indicates that the Transformer model is generalizable to other corpora.


Sec 2 articulates the code prediction problem in a couple of different forms, and introduces a running example. Sec 3 gives an introduction to Transformers, including how they would apply to source code. This section also describes how to communicate tree structure to the transformer. Sec 4 provides a quick recap of the previous work, focusing on the ones against which we compare our models. Sec 5 describes our datasets and implementation. Sec 6 presents our quantitative results. Sec 6.4 takes a closer look into why our models worked well (or did not.) Sec 7 discusses related work in the area of code prediction and in using Transformers. We conclude the paper with our future work.

2. Code Prediction

ip = socket.gethostbyname (host)
[port, request_size, num_requests, num_conns] = map (
                string.atoi, sys.argv[2:]
chain = build_request_chain(num_requests, host, request_size)
Figure 1. Running example of Python code. The code snippet 222data/JeremyGrosser/supervisor/src/supervisor/medusa/test/test_11.pyis from the py150 dataset (1).

Consider the Python code fragment in Fig 2. Suppose a developer has written code up to string following by a dot. At this point, it will be helpful for the IDE to prompt the developer with attribute names that are likely to follow, preferably, with atoi ranked at the top because in this case that is the correct next token.

Our goal is to devise a model such that it takes some code fragment as input and predicts the next token. In this section, we describe two main methods of representing code as inputs to be fed into various models.

2.1. Sequence-based Representation

In NLP, a common way of feeding in information for next token prediction is with a linearized token sequence. The same technique can be applied with source code, where we parse the source code into tokens. To predict ”atoi”, we would look at the tokens: […, ”map”, ”(”, ”string”, ”.”]. This is a natural approach for next token prediction since each input and prediction in the sequence equates to a token in source code, so we can easily evaluate on all tokens.

2.2. AST-based Representation

An alternate to a source token sequence is the Abstract Syntax Tree (AST), as shown in Fig 2 for the code fragment in Fig 2. An AST can better represent spatial relationship between nodes. For example, in source code, the tokens ip (node 3) and chain (node 41) are separated by 30 tokens, but they are related in the AST via a specific (short) path.

ASTs represent some source tokens explicitly and others implicitly. Tokens corresponding to identifiers, field names, and constants appear explicitly as leaf (terminal) nodes: for instance, ip and host appear as the leaf (terminal) nodes 3 and 11, respectively. Keywords and other syntactic tokens (e.g. =) are implied by the type of internal nodes (e.g. Assign). Accordingly, the prediction task can be separated into:

  • Value prediction: Predicting the values at leaf nodes. For example, given nodes 0-10 of the tree, we want to predict host, which is the value of the leaf node at node 11.

  • Type prediction: Predicting the types at internal nodes. For example, given nodes 0-33 of the tree, we want to predict Attr, which is the type of the internal node at node 34.

Knowing that the type of a node is Attr implies that after the source tokens corresponding to its left child, there will be a token ”.” (dot) before the (single) token from its right child. Thus, value prediction and type prediction together can simulate next token prediction problem, though there will need to be a stack-based controller that would call the right predictor, maintain some state, and emit the predicted source tokens appropriately.

Figure 2. AST for the example in Fig 2. The leaf (terminal) nodes have values and the interior (non-terminal) nodes have types.
Token value input ip socket getHostByName host map string atoi sys argv 2 chain
(node id) kind 3 7 9 11 24 27 29 33 35 38 41
Previous SrcRNN seq ¿10 ¿10 3 2 7 ¿10 2 1 1 3 ¿10
work Deep3 AST 5 5 3 1 5 5 5 5 1 6 5
Our SrcSeq seq ¿10 1 1 6 ¿10 ¿10 1 10 1 1 ¿10
work DFS AST ¿10 1 5 1 4 1 1 1 1 1 ¿10
DFSud AST 3 1 1 1 1 1 4 1 1 1 1
Table 1. Ranks for the predictions for the leaf nodes listed in Fig 2. ¿10 means the model did not get the right answer in the top 10 results. DFSud is our most powerful model.

2.3. A preview of results

In this paper, we explore both sequence-based and AST-based representation of code for code prediction, using various models (RNN, Decision Tree, Transformers). Table 1 shows the ranks (lower is better) of predicting the correct leaf node for all the leaf nodes in the AST in Fig 2. It compares two models of previous work and four Transformer-based models (our work). Transformer models generally achieve lower ranks, and in some cases they are the only models that produce the right token in their top-10 predictions. This table also shows (via one example here, but the results carry over) that feeding ASTs to Transformer models brings better results than feeding them source token sequences. The core of our paper is about how to feed ASTs to Transformers.

3. Transformers for Code Prediction

In this section, we explain the four models of our own creation: SrcSeq , RootPath , DFS , DFSud . All four models use Transformers (Vaswani et al., 2017)

, a class of deep learning models that have achieved the state-of-the-art results 

(Devlin et al., 2018; Dong et al., 2019; Radford et al., 2019) for a variety of NLP tasks such as language modeling, question answering, and sentence entailment. In this section, we discuss how we can apply Transformers for next code token prediction, feeding in both sequence-based (SrcSeq ) and AST-based (RootPath , DFS , DFSud ) inputs.

Figure 3. Schematic of a GPT2 Transformer. The self-attention layer is able to consider all tokens in the input up to the point of prediction. Here the self-attention box depicts the information flow when predicting next token after the ”.”; see Table 3 for where the numbers come from.

3.1. A Primer on Transformer

Transformers belong to a class of deep neural networks that are designed for sequence processing. Transformers eschew the hidden states of earlier generation sequence networks (such as RNNs, see Sec 4) in favor of exposing the entire input sequence simultaneously, solely relying on attention mechanisms. In Transformers, information from any previous location of the sequence can directly affect the encoding of the next token, through a mechanism called self-attention, which helps greatly improve the connectivity in long sequences. Transformers also uses multiple heads of these self-attention blocks, called multi-headed attention, which enables the model to simultaneously consider different ways of attending to previous information within one block and also across other blocks.

This section explains self-attention in detail (Figure 3), as it is the crux of the model. The purpose of self-attention is to give higher attention to more relevant tokens in the input sequence. To illustrate this, let’s take an example input sequence: [”map”, ”(”, ”string”, ”.”], and the target token being ”atoi.” This input sequence is first fed through the initial Embedding layer to give: . Then, this embedding is used as input to three fully-connected networks () to create a query, key, and value embedding for the sequence:

In our example, , and

. We use the query vector

to ”query” the ”keys” to see which token relationships are the most important by calculating . This results in a matrix of size n x n, as seen in  Table 2, where n is the length of the input sequence. Each row is then normalized (by square root of

) and passed through a softmax layer so all the scores are positive and add up to 1.  Table 

3 shows an example of the self-attention weights 333The rows do not sum to 1 since there are previous tokens in the sequence that is not shown in this table; looking at the last row, we can see that most of the self-attention is given to ”.”, meaning it has a greater factor in predicting the next token ”atoi”. Also note how the matrix is a lower triangular matrix - this is because self-attention cannot be applied to tokens that have not been seen before. Finally, this matrix is multiplied with the value vector to weight the token embeddings:

In our example, . is then fed through a fully-connected network, coupled with skip connections and layer normalizations. This process is repeated num_layer number of times. Finally, the output of the last layer goes through a classification layer at the end to generate predictions for the next token.

map ( string .
Table 2. Matrix for calculating the self-attention ”scores” for each token combination in the input sequence for Transformers. We use the query vector to ”query” the ”keys” to see which tokens are the most relevant for the next token prediction. The matrix multiplication is calculated with .
map ( string .
map 0.9
( 0.6 0.1
string 0.1 0.1 0.7
. 0.2 0.1 0.2 0.4
Table 3. Example matrix for the numerical self-attention ”scores” after taking the softmax over the normalized values in Table 2. Note that the rows listed here do not sum up to exactly 1 since there are previous tokens in the input sequence (not shown in this matrix) that self-attention gives scores to as well.

For other details, please refer to Vaswani et al. (2017)

(especially the multi-head attention part) and in particular, GPT-2 

(Radford et al., 2019), for a more thorough description.

The next sections discuss various ways of feeding code fragments into this Transformer architecture.

3.2. SrcSeq 

Our first attempt is to apply a Transformer over source token sequences. As a baseline for later models that takes more tree information, as well as a straightforward application of Transformer models, we apply a Transformer (GPT-2) over source token sequences:

where is the output of the Transformer to be used for prediction, and represents the embedding of the source tokens. It does next token prediction by taking all preceding source tokens, up to the point of prediction, as input. As the inputs and outputs are the same as the SrcRNN model (introduced in the next section), we can do a direct comparison between RNNs and Transformers. As we show in the experiments, this turns out to be an already strong baseline.

The next two subsections discuss how to present the AST to the Transformer.

3.3. Dfs 

One way to present all AST nodes to a Transformer is to linearize them in the using a pre-order traversal, or a depth-first-search (DFS). For Fig 2, for node 29, the previous nodes in DFS order would be: […, “Call”, “NameLoad”, “map”, “AttributeLoad”, “NameLoad”, “string”, “Attr”]

The DFS model simply feeds this sequence to the Transformer:

where is the output of the Transformer to be used for prediction, and represents the embedding of the AST nodes. DFS predicts the next node in the AST; thus, it does both value (leaf) prediction and type (internal) prediction.

DFS presents the tree nodes in a pre-determined order, but still does not retain detailed structural relationship between nodes. For example, consider the sequence of nodes 26 - 28 in Fig 2. This would be represented as [”NameLoad”, ”string”, ”attr”], the three nodes appearing consecutively in DFS order. Looking at the AST, we can see that the relations between (”NameLoad” & ”string”, and ”string” & ”attr”) are actually quite different: ”NameLoad” is one node up from ”string”, while ”string” is two nodes up and one node down from ”attr”. This path-based relation between the nodes provides richer information about the actual structure of the tree.

While DFS itself shows only a small improvement on SrcSeq (Table 6), it allows us to augment it with the richer information indicated above, leading to the DFSud model.

3.4. DFSud 

DFSud is an extension to the DFS model that incorporates more tree structure. Specifically, given any two nodes and in the AST, we want to capture the shortest path needed to reach from to , and communicate this to the Transformer. The path from to is represented abstractly only in terms of up and down moves:

where , and are the number of up and down nodes, respectively, node has to travel to reach node .444Code2vec (Alon et al., 2019) used (embeddings of) leaf-to-leaf AST paths to capture information for the purpose of code summarization; by contrast, UD paths specifically retain information on how a pair of tree nodes are situated with respect to each other. We create a matrix to contain for each pair of nodes , where comes after in DFS order. Table 4 (ignoring the qk parts inside the parenthesis) shows an example of the in the context of our running example (nodes 25-29 in the AST). 555Node 24 was omitted due to space constraints for the table.

Notice that this matrix has the same shape (lower triangular matrix) as the matrix in Table 2. We add in the matrix in the Attn block (after passing by an embedding layer):


where is element-wise product.

Table 4 shows an example of the new self-attention, One detail to note here that since we want the path relations to be relative to the next token we are predicting.

The rest of the Transformer model is the same as DFS ’s, with the updated calculation:

where is the output of the Transformer to be used for prediction, represents the embedding of the AST nodes, and represents the embedding of the relations.

Why might adding help the model do better?

Note that provides a way for the model to learn the strength of attention it needs to pay to previous tokens, organized in the order of inputs to the network (this order is implicit in the indices used in the matrix in Table 3.) provides a way for the model to learn the strength of the attention to pay to previous tokens, considering the AST relationship between pairs of nodes as well.

To recap, our key insight is to fortify the self-attention mechanism of the Transformer to enable it to learn weights on the basis of AST relationships between tokens as well.

25 26 27 28
Table 4. Matrix for calculating the self-attention ”scores” for DFSud . Matrix , which contains the up down path information, is multiplied with , from the traditional Transformer. In this example, node 25 represents ”AttributeLoad”, 26 is ”NameLoad”, 27 is ”string”, and 28 is ”Attr”.
Models Model Type Problem Input Prediction
Deep3 Decision Tree Value pred & type pred AST AST nodes
SrcRNN RNN Next token pred Source code Source code tokens
SrcSeq Transformer Next token pred Source code Source code tokens
DFS Transformer Value pred & type pred AST AST nodes
DFSud Transformer Value pred & type pred AST + path relations AST nodes
RootPath Transformer Value pred Leaf nodes + leaf to root paths Leaf nodes
LeafTokens Transformer Value pred Leaf nodes Leaf nodes
DFSud+ Transformer Value pred & type pred AST + path relations AST nodes
Table 5. Overview of the models presented in this paper. The first two are models from previous work using RNN and Decision Tree, and remainder are models of our own creation that uses a Transformer (the last three are exploratory and variations). The models differ in the type of prediction task, and in what the model inputs and predicts.

3.5. Variations of Models

In this section, we discuss some alternate models and variations of models we have explored.


RootPath is an AST-based model that feeds tree structure information to the model in an alternate way than DFS does. RootPath first creates a sequence based on the leaf nodes of the AST. To expose tree structure to the Transformer, it fortifies each leaf node with the path from the leaf node to the root of the AST by traversing up its ancestors; we call such a path to be root-path. For Fig 2, for node 29, the root-path would be:

([“NameLoad”, “Call”, … “Module”], “map”),
([“NameLoad”, “AttributeLoad”, “Call”, …, “Module”], “string”),
([“Attr”, “AttributeLoad”, “Call”, …, ”Module”], ?)

The root-paths are first fed into a sequence encoder (such as an LSTM), coupled with the leaf node, and is fed through the Transformer:

where is the output of the Transformer to be used for prediction, and represents the embedding of the leaf nodes, and is the embedding for all the root-paths. Since RootPath predicts only leaf nodes, it does only value prediction.


LeafTokens is a lightweight variation of RootPath , where only the leaf nodes are used. For  Fig 2, for node 29, the input sequence would be: […, ”map”, ”string”], and would predict ”atoi”. LeafTokens feeds in the leaf nodes of the AST into a Transformer:

where is the output of the Transformer to be used for prediction, and represents the embedding of the leaf nodes. We compare this model with RootPath to determine the importance of root-path information in next token prediction.


DFSud+ is a variation to DFSud that uses a richer vocabulary to the up-down paths to include some child index information, as it provides extra information about the tree structure. While DFSud uses only and to describe the relation between two nodes in the AST, DFSud+ expands into three sub words: , , ; this describes whether the node is either the first child, the last child, or somewhere in in between, respectively. For example, in  Table 4, the new relation for node 27 and 27 would expand from into . We chose this minimal extension to limit the possible exponential growth in path vocabulary size; even with this minor extension, our path vocabulary increases from 250 to 100k to cover more than 90 % of the vocab (with a long right tail). The rest of the model is same as DFSud , as described in  Sec 3.4. We compare this model with DFSud to examine whether adding in more information (at the expense of enlarging the model) improves MRR.

A high-level overview of the models is presented in Table 5. The next section will cover two previous models from literature.

4. Background on Previous Work

In this section, we recap two different methods for code prediction, representative of recent previous work, against which we compare our work. These are (1) a method based on language models that uses a sequence of source code tokens, and (2) a method based on decision trees (Raychev et al., 2016) that works on ASTs.

4.1. Language Model based prediction

A language model computes the probability of the next word

, given some window of preceding words: . Here we use an RNN to compute a language model; n-grams would be another choice.666The jury seems to be out on which one is necessarily better for the task (Hellendoorn and Devanbu, 2017a; Karampatsis et al., 2020).

Figure 4. Schematic of an RNN. Here , and are vectors and , and are matrices.

Fig 4

shows an Recurrent Neural Network (RNN) operating on some of the tokens from the example in Fig 

2. As the name suggests, RNNs consume input tokens recurrently, one per time step, and produce output tokens one per time step as well. The bottom layer of the RNN embeds input tokens into a vector: , where is the source token seen at the ’th time step. The hidden state is computed as , using both and the hidden state from the previous time step. The output is a vector of probabilities of various tokens computed by using softmax over ; the diagram shows the top-ranked predictions or the ground truth. , and are the parameters of the network, to be learned during training.

The pertinent point to note is that the hidden state encodes the knowledge of not just the current token, but of last several previous tokens via the propagation of information in previous hidden states. Thus, RNNs implicitly compute a language model over tokens.

A limitation of RNNs is the difficulty they have in tracking long-range dependence, even with various proposals to mitigate the problem (e.g. long-short-term-memory (LSTM) cells 

(Hochreiter and Schmidhuber, 1997), which we do use in our implementation, attention on top of RNNs (Iyer et al., 2016), and skip-connections between sequence locations (Vinyals et al., 2015)).

In our experiments, we feed the source code tokens into an RNN and call this model SrcRNN .

4.2. Decision Tree based prediction

Raychev et al. (Raychev et al., 2016) presented a system, Deep3, based on a learned decision tree combined with count-based probabilities at the leaves of the decision tree. We provide only a sketch here, highlighting how they use paths on an AST.

switch (Up WriteValue) {
  case Attr: switch (Up Up WriteValue) {
     case AttributeLoad:
        switch (Up Up DownFirst WriteValue) {
           case NameLoad:
              Up PrevDFS WriteValue
           default: ...
Figure 5. Fragment of a TGEN program encoding a decision tree. The bold words are steps that comprise a path in a given AST.

Fig 5 shows part of a learned decision tree, written in the form of program in a specialized language they call TGEN. Given an AST and a starting node , a TGEN program walks certain paths in starting from . For example, Up WriteValue (line 1) goes to the parent of and records the label. If the label is Attr, it walks a different path (line 2) in the vicinity of . The branch outcomes and observations collected by running this TGEN program on form a context

, which is then used to look up a probability distribution conditioned on that context. For the AST in Fig 

2, starting with node 29, the TGEN program will produce a context for which the probabilities of different tokens for node 29 might be: [atoi: 40%, length: 20%, …]. The flexibility of focusing on arbitrary paths in the AST allows the model to condition selectively on nodes farther back in the AST.

A TGEN program is learned—on a specific corpus—by a genetic search procedure that simultaneously selects paths and grows the decision tree from the training data, with an entropy minimization objective. The details are not important for this paper; in this paper, we use their pretrained model (32) as well as their Python dataset (1) for our experiments.

The reader will notice that the notion of in Section 3.4 is akin to the AST paths expressed in TGEN programs. The paths in TGEN are more general, but at a high-level, the idea that certain ”spatial” relation between nodes is important is common to both approaches. This, along with the competitive quality of results of the Deep3 model in Table 1, makes it an interesting comparison. We explore this similarity further in Appendix B.2.

5. Implementation and Datasets

5.1. Dataset

We train our models using the py150 dataset (1) used in Raychev et al. (2016). The dataset consists of 150k Python 2 source code files from GitHub repositories, along with their parsed ASTs, split into 100k for training and 50k for evaluation. From the ASTs extracted from the py150 dataset, we modify the AST to ensure that the internal nodes only has types and the leaf nodes only have values. For implementation details, please refer to AppendixA.1. To incorporate large trees (greater than 1000 nodes), we deploy a technique adopted by (Al-Rfou et al., 2018), which slices a large tree into shorter segments with a sliding window to maintain part of the previous context. For implementation details, please refer to AppendixA.2.

We evaluate our models on two evaluation datasets:

  • py150: We use the evaluation dataset used in Raychev et al. (2016), which consists of 50k Python ASTs. We perform the two modifications as listed above before feeding them into our models, there are 16,003,628 leaf nodes and 30,417,894 internal nodes.

  • internal: We also created an evaluation dataset consisting of 5000 Python files from a code repository internal to Facebook. With this dataset, we can evaluate how our trained model can generalize to a different dataset, even if the code comes from disjoint projects. After the modifications, there are 1,669,085 leaf nodes and 3,067,147 internal nodes.

Recent works (Karampatsis et al., 2020; Hellendoorn and Devanbu, 2017a) have divided evaluations into static and dynamic, where in the dynamic evaluations, the model continues to update its parameters during evaluation. This may increase accuracy by having the model adapt to the characteristics of the evaluation dataset. In our experiments, we choose to evaluate statically, and realize that evaluating dynamically may improve accuracy.

5.2. Implementation


. For the model that use Transformers (RootPath , DFS , SrcSeq , DFSud

 ), we adapt the Pytorch implementation 

777 We do not use positional encoding. Refer to Appendix A.3 for the explanation. of GPT-2 small (Radford et al., 2019). We use six Transformer blocks, six heads in each block, , and set embedding dimension

. We borrow other hyperparameters from

Radford et al. (2019). We limit the token vocabulary size to 100k, which covers over 90%  of the tokens used in the training dataset. For DFSud , we limit the vocabulary to 250, which covers over 95%  of the path relations. For RootPath , we limit the maximum length of the path from leaf node to root to be 13, which covers over 90%  of the nodes. For any path longer than 13, we keep the nodes closest to the leaf, and truncate the nodes near the root.


For the SrcRNN model, we adapt the PyTorch example implementation 888 of a word-level language model LSTM. We use embedding dimension , with and . We limit the token vocabulary size to 100K, which covers over 90%  of the tokens.


For the Deep3 model, since the authors have shared only the model and not the training algorithm, we used the model pretrained on py150.

We trained all models (except Deep3) on Nvidia Tesla V100 (using 4 GPUs at a time) until the loss converged, with all of the parameters randomly initialized. We used the Adam optimizer with the learning rate set to 1e-3. For convergence, DFS

 took 11 epochs,

DFSud took 21 epochs, SrcSeq took 9 epochs, and SrcRNN took 9 epochs (each epoch took around 45 minutes - 1 hour).

5.3. Evaluation Task

We evaluate the models on the code prediction tasks that we defined in Sec 2: next token prediction, which pertains to source code tokens taken as a linear sequence; value prediction, which pertains to predicting leaf nodes of the AST; and type prediction, which pertains to predicting internal nodes of the AST.

To measure performance on these tasks, we use mean reciprocal rank (MRR). The rank is defined as


where is the number of predicting locations and is the rank of the correct label given by the model for the data point. We present MRR as a percentage, in keeping with prior work (Karampatsis et al., 2020; Hellendoorn and Devanbu, 2017a).

While Acc@1 only gives score when the correct label is ranked at the top, MRR also give scores when the true label is not ranked as the top, but among top few prediction. Comparing to the hit-or-miss style metric (Acc@1), this is closer to the realistic scenario when completion suggestions are presented to developers. With this practical perspective and for ease of computation, we only consider for each location (all will have a score of 0).

We share our data processing scripts and model implementations at

6. Evaluation

6.1. Research Questions

At a high level, we want to answer the following research questions.

  1. [label=RQ0]

  2. Overall, do Transformer-based models provide better accuracy compared to prior state-of-the-art methods of code prediction?

  3. Does syntactic structure of code help get better accuracy out of Transformers, and if so, by how much?

  4. What did the Transformer model variants learn from the code? Did they learn the right things? What can we learn from the learned models?

We describe the experiments to answer the research questions RQ1 and RQ2. We discuss the evaluation of RQ3 in Section 6.4.

For RQ1, recall that prior work (Section 2) works on two different kinds of inputs: all source tokens as in program text, and ASTs of each program unit. To carry out a direct comparison against prior work, we split RQ1 into two specific questions:

  1. [label=RQ1.0]

  2. Is the Transformer-based model more accurate than the RNN-based model on the next token prediction problem? ( Sec 2)?

    To answer this question, we compare SrcRNN model against the SrcSeq model on the source tokens.

  3. Are the Transformer-based models more accurate than Deep3, on the value prediction and on the type prediction problems (Sec 2)?

    To answer this question, we compare Deep3 model against the DFS variant of the Transformer on the ASTs variants: DFS , DFSud+ , DFSud , and RootPath ,

For RQ2, we ask two sub-questions:

  1. [label=RQ2.0]

  2. Does a Transformer model based on an AST outperform a Transformer model that takes the corresponding source token sequences?

    This question can be answered directly only on tokens that appear both in ASTs and source token sequences: these are precisely the values at the leaf nodes of the AST. We compare SrcSeq and DFS models on the terminal value prediction problem.

  3. Does providing more detailed structural information help with accuracy?

    To answer this question, we compare among the tree-based Transformer models (DFS , DFSud+ , DFSud , and RootPath ) on the terminal value prediction and on the internal/type prediction problems.

6.2. Results

Prior work Our work
Applications SrcRNN Deep3 SrcSeq DFS DFSud
Next token prediction 65.7 (58.0) n/a 74.1 (68.1) n/a n/a
Value prediction 36.4. (29.1) 43.9 (40.5) 50.1 (43.4) 58.0 (52.4) 73.6 (71.0)
Type prediction n/a 81.9 (75.8) n/a 89.3 (82.7) 98.7 (97.6)
Table 6. MRR and Acc@1 (in parenthesis) of various prediction tasks for py150.
Prior work Our work
Applications SrcRNN Deep3 SrcSeq DFS DFSud
Next token prediction 57.4 (48.3) n/a 66.8 (60.2) n/a n/a
Value prediction 23.8 (17.7) 36.1 (33.3) 36.5 (30.7) 43.9 (38.8) 58.4 (55.3)
Type prediction n/a 79.9 (73.1) n/a 87.7 (80.2) 98.0 (96.3)
Table 7. MRR and Acc@1 (in parenthesis) of various prediction tasks for the internal dataset.
Applications DFSud RootPath LeafTokens DFSud+
Value prediction 73.6 (71.0) 55.1 (48.4) 41.9 (34.1) 73.3 (70.8)
Type prediction 98.7 (97.6) n/a n/a 97.8 (96.1)
Table 8. MRR and Acc@1 (in parenthesis) of the alternate models and variations of models for py150, compared against the best performing model, DFSud .

Our main evaluation results are reported in Table 6 and Table 7.


For RQ1.1, see the SrcRNN and SrcSeq columns in Table 6 and Table 7. For the py150 dataset, we can see a significant improvement in MRR, from 65.7% to 74.1% for the SrcRNN and SrcSeq models, respectively. The same holds for comparing on the internal dataset: 57.4% vs 66.8%. (Table 9 and  11 in the Appendix B.1 break down the data for different kinds of next token predictions.) Not surprisingly, Table 6 also shows that predicting the identifier and constant tokens (as in value prediction) is more challenging than predicting the keywords and punctuation tokens, which form almost 2/3 of all the source tokens.

For RQ1.2, we compare the Deep3 model against DFS and DFSud models. Overall, we found that all the Transformer models (SrcSeq , DFS DFSud ) achieve higher scores compared to Deep3. Table 6 shows that DFSud achieves the best MRR of 73.6 for leaf node prediction compared with Deep3’s MRR of 43.9. Similar results can be seen for the internal dataset, as shown in Table 7.


To answer RQ2.1, we compare the value prediction results for SrcSeq against the AST-based models (DFS , DFSud ). Table 6 shows that DFS outperforms SrcSeq by 7.9%, and DFSud significantly outperforms SrcSeq by 23.5% (73.6% vs 50.1%). These results demonstrate that representing the source code as AST vs linearized source code provides better results for next value prediction.

For RQ2.2, we compare the results amongst the AST-based models. First, comparing DFS and DFSud , DFSud provides more detailed structural information. Table 6 shows significant improvements to the accuracy, achieving 15.6% higher MRR for value prediction and 9.4% higher MRR for type prediction than DFS . Similar trends can be seen for the internal data set in Table 7.

Table 8 shows a significant drop in accuracy between RootPath and LeafTokens (55.1% vs 41.9% for all leaf nodes). This shows that the information captured by the leaf to root paths (both in terms of its values and tree structural information) gives a solid boost to accuracy. These results demonstrate that feeding the model with more structural information does improve results.

Next, we compare RootPath and DFS . These models are similar because both models take all of the AST nodes as the context, but are different in how they digest the context. RootPath first aggregates the context information for each leaf node before predicting the next leaf node, while DFS captures both leaf and internal nodes in one context. Results show that performance between the two models are pretty comparable (58.0% vs 55.1% for value prediction in Tables 6 and 8). One drawback of RootPath is that it can only predict leaf nodes, while DFS can predict all nodes in the AST, including internal nodes for type prediction.

Table 8 shows that DFSud+ did not outperform DFSud , which shows that simply expanding the up-down vocab may not be the right approach in exposing child index information to the model. Areas of explorations may include whether a vocabulary size of 100k is too sparse for the models to learn effectively, or whether child indices are inherently not as crucial for code prediction.

6.3. Threats to Validity

SrcRNN Implementation

Our SrcRNN implementation is based on a PyTorch implementation999

whereas related papers have generally built off of a Tensorflow implementation.

101010 As the hyperparameters were similar (, , ) to recent publications, we do expect our implementation to be comparable.


We have not integrated byte-pair encoding (BPE) (Karampatsis et al., 2020) into our RNN model. We expect BPE to benefit both RNN and transformer models, and plan to explore this in future work.

Training Corpus

While larger Python corpora have appeared, py150 is still sizable at 5̃00MB; we do not expect the larger corpora to reverse our findings.

Python specificity

We have only carried out evaluations on Python, and have not demonstrated that our results would carry over (in trends) to other languages. The Deep3 paper did find their results (in trends) to roughly carry over from Python to Javascript.

6.4. Model Inspection

(a) DFS 
(b) DFSud 
Figure 6. Influence of previous nodes in value prediction of the example in  Fig 2 by DFS and DFSud . -axis is labeled with the input values. -axis is labeled with the values to be predicted. Color indicates the model’s prediction is correct or wrong.

In this part, we study the influence of each input features to shed light on the black box of how our models make their predictions. Particularly, we study how each input token attributes to the models’ predictions (attribute analysis, this section) and which s are learned to be important by DFSud (Appendix B.2). For the latter, we found that local syntactical context is generally important and similarities exist compared to the heavily utilized Deep3 TGEN paths.

We use saliency maps (Simonyan et al., 2014)

for attribute analysis, which are constructed by taking the partial derivative of the loss function with respect to the inputs. Fig 

6 visualizes the magnitudes of the gradients falls at each input token when the model predicts a particular output. Intuitively, the larger the value for a particular token, the more sensitive the output is to the variations at that token.

Examining the saliency maps for DFS and DFSud , we first observe that that parent node of the AST (the internal node right above the leaf) is generally important for both models. From Fig 5(b), we can see DFSud is influenced by string when predicting atoi and by request_size when predicting num_requests. It is not shown in the figure but when predicting 2, DFSud is influenced by the previous occurrence of sys.argv indexed by 0 and 1. Looking at the differences between Fig 5(a) and Fig 5(b), we found that DFSud is influenced by ip while predicting gethostbyname correctly but DFS is not while predicting it wrong. Generally, we found that DFSud attributes more towards terminal values relevant to the values to be predicted, while DFS attributes little to values other than non-terminals. This provides an evidence that DFSud is more likely to have learned the right features for next value prediction.

On an orthogonal note, we also observe that for many predicting locations, the magnitude of gradients are very small, suggesting the robustness of the model in the sense that it is less sensitive to minor perturbations of the input sequence.

7. Related Work

Due to the vastness of the topic, we focus on two themes of related work.

7.1. Statistical Code Completion

Simply put, the task of code completion is to predict the rest of the code a user is typing. Code completion is widely used by commercial or free integrated development environments (IDEs)  111111  121212  131313 to accelerate or ease the process of developing software.

Since Hindle et al. (2016), there have been the rise of statistical learning for the task of code completion, exploiting naturalness of code (Allamanis et al., 2018a). Learning methods used starting from n-gram (Nguyen et al., 2013; Hindle et al., 2016) to probabilistic grammar (Allamanis and Sutton, 2014; Bielik et al., 2016a) and decision trees (Raychev et al., 2016). Recently there have been increasing application of deep learning to code completion, especially recurrent neural networks (Liu et al., 2016; Li et al., 2018; Liu et al., 2020) and graph neural networks (Allamanis et al., 2018b; Brockschmidt et al., 2019; Yang and Xiang, 2019).

Among other flavors of code completion, such as where program after the predicting location is available (Raychev et al., 2014; Allamanis et al., 2018b; Brockschmidt et al., 2019; Alon et al., 2020) or where the granularity of prediction is smaller (e.g. characters (Bielik et al., 2016b) or subtokens (Karampatsis et al., 2020)) or larger (e.g. sub-ASTs (Alon et al., 2020)), we focus on predicting next token given only partial program up to the predicting location.

PHOG (Bielik et al., 2016a), DeepSyn (Raychev et al., 2016) and Deep3 (Raychev et al., 2016) are particularly related as all of them utilize AST information for code completion. PHOG and DeepSyn uses a conditional probabilistic context-aware grammar based on AST walks. Deep3 further enriched the probabilistic model with a decision tree to allow more fine-grained modeling of context-dependent code occurrences.

However, these probabilistic models have been surpassed by deep neural networks, namely LSTMs over serialized ASTs (Liu et al., 2016). Accuracy can be further improved by stacking attention and pointer-network over an LSTM (Li et al., 2018) or by augmenting LSTMs with stacks for which the operations are guided by the AST structure (Liu et al., 2020).

7.2. Transformers

Transformers, popularized by Vaswani et al. (2017), are sequence-to-sequence (seq2seq) neural networks based-on layers of multi-head self-attentions. Surpassing RNNs, Transformer models (Devlin et al., 2018; Dong et al., 2019; Radford et al., 2019) have become the state-of-the-art natural language models, breaking records for a range of NLP tasks, including sentence entailment, question answering and language modeling. See Sec 3 for a more thorough introduction to Transformers.

There have been reported applications of Transformer models for code completion. Galois 141414

is an open source project that uses GPT-2 

(Radford et al., 2019) for code completion. The approach is similar to our SrcSeq model, despite their use of non-standard tokenizer and a subtoken segmenter. TabNine™ published a blog post (TabNine, 2019) in July 2019 mentioning the use of GPT-2 in their code completion but revealed no technical detail. To this point, we found no formal investigation up to this date on using transformers for the task of code completion.

There has been a surge of interest from 2019 in extending Transformer models to handle beyond sequential structures, for NLP (Wang et al., 2019; Ahmed et al., 2019; Nguyen et al., 2020) and for learning source code (Harer et al., 2019; Shiv and Quirk, 2019). Wang et al. (2019) put constraints on self-attentions to induce tree structures. Ahmed et al. (2019), Harer et al. (2019) and Nguyen et al. (2020) modify the attention block to mix node representations according to tree structures. Shiv and Quirk (2019) proposed a tree-induced positional encoding. As for learning source code, it has been showed that taking tree structured helped code correction (Harer et al., 2019) and code translation (Shiv and Quirk, 2019).

8. Future Work

Handling Out-of-Vocabulary Words

Source code presents a difficulty shared with natural language processing in handling large vocabularies and rare words. The token/word to be predicted in test data may not appear in the training data. This is even more challenging when predicting identifiers, such as method names, variable names, and so on, as developers can come up with arbitrary identifier names. Possible mitigation includes copying mechanism (Allamanis et al., 2016; Brockschmidt et al., 2019; Fernandes et al., 2019) and open-vocabulary models (Cvitkovic et al., 2019; Karampatsis et al., 2020).

Exposing Tree Structure even more completely

We saw significant improvement in performance by providing more tree structure (DFS vs DFSud ). Our attempt at DFSud+ , a variation to DFSud that enhances the path relation vocabulary, did not improve performance. This leaves open the possibility that our way of representing AST paths needs to be improved.

Using Semantic Information

Recent work has also shown the promise of using easy-to-compute static analysis information, such as def-use information. While it is harder to get such info for dynamic languages, it is still an interesting question as to how to communicate those to transformers, and compare it to graph neural networks (Li et al., 2016; Allamanis et al., 2018b) that do use it.


  • [1] (2016) 150k python dataset. External Links: Link Cited by: §1.1, Figure 1, §4.2, §5.1.
  • M. Ahmed, M. R. Samee, and R. E. Mercer (2019) You only need attention to traverse trees. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 316–322. External Links: Link Cited by: §7.2.
  • R. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones (2018) Character-level language modeling with deeper self-attention. External Links: 1808.04444 Cited by: §A.2, §5.1.
  • M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton (2018a) A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51 (4), pp. 1–37. Cited by: §1, §1, §7.1.
  • M. Allamanis, M. Brockschmidt, and M. Khademi (2018b) Learning to represent programs with graphs. In International Conference on Learning Representations, External Links: Link Cited by: §7.1, §7.1, §8.
  • M. Allamanis, H. Peng, and C. Sutton (2016) A convolutional attention network for extreme summarization of source code. In Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of Machine Learning Research, Vol. 48, New York, New York, USA, pp. 2091–2100. External Links: Link Cited by: §8.
  • M. Allamanis and C. Sutton (2014) Mining idioms from source code. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 472–483. Cited by: §7.1.
  • U. Alon, R. Sadaka, O. Levy, and E. Yahav (2020) Structural language models for any-code generation. External Links: Link Cited by: §7.1.
  • U. Alon, M. Zilberstein, O. Levy, and E. Yahav (2019)

    Code2vec: learning distributed representations of code

    Proceedings of the ACM on Programming Languages 3 (POPL), pp. 1–29. External Links: Link Cited by: footnote 4.
  • P. Bielik, V. Raychev, and M. Vechev (2016a) PHOG: probabilistic model for code. In International Conference on Machine Learning, pp. 2933–2942. Cited by: §1, §7.1, §7.1.
  • P. Bielik, V. Raychev, and M. Vechev (2016b) Program synthesis for character level language modeling. External Links: Link Cited by: §7.1.
  • M. Brockschmidt, M. Allamanis, A. L. Gaunt, and O. Polozov (2019) Generative code modeling with graphs. In International Conference on Learning Representations, External Links: Link Cited by: §1, §7.1, §7.1, §8.
  • M. Cvitkovic, B. Singh, and A. Anandkumar (2019) Open vocabulary learning on source code with a graph-structured cache. In Proceedings of the 36th International Conference on Machine LearningProceedings of the 57th Annual Meeting of the Association for Computational LinguisticsAdvances in Neural Information Processing SystemsInternational Conference on Learning RepresentationsProceedings of the 2017 11th Joint Meeting on Foundations of Software EngineeringProceedings of the 2018 26th acm joint meeting on european software engineering conference and symposium on the foundations of software engineeringProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)International Conference on Software Engineering (ICSE)International Conference on Learning RepresentationsInternational Conference on Learning RepresentationsProceedings of the 27th International Joint Conference on Artificial Intelligence2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)Proceedings of the 2013 9th Joint Meeting on Foundations of Software EngineeringInternational Conference on Learning RepresentationsProceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and ImplementationProceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming LanguagesProceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and ApplicationsAdvances in Neural Information Processing SystemsAdvances in neural information processing systemsProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)International Conference on Learning RepresentationsProceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering31st Int. Conf. Software Engineering and Knowledge EngineeringProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning ResearchIJCAI’18ESEC/FSE 2013PLDI ’14OOPSLA 2016, Vol. 97, Long Beach, California, USA, pp. 1475–1485. External Links: Link Cited by: §8.
  • Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov (2019) Transformer-XL: attentive language models beyond a fixed-length context. Florence, Italy, pp. 2978–2988. External Links: Link, Document Cited by: §A.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3, §7.2.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. pp. 13042–13054. External Links: Link Cited by: §3, §7.2.
  • P. Fernandes, M. Allamanis, and M. Brockschmidt (2019) Structured neural summarization. External Links: Link Cited by: §8.
  • J. Harer, C. Reale, and P. Chin (2019) Tree-Transformer: a transformer-based method for correction of tree-structured data. arXiv preprint arXiv:1908.00449. External Links: Link Cited by: §7.2.
  • V. J. Hellendoorn and P. T. Devanbu (2017a) Are deep neural networks the best choice for modeling source code?. Cited by: 1st item, §5.1, §5.3, footnote 6.
  • V. J. Hellendoorn and P. Devanbu (2017b) Are deep neural networks the best choice for modeling source code?. pp. 763–773. Cited by: §1.
  • A. Hindle, E. T. Barr, M. Gabel, Z. Su, and P. Devanbu (2016) On the naturalness of software. Communications of the ACM 59 (5), pp. 122–131. Cited by: §1, §1, §7.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.1.
  • S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer (2016)

    Summarizing source code using a neural attention model

    Berlin, Germany, pp. 2073–2083. External Links: Link, Document Cited by: §4.1.
  • R. Karampatsis, H. Babii, R. Robbes, C. Sutton, and A. Janes (2020) Big code != big vocabulary: open-vocabulary models for source code. Cited by: 1st item, §5.1, §5.3, §6.3, §7.1, §8, footnote 6.
  • J. Li, Y. Wang, M. R. Lyu, and I. King (2018) Code completion with neural attention and pointer networks. pp. 4159–25. External Links: ISBN 9780999241127 Cited by: §1, §1, §7.1, §7.1.
  • Y. Li, R. Zemel, M. Brockschmidt, and D. Tarlow (2016) Gated graph sequence neural networks. Proceedings of ICLR’16 edition. External Links: Link Cited by: §8.
  • C. Liu, X. Wang, R. Shin, J. E. Gonzalez, and D. Song (2016) Neural code completion. External Links: Link Cited by: §7.1, §7.1.
  • F. Liu, L. Zhang, and Z. Jin (2020) Modeling programs hierarchically with stack-augmented lstm. Journal of Systems and Software, pp. 110547. External Links: Link Cited by: §1, §7.1, §7.1.
  • T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen (2013) A statistical semantic language model for source code. New York, NY, USA, pp. 532–542. External Links: ISBN 9781450322379, Link, Document Cited by: §7.1.
  • X. Nguyen, S. Joty, S. Hoi, and R. Socher (2020) Tree-structured attention with hierarchical accumulation. External Links: Link Cited by: §7.2.
  • M. R. Parvez, S. Chakraborty, B. Ray, and K. Chang (2018) Building language models for text with named entities. Melbourne, Australia, pp. 2373–2383. External Links: Link, Document Cited by: §1.
  • [32] (2017) Pretrained probabilistic models for code. External Links: Link Cited by: §4.2.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. External Links: Link Cited by: §A.3, §3.1, §3, §5.2, §7.2, §7.2.
  • V. Raychev, P. Bielik, M. Vechev, and A. Krause (2016) Learning programs from noisy data. SIGPLAN Not. 51 (1), pp. 761–774. External Links: ISSN 0362-1340, Link, Document Cited by: §1, §7.1.
  • V. Raychev, P. Bielik, and M. Vechev (2016) Probabilistic model for code with decision trees. New York, NY, USA, pp. 731–747. External Links: ISBN 9781450344449, Link, Document Cited by: 1st item, §1, §1, §4.2, §4, 1st item, §5.1, §7.1, §7.1.
  • V. Raychev, M. Vechev, and E. Yahav (2014) Code completion with statistical language models. New York, NY, USA, pp. 419–428. External Links: ISBN 9781450327848, Link, Document Cited by: §7.1.
  • V. Shiv and C. Quirk (2019) Novel positional encodings to enable tree-based transformers. pp. 12058–12068. External Links: Link Cited by: §A.3, §7.2.
  • K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Deep inside convolutional networks: visualising image classification models and saliency maps. External Links: Link Cited by: 2nd item, §6.4.
  • TabNine (2019) Autocompletion with deep learning. TabNine. External Links: Link Cited by: §7.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. pp. 5998–6008. Cited by: §A.3, §1, §3.1, §3, §7.2.
  • O. Vinyals, M. Fortunato, and N. Jaitly (2015) Pointer networks. External Links: 1506.03134 Cited by: §4.1.
  • Y. Wang, H. Lee, and Y. Chen (2019) Tree transformer: integrating tree structures into self-attention. pp. 1060–1070. External Links: Link Cited by: §7.2.
  • Y. Yang and C. Xiang (2019) Improve language modelling for code completion through learning general token repetition of source code. pp. 667–777. Cited by: §7.1.

Appendix A Implementation Details

a.1. Modifying the AST

For the AST, we want the internal AST nodes to only have type information, and the leaf nodes to have value information. This way, our model can predict one information given a node (instead of both type and value). However, in the py150 dataset, there are internal and leaf nodes with both type and value information. To accomodate for this, we slightly modify the trees to fit our definition of ASTs. For nodes with both type and value information, we take the value information, and create a new node (now a leaf node) as the node’s first child. Fig 7 illustrates an example of the modification. This increases the average number of nodes in a tree from 623.4 to 951.9.

a.2. Splitting Large Trees

For neural network models, we need to set a maximum number of nodes in the tree that the model can take as input. Ideally, we would want to set the maximum to be high enough to take in any tree of any length; however, in practice, this is infeasible due to memory constraints (and the number of nodes could be infinitely large hypothetically.) We choose the maximum number of context (number of nodes) to be 1000, inspired by the maximum number of context set by GPT2 models and as this covers ¿ 70% of the training data. For trees with number of nodes greater than 1000, we deploy a technique adopted by (Al-Rfou et al., 2018). Given a large tree, we slice it into shorter segments with a sliding window (in our implementation, we used 500, which is half the context). For example, if a tree has 1700 nodes, we would have 3 new shorter trees: from nodes 0-999, nodes 500-1499, and 699-1699. For the last two trees, we would take loss and evaluate only on the nodes that the model has not seen before (1000-1499 and 1500-1699, respectively). In this way, we provide each subsequent shorter segment with some previous context, while increasing the number of training and testing datapoints at a reasonable amount (in our datasets, it doubled the number). An improvement to this sliding window technique would be to maintain the hidden states at each segment to pass along more context information, as explained in (Dai et al., 2019).

a.3. Why not Positional Encoding?

Some Transformers uses positional encoding (Vaswani et al., 2017) or positional embedding (Radford et al., 2019) to provide model extra positional information over elements. However, our early trials with LeafSeq suggested positional embedding is rather hurting than helping. Thus, we do not use positional encoding or embedding for all our models. Recently, Shiv and Quirk (2019) tried to introduce tree structures to Transformer models via positional encoding. However, their relative improvement is small compared to what we see with tree-relational prior in Section 6.

Figure 7. Example AST and our modification to allow nodes to have either only value or type information.

Appendix B Extra Results

b.1. Extra Evaluation Results

Table 9 and Table 10 show respectively the breakdown results for terminal and non-terminal value prediction at various type of locations over py150.

Table 11 and Table 12 show respectively the breakdown results for terminal and non-terminal value prediction at various type of locations over internal dataset.

Prior work Our work
Applications SrcRNN Deep3 SrcSeq DFS DFSud
Attribute access 39.3 (31.6) 45.3 (41.7) 55.9 (49.0) 60.5 (54.4) 75.6 (73.3)
Numeric constant 40.6 (29.3) 53.2 (46.4) 55.9 (45.7) 63.5 (53.7) 83.1 (79.0)
Name (variable, module) 38.2 (29.6) 48.9 (45.4) 54.1 (46.5) 66.6 (61.0) 79.8 (77.4)
Function parameter name 57.7 (54.0) 58.1 (56.6) 66.2 (62.8) 67.2 (63.6) 87.1 (84.7)
All values 36.6 (29.1) 43.9 (40.5) 50.1 (43.4) 58.0 (52.4) 98.7 (97.6)
Table 9. MRR and Acc@1 (in parenthesis) of various types of value predictions for py150.
Prior work Our work
Applications Deep3 DFS DFSud
Function call 81.6 (74.2) 88.5 (81.0) 98.7 (97.5)
Assignment 76.5 (66.7) 78.9 (64.3) 98.7 (97.5)
Return 52.8 (40.8) 67.8 (51.8) 97.8 (95.9)
List 59.4 (54.2) 76.0 (65.8) 97.1 (94.7)
Dictionary 66.3 (61.0) 15.0 (9.0) 83.8 (74.3)
Raise 35.0 (27.1) 63.3 (47.6) 97.0 (94.6)
All types 81.9 (75.8) 87.3 (79.6) 98.7 (97.6)
Table 10. MRR and Acc@1 (in parenthesis) of various type predictions for py150.
Prior work Our work
Applications SrcRNN Deep3 SrcSeq DFS DFSud
Attribute access 26.4 (20.9) 38.5 (36.0) 41.0 (35.5) 44.7 (39.9) 59.3 (56.7)
Numeric constant 32.2 (20.3) 46.5 (38.2) 51.7 (40.5) 61.5 (50.4) 84.0 (78.6)
Name (variable, module) 25.0 (17.8) 41.0 (38.2) 39.3 (32.7) 50.7 (45.6) 62.8 (60.1)
Function parameter name 45.5 (42.8) 50.6 (49.0) 54.3 (51.7) 53.3 (49.6) 73.7 (70.7)
All values 23.8 (17.7) 36.1 (33.3) 36.5 (30.7) 43.9 (38.8) 58.4 (55.3)
Table 11. MRR and Acc@1 (in parenthesis) of various types of next token value prediction for internal dataset.
Prior work Our work
Applications Deep3 DFS DFSud
Function call 78.2 (70.3) 86.0 (77.1) 97.8 (95.9)
Assignment 78.5 (69.1) 79.7 (65.8) 98.7 (97.4)
Return 59.9 (47.8) 72.2 (58.3) 97.6 (95.5)
List 40.8 (33.9) 63.1 (48.7) 94.3 (89.6)
Dictionary 39.8 (31.2) 23.5 (16.7) 81.0 (70.4)
Raise 33.5 (25.8) 59.3 (41.7) 96.4 (93.5)
All types 79.9 (73.1) 87.7 (80.2) 98.0 (96.3)
Table 12. MRR and Acc@1 (in parenthesis) of various types of next token type prediction for internal dataset.

b.2. Inspecting Attention Heads

DFSud learns weights for various s between a node and other nodes in its context as a component of self-attention. In this part, we inspect the learned weights for s in the DFSud model in order to understand which s are the most important for the model’s prediction.

There are six attention layers and six attention heads within each layer in DFSud . All of them collectively determine the importance of each previous node in the prediction of the next token. We look into the maximally and minimally weighted s at each attention head. The results are shown in Fig 8. Presumably, the extreme-weighted s are the most salient features for the model’s prediction. The more extreme the weight is, the more conspicuous the path is among other paths for the particular head.

For example, we found that , , , and are important across multiple heads. , and are particularly up-weighted by some heads; while , , and are particularly down-weighted by some heads. The frequent presences of , and suggest the importance of syntactical local context in the next value prediction. The extreme weights of very long paths, e.g. , is at first baffling. However, we found cases where they can be useful in, for example, referring to class names ( in Fig 8(a)) or to related variable names under similar scopes ( in Fig 8(b)).

Comparing to Deep3

As mentioned in Sec 4, Deep3 also relies on the values collected by their tree-walk programs (in TGEN, see Fig 5) executed over ASTs.

Deep3’s TGEN programs are strictly more expressive than our s, which are based on only up and down counts. However, for many of the tree walks, we can find corresponding s that represent the same movement in an AST. For example, TGEN expression [Up][Up][WRITE_TYPE] is similar to our . WRITE is disregarded as our models naturally have access to the values associated at the destination. We collected the most frequently used TGEN’s tree-walk expressions when evaluating their model (E13) over the py150 testing set. Table 13 lists the top equivalent s and their counts, assuming the node to be predicted is a leaf with a left sibling leaf.

Equivalent Count
Table 13. Top -convertible tree-walks used by E13 when predicting values over py150.

We found that , and are at the both extremely weighted by many heads in our DFSud and heavily utilized in Deep3. However, some of the potentially useful s heavily used by Deep3 are not often extremely weighted by DFSud . For example , potentially useful for knowing the scope of the value to be predicted, only appears once as the maximally value in layer 5, head 5 of DFSud (Fig 7(a)).

Figure 8. Maximally (a) or minimally (b) weighted tree-relations and their weights at each attention head in DFSud . Red means more extremal values.
(a) 1616footnotemark: 16, highlighting
(b) 181818data/Miserlou/OpenWatch/openwatch/map/, highlighting
Figure 9. Two code excerpts from py150 evaluation set. Highlighted tokens are picked by some long in prediction of the underlined tokens.