This repo will contain replication package for the paper "Feeding Trees to Transformers for Code Completion"
In this paper, we describe how to leverage Transformer, a recent neural architecture for learning from sequential data (such as text), for code completion. As in the realm of natural language processing, Transformers surpass the prediction accuracy achievable by RNNs; we provide an experimental confirmation of this over a Python dataset. Furthermore, we show that the way to obtain even better accuracy from Transformers is to expose the syntactic structure of code, which is easily recovered by parsing, to the neural network. This works significantly better than presenting the code as a linear token sequence, which is how Transformers were originally intended to be used. To accomplish this, we propose a novel enhancement to the self-attention mechanism of the Transformer. We enable the mechanism to learn weights—that is, how much to focus on each preceding token in the input—not only on the basis of a token's value, but also on the basis of the spatial relationships, as in their positions in the abstract syntax tree, between each pair of tokens. We provide comprehensive experimental evaluation of our proposal, along with alternative design choices, on a standard Python dataset, as well as on a Python corpus internal to Facebook.READ FULL TEXT VIEW PDF
Vision transformers rely on a patch token based self attention mechanism...
Transformers have emerged as a powerful tool for a broad range of natura...
In this paper, we describe the use of recurrent neural networks to captu...
A code generation system generates programming language code based on an...
Initially developed for natural language processing (NLP), Transformers ...
Vision transformers have recently received explosive popularity, but the...
An important development in deep learning from the earliest MLPs has bee...
This repo will contain replication package for the paper "Feeding Trees to Transformers for Code Completion"
Last several years have witnessed exciting progress in the application of machine learning techniques to developer productivity tools(Allamanis et al., 2018a), and in particular, to code prediction (Hindle et al., 2016; Raychev et al., 2016; Li et al., 2018; Brockschmidt et al., 2019). The idea of code prediction in general is to predict the next code element given previously written code. Code prediction is commonly used in an IDE for auto-complete, where based on the developer’s cursor position and the code already written up to the cursor position, the IDE offers the most likely next tokens (perhaps as a drop down list to choose from.) Auto-complete, not only saves the developer from having to type in the next token(s), but is also an effective code learning mechanism: for instance, a developer might not know the name of an API call he needs off the top of his head, but is able to choose among the choices shown by an auto-complete tool.
Recent work has shown the promise of code prediction based on machine learning. A common idea here is to use language models trained over large code corpora—treating code as text in a natural language (Hindle et al., 2016; Allamanis et al., 2018a)
—to enable highly accurate code prediction. These models have leveraged natural language processing techniques: n-grams(Hindle et al., 2016; Hellendoorn and Devanbu, 2017b), and more recently, deep neural networks such as RNNs (Parvez et al., 2018; Li et al., 2018; Liu et al., 2020).
A different line of work has proposed code prediction based on statistics on syntactic structure of code, as opposed to seeing code as text. These include probabilistic context-free grammars and probabilistic higher-order grammars (Bielik et al., 2016a; Raychev et al., 2016, 2016, 2016). This class of models considers code artifacts as abstract syntax trees, and make their predictions based on information gleaned selectively across paths in the code’s AST. Specifically, Raychev et al. (Raychev et al., 2016)
learn a decision tree model that uses this information essentially as features.
Researchers in the NLP community have recently developed Transformers, a new neural architecture for even more effective natural language processing (Vaswani et al., 2017). As we discuss later, Transformers promise to overcome some of the limitations of RNNs. We investigated the use of Transformers for code prediction, treating code as textual data, and validated experimentally that Transformers indeed outperform RNNs on the next code token prediction task.
Given this already strong baseline, we consider the question of whether informing the Transformer of code’s syntactic structure can further improve prediction accuracy. Our main result is that a better way to use transformers for code prediction is to expose the syntactic structure of code to the network. The details of how to do this are interesting, as encoding the structure of a program’s abstract syntax tree is not natural for sequence models. We show a range of design choices for communicating the AST structure to the Transformer. We find that the more faithfully we communicate the tree structure to the Transformer, the better the accuracy we obtain!
We report results based on training and evaluating various models code prediction on the py150 (1) dataset.
We show that a neural model based on the Transformer architecture is able to outperform state-of-the-art neural (e.g. RNN-based (e.g. in (Hellendoorn and Devanbu, 2017a; Karampatsis et al., 2020)) as well as non-neural models (e.g. Deep3 (Raychev et al., 2016)). Measured on the leaf tokens of the ASTs, our best Transformer model improves the mean reciprocal rank (reported as a percentage, see Sec 5) significantly over the prior work: upon the RNN model (40.0% v 55.5%) as well as upon the corresponding Deep3 model (43.9% v 73.6%).
We show that a key to obtaining superior performance from the Transformer model is to feed not just the source token sequence as is common in NLP tasks, but in making the Transformer aware of the syntactic structure of the code. We show that with more detailed syntactic structure, we get better accuracy (from 65.7% to 74.1% on leaf tokens).
We provide a preliminary investigation into why the Transformer model that is aware of tree structure works better than one without, by using saliency maps (Simonyan et al., 2014).
Our key technical novelty is a novel enhancement to the Transformer’s self-attention mechanism. We enable the mechanism to learn weights—how much to focus on each preceding token in the input—by factoring in the spatial relationship in the abstract syntax tree between each pair of tokens.
We also evaluated our trained model on a dataset selected from a Python code repository internal to Facebook, and found the relative benefits to be similar to those on py150. The accuracy on this other corpus indicates that the Transformer model is generalizable to other corpora.
Sec 2 articulates the code prediction problem in a couple of different forms, and introduces a running example. Sec 3 gives an introduction to Transformers, including how they would apply to source code. This section also describes how to communicate tree structure to the transformer. Sec 4 provides a quick recap of the previous work, focusing on the ones against which we compare our models. Sec 5 describes our datasets and implementation. Sec 6 presents our quantitative results. Sec 6.4 takes a closer look into why our models worked well (or did not.) Sec 7 discusses related work in the area of code prediction and in using Transformers. We conclude the paper with our future work.
Consider the Python code fragment in Fig 2. Suppose a developer has written code up to string following by a dot. At this point, it will be helpful for the IDE to prompt the developer with attribute names that are likely to follow, preferably, with atoi ranked at the top because in this case that is the correct next token.
Our goal is to devise a model such that it takes some code fragment as input and predicts the next token. In this section, we describe two main methods of representing code as inputs to be fed into various models.
In NLP, a common way of feeding in information for next token prediction is with a linearized token sequence. The same technique can be applied with source code, where we parse the source code into tokens. To predict ”atoi”, we would look at the tokens: […, ”map”, ”(”, ”string”, ”.”]. This is a natural approach for next token prediction since each input and prediction in the sequence equates to a token in source code, so we can easily evaluate on all tokens.
An alternate to a source token sequence is the Abstract Syntax Tree (AST), as shown in Fig 2 for the code fragment in Fig 2. An AST can better represent spatial relationship between nodes. For example, in source code, the tokens ip (node 3) and chain (node 41) are separated by 30 tokens, but they are related in the AST via a specific (short) path.
ASTs represent some source tokens explicitly and others implicitly. Tokens corresponding to identifiers, field names, and constants appear explicitly as leaf (terminal) nodes: for instance, ip and host appear as the leaf (terminal) nodes 3 and 11, respectively. Keywords and other syntactic tokens (e.g. =) are implied by the type of internal nodes (e.g. Assign). Accordingly, the prediction task can be separated into:
Value prediction: Predicting the values at leaf nodes. For example, given nodes 0-10 of the tree, we want to predict host, which is the value of the leaf node at node 11.
Type prediction: Predicting the types at internal nodes. For example, given nodes 0-33 of the tree, we want to predict Attr, which is the type of the internal node at node 34.
Knowing that the type of a node is Attr implies that after the source tokens corresponding to its left child, there will be a token ”.” (dot) before the (single) token from its right child. Thus, value prediction and type prediction together can simulate next token prediction problem, though there will need to be a stack-based controller that would call the right predictor, maintain some state, and emit the predicted source tokens appropriately.
In this paper, we explore both sequence-based and AST-based representation of code for code prediction, using various models (RNN, Decision Tree, Transformers). Table 1 shows the ranks (lower is better) of predicting the correct leaf node for all the leaf nodes in the AST in Fig 2. It compares two models of previous work and four Transformer-based models (our work). Transformer models generally achieve lower ranks, and in some cases they are the only models that produce the right token in their top-10 predictions. This table also shows (via one example here, but the results carry over) that feeding ASTs to Transformer models brings better results than feeding them source token sequences. The core of our paper is about how to feed ASTs to Transformers.
In this section, we explain the four models of our own creation: SrcSeq , RootPath , DFS , DFSud . All four models use Transformers (Vaswani et al., 2017)
, a class of deep learning models that have achieved the state-of-the-art results(Devlin et al., 2018; Dong et al., 2019; Radford et al., 2019) for a variety of NLP tasks such as language modeling, question answering, and sentence entailment. In this section, we discuss how we can apply Transformers for next code token prediction, feeding in both sequence-based (SrcSeq ) and AST-based (RootPath , DFS , DFSud ) inputs.
Transformers belong to a class of deep neural networks that are designed for sequence processing. Transformers eschew the hidden states of earlier generation sequence networks (such as RNNs, see Sec 4) in favor of exposing the entire input sequence simultaneously, solely relying on attention mechanisms. In Transformers, information from any previous location of the sequence can directly affect the encoding of the next token, through a mechanism called self-attention, which helps greatly improve the connectivity in long sequences. Transformers also uses multiple heads of these self-attention blocks, called multi-headed attention, which enables the model to simultaneously consider different ways of attending to previous information within one block and also across other blocks.
This section explains self-attention in detail (Figure 3), as it is the crux of the model. The purpose of self-attention is to give higher attention to more relevant tokens in the input sequence. To illustrate this, let’s take an example input sequence: [”map”, ”(”, ”string”, ”.”], and the target token being ”atoi.” This input sequence is first fed through the initial Embedding layer to give: . Then, this embedding is used as input to three fully-connected networks () to create a query, key, and value embedding for the sequence:
In our example, , and
. We use the query vectorto ”query” the ”keys” to see which token relationships are the most important by calculating . This results in a matrix of size n x n, as seen in Table 2, where n is the length of the input sequence. Each row is then normalized (by square root of
) and passed through a softmax layer so all the scores are positive and add up to 1. Table3 shows an example of the self-attention weights 333The rows do not sum to 1 since there are previous tokens in the sequence that is not shown in this table; looking at the last row, we can see that most of the self-attention is given to ”.”, meaning it has a greater factor in predicting the next token ”atoi”. Also note how the matrix is a lower triangular matrix - this is because self-attention cannot be applied to tokens that have not been seen before. Finally, this matrix is multiplied with the value vector to weight the token embeddings:
In our example, . is then fed through a fully-connected network, coupled with skip connections and layer normalizations. This process is repeated num_layer number of times. Finally, the output of the last layer goes through a classification layer at the end to generate predictions for the next token.
For other details, please refer to Vaswani et al. (2017)
(especially the multi-head attention part) and in particular, GPT-2(Radford et al., 2019), for a more thorough description.
The next sections discuss various ways of feeding code fragments into this Transformer architecture.
Our first attempt is to apply a Transformer over source token sequences. As a baseline for later models that takes more tree information, as well as a straightforward application of Transformer models, we apply a Transformer (GPT-2) over source token sequences:
where is the output of the Transformer to be used for prediction, and represents the embedding of the source tokens. It does next token prediction by taking all preceding source tokens, up to the point of prediction, as input. As the inputs and outputs are the same as the SrcRNN model (introduced in the next section), we can do a direct comparison between RNNs and Transformers. As we show in the experiments, this turns out to be an already strong baseline.
The next two subsections discuss how to present the AST to the Transformer.
One way to present all AST nodes to a Transformer is to linearize them in the using a pre-order traversal, or a depth-first-search (DFS). For Fig 2, for node 29, the previous nodes in DFS order would be: […, “Call”, “NameLoad”, “map”, “AttributeLoad”, “NameLoad”, “string”, “Attr”]
The DFS model simply feeds this sequence to the Transformer:
where is the output of the Transformer to be used for prediction, and represents the embedding of the AST nodes. DFS predicts the next node in the AST; thus, it does both value (leaf) prediction and type (internal) prediction.
DFS presents the tree nodes in a pre-determined order, but still does not retain detailed structural relationship between nodes. For example, consider the sequence of nodes 26 - 28 in Fig 2. This would be represented as [”NameLoad”, ”string”, ”attr”], the three nodes appearing consecutively in DFS order. Looking at the AST, we can see that the relations between (”NameLoad” & ”string”, and ”string” & ”attr”) are actually quite different: ”NameLoad” is one node up from ”string”, while ”string” is two nodes up and one node down from ”attr”. This path-based relation between the nodes provides richer information about the actual structure of the tree.
While DFS itself shows only a small improvement on SrcSeq (Table 6), it allows us to augment it with the richer information indicated above, leading to the DFSud model.
DFSud is an extension to the DFS model that incorporates more tree structure. Specifically, given any two nodes and in the AST, we want to capture the shortest path needed to reach from to , and communicate this to the Transformer. The path from to is represented abstractly only in terms of up and down moves:
where , and are the number of up and down nodes, respectively, node has to travel to reach node .444Code2vec (Alon et al., 2019) used (embeddings of) leaf-to-leaf AST paths to capture information for the purpose of code summarization; by contrast, UD paths specifically retain information on how a pair of tree nodes are situated with respect to each other. We create a matrix to contain for each pair of nodes , where comes after in DFS order. Table 4 (ignoring the qk parts inside the parenthesis) shows an example of the in the context of our running example (nodes 25-29 in the AST). 555Node 24 was omitted due to space constraints for the table.
Notice that this matrix has the same shape (lower triangular matrix) as the matrix in Table 2. We add in the matrix in the Attn block (after passing by an embedding layer):
where is element-wise product.
Table 4 shows an example of the new self-attention, One detail to note here that since we want the path relations to be relative to the next token we are predicting.
The rest of the Transformer model is the same as DFS ’s, with the updated calculation:
where is the output of the Transformer to be used for prediction, represents the embedding of the AST nodes, and represents the embedding of the relations.
Note that provides a way for the model to learn the strength of attention it needs to pay to previous tokens, organized in the order of inputs to the network (this order is implicit in the indices used in the matrix in Table 3.) provides a way for the model to learn the strength of the attention to pay to previous tokens, considering the AST relationship between pairs of nodes as well.
To recap, our key insight is to fortify the self-attention mechanism of the Transformer to enable it to learn weights on the basis of AST relationships between tokens as well.
|Deep3||Decision Tree||Value pred & type pred||AST||AST nodes|
|SrcRNN||RNN||Next token pred||Source code||Source code tokens|
|SrcSeq||Transformer||Next token pred||Source code||Source code tokens|
|DFS||Transformer||Value pred & type pred||AST||AST nodes|
|DFSud||Transformer||Value pred & type pred||AST + path relations||AST nodes|
|RootPath||Transformer||Value pred||Leaf nodes + leaf to root paths||Leaf nodes|
|LeafTokens||Transformer||Value pred||Leaf nodes||Leaf nodes|
|DFSud+||Transformer||Value pred & type pred||AST + path relations||AST nodes|
In this section, we discuss some alternate models and variations of models we have explored.
RootPath is an AST-based model that feeds tree structure information to the model in an alternate way than DFS does. RootPath first creates a sequence based on the leaf nodes of the AST. To expose tree structure to the Transformer, it fortifies each leaf node with the path from the leaf node to the root of the AST by traversing up its ancestors; we call such a path to be root-path. For Fig 2, for node 29, the root-path would be:
([“NameLoad”, “Call”, … “Module”], “map”),
([“NameLoad”, “AttributeLoad”, “Call”, …, “Module”], “string”),
([“Attr”, “AttributeLoad”, “Call”, …, ”Module”], ?)
The root-paths are first fed into a sequence encoder (such as an LSTM), coupled with the leaf node, and is fed through the Transformer:
where is the output of the Transformer to be used for prediction, and represents the embedding of the leaf nodes, and is the embedding for all the root-paths. Since RootPath predicts only leaf nodes, it does only value prediction.
LeafTokens is a lightweight variation of RootPath , where only the leaf nodes are used. For Fig 2, for node 29, the input sequence would be: […, ”map”, ”string”], and would predict ”atoi”. LeafTokens feeds in the leaf nodes of the AST into a Transformer:
where is the output of the Transformer to be used for prediction, and represents the embedding of the leaf nodes. We compare this model with RootPath to determine the importance of root-path information in next token prediction.
DFSud+ is a variation to DFSud that uses a richer vocabulary to the up-down paths to include some child index information, as it provides extra information about the tree structure. While DFSud uses only and to describe the relation between two nodes in the AST, DFSud+ expands into three sub words: , , ; this describes whether the node is either the first child, the last child, or somewhere in in between, respectively. For example, in Table 4, the new relation for node 27 and 27 would expand from into . We chose this minimal extension to limit the possible exponential growth in path vocabulary size; even with this minor extension, our path vocabulary increases from 250 to 100k to cover more than 90 % of the vocab (with a long right tail). The rest of the model is same as DFSud , as described in Sec 3.4. We compare this model with DFSud to examine whether adding in more information (at the expense of enlarging the model) improves MRR.
A high-level overview of the models is presented in Table 5. The next section will cover two previous models from literature.
In this section, we recap two different methods for code prediction, representative of recent previous work, against which we compare our work. These are (1) a method based on language models that uses a sequence of source code tokens, and (2) a method based on decision trees (Raychev et al., 2016) that works on ASTs.
A language model computes the probability of the next word, given some window of preceding words: . Here we use an RNN to compute a language model; n-grams would be another choice.666The jury seems to be out on which one is necessarily better for the task (Hellendoorn and Devanbu, 2017a; Karampatsis et al., 2020).
shows an Recurrent Neural Network (RNN) operating on some of the tokens from the example in Fig2. As the name suggests, RNNs consume input tokens recurrently, one per time step, and produce output tokens one per time step as well. The bottom layer of the RNN embeds input tokens into a vector: , where is the source token seen at the ’th time step. The hidden state is computed as , using both and the hidden state from the previous time step. The output is a vector of probabilities of various tokens computed by using softmax over ; the diagram shows the top-ranked predictions or the ground truth. , and are the parameters of the network, to be learned during training.
The pertinent point to note is that the hidden state encodes the knowledge of not just the current token, but of last several previous tokens via the propagation of information in previous hidden states. Thus, RNNs implicitly compute a language model over tokens.
A limitation of RNNs is the difficulty they have in tracking long-range dependence, even with various proposals to mitigate the problem (e.g. long-short-term-memory (LSTM) cells(Hochreiter and Schmidhuber, 1997), which we do use in our implementation, attention on top of RNNs (Iyer et al., 2016), and skip-connections between sequence locations (Vinyals et al., 2015)).
In our experiments, we feed the source code tokens into an RNN and call this model SrcRNN .
Raychev et al. (Raychev et al., 2016) presented a system, Deep3, based on a learned decision tree combined with count-based probabilities at the leaves of the decision tree. We provide only a sketch here, highlighting how they use paths on an AST.
Fig 5 shows part of a learned decision tree, written in the form of program in a specialized language they call TGEN. Given an AST and a starting node , a TGEN program walks certain paths in starting from . For example, Up WriteValue (line 1) goes to the parent of and records the label. If the label is Attr, it walks a different path (line 2) in the vicinity of . The branch outcomes and observations collected by running this TGEN program on form a context
, which is then used to look up a probability distribution conditioned on that context. For the AST in Fig2, starting with node 29, the TGEN program will produce a context for which the probabilities of different tokens for node 29 might be: [atoi: 40%, length: 20%, …]. The flexibility of focusing on arbitrary paths in the AST allows the model to condition selectively on nodes farther back in the AST.
A TGEN program is learned—on a specific corpus—by a genetic search procedure that simultaneously selects paths and grows the decision tree from the training data, with an entropy minimization objective. The details are not important for this paper; in this paper, we use their pretrained model (32) as well as their Python dataset (1) for our experiments.
The reader will notice that the notion of in Section 3.4 is akin to the AST paths expressed in TGEN programs. The paths in TGEN are more general, but at a high-level, the idea that certain ”spatial” relation between nodes is important is common to both approaches. This, along with the competitive quality of results of the Deep3 model in Table 1, makes it an interesting comparison. We explore this similarity further in Appendix B.2.
We train our models using the py150 dataset (1) used in Raychev et al. (2016). The dataset consists of 150k Python 2 source code files from GitHub repositories, along with their parsed ASTs, split into 100k for training and 50k for evaluation. From the ASTs extracted from the py150 dataset, we modify the AST to ensure that the internal nodes only has types and the leaf nodes only have values. For implementation details, please refer to AppendixA.1. To incorporate large trees (greater than 1000 nodes), we deploy a technique adopted by (Al-Rfou et al., 2018), which slices a large tree into shorter segments with a sliding window to maintain part of the previous context. For implementation details, please refer to AppendixA.2.
We evaluate our models on two evaluation datasets:
py150: We use the evaluation dataset used in Raychev et al. (2016), which consists of 50k Python ASTs. We perform the two modifications as listed above before feeding them into our models, there are 16,003,628 leaf nodes and 30,417,894 internal nodes.
internal: We also created an evaluation dataset consisting of 5000 Python files from a code repository internal to Facebook. With this dataset, we can evaluate how our trained model can generalize to a different dataset, even if the code comes from disjoint projects. After the modifications, there are 1,669,085 leaf nodes and 3,067,147 internal nodes.
Recent works (Karampatsis et al., 2020; Hellendoorn and Devanbu, 2017a) have divided evaluations into static and dynamic, where in the dynamic evaluations, the model continues to update its parameters during evaluation. This may increase accuracy by having the model adapt to the characteristics of the evaluation dataset. In our experiments, we choose to evaluate statically, and realize that evaluating dynamically may improve accuracy.
. For the model that use Transformers (RootPath , DFS , SrcSeq , DFSud
), we adapt the Pytorch implementation777https://github.com/graykode/gpt-2-Pytorch. We do not use positional encoding. Refer to Appendix A.3 for the explanation. of GPT-2 small (Radford et al., 2019). We use six Transformer blocks, six heads in each block, , and set embedding dimension
. We borrow other hyperparameters fromRadford et al. (2019). We limit the token vocabulary size to 100k, which covers over 90% of the tokens used in the training dataset. For DFSud , we limit the vocabulary to 250, which covers over 95% of the path relations. For RootPath , we limit the maximum length of the path from leaf node to root to be 13, which covers over 90% of the nodes. For any path longer than 13, we keep the nodes closest to the leaf, and truncate the nodes near the root.
For the SrcRNN model, we adapt the PyTorch example implementation 888https://github.com/pytorch/examples/tree/master/word_language_model of a word-level language model LSTM. We use embedding dimension , with and . We limit the token vocabulary size to 100K, which covers over 90% of the tokens.
For the Deep3 model, since the authors have shared only the model and not the training algorithm, we used the model pretrained on py150.
We trained all models (except Deep3) on Nvidia Tesla V100 (using 4 GPUs at a time) until the loss converged, with all of the parameters randomly initialized. We used the Adam optimizer with the learning rate set to 1e-3. For convergence, DFS
took 11 epochs,DFSud took 21 epochs, SrcSeq took 9 epochs, and SrcRNN took 9 epochs (each epoch took around 45 minutes - 1 hour).
We evaluate the models on the code prediction tasks that we defined in Sec 2: next token prediction, which pertains to source code tokens taken as a linear sequence; value prediction, which pertains to predicting leaf nodes of the AST; and type prediction, which pertains to predicting internal nodes of the AST.
To measure performance on these tasks, we use mean reciprocal rank (MRR). The rank is defined as
where is the number of predicting locations and is the rank of the correct label given by the model for the data point. We present MRR as a percentage, in keeping with prior work (Karampatsis et al., 2020; Hellendoorn and Devanbu, 2017a).
While Acc@1 only gives score when the correct label is ranked at the top, MRR also give scores when the true label is not ranked as the top, but among top few prediction. Comparing to the hit-or-miss style metric (Acc@1), this is closer to the realistic scenario when completion suggestions are presented to developers. With this practical perspective and for ease of computation, we only consider for each location (all will have a score of 0).
We share our data processing scripts and model implementations at https://github.com/facebookresearch/code-prediction-transformer.
At a high level, we want to answer the following research questions.
Overall, do Transformer-based models provide better accuracy compared to prior state-of-the-art methods of code prediction?
Does syntactic structure of code help get better accuracy out of Transformers, and if so, by how much?
What did the Transformer model variants learn from the code? Did they learn the right things? What can we learn from the learned models?
We describe the experiments to answer the research questions RQ1 and RQ2. We discuss the evaluation of RQ3 in Section 6.4.
For RQ1, recall that prior work (Section 2) works on two different kinds of inputs: all source tokens as in program text, and ASTs of each program unit. To carry out a direct comparison against prior work, we split RQ1 into two specific questions:
Is the Transformer-based model more accurate than the RNN-based model on the next token prediction problem? ( Sec 2)?
To answer this question, we compare SrcRNN model against the SrcSeq model on the source tokens.
Are the Transformer-based models more accurate than Deep3, on the value prediction and on the type prediction problems (Sec 2)?
To answer this question, we compare Deep3 model against the DFS variant of the Transformer on the ASTs variants: DFS , DFSud+ , DFSud , and RootPath ,
For RQ2, we ask two sub-questions:
Does a Transformer model based on an AST outperform a Transformer model that takes the corresponding source token sequences?
This question can be answered directly only on tokens that appear both in ASTs and source token sequences: these are precisely the values at the leaf nodes of the AST. We compare SrcSeq and DFS models on the terminal value prediction problem.
Does providing more detailed structural information help with accuracy?
To answer this question, we compare among the tree-based Transformer models (DFS , DFSud+ , DFSud , and RootPath ) on the terminal value prediction and on the internal/type prediction problems.
|Prior work||Our work|
|Next token prediction||65.7 (58.0)||n/a||74.1 (68.1)||n/a||n/a|
|Value prediction||36.4. (29.1)||43.9 (40.5)||50.1 (43.4)||58.0 (52.4)||73.6 (71.0)|
|Type prediction||n/a||81.9 (75.8)||n/a||89.3 (82.7)||98.7 (97.6)|
|Prior work||Our work|
|Next token prediction||57.4 (48.3)||n/a||66.8 (60.2)||n/a||n/a|
|Value prediction||23.8 (17.7)||36.1 (33.3)||36.5 (30.7)||43.9 (38.8)||58.4 (55.3)|
|Type prediction||n/a||79.9 (73.1)||n/a||87.7 (80.2)||98.0 (96.3)|
|Value prediction||73.6 (71.0)||55.1 (48.4)||41.9 (34.1)||73.3 (70.8)|
|Type prediction||98.7 (97.6)||n/a||n/a||97.8 (96.1)|
For RQ1.1, see the SrcRNN and SrcSeq columns in Table 6 and Table 7. For the py150 dataset, we can see a significant improvement in MRR, from 65.7% to 74.1% for the SrcRNN and SrcSeq models, respectively. The same holds for comparing on the internal dataset: 57.4% vs 66.8%. (Table 9 and 11 in the Appendix B.1 break down the data for different kinds of next token predictions.) Not surprisingly, Table 6 also shows that predicting the identifier and constant tokens (as in value prediction) is more challenging than predicting the keywords and punctuation tokens, which form almost 2/3 of all the source tokens.
For RQ1.2, we compare the Deep3 model against DFS and DFSud models. Overall, we found that all the Transformer models (SrcSeq , DFS DFSud ) achieve higher scores compared to Deep3. Table 6 shows that DFSud achieves the best MRR of 73.6 for leaf node prediction compared with Deep3’s MRR of 43.9. Similar results can be seen for the internal dataset, as shown in Table 7.
To answer RQ2.1, we compare the value prediction results for SrcSeq against the AST-based models (DFS , DFSud ). Table 6 shows that DFS outperforms SrcSeq by 7.9%, and DFSud significantly outperforms SrcSeq by 23.5% (73.6% vs 50.1%). These results demonstrate that representing the source code as AST vs linearized source code provides better results for next value prediction.
For RQ2.2, we compare the results amongst the AST-based models. First, comparing DFS and DFSud , DFSud provides more detailed structural information. Table 6 shows significant improvements to the accuracy, achieving 15.6% higher MRR for value prediction and 9.4% higher MRR for type prediction than DFS . Similar trends can be seen for the internal data set in Table 7.
Table 8 shows a significant drop in accuracy between RootPath and LeafTokens (55.1% vs 41.9% for all leaf nodes). This shows that the information captured by the leaf to root paths (both in terms of its values and tree structural information) gives a solid boost to accuracy. These results demonstrate that feeding the model with more structural information does improve results.
Next, we compare RootPath and DFS . These models are similar because both models take all of the AST nodes as the context, but are different in how they digest the context. RootPath first aggregates the context information for each leaf node before predicting the next leaf node, while DFS captures both leaf and internal nodes in one context. Results show that performance between the two models are pretty comparable (58.0% vs 55.1% for value prediction in Tables 6 and 8). One drawback of RootPath is that it can only predict leaf nodes, while DFS can predict all nodes in the AST, including internal nodes for type prediction.
Table 8 shows that DFSud+ did not outperform DFSud , which shows that simply expanding the up-down vocab may not be the right approach in exposing child index information to the model. Areas of explorations may include whether a vocabulary size of 100k is too sparse for the models to learn effectively, or whether child indices are inherently not as crucial for code prediction.
Our SrcRNN implementation is based on a PyTorch implementation999 https://github.com/pytorch/examples/blob/master/word_language_model/model.py
whereas related papers have generally built off of a Tensorflow implementation.101010https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py As the hyperparameters were similar (, , ) to recent publications, we do expect our implementation to be comparable.
We have not integrated byte-pair encoding (BPE) (Karampatsis et al., 2020) into our RNN model. We expect BPE to benefit both RNN and transformer models, and plan to explore this in future work.
While larger Python corpora have appeared, py150 is still sizable at 5̃00MB; we do not expect the larger corpora to reverse our findings.
In this part, we study the influence of each input features to shed light on the black box of how our models make their predictions. Particularly, we study how each input token attributes to the models’ predictions (attribute analysis, this section) and which s are learned to be important by DFSud (Appendix B.2). For the latter, we found that local syntactical context is generally important and similarities exist compared to the heavily utilized Deep3 TGEN paths.
We use saliency maps (Simonyan et al., 2014)
for attribute analysis, which are constructed by taking the partial derivative of the loss function with respect to the inputs. Fig6 visualizes the magnitudes of the gradients falls at each input token when the model predicts a particular output. Intuitively, the larger the value for a particular token, the more sensitive the output is to the variations at that token.
Examining the saliency maps for DFS and DFSud , we first observe that that parent node of the AST (the internal node right above the leaf) is generally important for both models. From Fig 5(b), we can see DFSud is influenced by string when predicting atoi and by request_size when predicting num_requests. It is not shown in the figure but when predicting 2, DFSud is influenced by the previous occurrence of sys.argv indexed by 0 and 1. Looking at the differences between Fig 5(a) and Fig 5(b), we found that DFSud is influenced by ip while predicting gethostbyname correctly but DFS is not while predicting it wrong. Generally, we found that DFSud attributes more towards terminal values relevant to the values to be predicted, while DFS attributes little to values other than non-terminals. This provides an evidence that DFSud is more likely to have learned the right features for next value prediction.
On an orthogonal note, we also observe that for many predicting locations, the magnitude of gradients are very small, suggesting the robustness of the model in the sense that it is less sensitive to minor perturbations of the input sequence.
Due to the vastness of the topic, we focus on two themes of related work.
Simply put, the task of code completion is to predict the rest of the code a user is typing. Code completion is widely used by commercial or free integrated development environments (IDEs) 111111https://code.visualstudio.com/docs/editor/intellisense 121212https://www.jetbrains.com/help/idea/auto-completing-code.html 131313https://flight-manual.atom.io/using-atom/sections/autocomplete/ to accelerate or ease the process of developing software.
Since Hindle et al. (2016), there have been the rise of statistical learning for the task of code completion, exploiting naturalness of code (Allamanis et al., 2018a). Learning methods used starting from n-gram (Nguyen et al., 2013; Hindle et al., 2016) to probabilistic grammar (Allamanis and Sutton, 2014; Bielik et al., 2016a) and decision trees (Raychev et al., 2016). Recently there have been increasing application of deep learning to code completion, especially recurrent neural networks (Liu et al., 2016; Li et al., 2018; Liu et al., 2020) and graph neural networks (Allamanis et al., 2018b; Brockschmidt et al., 2019; Yang and Xiang, 2019).
Among other flavors of code completion, such as where program after the predicting location is available (Raychev et al., 2014; Allamanis et al., 2018b; Brockschmidt et al., 2019; Alon et al., 2020) or where the granularity of prediction is smaller (e.g. characters (Bielik et al., 2016b) or subtokens (Karampatsis et al., 2020)) or larger (e.g. sub-ASTs (Alon et al., 2020)), we focus on predicting next token given only partial program up to the predicting location.
PHOG (Bielik et al., 2016a), DeepSyn (Raychev et al., 2016) and Deep3 (Raychev et al., 2016) are particularly related as all of them utilize AST information for code completion. PHOG and DeepSyn uses a conditional probabilistic context-aware grammar based on AST walks. Deep3 further enriched the probabilistic model with a decision tree to allow more fine-grained modeling of context-dependent code occurrences.
However, these probabilistic models have been surpassed by deep neural networks, namely LSTMs over serialized ASTs (Liu et al., 2016). Accuracy can be further improved by stacking attention and pointer-network over an LSTM (Li et al., 2018) or by augmenting LSTMs with stacks for which the operations are guided by the AST structure (Liu et al., 2020).
Transformers, popularized by Vaswani et al. (2017), are sequence-to-sequence (seq2seq) neural networks based-on layers of multi-head self-attentions. Surpassing RNNs, Transformer models (Devlin et al., 2018; Dong et al., 2019; Radford et al., 2019) have become the state-of-the-art natural language models, breaking records for a range of NLP tasks, including sentence entailment, question answering and language modeling. See Sec 3 for a more thorough introduction to Transformers.
There have been reported applications of Transformer models for code completion. Galois 141414https://github.com/iedmrc/galois-autocompleter
is an open source project that uses GPT-2(Radford et al., 2019) for code completion. The approach is similar to our SrcSeq model, despite their use of non-standard tokenizer and a subtoken segmenter. TabNine™ published a blog post (TabNine, 2019) in July 2019 mentioning the use of GPT-2 in their code completion but revealed no technical detail. To this point, we found no formal investigation up to this date on using transformers for the task of code completion.
There has been a surge of interest from 2019 in extending Transformer models to handle beyond sequential structures, for NLP (Wang et al., 2019; Ahmed et al., 2019; Nguyen et al., 2020) and for learning source code (Harer et al., 2019; Shiv and Quirk, 2019). Wang et al. (2019) put constraints on self-attentions to induce tree structures. Ahmed et al. (2019), Harer et al. (2019) and Nguyen et al. (2020) modify the attention block to mix node representations according to tree structures. Shiv and Quirk (2019) proposed a tree-induced positional encoding. As for learning source code, it has been showed that taking tree structured helped code correction (Harer et al., 2019) and code translation (Shiv and Quirk, 2019).
Source code presents a difficulty shared with natural language processing in handling large vocabularies and rare words. The token/word to be predicted in test data may not appear in the training data. This is even more challenging when predicting identifiers, such as method names, variable names, and so on, as developers can come up with arbitrary identifier names. Possible mitigation includes copying mechanism (Allamanis et al., 2016; Brockschmidt et al., 2019; Fernandes et al., 2019) and open-vocabulary models (Cvitkovic et al., 2019; Karampatsis et al., 2020).
We saw significant improvement in performance by providing more tree structure (DFS vs DFSud ). Our attempt at DFSud+ , a variation to DFSud that enhances the path relation vocabulary, did not improve performance. This leaves open the possibility that our way of representing AST paths needs to be improved.
Recent work has also shown the promise of using easy-to-compute static analysis information, such as def-use information. While it is harder to get such info for dynamic languages, it is still an interesting question as to how to communicate those to transformers, and compare it to graph neural networks (Li et al., 2016; Allamanis et al., 2018b) that do use it.
Code2vec: learning distributed representations of code. Proceedings of the ACM on Programming Languages 3 (POPL), pp. 1–29. External Links: Cited by: footnote 4.
Summarizing source code using a neural attention model. Berlin, Germany, pp. 2073–2083. External Links: Cited by: §4.1.
For the AST, we want the internal AST nodes to only have type information, and the leaf nodes to have value information. This way, our model can predict one information given a node (instead of both type and value). However, in the py150 dataset, there are internal and leaf nodes with both type and value information. To accomodate for this, we slightly modify the trees to fit our definition of ASTs. For nodes with both type and value information, we take the value information, and create a new node (now a leaf node) as the node’s first child. Fig 7 illustrates an example of the modification. This increases the average number of nodes in a tree from 623.4 to 951.9.
For neural network models, we need to set a maximum number of nodes in the tree that the model can take as input. Ideally, we would want to set the maximum to be high enough to take in any tree of any length; however, in practice, this is infeasible due to memory constraints (and the number of nodes could be infinitely large hypothetically.) We choose the maximum number of context (number of nodes) to be 1000, inspired by the maximum number of context set by GPT2 models and as this covers ¿ 70% of the training data. For trees with number of nodes greater than 1000, we deploy a technique adopted by (Al-Rfou et al., 2018). Given a large tree, we slice it into shorter segments with a sliding window (in our implementation, we used 500, which is half the context). For example, if a tree has 1700 nodes, we would have 3 new shorter trees: from nodes 0-999, nodes 500-1499, and 699-1699. For the last two trees, we would take loss and evaluate only on the nodes that the model has not seen before (1000-1499 and 1500-1699, respectively). In this way, we provide each subsequent shorter segment with some previous context, while increasing the number of training and testing datapoints at a reasonable amount (in our datasets, it doubled the number). An improvement to this sliding window technique would be to maintain the hidden states at each segment to pass along more context information, as explained in (Dai et al., 2019).
Some Transformers uses positional encoding (Vaswani et al., 2017) or positional embedding (Radford et al., 2019) to provide model extra positional information over elements. However, our early trials with LeafSeq suggested positional embedding is rather hurting than helping. Thus, we do not use positional encoding or embedding for all our models. Recently, Shiv and Quirk (2019) tried to introduce tree structures to Transformer models via positional encoding. However, their relative improvement is small compared to what we see with tree-relational prior in Section 6.
|Prior work||Our work|
|Attribute access||39.3 (31.6)||45.3 (41.7)||55.9 (49.0)||60.5 (54.4)||75.6 (73.3)|
|Numeric constant||40.6 (29.3)||53.2 (46.4)||55.9 (45.7)||63.5 (53.7)||83.1 (79.0)|
|Name (variable, module)||38.2 (29.6)||48.9 (45.4)||54.1 (46.5)||66.6 (61.0)||79.8 (77.4)|
|Function parameter name||57.7 (54.0)||58.1 (56.6)||66.2 (62.8)||67.2 (63.6)||87.1 (84.7)|
|All values||36.6 (29.1)||43.9 (40.5)||50.1 (43.4)||58.0 (52.4)||98.7 (97.6)|
|Prior work||Our work|
|Function call||81.6 (74.2)||88.5 (81.0)||98.7 (97.5)|
|Assignment||76.5 (66.7)||78.9 (64.3)||98.7 (97.5)|
|Return||52.8 (40.8)||67.8 (51.8)||97.8 (95.9)|
|List||59.4 (54.2)||76.0 (65.8)||97.1 (94.7)|
|Dictionary||66.3 (61.0)||15.0 (9.0)||83.8 (74.3)|
|Raise||35.0 (27.1)||63.3 (47.6)||97.0 (94.6)|
|All types||81.9 (75.8)||87.3 (79.6)||98.7 (97.6)|
|Prior work||Our work|
|Attribute access||26.4 (20.9)||38.5 (36.0)||41.0 (35.5)||44.7 (39.9)||59.3 (56.7)|
|Numeric constant||32.2 (20.3)||46.5 (38.2)||51.7 (40.5)||61.5 (50.4)||84.0 (78.6)|
|Name (variable, module)||25.0 (17.8)||41.0 (38.2)||39.3 (32.7)||50.7 (45.6)||62.8 (60.1)|
|Function parameter name||45.5 (42.8)||50.6 (49.0)||54.3 (51.7)||53.3 (49.6)||73.7 (70.7)|
|All values||23.8 (17.7)||36.1 (33.3)||36.5 (30.7)||43.9 (38.8)||58.4 (55.3)|
|Prior work||Our work|
|Function call||78.2 (70.3)||86.0 (77.1)||97.8 (95.9)|
|Assignment||78.5 (69.1)||79.7 (65.8)||98.7 (97.4)|
|Return||59.9 (47.8)||72.2 (58.3)||97.6 (95.5)|
|List||40.8 (33.9)||63.1 (48.7)||94.3 (89.6)|
|Dictionary||39.8 (31.2)||23.5 (16.7)||81.0 (70.4)|
|Raise||33.5 (25.8)||59.3 (41.7)||96.4 (93.5)|
|All types||79.9 (73.1)||87.7 (80.2)||98.0 (96.3)|
DFSud learns weights for various s between a node and other nodes in its context as a component of self-attention. In this part, we inspect the learned weights for s in the DFSud model in order to understand which s are the most important for the model’s prediction.
There are six attention layers and six attention heads within each layer in DFSud . All of them collectively determine the importance of each previous node in the prediction of the next token. We look into the maximally and minimally weighted s at each attention head. The results are shown in Fig 8. Presumably, the extreme-weighted s are the most salient features for the model’s prediction. The more extreme the weight is, the more conspicuous the path is among other paths for the particular head.
For example, we found that , , , and are important across multiple heads. , and are particularly up-weighted by some heads; while , , and are particularly down-weighted by some heads. The frequent presences of , and suggest the importance of syntactical local context in the next value prediction. The extreme weights of very long paths, e.g. , is at first baffling. However, we found cases where they can be useful in, for example, referring to class names ( in Fig 8(a)) or to related variable names under similar scopes ( in Fig 8(b)).
Deep3’s TGEN programs are strictly more expressive than our s, which are based on only up and down counts. However, for many of the tree walks, we can find corresponding s that represent the same movement in an AST. For example, TGEN expression [Up][Up][WRITE_TYPE] is similar to our . WRITE is disregarded as our models naturally have access to the values associated at the destination. We collected the most frequently used TGEN’s tree-walk expressions when evaluating their model (E13) over the py150 testing set. Table 13 lists the top equivalent s and their counts, assuming the node to be predicted is a leaf with a left sibling leaf.
We found that , and are at the both extremely weighted by many heads in our DFSud and heavily utilized in Deep3. However, some of the potentially useful s heavily used by Deep3 are not often extremely weighted by DFSud . For example , potentially useful for knowing the scope of the value to be predicted, only appears once as the maximally value in layer 5, head 5 of DFSud (Fig 7(a)).