Log In Sign Up

On-the-Fly Adaptation of Source Code Models using Meta-Learning

by   Disha Shrivastava, et al.

The ability to adapt to unseen, local contexts is an important challenge that successful models of source code must overcome. One of the most popular approaches for the adaptation of such models is dynamic evaluation. With dynamic evaluation, when running a model on an unseen file, the model is updated immediately after having observed each token in that file. In this work, we propose instead to frame the problem of context adaptation as a meta-learning problem. We aim to train a base source code model that is best able to learn from information in a file to deliver improved predictions of missing tokens. Unlike dynamic evaluation, this formulation allows us to select more targeted information (support tokens) for adaptation, that is both before and after a target hole in a file. We consider an evaluation setting that we call line-level maintenance, designed to reflect the downstream task of code auto-completion in an IDE. Leveraging recent developments in meta-learning such as first-order MAML and Reptile, we demonstrate improved performance in experiments on a large scale Java GitHub corpus, compared to other adaptation baselines including dynamic evaluation. Moreover, our analysis shows that, compared to a non-adaptive baseline, our approach improves performance on identifiers and literals by 44% and 15%, respectively. Our implementation can be found at:


Adaptation-Agnostic Meta-Training

Many meta-learning algorithms can be formulated into an interleaved proc...

Predicting Defective Lines Using a Model-Agnostic Technique

Defect prediction models are proposed to help a team prioritize source c...

Code Clone Matching: A Practical and Effective Approach to Find Code Snippets

Finding the same or similar code snippets in source code is one of funda...

Syntax-Aware On-the-Fly Code Completion

Code completion aims to help improve developers' productivity by suggest...

A Student-Teacher Architecture for Dialog Domain Adaptation under the Meta-Learning Setting

Numerous new dialog domains are being created every day while collecting...

AntiCopyPaster: Extracting Code Duplicates As Soon As They Are Introduced in the IDE

We have developed a plugin for IntelliJ IDEA called AntiCopyPaster that ...

Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing

In this paper, we present an approach to incorporate retrieved datapoint...

1 Introduction

The availability of large corpora of open source software code like GitHub and the development of scalable machine learning techniques have created opportunities for the use of deep learning to develop models of source code 

(Allamanis et al., 2018a). Hindle et al. (2012) first suggested the use of statistical language models for source code, as for natural language. Such models are usually designed to take as input a window of tokens and produce a predictive distribution for what the next token might be. However, modelling source code poses several challenges that are different from those in natural language. First, the size of the vocabulary that a source code model must handle proliferates substantially particularly due to identifiers (such as names of classes, methods and variables). According to Karampatsis and Sutton (2019), the out-of-vocabulary (OOV) rate is only 0.32% for the one billion-word benchmark corpus of English, while for the Java GitHub corpus (the dataset used in this work), it is larger than 13%. Second, source code is more localized, i.e, it tends to take repetitive form in local contexts (Tu et al., 2014). For example, a particular identifier is likely to occur multiple times within the same class or file. Third, while often-used software corpora will evolve quite fast (with bugs being fixed or new features deployed), the rate of evolution for natural language corpora is much slower (Hellendoorn and Devanbu, 2017). According to Allamanis and Sutton (2013), in the Java GitHub corpus test set, for each project, there is on an average 56.49 original identifiers (not seen in the training set) introduced every thousand lines of code. There are also coding styles and conventions that are specific to each file and may not necessarily be seen in the training data. Each organization or project may impose its own unique conventions related to code ordering, library and data structure usage, and naming conventions. Additionally, developers can have personal preferences in coding style (e.g., preferring as a loop variable to ).

Figure 1: Block diagram illustrating our approach for a sample file. (Left) Sample file where hole target (dark orange) along with hole window (light orange), and support tokens (dark blue) along with support windows (light blue) are highlighted; (Right) To predict hole target StandardPropertyManager using hole window (), our model learns parameter by performing steps of gradient update using support tokens () and support windows () in its inner loop. This is followed by updating in the outer loop during meta-training.

These motivate us to develop models that adapt their parameters to unseen contexts “on the fly”. That is, they are trained to explicitly and efficiently adapt to test files, even if the file contains identifiers and conventions that were unseen at training time.

A popular approach for model adaptation employed for natural language (Mikolov et al., 2010; Krause et al., 2018) and also advocated for source code (Karampatsis and Sutton, 2019) is dynamic evaluation. With dynamic evaluation, we allow updating the parameters of a trained model on tokens in test files, from the first token to the last. To avoid bias and obtain an unrealistically optimistic measure of performance (i.e. cheating), the prediction of a token in a test file is made before updating the model’s parameters. In this work, we argue for a different approach to adaptation, that we think is better suited to the workflow of a developer.

To reflect the way a software developer uses auto-completion in an IDE, we consider an evaluation setting that we call line-level maintenance. We imagine a cursor placed before a random token in a given file. We blank out the remainder of the line following the cursor to simulate a developer making an in-progress edit to the file. The task is then to predict the token (or hole target) that follows the cursor. This setting is different from the language modelling setting, where a test file is generated from scratch one token at a time, from top to bottom. Similarly, dynamic evaluation is ill-suited to this setting, as it processes tokens in that same order. Instead, we propose to select targeted information from both before and after the hole as a basis for adaptation.

To formalize the incorporation of targeted information for adaptation, we introduce Targeted Support Set Adaptation (TSSA), which leverages the notion of support windows and support tokens retrieved “on the fly” at test time. Consider the example illustrated in left part of Figure 1. It presents the specific task of predicting, on line 20, a hole target (dark orange shading) from its hole window (light orange shading) or preceding tokens. Note that the semicolon at the end of line 20 is ignored as it is part of the blanked-out region (black shading). To improve this prediction, in TSSA we leverage support tokens , which are tokens from around the file that we believe to be particularly influential in defining the nature of the local context. Intuitively, these could be tokens that are unique to the file and hence provide strong signal for adaptation. Similar to , support windows are the window of tokens that precede these support tokens. In Figure 1, lines 3 and 160 show the corresponding support windows (light blue shading) and support tokens (dark blue shading).

To best leverage the pairs of support windows/tokens, we propose to frame this problem as meta-learning. Specifically, we propose to train the parameters of a base source code model that first adapts to support tokens from a file before predicting the hole target. This is done using meta-learning methods, such as MAML (Finn et al., 2017) and Reptile (Nichol et al., 2018). These methods are capable of training models that incorporate an inner-loop of gradient-based optimization. In our formulation, the inner loop predicts support tokens from support windows and takes multiple gradient steps to update the parameters of the source code model and reduce the loss of its predictions. The updated parameters are then used to predict the hole target from the hole window , and the full process is trained in an outer loop so as to “learn to learn” to minimize the loss on the hole prediction task. Meta-learning thus provides an outer loop update on the initial point of the inner loop optimization, such that a better prediction of hole target is ultimately achieved.

Via experiments on a large-scale Java GitHub corpus, we demonstrate a significant improvement in performance from TSSA combined with meta-learning, as compared to other adaptive and non-adaptive baselines including dynamic evaluation (Section 4.4). We perform ablation studies that analyse the role of different components of TSSA and how they contribute towards its performance (Section 4.5). We carry out a study where we contrast performance of a high capacity non-adaptive model with a small capacity model that uses TSSA and meta-learning to identify cases where interesting improvements arise (Section 4.6). A block diagram of our approach can be found in right part of Figure 1. To summarize, our paper is framed as follows:

  • We consider a new setting that we call line-level maintenance for evaluating models for source code in a way that is directly inspired by the way developers operate in an IDE (Section 3.1).

  • We introduce TSSA, which formulates the problem of adaptation to local, unseen context in source code by retrieving targeted information (support tokens) from both before and after the hole in a file (Section 3.2.2).

  • We demonstrate that TSSA can be used to formulate a meta-learning objective that can be successfully optimized by recent meta-learning methods (Section 3.2).

  • We demonstrate experimentally that TSSA and meta-learning outperforms other adaptation baselines including dynamic evaluation. Further, via ablations we show that we improve performance on identifiers and literals by about 44% and 15% respectively.

2 Related Work

There have been numerous efforts in developing models for source code. Hindle et al. (2012) established the naturalness hypothesis, which means that human-written source code has statistical structure arising from its use as a human-to-human communication channel and is amenable to statistical language models. They used an -gram language model for source code. Nguyen et al. (2013) combined the -gram models with semantic information of tokens, global technical concerns of the source files and the pairwise associations of code tokens. Raychev et al. (2015); Bichsel et al. (2016) used conditional random fields by posing the problem of predicting properties of source code as structured prediction in a probabilistic graphical model. Some generative models for source code (Maddison and Tarlow, 2014; Raychev et al., 2016; Bielik et al., 2016) generalized probabilistic context-free grammars thus capturing rich context relevant to programs. The use of deep sequence models like RNNs (White et al., 2015) and LSTMs (Dam et al., 2016) instead of -grams have shown promising results. Allamanis et al. (2018b)

uses graph neural networks to learn syntactic and semantic structures over program graphs.

To tackle the specific challenge of local context adaptation (and particularly the out-of-vocabulary (OOV) problem), Tu et al. (2014) combined an -gram with the concept of a cache. Later, Hellendoorn and Devanbu (2017) extended this idea to develop nested -gram models combined with a cache. The components in the cache could then come not only from the current file, but also other files in the directory or project, leading to significant improvements in performance. This idea could be adapted to our setting, by collecting support tokens beyond just the current file. They showed that the nested model is able to beat an off-the-shelf (but non-adaptive) LSTM-based language model.

Follow up work from Karampatsis and Sutton (2019) have established the current state-of-the-art. They use deep recurrent models based on subword units to solve the out-of-vocabulary (OOV) problem. Thus, we followed their approach and adopted a subword-level model as well. Notably, they use dynamic evaluation as a means for adaptation. In their proposed setting, they apply dynamic evaluation by performing updates using information from all the files in a project and carrying over the updated value of parameters from one test file in the project to another during evaluation. However, on average this results in a long chain of adaptation steps before a prediction is made, which may present challenges when deploying in a real IDE (e.g., how to do quality control when the parameters used in the deployed system won’t be known at release time?). In this work, we instead focus on and perform controlled experiments in a single file setting with a much smaller number of allowed update steps, which is more generally applicable.

Though we built on Karampatsis and Sutton (2019) by using a simple recurrent network architecture for our base source code model, there have been other works trying to improve on that architecture specifically for source code. Bhoopchand et al. (2016) explored the use of a sparse pointer network, while Li et al. (2018) employed a pointer mixture network using the structure of Abstract Syntax Tree (AST). All these works model the source code in a language modelling setup where generation takes place one token at a time based on a preceding window. Our setup is somewhat different because we define the task in terms of predicting a hole at line-level in a file where broad context can be present both before and after the hole. Regardless, the focus of our work is to use meta-learning along with our notion of support tokens in a line-level maintenance setting for local adaptation. This idea is agnostic to the actual base model used to predict missing tokens. In fact, we can readily combine our meta-learning formulation with the above-mentioned models, which may lead to further benefits in performance.

3 Methodology

3.1 Line-level Maintenance

As discussed above, the line-level maintenance task is meant to reflect how a developer typically interacts with an auto-complete system in an IDE. The setting is a direct extension of the file-level maintenance task proposed by Hellendoorn and Devanbu (2017), where we are given the training set and the contents of all but one file in a test set software project and asked to generate the held out file in a left-to-right order. This is repeated for each file, so that each file must be predicted given all other files. Hellendoorn and Devanbu (2017) argue that this is infeasible for vanilla neural language models, because it would require training a separate model for each file in the test set. This motivates the need for models that can adapt on the fly.

The line-level maintenance task is both more realistic (developers typically edit files rather than generating them from left-to-right) and creates the need for stronger forms of adaptation. If carrying through the reasoning above, this would mean training a separate model for each (remainder of a) line that needs predicting, which obviously is not feasible. Our high-level contribution to this discussion is to show that even in this more extreme case of line-level maintenance, we can build neural language models that adapt on the fly.

More concretely, we refer to a file as a sequence of tokens . As per Karampatsis and Sutton (2019), we represent each token as a list of subtokens. Our task is to predict the first token (called hole target) in the blanked out range, which occurs at a particular position in the file. For an example, refer to Figure 6 where the hole target is highlighted in dark orange and the blanked out range is highlighted in black. Note that are not allowed to use any token from the blanked-out range.

3.2 Meta-Learning for Adaptation

3.2.1 Base Model

We begin by defining a base model, which is a Seq2Seq (Sutskever et al., 2014) model trained to predict the sequence of subtokens in the hole target from the sequence of subtokens in the hole window using parameters

. The probability of a hole target given its window can be written as


During training of the base model, each token in the file is used as a hole target.

3.2.2 Targeted Support Set Adaptation (TSSA)

To adapt the base model to the local file context, we consider regions from the file that potentially provide useful cues for predicting a given hole target. We call this set of tokens and preceding windows the support set, inspired by the usage of the term in few-shot learning (Vinyals et al., 2016). Each element of the support set, is a pair of support window and support token . The support windows and support tokens can come from anywhere in the file except for the blanked out remainder of the line following the hole target.

To adapt the model given a support set, we perform steps of gradient descent over mini-batches of support windows and tokens. In each step, we predict the support token from the corresponding support window using the base model with parameters from the previous step. The support loss at step and the updated parameters at step can be written as


where , , is the mini-batch size and

is a hyperparameter corresponding to the inner learning rate. We then use the updated parameters

to predict the hole target from its hole window, resulting in the hole loss


3.2.3 Multi-step Meta-Learning

Section 3.2.2 describes how to adapt the base model at test time, and it is possible to apply this adaptation without having explicitly trained parameters that are suitable for it. That is, we can train the base model as in Section 3.2.1 and then apply adaptation at evaluation time. Results in Table 2 show that this improves performance over the base model. However, to make the most of the adaptation step, we would like to learn parameters that are explicitly suitable for adaptation. This motivates casting the problem as meta-learning. Meta-learning is concerned with designing models that can learn to adapt to new settings by learning the parameters of the learning process. Model Agnostic Meta-Learning (MAML) (Finn et al., 2017) is an optimization-based approach where the parameters of the model are explicitly trained such that a small number of gradient steps with a limited amount of training data from a new task will produce good generalization performance on that task.

In preliminary experiments, we found that adaptation stages with more steps (larger ) were crucial for obtaining good performance. This motivated us to pursue meta-learning approaches that scale gracefully with the number of inner loop updates. In one of the computationally-cheap variants of MAML called first-order MAML (FOMAML), the second-order updates from differentiating through the gradient update are omitted. Reptile (Nichol et al., 2018) is similar to FOMAML except for a difference in its outer update. For both FOMAML and Reptile, the outer-loop updates can be written purely as a function of the ultimate adapted parameters , making its memory requirement independent of .

More precisely, FOMAML prescribes an update for that follows the direction of the gradient of the hole loss with respect to . On the other hand, Reptile uses the difference between the updated parameters and original parameters to determine the update direction. The outer loop update equations for Reptile and FOMAML during meta-training are written as


3.2.4 Support Set Selection Strategies

A key novelty in this work is the idea of actively choosing a support set that leads to effective adaptation. This is in contrast to, e.g., few-shot learning, where the support set is defined by the task and cannot be changed.

In source code, identifiers (variable names, function names) are the most difficult to predict (Allamanis and Sutton, 2013) and also the most frequent of all token-types (Broy et al., 2005), making it the most common use-case for auto-complete systems. Thus, our definitions of support tokens are aimed at providing additional context that should help in predicting identifiers. We are motivated by the fact that identifiers are frequently re-used within a file even if they are uncommon across files (or even if they only appear in one file). Further, even when there is not an exact match, it is common for there to be repeated substructure in identifiers (e.g., we might see WidgetRequestBuilder, WidgetRequestHandler and WidgetResponseHandler appear in one file).

With this in mind, we explored the following definitions of support tokens (which contribute towards determining the support sets). In all cases, we ensure that the selection of support sets does not depend on the hole target or the blanked out region following the hole target.

  1. Vocab: We try to capture tokens that are rare in the corpus as part of support tokens. We take all the tokens from the file and sort them based on their frequency in the vocabulary in reverse order and then take the top- entries.

  2. Proj: Here, as part of support tokens, our target is to capture tokens that are relatively common in the current project but are rare in the rest of the corpus. We divide each token’s frequency in the project with the frequency in the vocabulary, sort them and then take the top- entries.

  3. Unique: To study if multiple occurrences of the same token in the support set helps, we form a set of tokens in the file. We then take a subset of tokens as part of our support set. Here, each support token in the support set is unique.

  4. Random: We take random tokens from the file as support tokens.

Putting all the components together, we arrive at Algorithm 12. Note that step 10 is performed only during meta-training.

0:   = base model
0:   = #iterations; = #updates
1:  for iteration = 1 to  do
2:     Sample a file
3:     Set
4:     Sample and corresponding
5:     Retrieve support set from file
6:     for i = 1 to  do
7:        Calculate using and Equation 2
9:     end for
11:     Update using Equation 5 or Equation 6 {Only during meta-training}
12:  end for
Algorithm 1 TSSA with Meta-Learning

4 Experiments and Results

4.1 Dataset and Preprocessing

For our experiments, we work with the Java GitHub Corpus provided by Allamanis and Sutton (2013). It consists of open-source Java repositories for more than 14000 projects. Java is a convenient choice as it is one of the most popular languages for software development and has been widely used in previous works (Hellendoorn and Devanbu, 2017; Karampatsis and Sutton, 2019; Tu et al., 2014). Following Hellendoorn and Devanbu (2017), we focus on a 1% subset of the corpus. The name of the projects in training, validation and test splits of the dataset were taken from Hellendoorn and Devanbu (2017)111 Statistics of the data are provided in Table 1. Note that while we show results on Java, our method is otherwise applicable to corpora of any programming language.

Feature Train Val Test
# Projects 107 36 38
# Files 12934 7185 8268
# Lines 2.37M 0.50M 0.75M
# Tokens 15.66M 3.81M 5.31M
# Identifiers 4.68M 1.17M 1.79M
Table 1: Corpus Statistics for 1% split of the dataset. M indicates numbers in millions

We made use of the lexer provided by Hellendoorn and Devanbu (2017) to tokenize the files, preserving line-breaks. Note that the lexer also removes comments in the file. We need to use a Java-specific tokenizer because characters such as dot or semi-colon take a special meaning in Java and are not tokenized as individual tokens by NLP parsers. To get the Java token-types, we made use of Python’s Java-parser.222 Subword tokenization was performed using the subword text encoder provided by Tensor2Tensor (Vaswani et al., 2018). As in Karampatsis and Sutton (2019), we use a separate vocabulary data split, consisting of a set of 1000 randomly drawn projects (apart from the projects in 1% split), to build the subword text encoder. In addition, we append an extra end-of-token symbol (EOT) at the end of each Java token. The final size of the subword vocabulary is 5710. To reduce model computation while decoding, we remove hole targets of length greater than or equal to 20 subwords. These constitute only 0.2% of the total number of tokens in training data and  0.1% in validation and test data, making it less significant.

4.2 Model Configurations

All our models are Seq2Seq networks where both encoder and decoder networks are recurrent networks with a single layer of 512 GRU (Cho et al., 2014) hidden units, preceded by a trainable embedding layer of equal size. For adaptation models, we reiterate that the same model structure is used in the inner loop (to predict support tokens from support windows) and outer loop (to predict hole targets from hole windows). To train the base model, we create minibatches of successive target holes as in standard training of language models, and we train to minimize average token loss. However, during meta-training, it is more difficult to batch across holes because each hole target has its own set of support tokens and updated parameters . We use batches composed of single holes for meta-training and repeatedly iterate across files during training, choosing random target holes to train on. Note, however that we use mini-batches of support tokens in the inner loop of adaptation. We use the Adam (Kingma and Ba, 2015) optimizer in the inner loop optimization. Moreover, for FOMAML, instead of the constant step size update of Equation 5, we use the update calculated by Adam at . An important note is that during evaluation, at the beginning of each inner loop execution, we not only set to , but also set the state of the Adam optimizer to its value from the end of (meta-) training. The latter step ensures that the statistics for Adam are not carried from one file to another. Details of best hyperparameter values for all settings can be found in Appendix A.

#Support Tokens Base model Dynamic Evaluation TSSA-1 TSSA- TSSA-Reptile TSSA-FOMAML

4.499 0.08 3.636 0.07 3.605 0.06 3.530 0.06

5.384 0.10 3.912 0.08 4.459 0.08 3.637 0.07 3.562 0.07 3.497 0.06

4.935 0.09 3.659 0.07 3.559 0.07 3.506 0.06

Table 2: Performance on hole target prediction on test data in terms of token cross-entropy

. We also report 95% confidence intervals for each entry. Note that the first two methods are independent of the number of support tokens. Also, the first column is a non-adaptive method, and subsequent columns are adaptation-based methods.

4.3 Evaluation Setup

Metric: Cross-Entropy. It is the average negative log probability of tokens, as assigned by the model. It rewards accurate predictions with high confidence and also corresponds to the average number of nats required in predicting a token. We evaluate the average under a distribution over hole target tokens where we first sample a file uniformly from the set of all files and then sample a hole target token uniformly from the set of all tokens in the file. This reflects the assumption that a developer opens a random file and then makes an edit at a random position in the file.

Constraint: Number of Updates Per Hole Target. There is a trade-off between accuracy and number of inner loop updates of adaptation. More inner loop updates generally improve cross-entropy but come at the cost of computation time and ultimately latency in a downstream auto-complete application. To control for this, we fix the size of batches and number of updates per hole target prediction across all adaptive methods.

Methods. We wish to demonstrate the success of our proposed approach, TSSA, when combined with meta-learning. For this purpose, we consider two instances of this approach, TSSA-FOMAML and TSSA-Reptile (using FOMAML and Reptile respectively). Before applying meta-learning, both methods are initialized from a pretrained base model. We then also consider comparisons with the following baselines:

Base model: This is the pretrained base model used as is, without any contextual adaptation. This comparison allows us to confirm the benefit of adaptation in general.

TSSA-: This corresponds to using support set adaptation as in TSSA-FOMAML and TSSA-Reptile, but from the pretrained base model, without any meta-learning. This comparison allows us to measure the specific benefit brought by meta-learning. Otherwise the same number of inner-loop updates is used. We also report results for TSSA-1 (i.e., 1 single inner-loop update), to highlight the value of multiple updates.

Dynamic Evaluation: We also implement dynamic evaluation in our framework. As mentioned in Section 2, this is a bit different from in Karampatsis and Sutton (2019). Here, 1) the support sets are made of all window/tokens pairs appearing before the hole target (and none after), and 2) we constrain the inner-loop optimization to order its updates by starting at the beginning of the file, until the token right before the hole target. Thus, the first inner-loop mini-batch of size contains tokens at the beginning of the file, while the tokens immediately before the hole target only appear in the last mini-batch. Moreover, if the hole target is the th token in the file, then there will be updates in total.

The variants of TSSA assume a fixed number of inner-loop updates, unlike dynamic evaluation. To allow for an overall fair comparison, we set to the average number of updates performed by dynamic evaluation, which was found to be approximately 16 for our test data.

4.4 Performance on Hole Target Prediction

In Table 2, we report the average cross-entropy for test hole targets. In these results, we sample five holes per file to measure test performance. For each method, we select the best values of hyperparameters using the performance on the validation data. As can be seen from the table, both of our meta-learning formulations (TSSA-FOMAML and TSSA-Reptile) show significant improvement in performance, with TSSA-FOMAML giving the best performance in terms of token cross-entropy. Also, methods which involve multiple steps of adaptation (including dynamic evaluation) fare better than single-update or non-adaptation baselines. This highlights the importance of multiple steps of gradient update to obtain better performance while adaptation. TSSA- outperforms dynamic evaluation, though less than meta-training.

Figure 2: Variation of hole target cross-entropy values with number of updates and number of support tokens for val data

4.5 Ablation Studies

In this section, we try to draw insights into the workings of our framework by analyzing the role of each component. We took our best performing meta-learned model for all the experiments that follow. In Figure 2, we plot the variation of hole target cross-entropy values with the number of updates and number of support tokens ( from Section 3.2.4), for validation data. As can be seen from the plot, the cross-entropy decreases with more updates and initially goes down with the number of support tokens.

We also experimented with the definition of support tokens where in one case we fixed the number of updates (16), while in the second we fixed the number of support tokens (512). Figure 3 displays the results for validation data. We see that the Vocab definition of support tokens performs best closely followed by Proj. On the other hand, Unique and Random perform worse in both cases. This highlights the fact that how we define support tokens indeed plays a role in performance improvement.

We also see that for a fixed number of updates, the cross-entropy decreases with the number of support tokens only until it reaches a certain point after which it increases. This likely arises from the way we form mini-batches of support tokens where we first shuffle the support tokens and then cycle through them until exhausting the number of updates. Further, the average number of tokens per file is about 819. This suggests that going past this point where each support token has been visited once creates redundancy that is detrimental. This could also explain the early spike for Unique, since the support set size for that setting is much smaller.

Figure 3: Variation of token cross-entropy for val data with different definition of support tokens. (Left) With fixed number of updates; (Right) With fixed number of support tokens.
Figure 4: Average hole target cross-entropy for each token-type for our TSSA-FOMAML model
Token Type Base model TSSA-FOMAML % Improvement
Identifiers 13.77 7.66 44.37
Literals 6.04 5.16 14.57
Table 3: Comparison of performance on prediction of identifiers and literals for our model vs. a non-adaptation model

We also analysed how our framework performs with hole targets of different token-types. For ease of visualization, we grouped the token-types from the Java parser into five broad categories (details present in Appendix B). As can be seen from Figure 4, identifiers are the most difficult to predict, followed by literals. Note that literals here include string literals, char literals, etc. Table 3 shows the comparison of average test token cross-entropy values for each token-type for the non-adaptive base model as compared to our meta-learned model. As can be seen from the table, we obtain significant reduction in cross-entropy values of about 44% and 15%, respectively in case of identifiers and literals. This in turn leads to better performance overall.

4.6 TSSA vs. Bigger Model

One question is if benefits gained from meta-learning are similar to or orthogonal to benefits that would arise from using larger and more sophisticated models. To study this question, we start from a “small base model” (256 hidden units) and build two models that improve, but in different directions. The first “big base model” increases the model size to 512 hidden units. The second “small TSSA-FOMAML” model leaves the hidden sized fixed but employs TSSA-FOMAML. We then compare how individual examples benefit from each kind of modelling improvement.

Figure 5: Improvement due to meta-learning on small capacity model () vs. Improvement due to big model (). TSSA with meta-learning on small model does better than no meta-learning with big model in 63.9% of cases.

Specifically, let the hole target cross-entropy for the small base model be , for the big base model be , and for the small TSSA-FOMAML model be . In Figure 5 we plot the improvement obtained due to higher capacity model on the x-axis and improvement due to the low-capacity meta-learnt model on the y-axis. Each point is for a different test hole. The line marks cases where improvement from both models is equal.

First, we see that the majority (63.9%) of the points are above the line, indicating that applying TSSA-FOMAML improves on more cases than increasing the model size. Second, and perhaps more interestingly, there are many points where the improvement due to increasing model size is near 0, indicating that we have achieved saturation in benefit due to increasing model size in these cases. However, using meta-learning here, even with the small model, often leads to a large improvement in performance. This shows that TSSA-FOMAML can help in adapting even when we reach saturation in terms of model capacity. In Figure 6 we showcase two such sample cases. The portions highlighted in black indicate the blanked-out range. For the left one, we have a string literal as hole target (“column(”). We can see that fragments of it can be found in support tokens (highlighted in blue). The right one has an identifier (WGLOG) as hole target. Somewhere far later in the file, we find a support token that exactly matches the hole target, contributing to a large gain in performance of TSSA-FOMAML as compared to no meta-learning. In neither of these cases does a larger or more sophisticated base model help in harnessing this extra information.

Figure 6: Sample cases illustrating the benefits of meta-learning on low capacity model: (Left) Hole target is string literal with partial match in support tokens; (Right) Hole target is identifier with exact match in support tokens.

5 Conclusions and Future Directions

In this work, we propose a meta-learning formulation for tacking the problem of adaptation to local contexts in source code. It retrieves targeted support tokens from the file, which provide useful cues in the prediction of a hole target in that file. Our experiments on a large-scale Java GitHub corpus reveal the following: (a) Our formulation significantly outperforms all baselines including a comparable form of dynamic evaluation; (b) More updates in the inner loop, a carefully chosen definition of support tokens and an optimal number of support tokens play an important role in achieving the best performance in our framework; (c) Most of our performance benefits comes from reducing the cross-entropy on identifiers and literals; (d) TSSA with meta-learning provides improvements that are orthogonal to improving the base model, by learning from patterns across the file that may not even appear in the inputs of baseline approaches.

We think this work opens the door to a number of future research directions. One is to develop more sophisticated support set selection methods, by considering support tokens from other files in the same project or targeting certain files based on import statements. Another would be to attempt to learn the criteria for building support sets. Also, we think our proposed framework could provide a valuable new benchmark for future meta-learning research in general. For example, most of the gradient-based meta-learning has focused on the use of a small number of gradient steps, yet this work makes a good case for studying methods that can support more sophisticated optimization inner loops. Designing inner loops that can learn an optimal (possibly variable) number of gradient steps also offers an interesting avenue for exploration. We release our code at, for ease of reproduction of our results and also to facilitate the above mentioned research.


Hugo Larochelle would like to acknowledge the support of Canada Research Chairs and CIFAR for research funding. The authors would like to thank Google Cloud for providing compute resources required for this project. We would also like to thank David Bieber for his continued presence in discussions and useful suggestions on the implementation front especially with designing TensorFlow dataset loaders and code debugging tips. We would also want to extend our thanks to Charles Sutton for comments on the initial draft of the paper which helped us improve the writing.


  • M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton (2018a) A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51 (4), pp. 81. Cited by: §1.
  • M. Allamanis, M. Brockschmidt, and M. Khademi (2018b) Learning to represent programs with graphs. In 6th International Conference on Learning Representations, ICLR, Conference Track Proceedings, Cited by: §2.
  • M. Allamanis and C. Sutton (2013) Mining source code repositories at massive scale using language modeling. In Proceedings of the 10th Working Conference on Mining Software Repositories, pp. 207–216. Cited by: §1, §3.2.4, §4.1.
  • A. Bhoopchand, T. Rocktäschel, E. Barr, and S. Riedel (2016) Learning python code suggestion with a sparse pointer network. arXiv preprint arXiv:1611.08307. Cited by: §2.
  • B. Bichsel, V. Raychev, P. Tsankov, and M. Vechev (2016) Statistical deobfuscation of android applications. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 343–355. Cited by: §2.
  • P. Bielik, V. Raychev, and M. Vechev (2016) PHOG: probabilistic model for code. In International Conference on Machine Learning, pp. 2933–2942. Cited by: §2.
  • M. Broy, F. Deißenböck, and M. Pizka (2005) A holistic approach to software quality at work. In Proc. 3rd world congress for software quality (3WCSQ), Cited by: §3.2.4.
  • K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    pp. 1724–1734. Cited by: §4.2.
  • H. K. Dam, T. Tran, and T. Pham (2016) A deep language model for software code. arXiv preprint arXiv:1608.02715. Cited by: §2.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: On-the-Fly Adaptation of Source Code Models using Meta-Learning, §1, §3.2.3.
  • V. J. Hellendoorn and P. Devanbu (2017) Are deep neural networks the best choice for modeling source code?. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pp. 763–773. Cited by: §1, §2, §3.1, §4.1, §4.1.
  • A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu (2012) On the naturalness of software. In 2012 34th International Conference on Software Engineering (ICSE), pp. 837–847. Cited by: §1, §2.
  • R. Karampatsis and C. Sutton (2019) Maybe deep neural networks are the best choice for modeling source code. arXiv preprint arXiv:1903.05734. Cited by: §1, §1, §2, §2, §3.1, §4.1, §4.1, §4.3.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §4.2.
  • B. Krause, E. Kahembwe, I. Murray, and S. Renals (2018) Dynamic evaluation of neural sequence models. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, pp. 2766–2775. Cited by: §1.
  • J. Li, Y. Wang, M. R. Lyu, and I. King (2018) Code completion with neural attention and pointer networks. In

    Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI

    pp. 4159–4165. Cited by: §2.
  • C. Maddison and D. Tarlow (2014) Structured generative models of natural source code. In International Conference on Machine Learning, pp. 649–657. Cited by: §2.
  • T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur (2010) Recurrent neural network based language model. In INTERSPEECH, pp. 1045–1048. Cited by: §1.
  • T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen (2013) A statistical semantic language model for source code. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pp. 532–542. Cited by: §2.
  • A. Nichol, J. Achiam, and J. Schulman (2018) On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999. Cited by: On-the-Fly Adaptation of Source Code Models using Meta-Learning, §1, §3.2.3.
  • V. Raychev, P. Bielik, and M. Vechev (2016)

    Probabilistic model for code with decision trees

    ACM SIGPLAN Notices 51 (10), pp. 731–747. Cited by: §2.
  • V. Raychev, M. Vechev, and A. Krause (2015) Predicting program properties from” big code”. ACM SIGPLAN Notices 50 (1), pp. 111–124. Cited by: §2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, pp. 3104–3112. Cited by: §3.2.1.
  • Z. Tu, Z. Su, and P. Devanbu (2014) On the localness of software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 269–280. Cited by: §1, §2, §4.1.
  • A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, Ł. Kaiser, N. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer, and J. Uszkoreit (2018)

    Tensor2Tensor for neural machine translation

    CoRR abs/1803.07416. Cited by: §4.1.
  • O. Vinyals, C. Blundell, T. Lillicrap, k. kavukcuoglu, and D. Wierstra (2016) Matching networks for one shot learning. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 3630–3638. Cited by: §3.2.2.
  • M. White, C. Vendome, M. Linares-Vásquez, and D. Poshyvanyk (2015) Toward deep learning software repositories. In Proceedings of the 12th Working Conference on Mining Software Repositories, pp. 334–345. Cited by: §2.

Appendix A Details of Hyperparameter Values

In all settings of our Seq2Seq Models, the initial decoder state is set to be the last state of the encoder. The first input to the decoder is the last step output of the encoder. A dense layer with softmax output is used at the decoder. Also, note that both the parameters of the model and the state of Adam is reset after each hole target during evaluation. We use a dropout = 0.5 and gradient clipping = 0.25. We embedding layer dimension is equal to the hidden layer dimension = 512. We take both the support and hole window size to be 200. In Table 

5 we define the best hyperparameter values for all our settings. Notation for reading Table 5 is provided in Table 4. The last four rows in Table 5 indicate settings for results in Section 4.6 of the paper, where we train a small model with hidden dimension = embedding dimension = 256. For our experiments, we use NVIDIA P100 and K80 GPUs with 16GB memory each.

Symbol Meaning
lr learning rate of Adam optimizer
hbs hole batch-size
dbs batch-size of tokens in dynamic evaluation
sbs support tokens batch-size
#up number of inner loop updates
snum number of support tokens
sdef definition of support tokens
ilr learning rate of inner update Adam optimizer for FOMAML
olr learning rate of outer update Adam optimizer for FOMAML
T: while training/ meta-training
E: while evaluation
Table 4: Notation for terms occurring in Table 5
Model Hyperparameters
T: Base Model lr = 1e-4, hbs = 512
T: TSSA-Reptile = 0.1, ilr = 5e-5, hbs = 1, sbs = 20, sdef = proj, #up = 32, snum = 512
T: TSSA-FOMAML olr = 1e-5, ilr = 1e-5, hbs = 1, sbs = 20, sdef = vocab, #up = 14, snum = 1024
E: Base Model hbs = 1
E: Dynamic Evaluation lr = 1e-3, , hbs = 1, dbs = 20
E: TSSA-1 lr = 5e-3, hbs = 1, sdef = proj, snum = 256, 512, 1024
E: TSSA- lr = 5e-4, hbs = 1, sbs = 20, sdef = vocab, #up = = 16, snum = 256, 512, 1024
E: TSSA-Reptile ilr = 5e-4, hbs = 1, sbs = 20, sdef = vocab, #up = 16, snum = 256, 512, 1024
E: TSSA-FOMAML ilr = 5e-4, hbs = 1, sbs = 20, sdef = vocab, #up = 16, snum = 256, 512, 1024
T: Base Model (small) lr = 1e-4, hbs = 512
T: TSSA-FOMAML (small) olr = 1e-5, ilr = 1e-5, hbs = 1, sbs = 20, sdef = vocab, #up = 14, snum = 1024
E: Base Model (small) hbs = 1
E: TSSA-FOMAML (small) ilr = 5e-4, hbs = 1, sbs = 20, sdef = vocab, #up = 16, snum = 256

Table 5: Best hyperparameter values for all our settings

Appendix B Categorization of Token Types

Token Category Java Token-Type
Identifiers identifier
Keywords import, break, throws, extends, for, public, return, protected, boolean, package, new, class,
void, static, int, this, volatile, synchronized, if, private, final, implements, super, catch, try,
throw, else, instanceof, long, abstract, enum, case, byte, char, break, interface, finally
Operators dot, gt, lt, eq, plus, eqeq, colon, bangeq, ques, ampamp, sub, bang,
plusplus, barbar, star, amp, gteq, subsub, bar, ellipsis
Literals stringliteral, intliteral, charliteral, longliteral, null, false, true
Special Symbols semi, rparen, lparen, lbrace, rbrace, comma, monkeys_at, rbracket, lbracket
Table 6: Description of Java token-types given by Python’s Java-parser into broad token categories for ease of visualization