This paper is written because I receive several inquiry emails saying it is hard to achieve good results when applying token repetition learning techniques. If REP (proposed by me) or Pointer-Mixture (proposed by Jian Li) is directly applied to source code to decide all token repetitions, the model performance will decrease sharply. As we use pre-order traversal to traverse the Abstract Syntax Tree (AST) to generate token sequence, tokens corresponding to AST grammar are ignored when learning token repetition. For non-grammar tokens, there are many kinds: strings, chars, numbers and identifiers. For each kind of tokens, we try to learn its repetition pattern and find that only identifiers have the property of token repetition. For identifiers, there are also many kinds such as variables, package names, method names, simple types, qualified types or qualified names. Actually, some kinds of identifiers such as package names, method names, qualified names or qualified types are unlikely to be repeated. Thus, we ignore the kinds of identifiers that are unlikely to be repeated when learning token repetition. This step is crucial and this important implementation trick is not clearly presented in the paper because we think it is trivial and too many details may bother readers. We offer the GitHub address of our model in our conference paper and readers can check the description and implementation in that repository. Thus, in this paper, we supplement the important implementation optimization details for the already published papers.READ FULL TEXT VIEW PDF
This section will show very detailed implementation optimizations including detailed usage of Eclipse JDT classes and some options in calculating accuracy. This optimization is implemented by the second coauthor of the paper. He thinks the details may confuse readers which is not familiar with Eclipse JDT. But it is our fault that we do not mention the important implementation optimizations in the paper. The identifiers in Java AST generated by Eclipse JDT is the leaf node with type (abbreviated as ). In paper [ijseke19-rep], we point out that we split leaf node into two tokens (node type as one token and node content as one token) to check whether the syntax is predicted correctly. We predict the content of leaf node based on the type of leaf node. This step has no impact on the traditional language model, but it has a great impact on token repetition learning model such as REP and Point-Mixture. Based on this step, we try to use REP model to decide whether the content of a leaf node with type should be the previously existed token or not. However, considering all leaf nodes with type leads to low model performance, we must filter out the kinds of
nodes which are unlikely to be repeated. This filtering step is crucial because if the number of unrepeatable tokens is far more than the number of tokens that will be repeated, the classifier will assume that all tokens will not be repeated. The REP model will degenerate into the traditional language model. Now, we will describe in details what kind ofnode will be considered by REP model.
|Node Type||Parent Node Type||Extra Node Condition|
|SimpleName||MethodInvocation||Node is method name|
|SimpleName||SuperConstructorInvocation||Node is super class|
|SimpleName||SuperMethodInvocation||Node is method name or super class|
We use the rules in Table 1 to filter out nodes which are unlikely to be repeated. Note that the nodes with parent type , , actually are highly repeated. However, these nodes correspond to ‘go-to’ semantics in Java language, the proportion of these nodes is too small, and an isolated REP model should be used to learn the token repetition of these ‘go-to’ related nodes. Thus, we filter out these nodes here. Even we filter out these nodes, the remaining nodes still make up 20% of the total nodes. If the type of a node and the type of its parent node match any row of data in the table, that node will be filtered. For a node, if its parent node type is or , it will be filtered if it meets the
in the table. In our experiment, method name only has a very low probability of being repeated, so it needs to be removed. Thus, if the parent node type of anode is and that node represents the name of the invoked method, it will be filtered. Similarly, in some cases, people will call the constructor of the specified parent class, if the parent node type of a node is and that node represents the specified class, that node will be filtered. Note that we have conducted experiments to ensure that the filtered nodes have very low probability of being repeated. We only use simple syntax information to do this filtering.
Cared Node: after filtering, we give the remaining nodes a name: Cared Node. REP model uses syntax to judge whether the node is a Cared Node based on the type of the node. According to the node type on AST, if the node being code completed is a Cared Node, then REP model begins to predict its content. Of course, REP model only considers previously existed Cared Nodes in context.
For the position to be code-completed, REP model only considers Cared Nodes in the previous tokens. The previous tokens are taken as context. The is taken as context length. For example, if the original token sequence is shown in Figure 1. If we only consider previous 3 tokens as context. Then the original context is shown in Figure 3.
Here is 3. As REP only considers Cared Nodes, in Figure 3, token is not cared, thus, REP model removes token and only considers token and token . When is 3, the context which is considered by REP is shown in Figure 4. The whole idea is very simple. The is usually set to a small value, for example, 25 or 50 meaning that we only consider 25 previous tokens in learning token repetition. Here, we must correct the setting in paper [ijseke19-rep]: we say we can at most use 600 previous tokens as context. Actually, we use a small number of previous tokens as context.
The LSTM model will generate for each token in a token sequence. We use , …, to denote the generated by LSTM model for each token in context used by REP. As shown in Figure 5, the corresponding to tokens in context is denoted as , , …, . The corresponding to the token being predicted is denoted as .
The probability that should be the repetition of in context is computed by:
In Equation 1, is the normalization factor computed by:
In Equation 1 and 2, is the model parameter, is the transposition of . In training phase, if is really the repetition of in context, should be maximized. In paper [seke19-rep], we forget to add the base in the above equation, this is a mistake and we correct that mistake here.
To decide should be the repetition of some previously existed token or not, we compute . We use symbol to denote the th token in context which achieves the maximum probability among P(0,next), P(1,next), … P(m,next).
Then is the for th token in context which makes the highest, can be computed as follows:
In paper [seke19-rep], we forget to add the base in the above equation, this is a mistake again and we correct that mistake here. Actually, when training and testing, we use softmax cross entropy to optimize the value. Please see [report-yyx] for implementation details.
Actually, we can use different isolated REP models to learn token repetition for different kinds of tokens. For example, for types related to Java templates (generic class), we can use one REP model to specifically learn the token repetition, for identifiers related to ‘go-to’ syntax in Java, we can use another different REP model to learn token repetition. This optimization is implemented by the second author of the paper. Because the second author was in the process of graduation, there was a little mistake in our communication. I didn’t mention this important optimization in my paper, so I hereby add these implementation details here so as not to confuse readers.
Actually, we can only take Java variables in context into consideration to further improve the accuracy. We can remove those Cared Nodes which are not variables. This step is easily achieved as we can use Eclipse JDT to know which identifier is Java variable or not. Here, we provide the details about how to identify all variables in source code using eclipse JDT. If there are other better ways to identify all variables in source code, please ignore the following content. When pre-processing, we use eclipse JDT to identify every variable in a function. In details, the eclipse JDT provides a technique named as ResolveBinding. For every ASTNode which type is , we invoke resolveBinding method provided by eclipse JDT, if the binding is successfully resolved and the binding type is Variable, we think this ASTNode is a variable. This step is introduced in my Phd thesis, not clearly in paper [ijseke19-rep].
For each function in test set, we start to predict token from start to end. As code-structure-tokens are not predicted by REP model. Taking them into consideration may confuse readers. However, we still take most of them into consideration in our paper. Here, we also give results which do not take irrelevant tokens into consideration to show readers about the very strong ability of token repetition learning. Note that, when computing accuracy, some works do not count UNK tokens or some meaningless grammar tokens. In our work, although we split a leaf node into two tokens (node type and node content), we think predicting node content correctly is most important. Thus, when predicting leaf node, we compute the accuracy of predicting node content not node type. When predicting node content, we assume that the node type is already predicted but not compute that prediction accuracy. We consider top-k accuracy as the evaluation metrics. The entropy and the mrr are no longer taken into consideration. In this corrected version, we still use training set, validation set and test set. The proportion is 60%, 20%, 20% (slightly different from the paper).
The whole training procedure will stop if the top-1 accuracy on validation set does not exceed the maximum for 10epochs. The traditional language model and REP model are trained separately. The REP model directly uses the results of the fully-trained traditional language model. This step needs some engineering works please check GitHub address [report-yyx]
for details. All initial values for all token embedding parameters are randomly selected between -1.0 and 1.0 (we use uniform_random initializer in Tensorflow). All other parameters such as parameters in LSTM or token repetition are set to 0. This may have some exceptions, please check our implementation[report-yyx]. This setting can maximize the prediction effect for both LSTM and REP. The gradient is clipped between and (we try not to clip gradients). We mark 1000 least frequently appeared tokens in training set are marked as UNK.
For project Log4J, the context length is set to 25 which means REP and atten-ptr (token repetition learning in Pointer-Mixture) model only consider cared nodes in previous 25 tokens. Table 2 shows the accuracy. As can be seen, learning token repetition can greatly improve the prediction accuracy of variables especially for unseen variables.
|all nodes||top1||top3||top6||top10||total number|
|cared nodes||top1||top3||top6||top10||total number|
|unseen cared nodes||top1||top3||top6||top10||total number|
|cared nodes||top1||top3||top6||top10||total number|
The statistical language models have been widely used in capturing patterns of source code to solve the problem of code completion. In [DBLP:conf/icse/HindleBSGD12]
, source code was parsed into lexical tokens and the n-gram model was applied directly to suggest the next lexical token. In[DBLP:conf/msr/AllamanisS13a], a large scale experiments was conducted by using n-gram model and a visualization tool was provided to inspect the performance of the language model for the task of code completion. In SLAMC [DBLP:conf/sigsoft/NguyenNNN13], based on basic n-gram model, associating code lexical tokens with roles, data types and topics was one way to improve the prediction accuracy. Cacheca [Tu2014On]
improved n-gram model by caching the recently encountered tokens in local files to improve the performance of basic n-gram model. Decision tree learning was applied to code suggestion, based on this, a decision tree model which integrates the basic n-gram[DBLP:conf/oopsla/RaychevBV16] was proposed for source code. The work [poplRaychevBVK16]
abstracted source code into DSL and kept sampling and validating on that specially designed DSL until the good code suggestion was obtained. Deep learning techniques such as RNN, LSTM were applied to code generation model[White2015Toward] [dam2016deep] [FSE17] to achieve a higher prediction accuracy. The work in [FSE17] confirmed that LSTM significantly outperforms other models for doing token-level code suggestion. Given large amount of unstructured code, deep language models such as LSTM or its variants are the state-of-art solutions to the problem of code completion. All works described above are trying to solve the general code completion problem in which every token of code should be predicted and completed based on the context in a fixed or changeable length. There are also a lot of works paying attention to the API completion problem. Common sequences of API calls were captured with per-object n-grams in [DBLP:conf/pldi/RaychevVY14]. In [DBLP:conf/icse/NguyenN15]
, API usages was trained on graphs. Naive-Bayes was integrated into n-gram model to suggest API patterns. The migrations of API are studied in[Nguyen2017Exploring]. The completion of API full qualified name is studied in [phan2018statistical]. On top of general code synthesis problems, API synthesis is also studied in [Nguyen2016T2API, APILearn].