The use of deep learning techniques and related neural machine translation (NMT) models in code understanding has drawn much recent attention. It has been shown that methods using NMT models can achieve a much better performance than traditional Informational Retrieval (IR) techniques in tasks such as automated code summarization (a1; a3) and program property prediction (a4), thus improving productivity (b2) and reducing software development costs (b3). Unlike natural languages, which are unstructured and noisy, computer programs are highly structured. It is thus critical to encode as much as possible the structural information in a sequence-to-sequence learning model and to take advantage of the encoded information in the training.
Recently, several methods have been proposed to incorporate the structural information in the Abstract Syntax Tree (AST) of a computer program in a sequence-to-sequence learning model. In the Structure-Based Traversal (SBT) method proposed by a5, an AST is represented by a sequence of syntactic tokens and is generated by a depth-first traversal of the AST with parentheses pairs to retain the sub-tree information. The model Code2Seq (a4) uses the concatenation of the token sequences along the paths between pairs of terminal nodes in an AST to represent a computer program. While these methods and models have been shown to be effective in comparison to the most straightforward representation by “flat” sequence of tokens arranged in the same order as they appear in the source code, the choices of the traversal method and the ordering of the tokens appear to be still arbitrary. For example, for a standard tree traversal method such as the depth-first search, there are at least three ways to order the tokens in the sequence: in-order, pre-order, and post-order. We also note that the length of these sequence representations (and thus the input dimension of the learning model) is significantly increased. Due to the use of parentheses for sub-tree information, the length of the representation proposed by a5 is, in the worst case, three times as long as that of the source code. The length of the representation used in the model Code2Seq (a4) is, in the worst case, cubic in the length of the source code.
In this paper, we propose to use the Prüfer sequence of the Abstract Syntax Tree (AST) of a computer program to design a sequential representation scheme that preserves the structural information in an AST. Our representation makes it possible to develop deep-learning models in which signals carried by lexical tokens in the training examples can be exploited automatically and selectively based on their syntactic role and importance. Unlike other recently-proposed approaches, our representation is concise and lossless due to the fact that an AST can be uniquely reconstructed from its Prüfer sequence. Empirical studies on real-world benchmark datasets, using a sequence-to-sequence learning model we designed for code summarization, showed that our Prufer-sequence-based representation is indeed highly effective and efficient, and our model outperforms significantly all the recently-proposed deep-learning models used as the baselines in the experiments.
The application of deep learning models in representation learning for computer programs has attracted much recent attention and is widely used for many tasks in computer program comprehension such as automated code summarization (a7; a8; a9), code generation (a10), and code retrieval (a11).
Several approaches have been investigated to making use of syntactic and structural information, explicitly or implicitly, in representation learning from the source code of computer programs. a12 used the relations in the abstract syntax tree as a feature for training a learning model. a13 and a14 used the paths in an AST to identify the context node. a5 proposed a Structure-Based Traversal (SBT) method to represent ASTs as a linear sequence containing syntactic information in a sequence-to-sequence learning model for code summarization. a15 further extended their SBT-based model (a5) by adding to their model another encoder that learns from lexical information in the source code. Another representation (a16) uses the concatenation of the token sequences along the paths between pairs of terminal nodes in an AST. In addition to these efforts of directly using the structural information in the ASTs in a sequence-to-sequence model, a9 explored the possibility of exploiting structural information implicitly with a transformer model enhanced by pariwise semantic relationships of tokens in the model’s attention mechanism. A significant downside of a transformer-based approach is the increase of the model complexity (quadratic in the code length), and thus the sample complexity and the computational complexity.
Abstract Syntax Trees and Prüfer Sequences
Prüfer sequences have been used in the past as a sequential representation of tree structures in stochastic search methods (pru-r1; pru-r2), problems in fuzzy systems (pru-r3; pru-r5), and hierarchical or graph data management (pru-r7; pru-r8). To the best of our knowledge, our work is the first effort in using Prüfer sequences and exploiting their unique properties in (deep) representation learning for structured data.
In this section, we discuss the details of our Prüfer-sequence-based code representation, including its main idea and its construction. We also discuss the advantages of our representation over other sequential representations in terms of effectiveness, efficiency, and flexibility.
ASTs and Sequence to Sequence Models
An abstract syntax tree (AST) is a tree structure that models the abstract syntactic structure of the source code of a computer program. In an AST, the leaves (or terminal nodes) are labeled by tokens that contain user-defined values or reference types such as variable names, whereas internal nodes (or non-terminal nodes) are labeled by tokens that summarize the purpose of code blocks such as conditions, loops, and other flow-control statements. We note that a token labeling an internal node does not have to be from the source code or specification of the particular programming language. By slightly abusing the notion, we call a token labeling a leave node a “lexical token” as it contains program-specific information in the source code and call a token labeling a non-terminal node a “syntactic token” as it contains generic information about the structure and the purpose of a code block. Shown in Fig.1 is a function in Java and its AST where the token set Override, Int, String, mergeErrorIntoOutput, Boolean, Commands are used to label the terminal nodes and the token set BasicType, FormalParameter, MethodInvocation, ReturnStatement, ClassCreator, ReferenceType are used in label the internal nodes.
In a sequence-to-sequence model for code representation learning, tokens (or their word embedding) from the source code have to be represented as a linear sequence and are used as the input of the model. As has been shown by a5 and a16, a linear sequence representation that makes use of structural information encoded in the AST has a great advantage over a simple flat sequence representation where tokens appear in the same order as they appear in the source code (a6).
Prüfer Sequence of an AST
The Prüfer sequence of a node-labeled tree is a sequence of node labels from which the tree can be uniquely reconstructed. The famous proof of Caley’s formula for the number of labeled trees by Heinz Prüfer (a18; z1) is based on the one-to-one correspondence between the set of labeled trees and the set of such sequences. Given a tree with nodes labeled by the integers , its Prüfer sequence is a sequence of node labels (i.e., integers) and can be formed by successively removing the leave with the smallest label and including the label of its parent as the next node label in the Prüfer sequence. The process stopped when only two nodes were left in the tree.
Since ASTs are labeled by syntactic and lexical tokens, we use a fixed mapping to map each token in the given token set to a unique integer and use it as the integer label of the AST-node that is labeled by the token. The Prüfer sequence constructed from this integer-labeled AST is then mapped back to a sequence of syntactic tokens, which we call the “syntactic Prufer sequence” and is used as part of the input sequence to our learning model. Fig. 1 illustrates the process of Prüfer sequence generation from the AST of a Java method. The syntactic Prüfer sequence for the Java method in Fig. 1 is
MethodDeclaration, MethodDeclaration, FormalParameter, MethodDeclaration, FormalParameter, MethodDeclaration, ReturnStatement, MethodInvocation, MethodInvocation, ClassCreator, TypeArgument, ReferenceType, ClassCreator, MethodInvocation
Note that in the above sequence, a syntactic token may appear multiple times and at different positions. Also, note that the terminal nodes never appear in the sequence. The significance and relevance of those properties of the syntactic Prüfer sequence will be discussed below and further explored in the next two sections on the design of our learning model and our empirical studies.
Advantages of Learning with a Prüfer-Sequence Representation
The syntactic Prüfer sequence can be regarded as a “transformed” and “quantified” version of an AST and the corresponding source code where
the frequency with which a syntactic token appears is decided by the degree of the corresponding AST node and quantifies the “importance” of the token (measured the size of the code block it controls);
the positions of the appearances of syntactic tokens in the Prüfer sequence are decided by the position of the corresponding node in the tree; and
a lexical token labeling a leave node of an AST never appears in the Prüfer sequence, but its “significance” can be measured by the syntactic importance of the parent of the leave node.
This is in sharp contrast with all other recently proposed sequential representations of a source code and its AST, where all tokens are treated equally, and their positions only partially capture their roles in the AST.
Supported by observations from our empirical studies (discussed in the second last section), we believe that these are the properties that make it possible (or much easier) for our learning models to exploit information in the training examples that are hard, if not impossible, for other recently-proposed learning models to detect.
As we observed in our experiments, other properties that distinguish our Prufer-sequence representation from exiting representations and play important roles in the performance of our learning model are as follows.
A Uniqueness and Lossless Representation
To our best knowledge, all existing sequential representations of source code and its AST (a4; a5) are lossy encoding in the sense that the original AST cannot be uniquely reconstructed. Our Prüfer-sequence representation is a lossless encoding because, given a fixed syntactic-token-to-integer mapping, there is a one-to-one correspondence between the set of ASTs and their syntactic Prüfer sequences. This property may help improve the ability of a learning method to distinguish or detect subtle differences in training examples.
More Concise Input (or Lower Input Dimension)
For an AST with nodes, the length of our representation is . In comparison, the length of the representation proposed by a5 is in the worst case, while the length of the representation proposed by a16 is in in the worst case. In our experiments, we observed that the Prüfer sequence representation has an average length of 100.81 tokens while the representation based on SBT (a5) has the 193.71 tokens to represent the same AST corpus (Table 5), resulting in faster training of our model.
Prüfer-Sequence-Based Learning Model For Code Summarization
To study the effectiveness of the Prüfer-sequence-based representation, we developed a deep-learning model for code summarization. The model maps a Java method to a summary of the method’s purpose in English. The training data are pairs of source code of the Java method and developers’ comments. The high-level structure of our model is depicted in Fig.2. It is a sequence-to-sequence (seq2seq) model (a19) in the encoder-decoder paradigm, where two separate encoders are used to learn from the structural information of an AST and from a structure-aware representation of the lexical tokens from the source code. An attention module, similar to the one used by a15
, is used to combine the output of the two encoders into a context vector which is then used as the input to a standard decoder described in (a19) to output a code summary/comment in English.
We describe the details of the two encoders and their rationale below.
Prüfer Sequence Encoder
The Prüfer Sequence Encoder is designed to learn from the structural information of the ASTs that are losslessly encoded in their syntactic Prüfer sequences. Gated Recurrent Units (GRUs), as discussed by (a21), are used to map the syntactic Prüfer sequence of a computer program to a sequence of hidden states as follows:
In our implementation, lexical tokens labeling the terminal nodes are appended to the syntactic Prüfer sequence as part of the input to the encoder, resulting in an input length of at most . We observed in our initial experiments that the model Hybrid-DeepCom (a15) with its SBT-based encoder replaced by our Prüfer Sequence Encoder already outperformed notably the original Hybrid-DeepCom Model and other baseline models. It turned out that the bottleneck to further improvement of such models is the design of the second encoder, the Source Encoder (a15), that learns from source code tokens directly. This observation in our early investigation motivated us to design our own second encoder, which we call the Context Encoder, that exploits lexical information in a structure-aware way. We discuss our definition of the context of an AST and the design of the Context Encoder in the next subsection.
The context encoder, also consisting of GRUs, is designed to learn from the collection of lexical tokens (i.e., user-defined and program-specific values in the source code) organized in a way that reflects the structural information of the AST.
For each node in an AST, we define its context to be the set of lexical tokens that label the node’s leaf child/children. A node with no leaf child has an empty context. The context of an AST is defined to be the union of the contexts of the AST nodes ordered in the same way as they appear in the syntactic Prüfer sequence. The context of an AST can be calculated from the AST’s Prüfer sequence (See Fig. 1). The context encoder maps the context defined in the above to a sequence of hidden states.
The context of an AST defined in this way is a structure-aware sequence of lexical tokens because (1) the frequency of a lexical token is decided by the degree of the parent of the leaf node that the token labels; and (2) the order in which these tokens appear in the context is the same as the order of the parent nodes in the Prüfer sequence. As observed in our experiments, the use of the context encoder helps boost the performance of our learning model significantly. This is because, we believe, that the context we have designed helps amplify the learning-relevant lexical signals in the source code in a way that other models (such as those in a15) cannot detect that simply use the collection of entire tokens as they appear in the source code.
In this section, we first introduce the experimental setup, including the datasets, the baseline models, and the evaluation metrics. We then discuss observations from and analyses of our experimental results on the power, effectiveness, and efficiency of the proposed Prüfer sequence representation and the role it plays for our learning model to significantly outperform recently-proposed learning models.
Dataset and Experiment Setup
We perform our experiment on two Java datasets. Dataset-1(68469 pairs of java code and comments) is a popular dataset used in many research and is collected by a23 from popular repositories in Github. Dataset-2 ( 163316 pairs of java code and comments) is part of the CodeXGlue dataset developed by Microsoft (codexglue). It is known for its high quality and complexity and is believed to be one of the most challenging datasets for deep learning approaches to program understanding and generation. We split the data into 8:1:1 for training, testing, and validation. Some statistics of the datasets are shown in Tables 1 and 2.
We tokenized the java source code and comments by the programs Javalang and NLTK111https://www.nltk.org/ respectively. The size of the vocabulary for comments, code, code’s context, and Prüfer sequence is set to 30000 (a5; a19). The maximum length of the Prüfer sequence was kept at 200, and the size of the code and code’s context was kept at 500 (a5; a19). Special tokens START representing the start of the sequence and EOS representing the end of the sequence are added to the decoder sequence during the training. The maximum comment length is set to 30 and out-of-vocabulary represented by especial token UNK.
Our model uses one layered GRU with 256 dimensions of hidden state and 256-dimensional word embedding. The maximum iterations are 60 epochs. The learning rate is set to 0.5, and we clip the gradients norm by 5. The learning rate is decayed using the rate 0.99. The model uses the TensorFlow version 1.15, and we train our model on a single GPU of Tesla P100-PCIE-16GB with 25 GB RAM and 110GB disk.
|Dataset Type||Avg.||Mode||Median||200 Tokens|
|Our Model (Prüfer Encoder + Hu’s Source Encoder)||38.38 ()||29.43 ()||20.13 ()||41.82 ()|
Our Model (Prüfer Encoder + Context Encoder)
|39.67 ()||31.01 ()||21.01 ()||43.45 ()|
|Our Model (Prüfer Encoder + Hu’s Source Encoder)||15.50 ()||3.85 ()||8.9 ()||20.79 ()|
Our Model (Prüfer Encoder + Context Encoder)
|16.15()||4.49 ()||9.72 ()||24.73 ()|
In our experiments, we compared the performance of our learning model with the following baseline models to empirically analyze the power, effectiveness, and efficiency of the proposed Prüfer-sequence-based representation.
TL-CodeSum Model (a23). This is an NMT based code summarization method that uses API knowledge and source code tokens as the input in a sequence-to-sequence model.
Hybrid-DeepCom Model (a15). This sequence-to-sequence model for code summarization is one of the recent models that is designed to exploit structural information in the AST of a computer program. It uses two encoders: a source-token encoder and an SBT encoder. The SBT encoder uses a depth-first-traversal sequence representation of an AST as its input, and the code-token encoder is used to learn from the lexical information of the source code.
Code2Seq Model (a16). This is a deep-learning model for general code representation learning. Code2Seq uses the concatenation of the token sequences along the paths between pairs of terminal nodes in an AST as its input representation.
BFS-Hybrid-DeepCom. This model is based on the Hybrid-DeepCom Model (a15) with the SBT representation of an AST replaced by a sequence representation constructed from a breadth-first-search (BFS) traversal of the AST. We included this model to help verify our claim that all the recently-proposed sequence representations are more or less arbitrary. As observed from our experiments (see next Section), this BFS-based model performs equally well (or even slightly better) than those recently proposed models.
Lexical-Token-Only Model. This basic attention-based seq2seq model has only one encoder that learns from the lexical information in the source code. We used this model to understand the importance of incorporating syntactic information (in the AST) in deep-learning approaches for code summarization. The parameter setting of the model is similar to that of the Hybrid-DeepCom model.
To evaluate the effectiveness of different approaches, we use four widely-used machine translation metrics: two BLEU scores, the METEOR score, and ROUGE-L. The BLEU Score (a25) is a family of metrics to check the quality of machine-translated texts against that of the human-written texts. In this paper, use the Sentence-Level BLEU score (S-BLUE) with smoothing-4 method (a26) and Corpus-level BLEU (C-BLUE) and computed them by a program “multi-bleu.perl”. METEOR is a is recall-oriented evaluation method (a24), which evaluates the translation hypotheses by aligning them to reference translations and calculating sentence-level similarity scores (a24). ROUGE-L (ROUGE
) is one of the four measures of the ROUGE family, where L stands for Longest Common Subsequence. It computes the F-score (defined as the harmonic mean of the recall and precision value from finding the longest common sequence of the texts.)
Analysis of Experimental Results and Observations
In Tables 3 and 4, we summarize the experiment results on the effectiveness of our Prüfer-based learning model and the baseline models for code summarization on two public Java datasets: a) Dataset-1 (a23) and b) Dataset-2 (codexglue). We note that the performance scores for Dataset-2 is much lower, but this is not a surprise—as we mentioned in the previous section, this dataset is believed to be one of the most challenging dataset for deep learning approaches to program understanding and generation. A detailed discussion can be found on CodeXGlue’s project webpage (https://microsoft.github.io/CodeXGLUE/). Also, note that in Table 4 (for Dataset-2), we do not have a row for the model TL-CodeSum because Dataset-2 does not provide the API knowledge required to train the model. As we can see, the model that uses our Prüfer Sequence Encoder and Hu’s Source Encoder (second last rows in both Tables 3 and 4) has already had notable improvement over the baseline models. The improvement by our complete Prüfer-sequence-based model (using our Prüfer Sequence Encoder and our Context Encoder) is even more significant, especially on Dataset-2 (last rows in both Tables 3 and 4.)
In the rest of this section, we discuss our observations from the experiments and analyze the results to understand the power, effectiveness, efficiency, and the robustness (against the code length) of our Prüfer-sequence-based representation. These observations and their analyses are solid evidence, supporting our belief that our Prüfer-sequence-based representation makes it possible (or much easier) for learning models to exploit information in the training examples that are hard, if not impossible, for other recently-proposed learning models to detect.
Power and Effectiveness of Prüfer-Sequence Representation of AST.
As shown in the second last rows in Tables 3 and 4, our Prüfer Sequence Encoder and Hu’s Source Encoder (second last rows in both Tables 3 and 4) has already had notable improvement over the baseline models, with the average performance improvement being for Dataset-1 and for Dataset-2. We attribute the performance improvement to the properties of the Prüfer sequence representation discussed in the previous Section (Abstract Syntax Trees and Prüfer Sequences): a concise and lossless encoding that quantifies the “importance” of syntactic tokens and preserves their structural roles in describing the source code. This is further supported by the relatively poor performance of the recently-proposed general model, Code2Seq, that uses a lossy encoding with the input length cubic to the size of the AST in the worst case. Note that the performance of the Code2Seq Model is worse than the model that does not make use of any structural information of the AST (first rows in both tables.)
We can also see that the performance of the two versions of the Hybrid-DeepCom Model based on respectively the BFS sequence and the SBT sequence are comparable on both datasets, justifying our claim in the introduction that in such traversal-based sequence representation recently proposed, the order of the tokens in a sequence is largely arbitrary and only partially captures the structure of an AST.
Importance of Structure-Aware Context Sequence of Lexical Tokens.
As shown in the last row in both Tables 3 and 4, the use of the Context Encoder boosted the performance of our model dramatically. The average performance improvement over baseline models is increased from to for Dataset-1 and to for Dataset-2.
Considering that both of the two recently-proposed deep-learning models (Code2Seq and Hybrid-DeepCom, the second rows and the third last rows in both tables) are also designed to make use of structural information of an AST as well as lexical tokens from the source code, the significant performance gain of our model is best interpreted by the fact that the “context” sequence defined in our model as the input to the Context Encoder is structure-aware. The difference between our context sequence and the those used in the Hybrid-DeepCom model and the Code2Seq model is that in our context sequence, the frequency of a lexical token from the source code is decided by the degree of its parent node in the AST, whereas Hybrid-DeepCom and Code2Seq treat tokens from the source code equally regardless of their role and significance in the program.
Efficiency of the Prüfer Sequence Encoding.
The input dimension to a seq-to-seq deep learning model depends on the encoding scheme. The lower the dimension is, the faster it is to complete one epoch of training. There is, of course, a tradeoff among the effectiveness/ability of a model, its input dimension, and the training time. An ideal encoding is the one that preserves as much as possible the structural information and has a short length.
Our experiments confirmed that our Prüfer sequence representation, while encoding the structure of an AST losslessly, is more concise than other representations and indeed requires less training time. Fig. 3 summarizes our observation on the time required to complete different training epochs for three learning models: Our Model, Hybrid-DeepCom, BFS-Hybrid-DeepCom. Among the models, BFS-Hybrid-DeepCom uses the shortest sequence representation (70.21 on average over the training data), and Hybrid-DeepCom has the longest representation (193.91 on average). The average length of our Prüfer sequence representation is 100.81. It is a surprise to observe that BFS-Hybrid-DeepCom (a model we customized from Hybrid-DeepCom using a straightforward and much shorter breadth-first-search-based representation), requires less training time but has a comparable performance with Hybrid-DeepCom that is based on a carefully designed and more sophisticated representation. While our model requires more time to train than BFS-Hybrid-DeepCom (as expected but not by much), the performance gain of our model is significant.
Performance over Source Code of Different Lengths.
We further analyzed the accuracy of the three models when they are trained and tested on source codes of different lengths. We observed (Fig.4) that for all the three models, the performance decreases as the code length increases. We note that our Prüfer-sequence-based model performs better than the other two baseline models regardless of the code lengths. For java methods with 150 or more tokens, the Prüfer sequence-based model had a clear edge over the other two methods, suggesting that it is more robust against the increase of code length.
The correlation coefficient between the BLEU score and code length is the smallest for the Prüfer-sequence-based model (-0.037), while for BFS and SBT, it is -0.16 and -0.19, respectively. This indicates that the (negative) correlation between the performance of and the code length is much weaker for our model than the other two models.
|Prufer Sequence||100.81||8||67.74||88.06 %|
In this work, we proposed a concise and effective representation scheme that can be used in sequence-to-sequence models for code representation learning. By encoding structural information of abstract syntax trees of computer programs, our Prüfer-sequence-based representation makes it possible (or much easier) to develop sequence-to-sequence learning models to exploit automatically and selectively lexical and syntactic signals that are hard, if not impossible, for other recently-proposed sequence-to-sequence learning models to detect.
a9 investigated the possibility of incorporating AST information in their transformer-based model and concluded that AST information does not provide any help. Our studies in this paper suggest that it depends on how the (hierarchical) AST information is encoded and used in a learning model. It is a very interesting future work to study how our Prüfer-sequence-based encoding of ASTs can be used in a transformer-based model (such as the one in (a9)) to decrease the model’s complexity and improve its effectiveness.
While the model we developed and the experiments conducted are on the task of code summarization for a particular program language, no assumptions were made about the programming language, its specification, and the format of the abstract syntax trees, making our approach language independent and potentially applicable to other tasks in program comprehension and in any application domains of sequence-to-sequence learning models where sequential and hierarchical signals both exist in the training data.