Code summaries†††The first two authors contribute equally.††§Yanlin Wang and Hongbin Sun are the corresponding authors. are concise natural language descriptions of source code and they are important for program comprehension and software maintenance. However, it remains a labor-intensive and time-consuming task for developers to document code with good summaries manually.
Over the years, many code summarization methods have been proposed to automatically summarize program subroutines. Traditional approaches such as rule-based and information retrieval-based approaches regard source code as plain text HaiducAM10; HaiducAMM10 without considering the complex grammar rules and syntactic structures exhibited in source code. Recently, abstract syntax trees (ASTs), which carry the syntax and structure information of code, are widely used to enhance code summarization techniques. For example, HuLXLJ18
propose the structure-based traversal (SBT) method to flatten ASTs and use LSTM to encode the SBT sequences into vectors.hu2019deep and LeClairJM19 extend this idea by separating the code and AST into two input channels, demonstrating the effectiveness of leveraging AST information. AlonBLY19; AlonZLY19 extract paths from an AST and represent a given code snippet as a set of sampled paths. Other works WanZYXY0Y18; ZhangWZ0WL19; MouLZWJ16 use tree-based models such as Tree-LSTM, Recursive Neural Network (RvNN), and Tree-based CNN to model ASTs and improve code summarization.
We have identified some limitations of the existing AST-based approaches, which lead to a slow training process and/or the loss of AST structural information. We now use an example shown in Fig. LABEL:fig:ast_slicing_ex to illustrate the limitations:
Models that directly encode ASTs with tree-based neural networks suffer from long training time. HybridDrl WanZYXY0Y18
spends 21 hours each epoch on FuncomLeClairJM19. This is because ASTs are usually large and deep due to the complexity of programs, especially when there are nested program structures. For example, our statistics show that the maximal node number/depth of ASTs of methods in TL-CodeSum HuLXLLJ18 and Funcom are 6,165/74 and 550/32, respectively. Moreover, HybridDrl transforms ASTs into binary trees, leading to deeper trees and more loss of structural information. As shown in Fig. LABEL:fig:ast_slicing_ex:sum, the main semantics of the code in Fig. LABEL:fig:ast_slicing_ex:code are not fully captured by HybridDrl.
Linearization methods that flatten ASTs into sequences HuLXLJ18; AlonBLY19; AlonZLY19, by nature, lose the hierarchical information of ASTs. ASTNN ZhangWZ0WL19 splits an AST into small statement trees to reduce the difficulty of large tree training. However, each subtree contains one statement only and subtrees are later linearized and fed into an RNN, also leading to the loss of hierarchical information. From Fig. LABEL:fig:ast_slicing_ex:sum, we can see that linearization methods Code2seq AlonBLY19, Astattgru LeClairJM19 and ASTNN ZhangWZ0WL19 fail to capture the main semantics, and HDeepcom hu2019deep captures only partial semantics.
To overcome the above limitations, we propose a novel model (Code summarization with hierarchical splitting and reconstruction of Abstract Syntax Trees). The key idea of our approach is to split an AST (Fig. LABEL:fig:ast_slicing_ex:full_ast) into a set of subtrees (Fig. LABEL:fig:ast_slicing_ex:sliced_AST) at a proper granularity and learn the representation of the complete AST by aggregating its subtrees’ representation learned using tree-based neural models. First, we split a full AST in a hierarchical way using a set of carefully designed rules. Second, we use a tree-based neural model RvNN to learn each subtree’s representation. Third, we reconstruct the split ASTs and combine all subtrees’ representation by another RvNN to capture the full tree’s structural and semantic information. Finally, the representation of the complete tree, together with source code embedding obtained by a vanilla code token encoder, is fed to a Transformer decoder to generate descriptive summaries. Take Fig. LABEL:fig:ast_slicing_ex:code for example again: there are two sub-sentences in the reference summary. The For block (Lines 6, 7 and 13 in Fig. LABEL:fig:ast_slicing_ex:code) corresponds to the first sub-sentence “loop through each of the columns in the given table”, and the If block (Line 8-12) corresponds to the second sub-sentence “migrating each as a resource or relation”. The semantics of each block can be easily captured when the large and complex AST is split into five subtrees as shown in Fig. LABEL:fig:ast_slicing_ex:sliced_AST. After splitting, corresponds to first sub-sentence and corresponds to the second sub-sentence. When we reconstruct the split ASTs according to Fig. LABEL:fig:ast_slicing_ex:structure_tree, it is easier for our approach to generate the summary with more comprehensive semantics.
Our method has two-sided advantages: (1) Tree splitting reduces AST to a proper size to allow effective and affordable training of tree-based neural models. (2) Different from previous work, we not only split trees but also reconstruct the complete AST using split ASTs. This way, high-level hierarchical information of ASTs can be retained.
We conduct experiments on TL-CodeSum HuLXLLJ18 and Funcom LeClairJM19 datasets, and compare with the state-of-the-art methods. The results show that our model outperforms the previous methods in four widely-used metrics Bleu-4, Rouge-L, Meteor and Cider, and significantly decreases the training time compared to HybridDrl. We summarize the main contributions of this paper as follows:
We propose a novel AST representation learning method based on hierarchical tree splitting and reconstruction. The splitting rule specification and the tool implementation are provided for other researchers to use in AST relevant tasks.
We design a new code summarization approach , which incorporates the proposed AST representations and code token embeddings for generating code summaries.
We perform extensive experiments, including the ablation study and the human evaluation, on and state-of-the-art methods. The results demonstrate the power of .
2 Related Work
2.1 Source Code Representation
Previous work suggests various representations of source code for follow-up analysis. Allamanis et al. AllamanisBBS15 and Iyer et al. IyerKCZ16 consider source code as plain text and use traditional token-based methods to capture lexical information. Gu et al. GuZZK16 use the Seq2Seq model to learn intermediate vector representations of queries in natural language to predict relevant API sequences. Mou et al. MouLZWJ16
propose a tree-based convolutional neural network to learn program representations. Alon et al.AlonZLY19; AlonBLY19 represent a code snippet as a set of compositional paths in the abstract syntax tree. Zhang et al. ZhangWZ0WL19 propose an AST-based Neural Network (ASTNN) that splits each large AST into a sequence of small statement trees and encodes them to vectors by capturing the lexical and syntactical knowledge. Shin et al. ShinABP19 represent idioms as AST segments using probabilistic tree substitution grammars for two tasks: idiom mining and code generation. leclair2020improved; wang2021code utilize GNNs to model ASTs. There are also works that utilize ensemble model du2021single or pre-trained models feng2020codebert; guo2021graphcodebert; bui2021infercode to model source code.
2.2 Source Code Summarization
Apart from the works mentioned above, researchers have proposed many approaches to source code summarization over the years. For example, Allamanis et al. AllamanisBBS15 create the neural logbilinear context model for suggesting method and class names by embedding them in a high dimensional continuous space. Allamanis et al. AllamanisPS16 also suggest a convolutional model for the summary generation that uses attention over a sliding window of tokens. They summarize code snippets into extreme, descriptive function name-like summaries.
Neural Machine Translation based models are also widely used for code summarization IyerKCZ16; haije2016automatic; HuLXLJ18; HuLXLLJ18; WanZYXY0Y18; hu2019deep; LeClairJM19; AhmadCRC20; YuHWF020. CodeNN IyerKCZ16 is the first neural approach for code summarization. It is a classical encoder-decoder framework that encodes code to context vectors with an attention mechanism and then generates summaries in the decoder. NCS AhmadCRC20 models code using Transformer to capture the long-range dependencies. HybridDrl WanZYXY0Y18
uses hybrid code representations (with ASTs) and deep reinforcement learning. It encodes the sequential and structural content of code by LSTMs and tree-based LSTMs and uses a hybrid attention layer to get an integrated representation.HDeepcom hu2019deep, Astattgru, and Attgru LeClairJM19 are essentially encoder-decoder network using RNNs with attention. Astattgru and HDeepcom utilize a multi-encoder neural model that encodes both code and AST. Code2seq AlonBLY19 represents a code snippet as a set of AST paths and uses attention to select the relevant paths while decoding. When using neural networks to represent large and deep ASTs, the above work will encounter problems such as gradient vanishing and slow training. can alleviate these problems by introducing a more efficient AST representation to generate better code summarie.
3 CAST: Code Summarization with AST Splitting and Reconstruction
This section presents the details of our model. The architecture of (Fig. 2) follows the general Seq2Seq framework and includes three major components: an AST encoder, a code token encoder, and a summary decoder. Given an input method, the AST encoder captures the semantic and structural information of its AST. The code token encoder encodes the lexical information of the method. The decoder integrates the multi-channel representations from the two encoders and incorporates a copy mechanism SeeLM17 to generate the code summary.
3.1 AST Encoder
3.1.1 AST Splitting and Reconstruction
Given a code fragment, we build its AST and visit it by preorder traversal. Each time a composite structure (i.e. If, While, etc.) is encountered, a placeholder node is inserted. The subtree rooted at this statement is split out to form the next level tree, whose semantics will be finally stuffed back to the placeholder. In this way, a large AST is decomposed into a set of small subtrees with the composite structures retained.
Before presenting the formal tree splitting rules, we provide an illustrative example in Fig. LABEL:fig:ast_slicing_ex. The full AST111The full AST is omitted due to space limit, it can be found in Appendix. (Fig. LABEL:fig:ast_slicing_ex(b)) of the given code snippet (Fig. LABEL:fig:ast_slicing_ex(a)) is split to six subtrees to in Fig. LABEL:fig:ast_slicing_ex(c). is the overview tree with non-terminal nodes Root, MethSig, MethBody, and three terminal nodes StatementsBlock (blue), For, and StatementsBlock (yellow) corresponding to the three main segments with Line2-5, Line6-13, and Line14 in Fig. LABEL:fig:ast_slicing_ex(a), respectively. The StatementsBlock (blue) node corresponds to which contains initialization statements. The For node corresponds to and the StatementsBlock (yellow) node corresponds to which consists of a return statement. Note that each subtree reveals one-level abstraction, meaning that nested structures are abstracted out. Therefore, the If statement nested in the For loop is split out to the subtree , leaving a placeholder If node in .
We give the formal definition222We only present the top-down skeleton and partial rules due to space limitation. The full set of rules, the splitting algorithm, and tool implementation are provided in Appendix. of subtrees in Fig. 3. The goal is to split a given AST to the subtree set . In general, all subtrees are generated by mapping reduction rules on the input AST (similar to mapping language grammar rules to a token sequence) and the output collects four kinds of subtrees: , , , and . is the method overview tree, providing the big picture of the method and gives the method signature information. To avoid too many scattered simple statements, we combine sequential statements to form a statements’ block . We drill down each terminal node in to a subtree being a or , providing detailed semantics of nodes in the overview tree. In the same way, subtrees corresponding to nested structures (such as For or If) will be split out to form new subtrees. We split out nested blocks one level at a time until there is no nested block. Finally, we obtain a set of these block-level subtrees. Also, a structure tree (e.g., Fig. LABEL:fig:ast_slicing_ex:structure_tree) that represents the ancestor-descendant relationships between the subtrees is maintained.
3.1.2 AST Encoding
We design a two-phase AST encoder module according to the characteristics of subtrees. In the first phase, a tree-based Recursive Neural Network (RvNN) followed by a max-pooling layer is applied to encode each subtree. In the second phase, we use another RvNN with different parameters to model the hierarchical relationship among the subtrees.
A subtree is defined as where is the node set and is the edge set. The forward propagation of RvNN to encode the subtree is formulated as:
where and are learnable weight matrices, , , are the hidden state, token embedding, and child set of the node , respectively. Particularly, equals to for the leaf node .
Intuitively, this computation is the procedure where each node in the AST aggregates information from its children nodes. After this bottom-up aggregation, each node has its corresponding hidden states. Finally, the hidden states of all nodes are aggregated to a vector through dimension-wise max-pooling operation, which will be used as the embedding for the whole subtree :
After obtaining the embeddings of all subtrees, we further encode the descendant relationships among the subtrees. These relationships are represented in the structure tree (e.g., Fig. LABEL:fig:ast_slicing_ex:structure_tree) , thus we apply another RvNN model on :
There are two main advantages of our AST encoder design. First, it enhances the ability to capture semantic information in multiple subtrees of a program by the first layer RvNN, because the tree splitting technique leads to subtrees that contain semantic information from different modules. In addition, to obtain more important features of the node vectors, we sample all nodes through max pooling. The second layer RvNN can further aggregate information of subtrees according to their relative positions in the hierarchy. Second, tree sizes are decreased significantly after splitting, thus the gradient vanishing and explosion problems are alleviated. Also, after tree splitting, the depth of each subtree is well controlled, leading to more stable model training.
3.2 Code Token Encoder
The code snippets are the raw data source to provide lexical information for the code summarization task. Following AhmadCRC20, we adopt the code token encoder using Transformer that is composed of a multi-head self-attention module and a relative position embedding module. In each attention head, the sequence of code token embeddings are transformed into output vector
where , , and are trainable matrices for queries, keys and values; is the dimension of queries and keys; and are relative positional representations for positions and .
3.3 Decoder with Copy Mechanism
Similar to the code token encoder, we adopt Transformer as the backbone of the decoder. Unlike the original decoder module in VaswaniSPUJGKP17, we need to integrate two encoding sources from code and AST encoders. The serial strategy libovicky-etal-2018-input is adopted, which is to compute the encoder-decoder attention one by one for each input encoder (Fig. 4). In each cross-attention layer, the encoding of ASTs ( flatted by preorder traversal) or codes () is queried by the output of the preceding summary self-attention .
where , and are trainable projection matrices for queries, keys and values. is the number of subtrees. and are the length of code and summary tokens, respectively. Following VaswaniSPUJGKP17
, we adopt a multi-head attention mechanism in the self-attention and cross-attention layers of the decoder. After stacking several decoder layers, we add a softmax operator to obtain the generation probabilityof each summary token.
We further incorporate the copy mechanism SeeLM17 to enable the decoder to copy rare tokens directly from the input codes. This is motivated by the fact that many tokens (about 28% in the Funcom dataset) are directly copied from the source code (e.g., function names and variable names) in the summary. Specifically, we learn a copy probability through an attention layer:
where is the probability for choosing the -th token from source code in the summary position , is the encoding vector of the -th code token, is the decoding vector of the -th summary token, is a learnable projection matrix to map to the space of , and is the code length. The final probability for selecting the token as -th summary token is defined as:
where is the -th code token and is a learned combination probability defined as , where
4 Experimental Setup
4.1 Dataset and Preprocessing
In our experiment, we adopt the public Java datasets TL-CodeSum HuLXLLJ18 and Funcom LeClairJM19, which are widely used in previous studies AhmadCRC20; HuLXLJ18; hu2019deep; HuLXLLJ18; leclair2020improved; LeClairJM19; zhangretrieval20; WeiLLXJ20. The partitioning of train/validation/test sets follows the original datasets. We split code tokens by camel case and snake case, replace numerals and string literals with the generic tokens <NUM> and <STRING>, and set all to lower case. We extract the first sentence of the method’s Javadoc description as the ground truth summary. Code that cannot be parsed by the Antlr parser parr2013definitive is removed. At last, we obtain and pairs of source code and summaries on TL-CodeSum and Funcom, respectively.
4.2 Experiment Settings
We implement our approach based on the open-source project OpenNMTklein2017opennmt. The vocabulary sizes are , and for AST, code, and summary, respectively. The batch size is set to 128 and the maximum number of epochs is 200/40 for TL-CodeSum and Funcom. For optimizer, we use the AdamW loshchilov2017decoupled with the learning rate . To alleviate overfitting, we adopt early stopping with patience . The experiments are conducted on a server with 4 GPUs of NVIDIA Tesla V100 and it takes about 10 and 40 minutes each epoch for TL-CodeSum and Funcom, respectively. Detailed hyper-parameter settings and training time can be found in Appendix.
4.3 Evaluation Metrics
Similar to previous work IyerKCZ16; WanZYXY0Y18; zhangretrieval20, we evaluate the performance of our proposed model based on four widely-used metrics, including BLEU PapineniRWZ02, Meteor BanerjeeL05, Rouge-L lin-2004-rouge and Cider VedantamZP15
. These metrics are prevalent metrics in machine translation, text summarization, and image captioning. Note that we report the scores of BLEU, Meteor (Met. for short), and Rouge-L (Rouge for short) in percentages since they are in the range of. As Cider scores are in the range of , we display them in real values. In addition, we notice that the related work on code summarization uses different BLEU implementations, such as BLEU-ncs, BLEU-M2, BLEU-CN, BLEU-FC, etc (named by GrosSDY20). And there are subtle differences in the way the BLEUs are calculated GrosSDY20. We choose the widely used BLEU-CN IyerKCZ16; AlonBLY19; FengGTDFGS0LJZ20; wang2020cocogum as the BLEU metric in this work. Detailed metrics description can be found in Appendix.
5 Experimental Results
5.1 The Effectivness of
We evaluate the effectiveness of by comparing it to the recent DNN-based code summarization models introduced in Sec. 2.2: CodeNN, HybridDrl, HDeepcom, Attgru, Astattgru, Code2seq, and NCS. To make a fair comparison, we extend ASTNN to CodeAstnn, with an additional code token encoder as ours, so that the only difference between them is the AST representation.
From the results in Table 1, we can see that outperforms all the baselines on both datasets. CodeNN, Code2seq, Attgru, and NCS only use code or AST information. Among them, NCS performs better because it applies a transformer to capture the long-range dependencies among code tokens. Astattgru and CodeAstnn outperform Attgru because of the addition of AST channels. Note that our model outperforms other baselines even without the copy mechanism or aggregation. This is because we split an AST into block-level subtrees and each subtree contains relatively complete semantics. On the contrary, related work such as ASTNN splits an AST into statement-level subtrees, which only represent a single statement and relatively fragmented semantics.
5.2 Comparison of Different AST Representations
We evaluate the performance of different AST representations by comparing with Code2seq, HybridDrl, Astattgru, and CodeAstnn. Table 1 shows that performs the best among them. As linearization-based methods, Astattgru flattens an AST to a sequence and Code2seq obtains a set of paths from an AST, both losing some hierarchical information of ASTs naturally. As tree-based methods, HybridDrl transforms ASTs to binary trees and trains on the full ASTs with tree-based models. This leads to AST structural information loss, gradient vanishing problem, and slow training process (21 hours each epoch in Funcom)333See training time details in Appendix Table 2 and 3.. Both CodeAstnn and perform better than HybridDrl, Code2seq, and Astattgru because they split a large AST into a set of small subtrees, which can alleviate the gradient vanishing problem. Our achieves the best performance and we further explain it from two aspects: splitting granularities of ASTs. and the AST representation learning.
For splitting granularities of ASTs, CodeAstnn is statement-level splitting, leading to subtrees 71% smaller than ours on TL-CodeSum444See dataset statistics in Appendix Table 5 to 8.. Therefore, it may not be able to capture the syntactical information and semantic information. In terms of AST representation learning, CodeAstnn and all use RvNN and Max-pooling to learn the representation of subtrees but different ways to aggregate them. The former applies a RNN-based model to aggregate the subtrees. It only captures the sequential structure and the convergence becomes worse as the number of subtrees increases BengioFS93. The latter applies RvNN to aggregate all subtrees together according to their relative positions in the hierarchy, which can combine the semantics of subtrees well.
5.3 Ablation Study
To investigate the usfulness of the subtree aggregation (Sec. 3.1.2) and the copy mechanism (Sec. 3.3), we conduct ablation studies on two variants of . The results of the ablation study are given in the bottom of Table 1.
CAST: without subtree aggregation, which directly uses the subtree vectors obtained by Eq. (2) as AST representation. Our results show that the performance of CAST drops compared to (except for Met. in Funcom), demonstrating that it is beneficial to reconstruct and aggregate information from subtrees.
CAST: without copy mechanism. Our results show that outperforms CAST, confirming that the copy mechanism can copy tokens (especially the out-of-vocabulary ones) from input code to improve the performance of summarization.
5.4 Human Evaluation
Results of human evaluation (standard deviation in parentheses).
Besides textual similarity based metrics, we also conduct a human evaluation by following the previous work IyerKCZ16; Liu0T0L19; hu2019deep; WeiLLXJ20 to evaluate semantic similarity of the summaries generated by , Astattgru, NCS and CodeAstnn. We randomly choose 50 Java methods from the testing sets (25 from TL-CodeSum and 25 from Funcom) and their summaries generate by four approaches. Specially, we invite 10 volunteers with more than 3 years of software development experience and excellent English ability. Each volunteer is asked to assign scores from 0 to 4 (the higher the better) to the generated summary from the three aspects: similarity of the generated summary and the ground truth summary, naturalness (grammaticality and fluency), and informativeness (the amount of content carried over from the input code to the generated summary, ignoring fluency). Each summary is evaluated by four volunteers, and the final score is the average of them.
Table 2 shows that outperforms others in all three aspects. Our approach is better than other approaches in Informative, which means that our approach tends to generate summaries with comprehensive semantics. In addition, we confirm the superiority of our approach using Wilcoxon signed-rank tests wilcoxon1970critical for the human evaluation. And the results555See Appendix Table 9 reflect that the improvement of over other approaches is statistically significant with all p-values smaller than 0.05 at 95% confidence level (except for CodeAstnn on Naturalness).
6 Threats to Validity
There are three main threats to validity. First, we evaluate and compare our work only on a Java dataset, although in principle, the model should generalize to other languages, experiments are needed to validate it. Also, AST splitting algorithm need to be implemented for other languages by implementing a visitor to AST.
Second, in neural network model design, there are many orthogonal aspects such as different token embeddings, whether to use beam search, teacher forcing. When showing the generality of , we have done the experiments in a controlled way. A future work might be to do all experiments in a more controlled way and the performance of could rise further when combined with all other orthogonal techniques.
Third, summaries in the datasets are collected by extracting the first sentences of Javadoc. Although this is a common practice to place a method’s summary at the first sentence of Javadoc, there might still be some mismatch summaries. A higher quality dataset with better summaries collecting techniques is needed in the future.
In this paper, we propose a new model that splits the AST of source code into several subtrees, embeds each subtree, and aggregates subtrees’ information back to form the full AST representation. This representation, along with code token sequence information, is then fed into a decoder to generate code summaries. Experimental results have demonstrated the effectiveness of and confirmed the usefulness of the abstraction technique. We believe our work sheds some light on future research by pointing out that there are better ways to represent source code for intelligent code understanding.