Source code summaries are important for program comprehension and maintenance since developers can quickly understand a piece of code by reading its natural language description. However, documenting code with summaries remains a labor-intensive and time-consuming task. As a result, code summaries are often mismatched, missing, or outdated in many projects [52, 7, 16]. Therefore, automatic generation of code summaries is desirable and many approaches have been proposed over the years [50, 43, 28, 30, 34, 27, 23, 25, 54, 24, 32, 60, 1]. Recently, deep learning (DL) based models are exploited to generate better natural language summaries for code snippets [27, 23, 25, 54, 24, 32, 60, 1]
. These models usually adopt a neural machine translation framework to learn the alignment between code and summaries. Some studies also enhance DL-based models by incorporating information retrieval techniques[60, 57]. Generally, the existing neural source code summarization models show promising results on public datasets and claim their superiority over traditional approaches.
However, we notice that in the current code summarization work, there are many important details that could be easily overlooked and important issues that have not received much attention. These details and issues are associated with evaluation metrics, experimental datasets, and experimental settings. In this work, we would like to dive deep into the problem and answer:how to evaluate and compare the source code summarization models more correctly and comprehensively?
To answer the above question, we conduct systematic experiments of five representative source code summarization approaches (including CodeNN , Deepcom , Astattgru , Rencos  and NCS ) on three widely used datasets (including TL-CodeSum , Funcom , and CodeSearchNet ), under controlled experimental settings. We choose the five approaches with the consideration of representativeness and diversity. CodeNN is one of the first DL-based (RNN) models and utilizes code token sequence. Deepcom captures the syntactic and structural information from AST. Astattgru uses both code token sequence and AST. NCS is the first attempt to replace the previous RNN units with the more advanced Transformer model. Rencos is a representative model that combines information retrieval techniques with the generation model in the code summarization task.
Our experiments can be divided into three major parts. We first conduct an in-depth analysis of the BLEU metric, which is widely used in related work [27, 23, 25, 54, 24, 3, 32, 1, 60, 56, 14, 31, 57] (Section IV-A). Then, we explore the impact of different code pre-processing operations (such as token splitting, replacement, filtering, lowercase) on the performance of code summarization (Section IV-B). Finally, we conduct extensive experiments on the three datasets from three perspectives: corpus sizes, data splitting ways, and duplication ratios (Section IV-C).
Through extensive experimental evaluation, we obtain the following major findings about the current neural code summarization models:
First, we find that there is a wide variety of BLEU metrics used in prior work and they produce rather different results for the same generated summary. At the same time, we notice that many existing studies simply cite the original paper of BLEU without explaining their exact implementation. What’s worse, some software packages used for calculating BLEU is buggy: 1⃝ They may produce a BLEU score greater than 100% (or even 700%), which extremely exaggerates the performance of code summarization models, and 2⃝ the results are also different across different package versions. Therefore, some studies may overestimate their model performance or may fail to achieve fair comparisons, even though they are evaluated on the same dataset with the same experimental setting.We further give some suggestions about the BLEU usage in Section IV-A.
Second, we find that different code pre-processing operations can affect the overall performance by a noticeable margin of -18% to +25%. Therefore, code pre-processing should be considered carefully during model evaluation. We also give suggestions on the choice of data pre-processing operations in Section IV-B.
Third, we find that the code summarization approaches perform inconsistently on different datasets. For instance, one approach may perform better than other approaches on one dataset and poorly on another dataset. Furthermore, three dataset attributes (corpus sizes, data splitting ways, and duplication ratios) have an important impact on the performance of code summarization models. For corpus size, as the size of the training set becomes larger, the performance of all models will improve. For data splitting ways, all approaches perform poorly on the dataset split by project (the same project can only exist in one partition: train, validation, or test set) than by method (randomly split dataset). That is, approaches only tested with datasets split by method may have the risk of generalization to new projects. For duplication ratios, we find when the duplication ratio increases, the BLEU scores of all approaches will increase, but the ranking among these approaches cannot be preserved.
We further give some suggestions about evaluation datasets in Section IV-C.
In summary, our findings indicate that in order to evaluate and compare code summarization models more correctly and comprehensively, we need to pay much attention to the implementation of BLEU metrics, the way of data pre-processing, and the usage of datasets.
Several previous surveys and empirical studies on code summarization are related to our work. For example, some surveys [44, 49, 61] provided a taxonomy of code summarization methods and discussed the advantages, limitations, and challenges of existing models from a high-level perspective. Song et al.  also provided an overview of the evaluation techniques being used in existing methods. Gros et al. 
described an analysis of several machine learning approaches originally designed for the task of natural language translation for the code summarization task. They observed that different datasets were used in existing work and different metrics were used to evaluate different approaches. Our work differs from previous work in that we not only observe the inconsistent usage of different BLEU metrics but also conduct dozens of experiments on the five models and explicitly confirm that the inconsistent usage can cause severe problems in evaluating/comparing models. Moreover, we explore factors affecting model evaluation, which have not been systematically studied before, such as dataset size, dataset split methods, data pre-processing operations, etc. Different from the surveys, we provide extensive experiments on various datasets for various findings and corresponding discussions. Finally, we consolidate all findings and propose guidelines for evaluating code summarization models.
The major contributions of this work are as follows:
We conduct an extensive evaluation of five representative neural code summarization models, with different data pre-processing techniques, evaluation metrics, and datasets.
We conclude that many existing code summarization models are not evaluated comprehensively and do not generalize well in new experimental settings. Therefore, more research is needed to further improve code summarization models.
Based on the evaluation results, we give actionable suggestions for evaluating code summarization models from multiple perspectives.
Ii-a Code Summarization
Code summaries are short natural language descriptions of code snippets that can help developers better understand and maintain source code. However, in many software projects, code summaries are often absent or outdated. It arouses the interests of many researchers to automatically generate code summaries. In the early stage of automatic source code summarization, template-based approaches [50, 19, 20, 13, 47] are widely used. However, a well-designed template requires expert domain knowledge. Therefore, information retrieval (IR) based approaches [19, 20, 13, 47] are proposed. The basic idea of the IR-based approach is to retrieve terms from source code to generate term-based summaries or to retrieve similar source code and use its summary as the target summary. However, the retrieved summaries may not correctly describe the semantics and behavior of code snippets, leading to the mismatches between code and summaries.
Recently, Neural Machine Translation (NMT) based models are exploited to generate summaries for code snippets [27, 23, 25, 54, 15, 24, 3, 32, 56, 14, 1, 8, 5, 35, 58, 10, 31, 21, 59]. CodeNN  is an early attempt that uses only code token sequences, followed by various approaches that utilize AST [23, 24, 3, 32, 31, 35], API knowledge , type information , global context [5, 21]54, 55], multi-task and dual learning [56, 58, 59]
, and pre-trained language models.
Hybrid approaches [60, 57] that combines the NMT-based and IR-based methods are proposed and shown to be promising. For instance, Rencos proposed by Zhang et al.  obtains two most similar code snippets based on the syntax-level and semantics-level information of the source code, and feeds the original code and the two retrieved code snippets to the model to generate summaries. Re2Com proposed by Wei et al. 
is an exemplar-based summary generation method that retrieves a similar code snippet and summary pair from the corpus and then utilizes the seq2seq neural network to modify the summaries.
. In short, a BLEU score is a percentage number between 0 and 100 that measures the similarity between one sentence to a set of reference sentences using constituent n-grams precision scores. BLEU typically uses BLEU-1, BLEU-2, BLEU-3, and BLEU-4 (calculated by 1-gram, 2-gram, 3-gram, and 4-gram precisions) to measure the precision. A value of 0 means that the generated sentence has no overlap with the reference while a value of 100 means perfect overlap with the reference. Mathematically, the n-gram precisionis defined as:
BLEU combines all n-gram precision scores using geometric mean:
is the unifrom weight 1/N. The straightforward calculation will result in high scores for short sentences or sentences with repeated high-frequency n-grams. Therefore, Brevity Penalty (BP) is used to scale the score and each n-gram in the reference is limited to be used just once.
The original BLEU was designed for the corpus-level calculation . Therefore, it does not need to be smoothed as is non-zero as long as there is at least one 4-gram match. For sentence-level BLEU, since the generated sentences and references are much shorter, is more likely to be zero when the sentence has no 4-gram or 4-gram match. Then the geometric mean will be zero even if , , and are large. In this case, the score correlates poorly with human judgment. Therefore, several smoothing methods are proposed  to mitigate this problem.
There is an interpretation  of BLEU scores by Google, which is shown in Table I. We also show the original BLEU scores reported by existing approaches in Table II. These scores vary a lot. Specifically, 19.61 for Astattgru would be interpreted as “hard to get the gist” and 38.17 for Deepcom would be interpreted as “understandable to good translations” according to Table I. However, this interpretation is contrary to the results shown in  where Astattgru is relatively better than Deepcom. To study this issue, we need to explore the difference and comparability of different metrics and experimental settings used in different methods.
|10-19||Hard to get the gist|
|20-29||The gist is clear, but has significant|
|30-40||Understandable to good translations|
|40-50||High quality translations|
|50-60||Very high quality, adequate, and fluent translations|
|>60||Quality often better than human|
Iii Experimental Design
TL-CodeSum has 87,136 method-summary pairs crawled from 9,732 Java projects created from 2015 to 2016 with at least 20 stars. The ratio of the training, validation, and test sets is 8:1:1. Since all pairs are shuffled, there can be methods from the same project in the training, validation, and test sets. In addition, there are exact code duplicates among the three partitions.
CodeSearchNet is a well-formatted dataset containing 496,688 Java methods across the training, validation, and test sets. Duplicates are removed and the dataset is split into training, validation, and test sets in proportion with 8:1:1 by project (80% of projects into training, 10% into validation, and 10% into testing) such that code from the same repository can only exist in one partition.
Funcom is a collection of 2.1 million method-summary pairs from 28,945 projects. Auto-generated code and exact duplicates are removed. Then the dataset is split into three parts for training, validation, and testing with the ratio of 9:0.5:0.5 by project.
For a systematic evaluation, we modify some characteristics of the datasets (such as dataset size, deduplication, etc) and obtain 9 new variants. In total, we experiment on 12 datasets, as shown in Table III the statistics. In this paper, we use TLC, FCM, and CSN to denote TL-CodeSum, Funcom, and CodeSearchNet, respectively. TLC is the original TL-CodeSum. CSN and FCM are CodeSearchNet and Funcom with source code that cannot be parsed by javalang111https://github.com/c2nes/javalang filtered out. These datasets are mainly different from each other in corpus sizes, data splitting ways, and duplication ratios. For corpus sizes, we set three magnitudes: small (the same size as TLC), medium (the same size as CSN), and large (the same size as FCM). Detailed descriptions of data splitting way and duplication can be found in Section III-D.
|TLC||69,708||8,714||8,714||–||9,732||Original TL-CodeSum |
|CSN||454,044||15,299||26,897||136,495||25,596||Filtered CodeSearchNet |
|CSNProject-Medium||454,044||15,299||26,897||136,495||25,596||CSN split by project|
|CSNClass-Medium||448,780||19,716||28,192||136,495||25,596||CSN split by class|
|CSNMethod-Medium||447,019||19,867||29,802||136,495||25,596||CSN split by method|
|CSNMethod-Small||69,708||19,867||29,802||–||–||Subset of CSNMethod-Medium|
|FCM||1,908,694||104,948||104,777||–||28,790||Filtered Funcom |
|FCMProject-Large||1,908,694||104,948||104,777||–||28,790||Split FCM by project|
|FCMMethod-Large||1,908,694||104,948||104,777||–||28,790||Split FCM by method|
|FCMMethod-Medium||454,044||104,948||104,777||–||–||Subset of FCMMethod-Large|
|FCMMethod-Small||69,708||104,948||104,777||–||–||Subset of FCMMethod-Large|
Iii-B Evaluated Approaches
We describe the code summarization models used in this study:
Deepcom  is an SBT-based (Structure-based Traversal) model, which is more capable of learning syntactic and structure information of Java methods.
Astattgru  is a multi-encoder neural model that encodes both code and AST to learn lexical and syntactic information of Java methods.
Rencos  enhances the neural model with the most similar code snippets retrieved from the training set. Therefore, it leverages both neural and retrieval-based techniques.
Iii-C Experimental Settings
We use the default hyper-parameter settings provided by each method and adjust the embedding size, hidden size, learning rate, and max epoch empirically to ensure that each model performs well on each dataset. We adopt max epoch 200 for TLC and TLCDedup
(others are 40) and early stopping with patience 20 to enable the convergence and generalization. In addition, we run each experiment 3 times and display the mean and standard deviation in the form of. All experiments are conducted on a machine with 252 GB main memory and 4 Tesla V100 32GB GPUs.
We use the provided implementations by each approach: CodeNN 222https://github.com/sriniiyer/codenn, Astattgru 333https://bit.ly/2MLSxFg, NCS 444https://github.com/wasiahmad/NeuralCodeSum and Rencos 555https://github.com/zhangj111/rencos. For Deepcom, we re-implement the method666The code for our re-implementation is included in the anonymous link. according to the paper description since it is not publicly available. We have checked the correctness by both reproducing the scores in the original paper  and double confirmed with the authors of Deepcom.
Iii-D Research Questions
This study investigates three research questions from three aspects: metrics, pre-processing operations, and datasets.
RQ1: How do different evaluation metrics affect the performance of code summarization?
There are several metrics commonly used for various NLP tasks such as machine translation, text summarization, and captioning. These metrics include BLEU , Meteor , Rouge-L , Cider , etc. In RQ1, we only present BLEU as it is the most commonly used metric in the code summarization task. For other RQs, all metrics are calculated (some of the results are put into Appendix due to space limitation). As stated in Section II-B, BLEU can be calculated at different levels and with smoothing methods. There are many BLEU variants used in prior work and they could generate different results for the same generated summary. Here, we use the names of BLEU variants defined in  and add another BLEU variant: BLEU-DM, which is the Sentence BLEU without smoothing  and is based on the implementation of NLTK3.2.4. The meaning of these BLEU variants are:
BLEU-NCS: This is a Sentence BLEU metric used in . It applies a Laplace-like smoothing by adding 1 to both the numerator and denominator of all .
BLEU-RC: This is an unsmoothed Sentence BLEU metric used in . To avoid the divided-by-zero error, it adds a tiny number in the numerator and a small number in the denominator of .
We first train and test the five approaches on TLC and TLCDedup, and measure their generated summaries using different BLEU variants. Then we will introduce the differences of the BLEU variants in detail, and summarize the reasons for the differences from three aspects: different calculation levels (sentence-level v.s. corpus-level), different smoothing methods used, and many problematic software implementations. Finally, we analyze the impact of each aspect and provide actionable guidance on the use of BLEU, such as how to choose a smoothing method, what problematic implementations should be avoided, and how to report the BLEU scores more clearly and comprehensively.
RQ2: How do different pre-processing operations affect the performance of code summarization?
There are various code pre-processing operations used in related work, such as token splitting, replacement, lowercase, filtering, etc. We select four operations that are widely used [23, 24, 25, 56, 32, 31, 57, 1, 60] to investigate whether different pre-processing operations would affect performance. The four operations are:
: replace string and number with generic symbol <STRING> and <NUM>.
: split tokens using camelCase and snake_case.
: filter the punctuations in code.
: lowercase all tokens.
We define a bit-wise notation to denote different pre-processing combinations. For example, means , , , and , which stands for performing , , and preventing , . Then, we evaluate different pre-processing combinations on TLCDedup dataset in Section IV-B.
RQ3: How do different datasets affect the performance?
Many datasets have been used in source code summarization. We first evaluate the performance of different methods on three widely used datasets, which are different in three attributes: corpus size, data splitting methods, and duplication ratio. Then, we study the impact of the three attributes with the extended datasets shown in Table III. The three attributes we consider are as follows:
Data splitting methods: there are three data splitting ways we investigate: 1⃝by method: randomly split the dataset after shuffling the all samples , 2⃝by class: randomly divide the classes into the three partitions such that code from the same class can only exist in one partition, and 3⃝by project: randomly divide the projects into the three partitions such that code from the same project can only exist in one partition [26, 32].
Corpus sizes: there are three magnitudes of training set size we investigate: 1⃝small: the training size of TLC, 2⃝medium: the training size of CSN, and 3⃝large(the training size of FCM).
Duplication ratios: Code duplication is common in software development practice. This is often because developers copy and paste code snippets and source files from other projects . According to a large-scale study 
, more than 50% of files were reused in more than one open-source project. Normally, for evaluating neural network models, the training set should not contain samples in the test set. Thus, ignoring code duplication may result in model performance and generalization ability not being comprehensively evaluated according to the actual practice. Among the three datasets we experimented on, Funcom and CodeSearchNet contain no duplicates because they have been deduplicated, but we find the existence of 20% exact code duplication in TL-CodeSum. Therefore, we conduct experiments on TL-CodeSum with different duplication ratios to study this effect.
Iv Experimental Results
Iv-a How do different evaluation metrics affect the performance of code summarization? (Rq1)
|and represent sentence BLEU and corpus BLEU, respectively. represents different smoothing methods,|
|is without smoothing method, and means using add-one Laplace smoothing which is similar to .|
We experiment on the five approaches and measure their generated summaries using different BLEU variants. The results are shown in Table IV. We can find that:
The scores of different BLEU variants are different for the same summary. For example, the BLEU scores of Deepcom on TLC vary from 12.14 to 40.18. Astattgru is better than Deepcom in all BLEU variants.
The ranking of models is not consistent using different BLEU variants. For example, the score of Astattgru is higher than that of CodeNN in terms of BLEU-FC but lower than that of CodeNN in other BLEU variants on TLC.
Under the BLEU-FC measure, many existing models (except Rencos) have scores lower than 20 on TLCDedup dataset. According to the interpretations in Table I, this means that under this experimental setting, the generated summaries are not gist-clear and understandable.
Next, we elaborate on the differences among the BLEU variants. The mathematical equation of BLEU is shown in Equation (2), which combines all n-gram precision scores using the geometric mean. The BP (Brevity Penalty) is used to scale the score because the short sentence such as single word outputs could potentially have high precision.
BLEU  is firstly designed for measuring the generated corpus; as such, it requires no smoothing, as some sentences would have at least one n-gram match. For sentence-level BLEU, will be zero when the example has not a 4-gram, and thus the geometric mean will be zero even if is large. For sentence-level measurement, it usually correlates poorly with human judgment. Therefore, several smoothing methods have been proposed in . NLTK 888https://github.com/nltk/nltk (the Natural Language Toolkit), which is a popular toolkit with 9.7K stars, implements the corpus-level and sentence-level measures with different smoothing methods and are widely used in evaluating generated summaries [23, 24, 56, 25, 32, 31, 51, 57]. However, there are problematic implementations in different NLTK versions, leading to some BLEU variants unusable. We further explain these differences in detail.
Iv-A1 Sentence v.s. corpus BLEU
The BLEU score calculated at the sentence level and corpus level is different, which is mainly caused by the different calculation strategies for merging all sentences. The corpus-level BLEU treats all sentences as a whole, where the numerator of is the sum of the numerators of all sentences’ , and the denominator of is the sum of the denominators of all sentences’ . Then the final BLEU score is calculated by the geometric mean of . Different from corpus-level BLEU, sentence-level BLEU is calculated by separately calculating the BLEU scores for all sentences, and then the arithmetic average of them is used as sentence-level BLEU. In other words, sentence-level BLEU aggregates the contributions of each sentence equally, while for corpus-level, the contribution of each sentence is positively correlated with the length of the sentence. Because of the different calculation methods, the scores of the two are not comparable. We thus suggest explicitly report at which level the BLEU is being used.
Iv-A2 Smoothing methods
Smoothing methods are applied when deciding how to deal with cases if the number of matched n-grams is 0. Since BLEU combines all n-gram precision scores() using the geometric mean, BLEU will be zero as long as any n-gram precision is zero. One may add a small number to , however, it will result in the geometric mean is near zero. Thus, many smoothing methods are proposed. Chen et al.  summarized 7 smoothing method. Smoothing methods 1-4 replace 0 with a small positive value, which can be a constant or a function of generated sentence length. Smoothing methods 5-7 average the , , and –gram matched counts in different ways to obtain the n-gram matched count. We plot the curve of under different smoothing methods applied to sentences of varying lengths in Fig. 1 (upper). We can find that values of calculated by different smoothing methods can vary a lot, especially for short sentences, which is the case for code summaries.
Iv-A3 Bugs in software packages
We measure the same summaries generated by CodeNN in three BLEU variants (BLEU-DM, BLEU-FC, and BLEU-DC), which are all based on NLTK implementation (but with different versions). From Table V, we can observe that scores of BLEU-DM and BLEU-DC are very different under different NLTK versions. This is because the buggy implementations for method0 and method4 in different versions and buggy implementation can cause up to 97% performance difference for the same metric under different versions.
Smoothing method0 bug. method0 (means no smoothing method) of NLTK3.2.x only combines the non-zero precision values of all n-grams using the geometric mean. For example, BLEU is the geometric mean of , , and when and .
Smoothing method4 bugs. method4 is implemented problematically in different NLTK versions. We plot the curve of of different smoothing method4 implementations in NLTK in Fig. 1 bottom, where the correct version is NLTK3.6.x. In NLTK versions 3.2.2 to 3.4.x, , where , which always inflates the score in different length (Fig. 1). The correct method4 proposed in  is , where and is a geometric sequence starting from 1/2 to n-grams with 0 matches. In NLTK3.5.x, where is the length of the generated sentence, thus can be assigned with a percentage number that is much greater than 100% (even 700%) when 5 in n-gram. We have reported this issue999https://github.com/nltk/nltk/issues/2676 and filed a pull request101010https://github.com/nltk/nltk/pull/2681 to NLTK GitHub repository, which has been accepted and merged into the official NLTK library and released in NLTK3.6.x (the revision is shown in Fig. 2). Therefore, NLTK3.6.x should be used when using smoothing method4.
From the above experiments, we can conclude that BLEU variants used in prior work on code summarization are different from each other and the differences can carry some risks such as the validity of their claimed results. Thus, it is unfair and risky to compare different models without using the same BLEU implementation. For instance, it is unacceptable that researchers ignore the differences among the BLEU variants and directly compare their results with the BLEU scores reported in other papers. We use the correct implementation to calculate BLEU scores in the following experiments.
|111111Except for versions 3.2 and 3.2.1, as these versions are buggy with the ZeroDivisionError exception. Please refer to https://github.com/nltk/nltk/issues/1458 for more details.||/||121212NLTK3.6.x are the versions with the BLEU calculation bug fixed by us.|
Summary. The BLEU measure should be described precisely, including calculation level (sentence or corpus) and smoothing method being used. Implementation correctness should be carefully checked before use. Identified buggy ones are: method0 in NLTK3.2.x and method4 from NLTK3.2.2 to NLTK3.5.x.
Iv-B The effect of different pre-processing operations (Rq2)
In order to evaluate the individual effect of four different code pre-processing operations and the effect of their combinations, we train and test the four models (CodeNN, Astattgru, Rencos, and NCS) under 16 different code pre-processing combinations. Note that the model Deepcom is not experimented as it does not use source code directly. In the following experiments, we have performed calculations on all metrics. Due to space limitation, we present the scores under BLEU-CN and BLEU-DC for RQ2 and BLEU-CN for RQ3. All findings still hold for other metrics, and the omitted results can be found in Appendix.
As shown in Table VI, we can observe that for all models, performing (identifier splitting) is always better than not performing it, while it is not clear whether to perform the other three operations. Then, we conduct the two-sided t-test  and Wilcoxon-Mann-Whitney test  to statistically evaluate the difference between using or dropping each operation. The significance signs (*) labelled in Table VI mean that the p-values of the statistical tests at 95% confidence level are less than 0.05. The results confirm that the improvement achieved by performing is statistically significant, while performing the other three operations does not lead to statistically different results. The detailed statistical test scores can be found in Appendix. As pointed out in , the OOV (out of vocabulary) ratio is reduced after splitting compound words, and using subtokens allows a model to suggest neologisms, which are unseen in the training data. Many studies [2, 17, 41, 6, 39] have shown that the performance of neural language models can be improved after handling the OOV problem. Therefore, the performance is improved after performing the identifier splitting pre-processing.
Next, we evaluate the effect of different combinations of the four code pre-processing operations and show the result in Table VII and Table VIII. For each model, we mark the top 5 scores in red and the bottom 5 scores in blue. From Table VII we can find that:
Different pre-processing operations can affect the overall performance by a noticeable margin.
is a recommended code pre-processing method, as it is in the top 5 for all approaches.
is the not-recommended code pre-processing method, as it is in the bottom 5 for all approaches.
Generally, the ranking of performance for different models are generally consistent under different code pre-processing settings.
Summary. To choose the best pre-processing operations, different combinations should be tested as different models prefer different pre-processing and the difference can be large (from -18% to +25%). Among them, using (identifier splitting) and is recommended, while is not recommended.
Iv-C How do different datasets affect the performance? (Rq3)
To answer RQ3, we evaluate the five approaches on the three base datasets: TLC, CSN, and FCM. From Table IX, we can find that:
The performance of the same model is different on different datasets.
The ranking among the approaches does not preserve when evaluating them on different datasets. For instance, Rencos outperforms other approaches in TLC but is worse than Astattgru and NCS in the other two datasets. CodeNN performs better than Astattgru on TLC, but Astattgru outperforms CodeNN in the other two datasets.
The average performance of all models on TLC is better than the other two datasets, although TLC is much smaller (about 96% less than FCM and 84% less than CSN).
The average performance of FCM is better than that of CSN.
Summary. To more comprehensively evaluate different models, it is recommended to use multiple datasets, as the ranking of model can be inconsistent on different datasets.
Since there are many factors that make the three datasets different, in order to further explore the reasons for the above results in-depth, we use the controlled variable method to study from three aspects: corpus size, data splitting way, and duplication ratio.
Iv-C1 The impact of different corpus sizes
We evaluate all models on two groups (one group contains CSNMethod-Medium and CSNMethod-Small, the other group contains FCMMethod-Large, FCMMethod-Medium and FCMMethod-Small). Within each group, the test sets are the same, the only difference is in the corpus size.
The results are shown in Table X. We can find that the ranking between models can be preserved on different corpus sizes. Also, as the size of the training set becomes larger, the performance of the five approaches improves in both groups, which is consistent with the findings of previous work . We can also find that, compared to other models, the performance of Deepcom does not improve significantly when the size of the training set increases. We suspect that this is due to the high OOV ratio, which affects the scalability of the Deepcom model [22, 29], as shown in the bottom of Table X. Deepcom uses only SBT and represents an AST node as a concatenation of the type and value of the AST node, resulting in a sparse vocabulary. Therefore, even if the training set becomes larger, the OOV ratio is still high. Therefore, Deepcom could not fully leverage the larger datasets.
Summary. If additional data is available, one can enhance the performance of models by training with more data since the performance improves as the size of the training set becomes larger.
|OOV Ratio of Deepcom||91.90%||88.94%||88.32%||91.49%||85.81%|
|OOV Ratio of Others||63.36%||53.09%||48.60%||60.99%||34.00%|
Iv-C2 The impact of different data splitting ways
In this experiment, we evaluate the five approaches on two groups (one group contains FCMProject-Large and FCMMethod-Large and another contains CSNProject-Medium, CSNClass-Medium, CSNMethod-Medium). Each group only differs in data splitting ways. From Table XI, we can observe that all approaches perform differently in different data splitting ways, and they all perform better on the dataset split by method than by project. This is because similar tokens and code patterns are used in the methods from the same project [45, 33, 37]. In addition, when the data splitting ways are different, the rankings between various approaches remain basically unchanged, which indicates that it would not impact comparison fairness across different approaches whether or not to consider multiple data-splitting ways.
Summary. Different data splitting ways will significantly affect the independent performance of all models. However, the ranking of the model remains basically unchanged. Therefore, if data availability or time is limited, it is also reliable to evaluate the performance of different models under only one data splitting way.
Iv-C3 The impact of different duplication ratios
To simulate scenarios with different code duplication ratios, we construct synthetic test sets from TLCDedup by adding random samples from the training set to the test set. Then, we train the five models using the same training set and test them using the synthetic test sets with different duplication ratios (i.e., the test sets with random samples). From the results shown in Fig. 3, we can find that:
The BLEU scores of all approaches increase as the duplication ratio increases.
The score of the model Rencos increases significantly when the duplication ratio increases. We speculate that the reason should be the duplicated samples being retrieved back by the retrieval module in Rencos. Therefore, retrieval-based models could benefit more from code duplication.
In addition, the ranking of the models is not preserved with different duplication ratios. For instance, CodeNN outperforms Astattgru without duplication and is no better than Astattgru on other duplication ratios.
Summary. To evaluate the performance of neural code summarization models, it is recommended to use deduplicated datasets so that the generalization ability of the model itself can be tested. However, in real scenarios, duplications are natural. Therefore, we suggest evaluating models under different duplication ratios. Moreover, it is recommended to consider incorporating retrieval techniques to improve the performance especially when code duplications exist.
V Threats to Validity
We have identified the following main threats to validity:
Programming languages. We only conduct experiments on Java datasets. Although, in principle, the models and experiments are not specifically designed for Java, more evaluations are needed when generalizing our findings to other languages. In the future, we will extend our study to other programming languages.
The quality of summaries. The summaries in all datasets are collected by extracting the first sentences of Javadoc. Although this is a common practice to place a method’s summary at the first sentence according to the Javadoc guidelines131313http://www.oracle.com/technetwork/articles/java/index-137868.html, there might still be some incomplete or mismatched summaries in the datasets.
Data difference. We observe that even when we control all three factors (splitting methods, duplication ratios, and dataset sizes), the performance of the same model still varies greatly between different datasets141414The result is put into Appendix due to space limitation.. This indicates that the differences in training data may also be a factor that affects the performance of code summarization. We leave it to future work to study the impact of data differences.
Models evaluated. We covered all representative models with different characteristics, such as both RNN-based and Transformer-based models, both single-channel and multi-channel models, both models with and without retrieval techniques. However, other models that we are out of our study may still cause our findings to be untenable.
Human evaluation. We use quantitative evaluation metrics to evaluate the code summarization results. Although these metrics are used in almost all related work, qualitative human evaluation can further confirm the validity of our findings. We defer a thorough human evaluation to future work.
In this paper, we conduct an in-depth analysis of recent neural code summarization models. We have investigated several aspects of model evaluation: evaluation metrics, datasets, and code pre-processing operations. Our results point out that all these aspects have a large impact on evaluation results. Without a carefully and systematically designed experiment, neural code summarization models cannot be fairly evaluated and compared. Our work also suggests some actionable guidelines including: (1) using proper (and maybe multiple) code pre-processing operations (2) selecting and reporting BLEU metrics explicitly (including a sentence or corpus level, smoothing method, NLTK version, etc) (3) considering the dataset characteristics when evaluating and choosing the best model. We believe the results and findings we obtained can be of great help for practitioners and researchers working on this interesting area.
In our future work, we will extend our study to programming languages other than Java. We will also explore more important attributes of dataset and investigate better techniques for building a higher-quality parallel corpus. Furthermore, we plan to extend our guidelines actionable to other text generation tasks in software engineering such as commit message generation.
-  (2020) A transformer-based approach for source code summarization. In ACL, Cited by: §I, §I, §I, §II-A, §II-B, TABLE II, 4th item, 5th item, §III-D.
-  (2016) A convolutional attention network for extreme summarization of source code. In ICML, JMLR Workshop and Conference Proceedings, Vol. 48, pp. 2091–2100. Cited by: §IV-B.
-  (2019) Code2seq: generating sequences from structured representations of code. In ICLR (Poster), Cited by: §I, §II-A, §II-B, 1st item, §IV-C1.
-  (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In IEEvaluation@ACL, Cited by: §III-D.
-  (2021) Project-level encoding for neural source code summarization of subroutines. In ICPC, Cited by: §II-A.
-  (2002) Modelling out-of-vocabulary words for robust speech recognition. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §IV-B.
-  (2003) Software documentation: how much is enough?. In CSMR, pp. 13. Cited by: §I.
-  (2020) TAG: type auxiliary guiding for code comment generation. In ACL, Cited by: §II-A.
-  (2014) A systematic comparison of smoothing techniques for sentence-level BLEU. In WMT@ACL, pp. 362–367. Cited by: §II-B, §III-D, §IV-A2, §IV-A3, §IV-A.
-  (2018) A neural framework for retrieval and summarization of source code. In ASE, Cited by: §II-A.
-  (2007) AutoML: evaluating models. External Links: Cited by: §II-B, TABLE I.
-  (2011) Statistics for research. Vol. 512, John Wiley & Sons. Cited by: §IV-B.
-  (2013) Evaluating source code summarization techniques: replication and expansion. In ICPC, pp. 13–22. Cited by: §II-A.
-  (2020) CodeBERT: A pre-trained model for programming and natural languages. In EMNLP (Findings), Cited by: §I, §II-A, §II-B, 1st item.
-  (2019) Structured neural summarization. In ICLR, Cited by: §II-A.
-  (2002) The relevance of software documentation, tools and technologies: a survey. In ACM Symposium on Document Engineering, pp. 26–33. Cited by: §I.
-  (2017) Improving neural language models with a continuous cache. In ICLR (Poster), Cited by: §IV-B.
-  (2020) Code to comment ”translation”: data, metrics, baselining & evaluation. In ASE, Cited by: §I, §III-D.
-  (2010) Supporting program comprehension with source code summarization. In ICSE, Vol. 2, pp. 223–226. Cited by: §II-A.
-  (2010) On the use of automated text summarization techniques for summarizing source code. In WCRE, pp. 35–44. Cited by: §II-A.
-  (2020) Improved automatic summarization of subroutines via attention to file context. In MSR, Cited by: §II-A, §II-B.
-  (2017) Are deep neural networks the best choice for modeling source code?. In FSE, Cited by: §IV-C1.
-  (2018) Deep code comment generation. In ICPC, Cited by: §I, §I, §I, §II-A, §II-B, TABLE II, 2nd item, 2nd item, §III-C, §III-D, §IV-A, footnote 7.
-  (2019) Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering. Cited by: §I, §I, §II-A, §II-B, 2nd item, §III-D, §IV-A, footnote 7.
-  (2018) Summarizing source code with transferred api knowledge. In IJCAI, Cited by: §I, §I, §I, §II-A, §II-B, item 1, §III-A, §III-D, TABLE III, §IV-A.
-  (2019) CodeSearchNet challenge: evaluating the state of semantic code search. arXiv Preprint. External Links: Cited by: §I, item 1, §III-A, TABLE III.
Summarizing source code using a neural attention model. In ACL, Cited by: §I, §I, §I, §II-A, §II-B, TABLE II, 1st item, 1st item.
-  (2002) CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Software Eng. 28 (7), pp. 654–670. Cited by: §I.
-  (2020) Big code != big vocabulary: open-vocabulary models for source code. In ICSE, pp. 1073–1085. Cited by: §IV-B, §IV-C1.
-  (2005) An empirical study of code clone genealogies. In ESEC/SIGSOFT FSE, pp. 187–196. Cited by: §I.
-  (2020) Improved code summarization via a graph neural network. In ICPC, pp. 184–195. Cited by: §I, §II-A, §II-B, 4th item, §III-D, §IV-A.
-  (2019) A neural model for generating natural language summaries of program subroutines. In ICSE, Cited by: §I, §I, §I, §II-A, §II-B, §II-B, TABLE II, 3rd item, 4th item, item 1, §III-A, §III-D, TABLE III, §IV-A.
-  (2019) Recommendations for datasets for source code summarization. In NAACL, Cited by: §IV-C2.
-  (2006) CP-miner: finding copy-paste and related bugs in large-scale software code. IEEE Trans. Software Eng. 32 (3), pp. 176–192. Cited by: §I.
-  (2021) Improving code summarization with block-wise abstract syntax tree splitting. In ICPC, Cited by: §II-A.
-  (2004) ROUGE: a package for automatic evaluation of summaries. In ACL, Cited by: §III-D.
-  (2019) ATOM: commit message generation based on abstract syntax tree and hybrid ranking. arXiv. Cited by: §IV-C2.
-  (2017) Déjàvu: a map of code duplicates on github. In OOPSLA, Cited by: item 3.
-  (2013) Better word representations with recursive neural networks for morphology. In CoNLL, pp. 104–113. Cited by: §IV-B.
On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. The Annals of Mathematical Statistics 18 (1), pp. 50 – 60. External Links: Cited by: §IV-B.
-  (2017) Pointer sentinel mixture models. In ICLR (Poster), Cited by: §IV-B.
-  (2007) Large-scale code reuse in open source software. In First International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS’07: ICSE Workshops 2007), pp. 7–7. Cited by: item 3.
-  (2013) Automatic generation of natural language summaries for java classes. In ICPC, pp. 23–32. Cited by: §I.
-  (2016) Summarizing software artifacts: a literature review. Journal of Computer Science and Technology 31 (5), pp. 883–909. Cited by: §I.
-  (2020) Learning to update natural language comments based on code changes. In ACL, Cited by: §IV-C2.
-  (2002) Bleu: a method for automatic evaluation of machine translation. In ACL, pp. 311–318. Cited by: 1st item, §II-B, §II-B, §III-D, §IV-A.
-  (2014) Improving automated source code summarization via an eye-tracking study of programmers. In ICSE, pp. 390–401. Cited by: §II-A.
-  (2017) Get to the point: summarization with pointer-generator networks. In ACL, Cited by: 4th item.
-  (2019) A survey of automatic generation of source code comments: algorithms and techniques. IEEE Access 7, pp. 111411–111428. Cited by: §I.
-  (2010) Towards automatically generating summary comments for java methods. In ASE, pp. 43–52. Cited by: §I, §II-A.
-  (2020) A human study of comprehension and code summarization. In ICPC, pp. 2–13. Cited by: §IV-A.
-  (1992) Documenting software systems with views. In SIGDOC, pp. 211–219. Cited by: §I.
-  (2015) CIDEr: consensus-based image description evaluation. In CVPR, Cited by: §III-D.
-  (2018) Improving automatic source code summarization via deep reinforcement learning. In ASE, Cited by: §I, §I, §II-A, §II-B.
-  (2020) Reinforcement-learning-guided source code summarization via hierarchical attention. IEEE Transactions on Software Engineering. Cited by: §II-A.
-  (2019) Code generation as a dual task of code summarization. In NeurIPS, pp. 6559–6569. Cited by: §I, §II-A, §II-B, §III-D, §IV-A.
-  (2020) Retrieve and refine: exemplar-based neural comment generation. In ASE, pp. 349–360. Cited by: §I, §I, §II-A, §II-B, 4th item, §III-D, §IV-A.
-  (2021) Exploiting method names to improve code summarization: a deliberation multi-task learning approach. In ICPC, Cited by: §II-A.
-  (2020) Leveraging code generation to improve code retrieval and summarization via dual learning. In The Web Conference, Cited by: §II-A, §II-B.
-  (2020) Retrieval-based neural source code summarization. In ICSE, Cited by: §I, §I, §I, §II-A, §II-B, TABLE II, 5th item, 6th item, §III-D.
-  (2019) Automatic code summarization: a systematic literature review. arXiv preprint arXiv:1909.04352. Cited by: §I.