Manually writing comments is very time-consuming, and code comments are often low-quality, missing, or mismatched after the software is upgraded (de2005study; kajko2005survey). To assist developers in writing high-quality comments or fill in absent comments, code comment generation techniques have been proposed, which aim to generate a summary for a given code snippet automatically (moreno2013automatic; eddy2013evaluating; iyer2016summarizing; hu2018deep; zhang2020retrieval; wei2020retrieve).
Most of existing code comment generation approaches can be categorized into two orthogonal types, i.e., the information retrieval (IR) based approaches (haiduc2010supporting; haiduc2010use; edmund2014mining; wong2015clocom; kamiya2002ccfinder; li2006cp; kim2005empirical; liu2018commitmsg), which leverage the comments of retrieved similar code snippets to generate comments for code snippets and the neural-based approaches (iyer2016summarizing; hochreiter1997long; hu2018deep; leclair2019neural)
, which treat the comment generation task as a translation problem and build neural machine translation (NMT) models to generate comments for code snippets. IR-based approaches can directly leverage the existing and manually written comments, which may contain rare words or project-specific information that are difficult to be generated by NMT(koehn2017rareword). In contrast, the neural-based approaches perform more robustly on general and new-coming samples with generalization capability (koehn2017rareword). Therefore, recent studies (zhang2020retrieval; wei2020retrieve) have gradually focused on combining the strengths of the IR-based and neural-based approaches to achieve better performance. Specifically, most of the existing approaches bind IR- and neural-based approaches statically, i.e., each input code sample and its retrieved similar code snippet from the IR-based approaches will be fed to the NMT model of neural-based approaches to generate comments regardless of whether the retrieved similar code snippet is actually similar to the input one or not. In this paper, we will refer to these approaches as IR+NMT approaches.
However, despite the tremendous progress of existing IR+NMT approaches, our pilot study reveals that such a combination is not generalizable and can lead to performance degradation. For instances, Figure 1 shows an example that the comment from the retrieved similar code snippet is a perfect match to the input code sample; thus, there is no need to feed it into the neural-based models. In contrast, Figure 2 shows another example that a retrieved sample is highly lexical similar to the input sample in codes while they are irrelevant in comments; feeding the retrieved false-positive code snippets into a neural-based model will confusing the neural model and further degrade its performance.
In this paper, to tackle the issue of existing static binding of IR- and neural-based approaches, we propose a straightforward but effective approach to combine the strengths of the IR-based and neural-based approaches in a dynamic manner. Specifically, given an input code snippet, we first use an IR-based approach to retrieve a similar code snippet from the corpus. Then we use a Cross-Encoder based classifier to select the comment generation method to be used dynamically, i.e., if the retrieved similar code snippet is a true positive, we directly reuse the existing comment from the similar sample retrieved by the IR technique. Otherwise, we pass the input to the neural-based approach to generate its comment.
To evaluate our approach, we conduct experiments on a large-scale dataset provided by LeClair et al. (leclair2019neural), which comes from the Sourcerer repository and contains about 2M code-comment pairs. We employ BLEU (papineni2002bleu), METEOR (banerjee2005meteor), ROUGE-L (lin2004rouge), and CIDER (vedantam2015cider)
as evaluation metrics to evaluate predicted comments. The experimental results show that our approach can outperform state-of-the-art baselines on all selecting metrics. Specifically, our approach can achieve 25.45 BLEU score, which improves the state-of-the-art IR-based approach, neural-based approach, and their combination by 41%, 26%, and 7%, respectively.
The main contributions of this paper are as follows:
We propose a straightforward but effective approach to combine the IR-based and neural-based comment generation approaches in a dynamic manner.
We have designed a Cross-Encoder based classifier, which dynamically selects the comment generation method to be used for each input sample.
We conduct extensive experiments on a large-scale dataset to evaluate the performance of our approach. The experiment results show the effectiveness of our approach.
We release the source code of our approach and the dataset of our experiments to help other researchers replicate and extend our study111https://zenodo.org/record/4757011.
The rest of this paper is organized as follows. Section 2 presents the background of this study. Section 3 describes the details of our approach. Section 4 and Section 5 present the experiment setup and results. Section 6 discusses the strengths of our approach and threats to validity. Section 7 reviews related work. Finally, we conclude our work in Section 8.
2.1. Neural Machine Translation
Recent neural-based comment generation approaches (iyer2016summarizing; hu2018deep; leclair2019neural; hu2020deep; zhang2020retrieval) treat comment generation as an end-to-end neural machine translation (NMT) task and leverage the encoder-decoder Sequence-to-Sequence (Seq2Seq) model to learn the translating pattern. Specifically, at each time step , it reads one token from the input code snippet sequence , then the encoder updates the current hidden state :
where is a neural unit, e.g. GRU (cho2014gru), LSTM (hochreiter1997long).
Attention mechanism (bahdanau2014attention) is adopted to focus on the critical part of the input code during decoding. For predicting target word
, a context vectoris calculated as a weighted sum of all hidden states :
The weight of each hidden state is calculated as follows:
where donates the last hidden state of the decoder, is an alignment model, e.g., a Multi-Layer Perception (MLP) unit (pal1992mlp).
At time step , the hidden state of the decoder is updated by:
where is the previous generated token. Then, the decoder generates the target sequence
by sequentially predicting the conditional probability of a wordbased on the hidden state and the context vector .
where is the generator function, e.g., a MLP layer (pal1992mlp) along with softmax.
The cross-entropy loss function is used to train the Seq2Seq model, i.e., minimizing the following objective function:
where donates the trainable parameters, is the number of training instances and is the length of each target sequence. means the th word in the th instance.
2.2. Semantic Textual Similarity
To better distinguish false-positive samples, like the example shown in Figure 2, we treat determining whether the retrieved results are similar to the input samples
as a supervised learning task. The semantic textual similarity (STS) task aims to determine the semantic similarity of a given sentence pair, which is similar to our task. The input sentence pair to the semantic classifier is the input and retrieved code snippet. The predicted label is whether the retrieved result is accurate.
Cross-Encoder (devlin2018bert) is one of the state-of-the-art methods for the semantic textual similarity (STS) task. The structure of the Cross-Encoder is shown in Figure 4. For the given sentence pair (), Cross-Encoder concatenates them by a special token ([SEP]) to encode them simultaneously. A multi-head attention pre-trained model (e.g., BERT (devlin2018bert)) is used to encode the concatenated sequence. In the encoding process, the self-attention mechanism allows two input sentences to perceive each other’s information at a fine-grained level. The embedding result is fed into a classifier layer that produces an output value between 0 and 1, indicating the semantic similarity.
In this paper, we use a Cross-Encoder based classifier to identify samples with accurate retrieved results. For the pre-trained model of the Cross-Encoder, we choose CodeBERT (feng2020codebert), which is trained on a large-scale code corpus consists of Java and five other programming languages (husain2019codesearchnet). Comparing with other pre-trained models on natural language, CodeBERT can save the effort of semantic migration from natural language to programming language during fine-tuning.
In this work, we propose a comment generation approach that combines the strengths of the IR- and neural-based comment generation approaches dynamically. The key idea of our approach is straightforward: given an input code snippet, we first use an IR-based approach to retrieve a similar code snippet from the corpus. Then we use a Cross-Encoder based classifier to select the comment generation method to be used dynamically, i.e., if the retrieved similar code snippet is a true positive, we directly use the IR result. Otherwise, we pass the input to the neural-based approach to generate the comment. Unlike existing IR+NMT approaches (zhang2020retrieval; wei2020retrieve), we do not pass the information obtained by the IR-based approach to the neural network model to avoid textually similar but semantically dissimilar retrieved results to confuse the model.
3.1. Overview of Our Approach
The workflow of our approach is shown in Figure 3. Given an input sample, our approach generates its comment using the following three steps: 1) Comment generation with the IR-based technique (Section 3.2). In this step, our approach extracts the comment from the most similar sample retrieved from the corpus through the IR-based retrieval technique. 2) Evaluate the retrieved result (Section 3.3). We use a Cross-Encoder based classifier to determine whether the retrieved code snippet is similar to the input semantically. We assume that directly leveraging the existing comment from a true-positive similar sample, which may contain low-frequency words and project-specific information that hard to be generated by NMT (koehn2017rareword; zhang2020retrieval; wei2020retrieve), will be more accurate and informative than the generated result of NMT models. Therefore, when the retrieved code snippet is similar to the input, our approach will reuse the comment of the retrieved code snippet. Otherwise, we assume that the current sample needs to be inferred by generation-based methods. 3) Comment generation with the neural-based technique (Section 3.4). For the input sample whose retrieval result is determined to be inaccurate from the previous step, the neural model is used to automatically generate its comment based on the input code snippet and corresponding AST sequence.
3.2. Comment Generation with The IR-based Technique
This step aims to provide an existing comment for each input sample that may be reusable from the retrieved similar code snippet.
To identify the most similar sample for a given sample, in this work, we reuse the retrieval method of Re2Com (wei2020retrieve), which is a code lexical similarity based retrieval method. The retrieval module of Re2Com uses the training set as the corpus. It retrieves the sample with the highest lexical similarity between code snippets based on BM25 algorithm from search engine Lucene222https://lucene.apache.org/, a widely used similarity metric. For each term in the given code snippet, its relevance score to the candidate code snippet is calculated based on the term frequency. Then, the BM25 score between the input and candidate code snippet is calculated as a weighted sum of the relevance score of each term, where the weight of each term is calculated based on its inverse document frequency. Finally, the candidate code snippet with the highest BM25 score is selected as the retrieved result. Note that, IR-based approach does not have a training process. We use the settings of BM25 from Re2Com to run our experiments.
3.3. Evaluate The Retrieved Result with The Cross-Encoder based Classifier
In the previous step, we have provided an existing comment from the retrieved similar code snippet for each input sample. However, as shown in Figure 2, the results of the IR technique could be incorrect, thus to achieve more accurate determination, we compare the semantic between the input and the retrieved code snippet by a semantic model to predict whether the IR result is accurate and can be directly reused.
To identify samples with accurate IR results, we compare the input with the retrieved code snippet semantically rather than textually. This is because determining the performance of IR results from text similarity is not accurate enough. As shown in Figure 2, the input and the retrieved code snippet are very similar, with only 2-3 tokens different. However, their corresponding comments have only one token in common. In this work, we use the Cross-Encoder model for the semantic comparison, one of the state-of-the-art methods for the semantic textual similarity (STS) task. Figure 4 shows the structure of the Cross-Encoder. The input to the model is the input and retrieved code snippet. Two snippets are concatenated into a sequence through a specific token [SEP] provided by BERT (devlin2018bert) and simultaneously passed to a pre-trained multi-level transformer (vaswani2017attention) network for embedding. We choose CodeBERT (feng2020codebert) as the pre-trained model to save the effort of semantic migration. The embedding result of the two snippets is fed into a liner classifier layer that produces an output value between 0 and 1, indicating the degree of semantic similarity:
where is the predicted degree of semantic similarity, is the weight of the linear layer, and is the embedding result of the input and retrieved code snippet.
The training process is fine-tuning the semantic model with pairs of code snippets to the target that if a semantically similar snippet is retrieved, the model returns 1, otherwise returns 0. We use the classic cross-entropy loss function to fine-tune the model:
where indicates the golden label of whether the retrieved result is accurate.
The details of how we train the Cross-Encoder based classifier are available in Section 4.2.2.
3.4. Comment Generation with The Neural-based Technique
In the previous step, we have identified samples with accurate IR results. While the remaining input samples, we further use the neural-based approach to generate comments for them. Specifically, in this step, we first build and train an NMT model on our corpus. Then we input samples that are determined to have inaccurate IR results in the previous step to this model to generate comments. This step aims to use the generalization ability of NMT to generate comments for general input samples.
In this step, we use the state-of-the-art neural-based comment generation method, i.e., DeepCom (hu2020deep). DeepCom is an encoder-decoder structure model with the attention mechanism (bahdanau2014attention). The input of the model contains both code and AST sequences, where the code sequence contains semantic information such as identifier names, and the AST sequence contains structural information. Using semantic and structural information from the input code snippet simultaneously can help the model understand them more clearly and predict more accurately (hu2020deep). The model uses two encoders to encode the code sequence and the AST sequence, respectively. We follow the model training and turning processes described in DeepCom (hu2020deep) to re-train the models on our corpus (details are in Section 4.2.1).
4. Experiment Design
We use the FunCom dataset provided by LeClair et al. (leclair2019neural) to conduct our experiments, which has been used in many existing studies (leclair2019neural; wei2020retrieve; haque2021action). The FunCom dataset is collected from a large Sourcerer repository (lopes2010uci), which contains over 50,000 projects and 5.1 million java methods. LeClair et al. treat the first sentence of the Javadoc of each method as its comment (kramer1999javadoc), use srcML (collard2011srcml) to extract AST sequences from source codes, then serialize them by the SBT method proposed by Hu et al. (hu2018deep)
. To reduce the vocabulary size, LeClair et al. adopt a series of preprocessing to the code and comment text: splitting identifiers in code and comment by camel case and underscore, removing non-alpha characters (including symbols) from the text, and converting the text to lowercase. To better simulate the case where only AST is known, identifiers in the AST sequence are replaced with ¡OTHER¿. To reduce duplicate samples between the training and test set, LeClair et al. use a heuristic rule(shimonaka2016removeauto) to filter out auto-generated codes which are very similar to each other and too easy to be learned and predicted by the model. In addition, LeClair et al. divide all the data by project in the dataset building stage: data from 90% of projects are divided as the training set, 5% as the validation set, and 5% as the test set. After filtering, the FunCom dataset has about 2M code-comment pairs for training and testing.
The FunCom dataset is the most reasonable dataset to the best of our knowledge, which has a large amount of data and excludes noisy data, thus allowing us to evaluate the model’s generalization ability more accurately.
4.2. Experiment Settings
In this work, we train both DeepCom (hu2020deep) and the Cross-Encoder based classifier (devlin2018bert) on the FunCom dataset. Their training details are as follows.
4.2.1. Training Details of DeepCom
We use the default settings of DeepCom for training, i.e., the encoder and decoder use a single-layer Gated Recurrent Unit (GRU) structure(cho2014gru). Both the word embeddings and the GRU hidden states are set to 256. In the decoding stage, beam search (wiseman2016beamsearch)
is leveraged to obtain more accurate results, with the beam width is set to 5. We use the entire FunCom dataset for training and validation. DeepCom is trained on the FunCom training set (19,548,008 samples in total). Following DeepCom, we use Stochastic Gradient Descent (SGD) based optimizer to train the model, the initial learning rate is set to 0.5, and the learning rate decay factor is set to 0.95. In addition, to save GPU memory, we set the batch size to 256. Every 2000 training steps, the checkpoint is saved and validated on the FunCom validation set (104,273 samples in total). After 20 epochs of training (about 150,000 steps), the best parameters are selected from the checkpoint that performs best on the validation set. We trained the model on a Linux server with the NVIDIA RTX 2060S GPU with 8GB memory, which took about 70 hours for training.
4.2.2. Training Details of Cross-Encoder Based Classifier
We use the Sentence-Bert (reimers2019sentencebert) package to build and train the Cross-Encoder based classifier. In order to save the effort of language semantics migration, we adopt the widely used CodeBERT pre-trained model (feng2020codebert), a 24-layer bidirectional transformer (vaswani2017attention) network.
To label the dataset for training the Cross-Encoder based classifier, we use code-comment pairs from the validation set of FunCom (104,273 samples in total). For each sample, we use the IR-based approach (details are in Section 3.2) to retrieve the most similar code snippet, and the corresponding comment will be treated as the IR result. Then we use a trained neural model (i.e., DeepCom) to generate its comment, i.e., NMT result. The label of the sample is whether the IR result is more accurate. Specifically, we use sentence_bleu metric in the NLTK (loper2002nltk) package to calculate the similarities of the IR result and NMT result with ground truth, respectively. If the score of the IR result is greater than the score of the NMT result, it is labeled as a positive sample; otherwise, it is labeled as a negative sample. We further exclude cases where both methods perform poorly from positive samples (e.g., both IR result and NMT result fail to hit any word in the ground truth comment). Finally, we obtain a triplet for each sample: ¡ Input code snippet, Retrieved code snippet, Is_IR_Result_Better? ¿. After labeling the data, we take 90% of triplets (93,846 samples) for training, and the remaining 10% (10,427 samples) of triplets are used as a developmentset for tuning the parameters and testing.
We use Adam optimizer (kingma2014adam) to train the Cross-Encoder based classifier, and the initial training rate is set to 2e-5, the learning rate decay factor is set to 0.99. We set the batch size to 16, and for every 2000 training steps, save the checkpoint and validate it on the development set. After fine-tuning 5 epochs (about 55,000 steps), the best parameters are selected from the checkpoint that performs best on the development set. We fine-tuned the model on a Linux server with the NVIDIA Titan RTX GPU with 24GB memory, which took about 3 hours for fine-tuning.
4.3.1. Baselines for Evaluating Our Comment Generation Approach
To investigate the performance of our comment generation method, we selected the IR-based approach (details are in Section 3.2), four state-of-the-art neural-based comment generation methods (zhang2020retrieval; leclair2019neural; hu2020deep), and two state-of-the-art IR+NMT methods (zhang2020retrieval; wei2020retrieve) as baselines.
1) Neural-based methods
Rencos NMT module (zhang2020retrieval) is the NMT module of Rencos (zhang2020retrieval), a standard attentional Seq2Seq model where the encoder is bidirectional LSTM and the decoder is LSTM. This baseline represents a fundamental solution to use NMT on code to comment problem, i.e., train an NMT with code as input and comment as output.
attendgru (leclair2019neural) is an attentional Seq2Seq-like model. This baseline predicts only one word at a time. In the encoding process, the model encodes both the code sequence and the output sequence predicted in previous steps. In the decoding process, the model predicts the next most likely word and appends it to the output sequence for the subsequent prediction steps.
ast-attendgru (leclair2019neural) is also an attentional Seq2Seq-like model. This baseline adds AST as an additional input to improve the prediction performance. LeClair et al. (leclair2019neural) use the traversal method SBT (hu2018deep) to flatten the AST into a sequence and adds an additional encoder for the AST sequence.
DeepCom (hu2020deep) is a standard attentional Seq2Seq model, where the encoder and the decoder are both Gated Recurrent Unit (GRU). The inputs to the model are code and AST sequences. As our proposed method takes the prediction results of this baseline as the NMT results, improvement from combining IR results can be directly measured by comparing the performance of our proposed method with this baseline.
2) IR+NMT methods
Rencos (zhang2020retrieval) combines the IR-based and neural-based comment generation by feeding the most semantic-level and syntactic-level similar code snippets of an input code snippet retrieved by IR-based approach into the neural-based approach to generate the comment. Specifically, given an input code snippet, Rencos retrieves its two most similar code snippets on semantic-level and syntactic-level. Then, the input code snippet and its two similar ones are fed separately into a trained code-to-comment NMT model to generate the comment.
Re2Com (wei2020retrieve) uses additional encoders to encode information from the retrieved sample of IR-based approaches. For a given code snippet, a similar sample with the highest text similarity is retrieved from the corpus. Then Re2Com takes the given code, its AST, code, and comment of the similar sample as input and encodes them by four different encoders. The encoding results are fused by the similarity between the input and the retrieved code and then passed to the decoder to obtain the predicted comment.
4.3.2. Baselines for Evaluating Cross-Encoder Based Classifier
To evaluate the effectiveness of our Cross-Encoder based classifier (details are in Section 3.3) in determining whether IR results are accurate, we adopt two other classification methods as the baselines.
Lexical-level Similarity is a simple method determining whether the IR result is accurate based on the lexical similarity between the input and retrieved code. If the similarity is greater than an appropriate threshold, we assume that the IR result is accurate and treat it directly as the output; otherwise, the neural-based approach will be used to generate its comment. We follow (gros2020code) and use the sentence_bleu metric in the NLTK (loper2002nltk) package to calculate the lexical similarity. This method does not require training but needs to determine an appropriate threshold that makes the dynamic combination of IR- and neural-based approaches on the test dataset can achieve optimal performance. To find the optimal threshold, we experiment the threshold values from 0 to 1 with an interval of 0.05. When the threshold value is 0.40, this approach achieves optimal performance on FunCom’s validation set. Thus, we use 0.4 as the threshold value in our experiments.
Siamese Network (bromley1993signature) is another state-of-the-art method on the semantic textual similarity (STS) task. It consists of two identical encoders to encode the two input sentences separately, which share the same model structure and parameters. Then, the distance between two embeddings is treated as the semantic similarity between the sentence pair. We use the implementation from GitHub333https://github.com/tlatkowski/multihead-siamese-nets to build a Siamese network, which uses a bidirectional LSTM (Bi-LSTM) (schuster1997bilstm) with 256 hidden sizes as the encoder structure and chooses manhattan distance as the similarity of embedding vector of input sentence pairs. Like Cross-Encoder, we use the labeled dataset described in Section 4.2.2 to train the Siamese network.
|IR-Based||Re2Com Retrieve Module||18.04||32.04||17.84||14.4||12.88||15.41||30.64||1.643|
|Neural-based||Rencos NMT Module||19.15||34.64||20.58||15.11||12.49||18.92||39.54||2.074|
|Our Method||25.45 (41%/26%)||43.92 (37%/7%)||27.08 (51%/19%)||20.38(41%/34%)||17.3 (34%/47%)||22.03 (42%/10%)||43.21 (41%/7%)||2.46 (49%/20%)|
4.4. Evaluation Metrics
4.4.1. Metrics for Evaluating Generated Comments
In our experiments, we follow Rencos (zhang2020retrieval) and evaluate the performance of different comment generation methods with four common metrics, i.e., BLEU (papineni2002bleu), METEOR (banerjee2005meteor), ROUGE-L (lin2004rouge), and CIDER (vedantam2015cider), which are widely used in machine translation (sutskever2014machine)rush2015summarization)
, and image captioning(you2016image).
measures the similarity between the generated comment and ground truth by the geometric mean of-gram matching precision scores . A brevity penalty is used to prevent very short generated sentences.
where is the uniform weight, and is set to 4 in our paper. We report a composite BLEU score in addition to BLEU1 through BLEU4 in our experiment.
METEOR (banerjee2005meteor) calculates the similarity scores by the unigram precision and recall , and multiplied by a penalty of language order:
where is the fragmentation fraction. , , and are three parameters whose default values are 0.9, 3.0 and 0.5, respectively.
is calculated by the Longest Common Subsequence (LCS) matching F-score. Suppose the length of the target sentence () and the predicted sentence () are m and n, respectively, and the length of the LCS between them is , then:
where is the value of ROUGE-L, and
denote the LCS precision and recall, respectively, and.
CIDER (vedantam2015cider) examines whether the prediction result has captured the critical information. Given the generated summary and the ground-truth , CIDER is calculated by the frequency of -grams and TF-IDF weighting:
where is set to 4, denotes the TF-IDF weight vector of all -gram in sentence , represents the number of reference sentences for each sample (in our work, ) . The final result is calculated by summing of the scores for different -grams () with weight .
4.4.2. Metrics for Evaluating Cross-Encoder Based Classifier
To evaluate whether the classifier can accurately distinguish samples with accurate IR results, we use four metrics commonly used in classification problems to verify the performance of our Cross-Encoder based classifier and baselines, i.e., accuracy, precision, recall, and F1-score.
where / donates the number of positive samples identified by the classifier that are/are not samples with accurate IR results, and / donates the number of negative samples identified by the classifier that are/are not samples with inaccurate IR results.
4.5. Research Questions
We perform a large-scale comparative study to answer the following three research questions for evaluating our approach.
RQ 1 (Performance): How does our approach compare to the commonly-used and state-of-the-art comment generation baselines?
RQ 2 (Accuracy of classification): What is the accuracy of our Cross-Encoder based classifier?
RQ 3 (Generalizability): Does our approach work with other NMT methods?
In RQ1, we set out to investigate the performance of generated comments of our proposed approach by comparing with seven state-of-the-art baselines (details are in Section 4.3.1). In RQ2, we evaluate whether the Cross-Encoder based classifier can effectively distinguish samples with accurate retrieved results by comparing with two baselines (details are in Section 4.3.2). In RQ3, we explore whether our approach is applicable for other neural comment generation approaches, i.e., still can obtain a significant improvement from dynamically combining with IR results.
5. Result Analysis
5.1. RQ 1: Our Approach vs. Baselines
Experimental Method. To answer this research question, we compare our approach with comment generation baselines listed in Section 4.3. All baselines are trained on the FunCom training set. We compare generated comments of our approach and other baselines on the FunCom test set by four evaluation metrics described in Section 4.4.1.
Result. Table 1 shows the performance of our method compared to other comment generation baselines. Overall, our approach achieves the best performance on all evaluation metrics. Our approach achieves a 26% improvement on BLEU and a 7%-47% improvement on other metrics compared to DeepCom, the state-of-the-art neural-based approach, and achieves a 7% improvement on BLEU compared to Re2Com, the state-of-the-art IR+NMT approach.
From the table, we can see that the IR-based approach has a similar performance to neural-based approaches. The IR-based approach achieves 18.04 BLEU score, while neural-based approaches perform slightly better than it and achieve BLEU score range from 19.15 to 20.11. One of the possible reasons that the neural-based approaches and the IR-based approach perform similarly can be that the word distributions in the training and test datasets are different. Some custom identifiers in the test set samples may be rare or even absent from the training set, making it hard for the model to capture their information accurately (karampatsis2020big).
For the two existing combinations of IR-based and neural-based approaches, i.e., Rencos and Re2Com, as we can see from the table, both could outperform IR-based and neural-based approaches. Specifically, Rencos achieves 19.86 BLEU score by fusing prediction results of the input code snippets with similar snippets. Re2Com achieves 23.69 BLEU scores by feeding the codes and comments of similar samples into the neural model. Our method achieves a higher 25.45 BLEU score by dynamically combining IR results and NMT results. In addition, both Rencos and Re2Com fail to improve the performance of the METEOR and ROUGE-L metrics significantly, but our approach achieves a significant improvement.
We have also conducted the Wilcoxon signed-rank test (wilcoxon1963wilcoxon) to compare the performance of our approach and these baselines. The test result suggests that our approach achieves significantly better performance than baseline approaches in BLEU, METEOR, ROUGE-L, and CIDER.
[boxrule=0.5pt, colback=white, arc=4pt, left=6pt,right=6pt,top=6pt,bottom=6pt,boxsep=0pt] Our approach significantly outperforms the state-of-the-art comment generation baselines. The improvements on the IR-based approach, neural-based approach, and their combination are 41%, 26%, and 7% in terms of BLEU score, respectively.
5.2. RQ 2: Cross-Encoder vs. Other Classification Algorithms
Experimental Method. To answer this research question, we compare the Cross-Encoder based classifier with other classifier baselines listed in Section 4.3.2. Specifically, we apply these approaches on the test set labeled as described in Section 4.2.2 and use accuracy, precision, recall, and F1-score to measure the performance. In addition, we replace the Cross-Encoder based classifier of our approach with other classifier baselines, then use BLEU to measure the quality of the generated comments.
Result. The performance of each classification method is shown in Table 2. Overall, our approach (the Cross-Encoder based classifier) outperforms the two baselines on all the five metrics.
|Approach||Classification Performance||Generated Comments|
The first row of Table 2 shows the performance of the lexical-level similarity method (details are in Section 4.3.2), which achieves an accuracy of 71.3% in inferring whether the IR results are accurate. Its combined results achieve 24.22 BLEU score, which is better than Re2Com. Significant improvement can also be achieved even without training a classifier for comparison, which further validates that our idea of dynamically combining IR results with NMT results is indeed practical. However, the text-similarity-based approach also suffers the issues of false-positive as shown in Figure 2. To identify such false-positive samples, we use the Cross-Encoder, a semantic-based classifier, to more accurately predict whether the IR results are accurate.
The second row of Table 2 shows the performance of the Siamese network method (details are in Section 4.3.2). We train a Bi-LSTM network with strong expressive capability from scratch to determine semantics similarity. However, the Siamese network does not perform as well as expected; its performance is even worse than the lexical-level similarity method we showed above. One possible reason is that the model focuses on irrelevant features instead of the semantic gap between code snippet pair, leading to over-fitting and poor performance.
The third row of Table 2 shows the performance of our Cross-Encoder based classifier. Overall, our Cross-Encoder based classifier achieves the best performance on all metrics. The high accuracy (73.6%) and precision (70.2%) validate that it can help achieve our goal of filtering false-positive retrieval results, i.e., textually similar but semantically dissimilar. Besides, we can also see that the performance of the combined result increases with the increase of accuracy of the classification, which suggests that the performance of our comment generation approach can be improved by better distinguishing samples with accurate IR result.
[boxrule=0.5pt, colback=white, arc=4pt, left=6pt,right=6pt,top=6pt,bottom=6pt,boxsep=0pt] Our Cross-Encoder based classifier can accurately identify samples with accurate IR results. Besides, our idea of dynamically combining IR-based and neural-based approaches can outperform the state-of-the-art IR+NMT approaches even with the naive textual-similarity algorithm.
5.3. RQ 3: Generalizability
Experimental Method. Different neural models might generate different results, which can affect the generalizability of our approach. To evaluate the generalizability of our approach, we replace the DeepCom in our approach with three other neural-based baseline approaches (listed in Section 4.3). Then we measure the quality of generated comments with BLEU.
|Approach||NMT Only||Combined Result||Improvement|
|Rencos NMT Module||19.15||24.95||5.8 (30%)|
Result. Table 3 shows the performance of other neural-based approaches combined with IR results. Overall, after combining IR results, all three neural methods achieve better performance with 24.95-25.34 BLEU score. Specifically, Rencos NMT module, attendgru, and ast-attendgru can achieve relative improvements of 30%, 31%, and 28% from combining IR results, respectively, which are even higher than the relative improvement of DeepCom (26%). The above results fully demonstrate that the performance of our proposed approach remains stable across different neural approaches. Moreover, all the combined results outperform Re2Com , the current state-of-the-art IR+NMT method, which again validates the feasibility of our idea of dynamically combining IR results and NMT results.
[boxrule=0.5pt, colback=white, arc=4pt, left=6pt,right=6pt,top=6pt,bottom=6pt,boxsep=0pt] The performance of our approach remains stable across different neural-based comment generation approaches.
6.1. Why Our Approach Performs Better?
To investigate why our proposed approach can achieve better performance, we partition the 90,908 samples in the test set into two sets, i.e., samples on which the IR-based approach performs better (IR-better samples) and samples on which the neural-based approach (DeepCom) performs better (NMT-better samples). Overall, there are 31,636 samples (34.8%) where the IR-based approach performs better, and 59,272 samples (65.2%) where the neural-based approach performs better. We then recalculate the performance (based on BLEU) of the four methods in these two sets, i.e., Re2Com retrieve module (IR-based approach), DeepCom (neural-based approach), ReCom (IR+NMT approach), and our approach. The results are in Table 4.
From the table, we can see that for IR-better samples, the IR-based approach, i.e., Re2Com retrieve module, can directly leverage existing comments from similar samples in the corpus and achieves 39.55 BLEU score, which is almost twice as large as the score of the neural-based approach, i.e., DeepCom. For NMT-better samples, since no similar sample can be retrieved from the corpus, the IR-based approach performs poorly on these general samples and only achieves 5.25 BLEU score. In contrast, the neural-based approach can infer more accurate results by summarizing the code-to-comment pattern and achieves 19.58 BLEU score. The IR-based approach and the neural-based approach perform similarly on the whole test set, but their performance differs significantly on these two sets of samples. Thus combining the strengths of these two methods can achieve better performance.
|Approach||All||IR-better samples||NMT-better samples|
|90908||31636 (34.8%)||59272 (65.2%)|
|Re2Com Retrieve Module||18.04||39.55||5.25|
By feeding information from the retrieved similar sample (code snippet and comment) to the neural model, the IR+NMT approach, i.e., Re2Com, performs better than the neural-based approach, i.e., DeepCom, on IR-better samples and achieves 39.46 BLEU score. However, on NMT-better samples, Re2Com only achieves 14.33 BLEU score, which is 27% lower than the score of DeepCom. The reason for such a performance degradation is that Re2Com can not accurately distinguish false-positive samples like Figure 2, thus incorrectly rely on the inaccurate retrieved information, i.e., the IR-based approach only achieves 5.25 BLEU score on NMT-better samples. Therefore, inaccurate retrieval information can lead to the degradation of the model’s generalization. In contrast, our approach directly distinguishes whether the retrieved result is accurate, which can help avoid the inaccurate retrieved information misleading the NMT to generate inaccurate comment. Thus our approach can outperform Re2Com on the NMT-better samples and the whole test set. Since the Cross-Encoder based classifier cannot perfectly predict whether the IR result is accurate, some samples incorrectly use inaccurate IR results as output or neglect accurate IR results. There is still a distance from the optimal performance of combing IR results and NMT results, i.e., achieving 39.55 BLEU score on IR-better samples and achieving 19.58 BLEU score on NMT-better samples.
6.2. Performance of Our Approach on An Alternative Dataset
To show the generalization of our approach, we further verify the performance of our method on another large-scale dataset, i.e., the DeepCom dataset (hu2020deep). The DeepCom dataset was collected from GitHub’s Java repositories created from 2015 to 2016 and contained 445,812 code-comment pairs for training and 20,000 code-comment pairs for validation and testing.
We re-run our approach and the three baselines on the DeepCom Dataset, and the results are shown in Table 5. Overall, all four methods achieve outstanding performance on the DeepCom dataset, which quite different from their performance on the FunCom dataset. The main reason can be that the projects used in these two datasets are different, in which more code snippets and comments are reused among projects. The IR-based approach, Re2Com retrieval module, achieves 55.28 BLEU score on the test set, which implies that code reuse is more frequent on the projects collected by the DeepCom dataset. Thus the neural model can predict the samples in the test set more accurately due to the presence of similar samples in the training set. The neural-based approach, DeepCom, achieves 38.79 BLEU score, which seems to perform well, but it is even inferior to the naive IR-based method. By feeding codes and comments from retrieved similar samples, the IR+NMT method, Re2Com, achieves 50.21 BLEU score on the test set. However, the performance of Re2Com is still worse than the naive IR-based method, which implies that it fails to combine the strengths of the IR-based and NMT-based method on the DeepCom dataset. In contrast, our proposed approach, dynamically combining the generated results from DeepCom and IR-based approach, achieves 57.13 BLEU score on the test set, which successfully combines the strengths of the IR method and NMT method and achieves the best performance.
|Re2Com Retrieval Module||55.28||65.93||55.27||51.69||49.59|
6.3. Effort Saved Comparing to The Existing Combination
Compared to the existing combination of IR- and NMT-based comment generation approaches, which use both the two models to generate a comment for each input sample, our approach dynamically selects the model to be used. To show the effort our method can save, we count the number of samples that do not need to run neural-based approaches to generate comments.
Specifically, our Cross-Encoder based classifier identifies 18,912 samples and 12,979 samples on the FunCom dataset and DeepCom dataset, respectively, that can be directly used for IR results. It implies that about 20% and 65% of the samples do not need to be fed into the NMT. Our approach can save the redundant effort of NMT predicting, making it faster than the current IR+NMT approach.
6.4. Threats to Validity
Internal Validity relates to the errors in the implementation of the baselines. To mitigate this issue, we directly use the public available code of DeepCom (hu2020deep), (ast-)attendgru (leclair2019neural), Re2Com (wei2020retrieve), and Rencos (zhang2020retrieval) to implement baselines. Our experiments showed these baselines achieve comparable performance with the result reported in their papers.
External Validity is about the quality of our dataset. Different data sources can have significant different characterics. Therefore, both our proposed approach and the baselines may perform differently on different datasets. In this paper, we only evaluate our proposed approach and baselines on two widely used datasets, i.e., DeepCom (hu2020deep) and FunCom (leclair2019neural). In our future work, we will experiment with other datasets.
Construct Validity relates to the suitability of our evaluation metrics. We use BLEU, ROUGE-L, METEOR, and CIDER to evaluate the generated comments of our approach and other baselines. These metrics mainly measure the gap between generated comments and ground truth in terms of textual similarity.
7. Related Work
Comment generation. Code comment generation techniques can be divided into three types: manually-crafted templates (sridhara2010towards; moreno2013automatic), IR-based (haiduc2010supporting; haiduc2010use; eddy2013evaluating; wong2015clocom; edmund2014mining), and neural models (iyer2016summarizing; hu2018deep; hu2020deep; leclair2019neural; zhang2020retrieval; wei2020retrieve).
Early studies leveraged manually-craft templates to generate comments automatically. Sridhara et al. (sridhara2010towards) built the Software Word Usage Model (SWUM) to capture the meaning and relationship of terms in the source code, then organized them into readable comments using different predefined templates. Moreno et al. (moreno2013automatic) used heuristic rules to capture critical information from the source code and further used them to generate comments.
Information retrieval (IR) techniques are also widely used in comment generation. One way is to provide extractive summaries of the source code, using IR techniques to extract keywords from the source code and compose them into term-based comments. Haiduc et al. (haiduc2010supporting; haiduc2010use) treated each function of source code as a document and leveraged Vector Space Model and Latent Semantic Indexing (LSI) to extract relevant terms from source code, then organized selected terms into comments. Eddy et al. (eddy2013evaluating) took a similar idea and adopted a hierarchical topic model for improvement. Another way is directly use the existing comment of a similar sample. Since code reuse and cloning are common in software development, similar code snippets that use the same code fragments may be found in large project repositories (e.g., GitHub) or software Q&A sites (e.g., Stack Overflow). Edmund et al. (wong2015clocom; edmund2014mining) retrieved the replicated samples from the corpus by code clone detection techniques.
More and more researchers have focused on neural-based methods, which train probabilistic models from large-scale source code in recent years. Iyer et al. (iyer2016summarizing) treated code to comment as an end-to-end translation problem and first introduced neural machine translation (NMT) into comment generation. They leveraged an attentional seq2seq model to translate code to comment, which used token embedding as the encoder and an LSTM layer as the decoder. Other researchers followed this way. Hu et al. (hu2018deep) argued that treating code as natural language sequences may lose its syntactical information. They proposed a new structure-based traversal (SBT) method to flatten the AST into sequence and replaced code with it as the model input. Later they proposed another hybrid model (hu2020deep) that simultaneously used codes and AST sequences for prediction. LeClair et al. (leclair2019neural) also proposed a similar hybrid model but proved that the neural model also works with only the AST sequence known. The NMT-based method can automatically learn code to comment patterns from the corpus, which saves the manual effort to design features or templates and brings impressive generalization capability. The IR-based method may fail when there are no similar samples in the training set, but the NMT-based method can give more accurate answers.
IR-based Neural Comment Generation. The neural models are difficult to generate low-frequency tokens (koehn2017rareword). LeClair et al. (leclair2019neural) showed that about 21% of comments in their test set contained low-frequency words (frequency 100). However, only 7% generated results of their method contained low-frequency words. The IR-based methods leverage existing comments from similar samples, which may contain low-frequency words and project-specific information. Therefore, researchers have begun to combine IR-based methods with NMT-based methods by feeding information from similar samples (their codes only/ and comments) to assist neural models in better generating low-frequency words. Zhang et al. (zhang2020retrieval) proposed an approach that fuzed decoded results of the input code snippet and its similar code snippets, which were retrieved based on syntactic similarity and semantical similarity. Wei et al. (wei2020retrieve) treated the existing comments of similar codes as exemplars, which can be reference examples for generating new comments. They introduced additional encoders to encode codes and comments from similar samples, then jointly trained model. To avoid the disturbance of inaccurate search results, both models decided the degree of using retrieved information based on the embedding similarity of the input and retrieved code snippets. The result shows that these methods can improve both the performance of generated comments and generating low-frequency words. However, both methods may be confused by false-positive samples like Figure 2. Without supervised learning, the input and retrieved code snippet of this example will yield similar embedding, making the model mistakenly believe that the retrieved results are accurate and wrongly rely on the inaccurate retrieved result, and leading to a decrease in generalization performance. In our work, we treat determining whether the retrieved result is accurate as a supervision task to distinguish false-positive retrieval results more accurately, and combine the IR-based and NMT-based methods in a dynamic manner to avoid the neural model over-rely on the retrieved information.
In this paper, we propose a dynamic approach to combine the strength of the IR-based and neural-based comment generation approaches. Specifically, given an input code snippet, we first use an IR-based technique to retrieve a similar code snippet from the corpus. Then we use a Cross-Encoder based classifier to decide the comment generation method to be used dynamically, i.e., if the retrieve similar code snippet is a true positive, we directly use the comment generated by IR-based approach. Otherwise, we input it to the neural-based approach to generate its comment. We have evaluated the effectiveness and generality of our approach on a large-scale Java dataset. The results show that our approach outperforms the state-of-the-art baselines by a significant margin.