ComFormer: Code Comment Generation via Transformer and Fusion Method-based Hybrid Code Representation

07/08/2021 ∙ by Guang Yang, et al. ∙ Tencent QQ NetEase, Inc 0

Developers often write low-quality code comments due to the lack of programming experience, which can reduce the efficiency of developers program comprehension. Therefore, developers hope that code comment generation tools can be developed to illustrate the functionality and purpose of the code. Recently, researchers mainly model this problem as the neural machine translation problem and tend to use deep learning-based methods. In this study, we propose a novel method ComFormer based on Transformer and fusion method-based hybrid code presentation. Moreover, to alleviate OOV (out-of-vocabulary) problem and speed up model training, we further utilize the Byte-BPE algorithm to split identifiers and Sim_SBT method to perform AST Traversal. We compare ComFormer with seven state-of-the-art baselines from code comment generation and neural machine translation domains. Comparison results show the competitiveness of ComFormer in terms of three performance measures. Moreover, we perform a human study to verify that ComFormer can generate high-quality comments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the increasing complexity and evolutionary frequency of software projects, the importance of program comprehension is also increasing. A recent study by Xia et al. [52] showed developers spend 59% of their time on program comprehension on average during software development and maintenance. Therefore, high-quality code comments are critical to improving the efficiency of developers’ program comprehension [17]. However, developers often write low-quality code comments or do not write code comments due to the limited project development budget, lack of programming experience, or insufficient attention to writing code comments. Although some tools (such as JavaDoc [23] and Doxygen111http://www.doxygen.org) can assist in generating code comment templates, these tools still unable to automatically generate content related to the functionality and purpose of the focused code. If developers manually write code comments, it will be time-consuming and difficult to guarantee the quality of the written comments. Moreover, existing code comments should be updated automatically with the evolution of the related code [23]. Therefore, it is of great significance to design novel methods that can automatically generate high-quality comments after analyzing the focused code.

Code comment generation222This challenging research problem is also called source code summarization in some previous studies [45][25] is an active research topic in the current program comprehension research domain. Research achievements in this research problem can also improve other software engineering tasks (such as software maintenance, code search, and code categorization). In the early phase, most of the studies [14][15][39][40] on code comment generation were based on template-based methods or information retrieval-based methods. Recently, most of the studies [21][18][20] started to follow an encoder-decoder framework and achieved promising results.

In this study, we propose a novel method ComFormer via Transformer [44]

and fusion method-based hybrid code representation. Our method considers Transformer since this deep learning model can achieve better performance than traditional sequence to sequence models in classical natural language processing (NLP) tasks (such as neural machine translation 

[43][35] and software engineering [7]). Moreover, our method also utilizes the hybrid code representation to effectively learn the semantic of the code since this representation can extract both lexical-level and syntactic-level information from the code, respectively. In the hybrid code representation, we not only consider sequential tokens of source code (i.e., lexical level of code) but also utilize AST (abstract syntax tree) information by our proposed Sim_SBT method (i.e., syntactic level of code). Moreover, we also consider three different methods to fuse this information. Finally, to alleviate the OOV (out-of-vocabulary) problem, we utilize the byte-level Byte-Pair-Encoding algorithm (Byte-BPE) [46] to split identifiers.

To evaluate the effectiveness of our proposed method ComFormer, we conduct experimental studies on a large-scale code corpus, which contains 485,812 pairs. Each pair includes a Java method and corresponding code comment. This corpus was gathered by Hu et al.  [19]. They performed a set of data cleaning steps to ensure the high quality of this corpus. Until now, this corpus has been widely used as the experimental subject in previous code comment generation studies [19][18][22][47][54][49].

We design empirical studies and perform human studies to verify the effectiveness of our proposed method. We first compare ComFormer with four state-of-the-art baselines from code comment generation (i.e., DeepCom [18], Hybrid-DeepCom [19], Transformer [2], CodePtr [8]) and three baselines from neural machine translation (i.e., seq2seq models [42] with/without attention mechanism [5]

and GPT-2 

[34]) in terms of three performance measures (i.e., BLEU, METEOR, and ROUGE-L), which are classical measures in previous code comment generation studies. Empirical results show ComFormer can improve the performance when compared with these state-of-the-art baseline methods. Second, after comparing three fusion methods (i.e., Jointly Encoder, Shared Encoder, and Single Encoder) to combine code lexical information and AST syntactic information, we find ComFormer with Single Encoder can achieve the best performance. Third, We perform a human study to verify the effectiveness of ComFormer. In our human study, we compare the comments generated by ComFormer with the comments generated by Hybrid-DeepCom [19], which has the best performance among the chosen baselines. The results of our human study also show the competitiveness of ComFormer.

To our best knowledge, the main contributions of our study can be summarized as follows:

  • We propose a novel code comment generation method ComFormer based on the Transformer and the fusion method-based hybrid code representation. Instead of the copy mechanism, we mitigate the OOV problem through the Byte-BPE algorithm and vocabulary sharing. Then we propose a simplification version of the SBT algorithm (i.e., Sim_SBT) to traverse the structural information of the AST, which can speed up model training. Finally, we consider three different methods for fusing lexical and syntactical information of the code.

  • We evaluate the performance of our proposed method ComFormer on a large-scale code corpus, which contains 485,812 Java methods and corresponding code comments. The experimental results show that ComFormer is more effective than seven state-of-the-art baselines from both the code comment generation domain and the neural machine translation domain in terms of three performance measures. Moreover, we further conduct a human study to verify the effectiveness of ComFormer.

  • We share our source code, trained models, and used code corpus in the GitHub repository333https://github.com/NTDXYG/ComFormer, which can facilitate the replication of ComFormer and encourage other researchers to make a comparison with ComFormer.

Paper organization. The rest of the paper is organized as follows. Section II presents the background and related work of our study. Section III shows the framework of our proposed method ComFormer and details of key components in ComFormer. Section IV shows the experiment setup. Section V analyzes our empirical results. Section VI performs a discussion on our proposed method ComFormer. Section VII discusses potential threats to the validity of our empirical study. Finally, Section VIII concludes this paper and shows potential future directions for our study.

Ii Related Work

In the early phase, most studies [39, 41, 40, 48, 30, 31, 1, 15, 14, 11, 37, 36, 51] used template-based or information retrieval-based methods to generate code comments. Recently, most of the studies followed deep learning-based methods (i.e., encoder-decoder framework) and achieved promising results.

Iyer et al. [21]

first proposed a method code-NN via an attention-based neural network. Allamanis et al. 

[3] proposed a model in which the encoder uses CNN and attention mechanisms, and the decoder uses GRU. The use of convolution operations helps to detect local time-invariant features and long-range topical attention features. Zheng et al. [55] proposed a new attention module called Code Attention, which can utilize the domain features (such as symbols and identifiers) of code segments. Liang and Zhu [26]

used Code-RNN to encode the source code into the vectors, and then they used Code-GRU to decode the vectors to code comments.

Hu et al. [18] proposed a method DeepCom by analyzing abstract syntax trees (ASTs). To better present the structure of ASTs, they proposed a new structure-based traversal (SBT) method. Later, Hu et al. [19] further proposed the method Hybrid-DeepCom. This method mainly made some improvements. For example, the identifiers satisfying the camel casing naming convention are split into multiple words. Recently, Kang et al. [22] analyzed whether using the pre-trained word embedding can improve the model performance. They surprisingly found that using the pre-trained word embedding based on code2vec [4] or Glove [33] does not necessarily improve the performance.

Leclair et al. [25] proposed a method ast-attendgru, which combines words from code and code structure. Leclair et al. [24] then used a Graph neural network (GNN), which can effectively analyze the AST structure to generate code comments. Wan et al. [45]

proposed the method Hybrid-DRL to alleviate the exposure bias problem. They input an AST structure and sequential content of code segments into a deep reinforcement learning framework (i.e., actor-critic network). Then, Wang et al. 

[47] extended the method Hybrid-DRL. They used a hierarchical attention network by considering multiple code features, such as type-augmented ASTs and program control flows.

Ahmad et al. [2] used the Transformer model to generate code comments. The Transformer model is a kind of sequence to sequence model based on multi-head self-attention, which can effectively capture long-range dependencies. Specifically, they proposed to combine self-attention and copy attention as the attention mechanism of the model and analyzed the influence of absolute position and pairwise relationship on the performance of the code comment generation.

Chen et al. [9]

proposed a neural framework, which allows bidirectional mapping between a code retrieval task and a code comment generation task. Their proposed framework BVAE has two Variational AutoEncoders (VAEs): C-VAE for source code and L-VAE for natural language. Ye et al. 

[53] exploited the probabilistic correlation between code comment generation task and code generation task via dual learning. Wei et al. [49] also utilized the correlation between code comment generation task and code generation task and proposed a dual training framework.

On the other hand, Hu et al. [20] proposed a method TL-CodeSum, which can utilize API knowledge learned in a related task to improve the quality of code comments. Zhang et al. [54] proposed a retrieval-based neural code comment generation method. This method enhances the model with the most similar code segments retrieved from the training set from syntax and semantics aspects. Liu et al. [29] utilized the knowledge of the call dependency between the source code and the dependency of codes. Zhou et al. [56] proposed a method ContextCC, which uses the program analysis to extract context information (i.e., the methods and their dependency). Haque et al. [16] modeled the file context (i.e., other methods in the same file) of methods, then they used an attention mechanism to find words and concepts to generate comments.

Different from the previous studies, ComFormer is designed based on Transformer and fusion method-based hybrid code presentation. In this study, we investigate three different methods to fuse lexical-level and syntactic-level code information. Moreover, we utilize the Byte-BPE algorithm to alleviate the OOV problem and use the Sim_SBT method to reduce the size of the sequence generated by the original SBT method [18], which can speed up model training.

Iii Our Proposed Method ComFormer

Fig. 1 shows the framework of ComFormer. In this figure, we can find that ComFormer consists of three parts: data process part, model part, and comment generation part. Then we show the details of these three parts.

Fig. 1: The framework of our proposed method ComFormer

Iii-a Data Process Part

In ComFormer, we consider a hybrid representation of code. For this representation, we not only consider sequential tokens of source code (i.e., lexical level of code) but also utilize AST structure information (i.e., syntactical level of code).

Iii-A1 Constructing Source Code Sequence.

We first convert the tokens of code into the sequences. However, many tokens are identifiers (such as the class name, the method name, the variable name). These identifiers are named according to Java’s naming convention (i.e., camel casing naming convention). Therefore, most of the identifiers are OOV tokens. In our study, we first split these identifiers into multiple words, which helps to alleviate the OOV problem and keep more code information. For example, the variable name “SegmentCopy” can be split into two words: ”segment” and ”copy”. The method name “onDataChanged” can be split into three words: “on”, “data” and “changed”. The class name “SecureRandom” can be split into two words: “secure” and “random”. After splitting the identifiers into multiple words, we then convert all the tokens into lowercase. Finally, we replace the specific numbers and strings with “num_” and “str_” tags, respectively.

Second, after performing a more detailed manual analysis on the training set, we find that after splitting the words based only on the camel casing naming convention, there are still a large number of OOV words in the testing set. Most of the current studies [8, 2] alleviated this problem through the copy mechanism by using the pointer network. In our study, we use the Byte-BPE algorithm [46] to further divide the token of the code into sub-tokens, then combine the vocabulary sharing to solve the OOV problem. For example, we find the word ”forgo” exists in the comments of the test set, which does not appear in the comments of the training set, nor the corresponding source code. In this case, neither the camel casing naming split nor the copy mechanism can solve the problem. However, in the Byte-BPE algorithm, the word “forgotten” is split into “for”, “go”, “t”, “ten”, so that in The Decoder, it can decode the comments to produce the correct word “forgo”.

Iii-A2 Constructing AST Sequence

We first use the javalang tool444https://pypi.org/project/javalang/ to convert the Java code into the corresponding AST. Then we use our proposed Sim_SBT method to generate the traversal sequence of the AST. Since the sequences generated by the SBT method [18] may contain redundant information (i.e., many parentheses between type nodes), the sequences generated by SBT traversal are sometimes longer than the source code sequences, which makes it more difficult for the model to learn syntactic information. To alleviate this problem, we propose a new method Sim_SBT, which can better present the structure of ASTs and keep the sequences unambiguous.

The AST traversal results of the methods SBT and Sim_SBT are shown in Fig. 2. In this example, the sequence generated by the original SBT method is too long. Our proposed method Sim_SBT adopts a prior order traversal in a tree, which has the advantage of reducing the length of the sequence. We use a code example to show the generated sequence by using our proposed method Sim_SBT in Fig. 3. In this figure, the source code token of the same color corresponds to the token of the AST syntax type. We can find the AST sequence generated by Sim_SBT is slightly shorter than the source code length, which can effectively reduce the time of model training.

Fig. 2: The AST traversal results of the methods SBT and Sim_SBT
Fig. 3: An example of convert an AST to a sequence by using our proposed method Sim_SBT.

Iii-B Model Part

ComFormer follows the Transformer architecture (i.e., the encoder and the decoder are built using the self-attentive mechanism). Moreover, ComFormer considers three methods for fusing lexical and syntactical information of the code at the Encoder.

Iii-B1 Encoder Layer

In this section, we first introduce Transformer’s Encoder and then illustrate three different fusion methods at the Encoder.

The Encoder of Transformer does not attempt to compress the entire source sentence into a single context vector . Instead it produces a sequence of context vectors . First, the tokens of the input are passed through a standard embedding layer. Next, as the model has no recurrent, it has no idea about the tokens’ order in the sequence. This problem is solved by using another embedding layer (i.e., positional embedding layer). The positional embedding layer’s input is not the token itself but the token position in the sequence. Notice the input starts with SOS (i.e., start of the sequence) token, which is the first token in position 0. The original implementation of Transformer [44] uses fixed static embeddings and does not learn positional embeddings. Recently, positional embeddings have been widely used in modern Transformer architectures (such as Bert [10]). Therefore, our study also uses this positional embedding layer.

The encoded word embeddings are then used as the input to the encoder, which consists of layers. Each layer contains two sub-layers: (a) a multi-head attention mechanism and (b) a feed-forward network.

A multi-head attention mechanism builds upon scaled dot-product attention, which operates on a query , a key , and a value . The original attention calculation uses scaled dot-product for each representation:

(1)

Multi-head attention mechanisms obtain different representations of (, , ). Then concatenate the results, and project the concatenation with a feed-forward layer:

(2)
(3)

where and are parameter projection matrices that are learned, and denotes the number of heads in the multi-head attention.

The second component of each layer of the Transformer network is a feed-forward network.

(4)

Next, we illustrate three different methods (i.e., Jointly Encoder, Shared Encoder, and Single Encoder), which can fuse lexical and syntactical information of the code at the Encoder. The structure of these fusing methods can be found in Fig. 4. Specifically, Jointly Encoder assumes that AST and source code are two different levels of the input. Therefore, this method sets up an encoder for the source code sequence (i.e., Code Encoder) and an encoder for the AST Sequence (i.e., AST Encoder), respectively. The Linear layer is activated by the Tanh function to obtain the final matrix of contextual information. Shared Encoders considers the effect of having two encoders on the model parameters. This method encodes the source code sequence and the AST Sequence by weight sharing (i.e., using one encoder). Then it switches the two output matrices together, adds a Linear layer, and activates it with the Tanh function to obtain the final matrix of contextual information. Single Encoder first splices the source code sequence and the AST sequence. Then, this method proceeds through the word embedding in the encoder afterward, which relies entirely on the positional information encoded in the model for learning lexical and syntactical information.

Fig. 4: Structure of three different fusion methods in the Encoder

Iii-B2 Decoder Layer

According to the structure of the Transformer, it can be found that the Decoder is the same as the Encoder. In the beginning, a position vector Positional Encoding was added first, which is the same as the method used in the Encoder.

Next is the masked multi-head attention, the mask represents a mask, which masks certain values so that it has no effect when the parameters are updated. The decoder implements the autoregressive model by means of Mask. The sequence mask is to make the decoder unable to see future information. That is, for a sequence, at the moment time step is

, our decoded output should only depend on the output before time , not the output after . So we need to generate an upper triangular matrix, the values of the upper triangle are all 0. By applying this matrix to each sequence, the information after can be hidden.

Finally, The combined embeddings are passed through the decoder layers, along with the encoded source, and the source and target masks. Notice the rest of the layer structure is the same as the Encoder in our method.

Iii-C Comment Generation Part

Previous studies [18][20]

showed that generating the text through the maximum probability distribution of the neural networks often yields a low-quality result. Recently, most studies 

[12][50] resorted to Beam Search [12], which can achieve high performance on text generation tasks. Therefore, ComFormer uses the Beam Search algorithm to generate code comments.

Iv Experimental Setup

In our empirical study, we want to answer the following three research questions (RQs):

RQ1: Can our proposed method ComFormer outperform state-of-the-art baselines for code comment generation in terms of neural machine translation-based measures?

Motivation. In this RQ, we want to compare the performance of ComFormer with the state-of-the-art baselines from both the code comment generation domain and neural machine translation domain in an automated manner. The main challenge is how to measure the similarity between the comments written by developers and the comments generated by ComFormer and baselines. In this RQ, we consider three performance measures, which have been used in the previous studies on neural machine translation [27][13] and code comment generation [20][45][19].

RQ2: Can hybrid code representation improve the performance of our proposed method ComFormer?

Motivation. In this RQ, we want to show the effectiveness of fusion Method-based hybrid code representation. Therefore, we want to compare this code representation method with the methods, which only consider code lexical information. Moreover, we want to compare the performance of different methods for fusing code lexical information and code syntactical information. Then we can select the best fusion method in this study.

RQ3: Can our proposed method ComFormer outperform state-of-the-art baselines for code comment generation via human study?

Motivation. Evaluating the effectiveness of our proposed method in terms of performance measures has the following disadvantages. First, the quality of the comments written by developers can not be guaranteed in some cases. Second, sometimes evaluation based on word similarity is not accurate since two semantic similar code comments may contain different words. Therefore, it is necessary to evaluate the effectiveness of our proposed method via human study in a manual way.

Iv-a Code Corpus

In our empirical study, we choose code corpus555This corpus can be downloaded from https://github.com/xing-hu/EMSE-DeepCom gathered by Hu et al. [19] as our empirical subjects, since this code corpus have been widely used in previous studies for code comment generation [19][18][22][47][54][49].

Table I shows the statistical information for code length, SBT length, and comment length. For the above code corpus, 20,000 pairs are selected to construct the testing set and the validation set. Then the remaining 445,812 pairs are used to construct the training set. This setting is consistent with the experimental setting in the previous studies (such as Hybrid-DeepCom [19]), which can guarantee a fair comparison with the baselines.

Statistics for Code Length
Avg Mode Median
55.79 11 36
Statistics for Comment Length
Avg Mode Median
10.25 8 9
TABLE I: Statistics of code corpus used in our empirical study

Iv-B Performance Measures

In our study, we use the performance measures from neural machine translation research to automatically evaluate the quality between the candidate comments (generated by code comment generation methods) and the reference comments (generated by developers). The chosen performance measures include BLEU, METEOR, and ROUGE-L. These performance measures have been widely used in previous studies for code comment generation [20][45][19]. The details of these performance measures can be found as follows.

BLEU. BLEU (Bilingual Evaluation Understudy) [32] is the earliest measure used to evaluate the performance of the neural machine translation models. It is used to compare the degree of coincidence of -grams in the candidate text and the reference text. In practice, =14 is usually taken, and then the weighted average is performed. Unigram ( =2) is used to measure word translation accuracy, and high-order -gram is used to measure the fluency of sentence translation.

METEOR. METEOR (Metric for Evaluation of Translation with Explicit Ordering) [6] is based on

with some improvements. METEOR is based on the single-precision weighted harmonic mean and the single word recall rate, and its purpose is to solve some inherent defects in the BLEU standard.

ROUGE-L. ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation) [28] calculates the length of the longest common subsequence between the candidate text and the reference text. The longer the length, the higher the score.

We utilize the implementations provided by nlg-eval library666https://github.com/Maluuba/nlg-eval, which can ensure the implementation correctness of these performance measures.

Iv-C Experimental Settings

Our proposed method ComFormer is implemented with PyTorch 1.6.0. In our study, we choose AdamW as the optimizer and use cross Entropy as the loss function. We set the learning rate to 0.0005 and set the value of epoch to 30.

All the experiments run on a computer with an Inter(R) Xeon(R) Silver 4210 CPU and a GeForce RTX3090 GPU with 24 GB memory. The running OS platform is Windows OS.

V Result Analysis

V-a Result Analysis for RQ1

RQ1: Can our proposed method ComFormer outperform state-of-the-art baselines for code comment generation in terms of neural machine translation-based measures?

Method. In this RQ, we first want to compare our proposed method ComFormer with Hybrid-DeepCom [19]. Hybrid-DeepCom used the AST traversal method to represent the code structure information. Then they used the seq2seq model with the attention mechanism to construct the model. Then, we also choose other four state-of-the-art code comment generation methods (i.e., DeepCom [18], CodePtr [8], and Transformer [2]) as our baselines. Later, we choose three baselines from deep learning-based machine translation models. The first two baselines are traditional seq2seq models [42] with/without attention mechanism [5]. The last baseline is GPT-2 [34]

). GPT2 only uses the decoder in the Transformer by large-scale anticipatory learning on tasks (such as machine translation and text summarization). Finally, to show the competitiveness of our fusion method, we also consider a baseline (i.e., ComFormer without AST), in which the Encoder only considers the code lexical information. Notice, in this RQ, ComFormer (i.e., ComFormer with AST) considers the Single Encoder as the fusion method.

For these chosen baselines, we re-use the experimental results of three methods (i.e., DeepCom, Hybrid-DeepCom, and CodePtr) due to the same dataset split setting and re-implement the remaining baselines.

Results. The comparison results between ComFormer and the baselines can be found in Table II. Based on Table II, we can find that our proposed method ComFormer can outperform all of the baselines. In terms of _1/2/3/4, ComFormer can at least improve its performance by 6.18%, 9.86%, 12.76%, 14.85% respectively. In terms of , ComFormer can improve its performance by 8.20% at least. In terms of -, ComFormer can improve its performance by 4.87% at least. Therefore, ComFormer can achieve better performance than the baselines in terms of these performance measures.

METHOD BLEU_1(%) BLEU_2(%) BLEU_3(%) BLEU_4(%) METEOR(%) ROUGE_L(%)
DeepCom 49.023 44.140 38.265 35.216 25.183 52.175
Hybrid-DeepCom 54.056 45.046 40.336 37.397 27.383 54.331
Transformer 55.624 46.295 41.574 38.692 29.056 55.263
CodePtr 59.506 51.107 46.386 43.371 31.382 62.761
Seq2Seq 45.016 40.625 36.162 34.024 23.695 50.462
Seq2Seq with atten 46.526 41.526 37.812 35.041 24.534 51.842
GPT-2 47.915 41.253 37.593 35.301 26.887 53.398
ComFormer without AST 59.090 51.027 46.613 43.801 31.711 60.539
ComFormer with AST 62.790 55.283 51.127 48.437 34.182 63.249
TABLE II: The comparison results between our proposed method ComFormer and baseline methods in terms of BLEU, METEOR and Rouge_L

In addition, four code examples with different lengths are selected from the testing set to compare the results generated by ComFormer and baselines. The comparison results can be found in Table III. In Case 1, the use of a network of pointers in CodePtr and the use of BPE splitting with vocabulary sharing in ComFormer both generate ”cache” words in the comment, which can demonstrate the effectiveness of our method in alleviating the OOV problem. We further verify the competitive nature of our method in Case 2, where the word “insectwordcategory” does not appear in the source code and ComFormer still generates the comment correctly, while Hybrid-DeepCom and CodePtr only generate UNK. As we can find from Case 3 and Case 4, although the comments generated by the baselines are consistent, the comments generated by ComFormer are better after manual analysis. For example, the comment generated in Case 3 explains the reason for doing this separate step, and the comment generated in Case 4 emphasizes the meaning of the if statement.

Case ID Example
1
private void addCachedLegionMemberEx(LegionMemberEx legionMemberEx) {
    this.allCachedLegionMembers.addMemberEx(legionMemberEx);
}
Hybrid-DeepCom: this method will add a new legion to this container
Codeptr: convenience method to add new member to cache
ComFormer without AST: add legion member to cache
ComFormer with AST: this method will add a new legion member to the cache
Human Written: this method will add a new legion member to the cache
2
public SimpleTextCategory(String sName) {
    super();
    Dictionary = new Dictionary(sName, NUM_);
    Dictionary.AddSubStrings = BOOL_;
    Dictionary.RemoveSubStrings = BOOL_;
    DocumentCount = NUM_;
    Name = sName;
}
Hybrid-DeepCom: creates a new instance of UNK
Codeptr: creates a new instance of UNK
ComFormer without AST: creates a new instance of insectcategory given a category name .
ComFormer with AST: creates a new instance of insectwordcategory
Human Written: creates a new instance of insectwordcategory
3
public void insertNewAddresses(Dao<AddressBook, Integer> addressBookDao,
    Dao<AddressItem, Integer> addressItemDao) throws SQLException {
    if (this.nonDaoAddresses == null) {
        return;
    }
    addressBookDao.assignEmptyForeignCollection(this, STR_);
    for (AddressItem addressItem : nonDaoAddresses) {
        addressItem.setAddressBook(this);
        addressItemDao.create(addressItem);
        addresses.add(addressItem);
    }
}
Hybrid-DeepCom: we need to insert a new contacts into the book.
Codeptr: we have separate ormlite doesn t automatically insert ormlite doesn doesn t automatically insert ormlite doesn doesn t automatically .
ComFormer without AST: insert new contact to database
ComFormer with AST: we have to do this separate step because ormlite doesn t automatically insert children
Human Written: we have to do this separate step because ormlite doesn t automatically insert children
4
public static Class<?> findCommonElementType(Collection collection) {
    if (isEmpty(collection)) {
        return null;
    }
    Class<?> candidate = null;
    for (Object val : collection) {
        if (val != null) {
            if (candidate == null) {
                candidate = val.getClass();
            } else if (candidate != val.getClass()) {
                return null;
            }
        }
    }
    return candidate;
}
Hybrid-DeepCom: finds the common element type for a given collection.
Codeptr: find the common element of the given collection.
ComFormer without AST: find the common element type of the given collection.
ComFormer with AST: find the common element type of the given collection if any.
Human Written: find the common element type of the given collection if any.
TABLE III: Examples of generated comments by ComFormer and other baselines. These examples cover both long and short code snippets.

[width=1.0title=] Summary for RQ1: Our proposed method ComFormer can outperform state-of-the-art baselines both from the code comment generation domain and neural machine translation domain in terms of three performance measures. Besides, the comments generated by ComFormer can have better quality after analyzing some cases.

V-B Result Analysis for RQ2

RQ2: Can hybrid code representation improve the performance of our proposed method ComFormer?

Method. As shown in Fig 4, we consider three different fusion methods (i.e., Jointly Encoder, Shared Encoder, and Single Encoder) to combine code lexical information and AST syntactical information.

Reults. The comparison results are shown in Table IV. First, we can find that using these three fusion methods can achieve better performance than ComFormer without AST. This means considering syntactical information from AST can further improve the performance of ComFormer. Second, among these three fusion methods, Single Encoder can achieve the best performance. This means Single Encoder is best suited for this task.

[width=1.0title=] Summary for RQ2: Hybrid code representation can improve the performance of our proposed method ComFormer, while Single Encoder can achieve the best performance.

METHOD BLEU_4(%) METEOR(%) ROUGE_L(%)
ComFormer without AST 43.801 31.711 60.539
Jointly Encoder 46.301 32.925 63.012
Shared Encoder 44.512 32.052 62.105
Single Encoder 48.437 34.182 63.249
TABLE IV: The comparison results between three different fusion methods

V-C Result Analysis for RQ3

RQ3: Can our proposed method ComFormer outperform state-of-the-art baselines for code comment generation via human study?

Method. In RQ1, the performance comparison is automatically performed in terms of neural machine translation-based performance measures. To verify the effectiveness of our proposed methods, we further conduct a human study. We recruit two master students majoring in computer science, to perform manual analysis. Since both of these two master students have rich project development experience, the quality of our human studies can be guaranteed.

Due to the high cost of manually analyzing all the Java methods in the testing set, we use a common sampling method [38] to randomly select at least Java methods and the generated comments from the testing sets. The value of can be determined by the following formula:

(5)

where depends on the selected confidence level and the desired error margin . is a confidence level score and is the error margin. is the number of samples in the testing set. For the final manual analysis, we select examples for the relevant data for the error margin = 0.05 at 95% confidence level (i.e., ).

For the 377 selected samples, we show the corresponding source code, the comments generated by ComFormer, and the comments generated by the method Hybrid-DeepCom to master students. Notice, these two master students do not know which method the comment is generated by, which can guarantee a fair comparison.

Three scores are defined as follows.

  • 1 means that there is no connection between the comment and the code, i.e., the comment does not describe the function and meaning of the corresponding code. We use Low to denote this result.

  • 2 means that the comment is partially related to the code, i.e., it describes part of the function and meaning of the corresponding code. We use Medium to denote this result.

  • 3 means that there is a strong connection between the comment and the code, i.e., the comment correctly describes the function and meaning of the corresponding code. We use High to denote this result.

Results. After our human study, we analyze the scoring results of these two master students. The final results are shown in Table V. First, we can find ComFormer can generate a significantly higher proportion of high-quality comments than Hybrid-DeepCom. Then, ComFormer can generate a much lower proportion of low-quality comments than Hybrid-DeepCom. Finally, ComFormer can achieve a higher score than Hybrid-DeepCom. These results indicate that ComFormer can significantly outperform the baseline Hybrid-DeepCom.

[width=1.0title=] Summary for RQ3: Our proposed ComFormer also works better than the baseline method on human study.

Student
Low
ComFormer Hybrid-DeepCom
Medium
ComFormer Hybrid-DeepCom
High
ComFormer Hybrid-DeepCom
Mean
ComFormer Hybrid-DeepCom
1 2.39%    12.20% 28.12%    35.28% 69.49%    52.52% 2.67    2.40
2 4.51%    10.34% 29.90%    35.01% 65.59%    54.65% 2.61    2.46
TABLE V: Manual analysis results on comments generated by ComFormer and Hybrid-DeepCom via human study

Vi Discussions

In this section, we aim to analyze the impact of codes’ length on the performance of ComFormer and Hybrid-DeepCom. The final results in terms of two performance measures can be found in Fig. 5. As shown in Fig. 5, the longer the source code length, the lower the average score of METEOR and Rouge_L. These two methods obtain higher performance when the coding length is between 1550. When the source code length is short, These two methods can learn the full semantics of the source code more easily. We found that the performance fluctuates significantly when the number of tokens in the source code exceeded 125. Because there are fewer source codes in the corpus, whose length is over 125 and this limits ComFormer’s ability to learn this kind of code. Overall, ComFormer outperformed Hybrid-DeepCom regardless of code length.

(a) METEOR scores for different Code lengths
(b) ROUGE_L scores for different Code lengths
Fig. 5: Performance comparison between ComFormer and Hybrid-DeepCom by considering different code length in terms of two performance measures, here blue line denotes ComFormer and yellow line denotes Hybrid-DeepCom.

Vii Threats to Validity

In this section, we mainly discuss potential threats to the validity of our empirical study.

Internal threats. The mainly first internal threat is the potential defects in the implementation of our proposed method. To alleviate this threat, we first check code carefully and use mature libraries, such as PyTorch and Transformers777https://github.com/huggingface/transformers. The second internal threat is the implementation correctness of our chosen baseline methods. To alleviate this threat, we try our best to re-implement their approach according to the original description of these baselines, and our implementation can achieve similar performance reported in their empirical study.

External threats. The first external threat is the choice of the corpus. To alleviate this threat, we select a corpus, which was provided by Hu et al. [19]. The reasons can be summarized as follows. First, Java is the most popular programming language, and most of the projects are developed by using Java. Second, the quality of this code corpus has been improved by Hu et al. by performing data preprocessing. Therefore, this code corpus has also been used in previous studies on code comment generation  [19][18][22][47][54][49]. In the future, we want to verify the effectiveness of our proposed method for the corpus of other programming languages (such as Python, C#) [21].

Construct threats. The construct threat in this study is the performance measures used to evaluate our proposed method’s performance. To alleviate these threats, we choose three popular performance measures from the neural machine translation domain. These measures have also been widely used in previous code comment generation studies [20][45][19]. Moreover, we also perform a human study to show the competitiveness of our proposed method.

Conclusion threats. The conclusion threat in our study is we do not perform cross-validation in our research. In our study, the data split on the corpus is consistent with the experimental setting in the previous study for DeepCom [19]. This can guarantee a fair comparison with the baselines DeepCom, Hybrid-DeepCom, and CodePtr (i.e., the model construction and application on the same training set, validation set, and testing set). Using cross-validation can comprehensively evaluate our proposed method, since different splits may result in a diverse training set, validation set, and testing set. However, this model evaluation method has not been commonly used for neural machine translation experiments due to the high training computational cost.

Viii Conclusion and Future Work

High-quality code comments are the key to improve the program comprehension efficiency of developers. Inspired by the latest research advancements in the field of deep learning and program semantic learning, we propose a novel method ComFormer via Transformer and fusion Method-based hybrid code representation for code comment generation. In particular, we consider the Transformer to automatically translate the target code to code comment. Moreover, we also use a hybrid code representation (i.e., capture both lexical information and syntactic information) to learn the code semantic effectively. Both empirical studies and human studies verify the effectiveness of our proposed method ComFormer.

In the future, we first want to evaluate the effectiveness of our proposed method ComFormer by considering other corpus gathered from other programming languages, such as Python, C#, and SQL query. Second, we want to use state-of-the-art deep learning methods to improve the performance of our proposed method. Finally, we also want to design more reasonable performance metrics to better evaluate the quality of code comments generated by ComFormer.

Acknowledgment

This work is supported in part by National Natural Science Foundation of China (Grant nos. 61702041 and 61202006 ), The Open Project of Key Laboratory of Safety-Critical Software for Nanjing University of Aeronautics and Astronautics, Ministry of Industry and Information Technology (Grant No. NJ2020022).

References

  • [1] N. J. Abid, N. Dragan, M. L. Collard, and J. I. Maletic (2015) Using stereotypes in the automatic generation of natural language summaries for c++ methods. In 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 561–565. Cited by: §II.
  • [2] W. U. Ahmad, S. Chakraborty, B. Ray, and K. Chang (2020) A transformer-based approach for source code summarization. arXiv preprint arXiv:2005.00653. Cited by: §I, §II, §III-A1, §V-A.
  • [3] M. Allamanis, H. Peng, and C. Sutton (2016) A convolutional attention network for extreme summarization of source code. In

    International conference on machine learning

    ,
    pp. 2091–2100. Cited by: §II.
  • [4] U. Alon, M. Zilberstein, O. Levy, and E. Yahav (2019)

    Code2vec: learning distributed representations of code

    .
    Proceedings of the ACM on Programming Languages 3 (POPL), pp. 1–29. Cited by: §II.
  • [5] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, Cited by: §I, §V-A.
  • [6] S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §IV-B.
  • [7] K. Cao, C. Chen, S. Baltes, C. Treude, and X. Chen (2021) Automated query reformulation for efficient search based on query logs from stack overflow. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pp. 1273–1285. Cited by: §I.
  • [8] N. Chang-An, G. Ji-Dong, T. Ze, L. Chuan-Yi, Z. Yu, and L. Bin (2021) Automatic generation of source code comments model based on pointer-generator network. Note: doi:http://dx.doi.org/10.13328/j.cnki.jos.006270 Cited by: §I, §III-A1, §V-A.
  • [9] Q. Chen and M. Zhou (2018) A neural framework for retrieval and summarization of source code. In 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 826–831. Cited by: §II.
  • [10] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §III-B1.
  • [11] B. P. Eddy, J. A. Robinson, N. A. Kraft, and J. C. Carver (2013) Evaluating source code summarization techniques: replication and expansion. In 2013 21st International Conference on Program Comprehension (ICPC), pp. 13–22. Cited by: §II.
  • [12] M. Freitag and Y. Al-Onaizan (2017) Beam search strategies for neural machine translation. arXiv preprint arXiv:1702.01806. Cited by: §III-C.
  • [13] D. Guo, W. Zhou, H. Li, and M. Wang (2018) Hierarchical lstm for sign language translation. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 32. Cited by: §IV.
  • [14] S. Haiduc, J. Aponte, and A. Marcus (2010) Supporting program comprehension with source code summarization. In 2010 acm/ieee 32nd international conference on software engineering, Vol. 2, pp. 223–226. Cited by: §I, §II.
  • [15] S. Haiduc, J. Aponte, L. Moreno, and A. Marcus (2010) On the use of automated text summarization techniques for summarizing source code. In 2010 17th Working Conference on Reverse Engineering, pp. 35–44. Cited by: §I, §II.
  • [16] S. Haque, A. LeClair, L. Wu, and C. McMillan (2020) Improved automatic summarization of subroutines via attention to file context. arXiv preprint arXiv:2004.04881. Cited by: §II.
  • [17] H. He (2019) Understanding source code comments at large-scale. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1217–1219. Cited by: §I.
  • [18] X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin (2018) Deep code comment generation. In 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC), pp. 200–20010. Cited by: §I, §I, §I, §II, §II, §III-A2, §III-C, §IV-A, §V-A, §VII.
  • [19] X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin (2020) Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering 25 (3), pp. 2179–2217. Cited by: §I, §I, §II, §IV-A, §IV-A, §IV-B, §IV, §V-A, §VII, §VII, §VII.
  • [20] X. HU, G. LI, X. XIA, D. LO, S. LU, and Z. JIN (2018) Summarizing source code with transferred api knowledge. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelli-gence (IJCAI 2018), Vol. 19, pp. 2269–2275. Cited by: §I, §II, §III-C, §IV-B, §IV, §VII.
  • [21] S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer (2016)

    Summarizing source code using a neural attention model

    .
    In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2073–2083. Cited by: §I, §II, §VII.
  • [22] H. J. Kang, T. F. Bissyandé, and D. Lo (2019) Assessing the generalizability of code2vec token embeddings. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1–12. Cited by: §I, §II, §IV-A, §VII.
  • [23] D. Kramer (1999) API documentation from source code comments: a case study of javadoc. In Proceedings of the 17th annual international conference on Computer documentation, pp. 147–153. Cited by: §I.
  • [24] A. LeClair, S. Haque, L. Wu, and C. McMillan (2020) Improved code summarization via a graph neural network. arXiv preprint arXiv:2004.02843. Cited by: §II.
  • [25] A. LeClair, S. Jiang, and C. McMillan (2019) A neural model for generating natural language summaries of program subroutines. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 795–806. Cited by: §II, footnote 2.
  • [26] Y. Liang and K. Q. Zhu (2018) Automatic generation of text descriptive comments for code blocks. arXiv preprint arXiv:1808.06880. Cited by: §II.
  • [27] C. Lin and F. J. Och (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pp. 605–612. Cited by: §IV.
  • [28] C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §IV-B.
  • [29] B. Liu, T. Wang, X. Zhang, Q. Fan, G. Yin, and J. Deng (2019) A neural-network based code summarization approach by using source code and its call dependencies. In Proceedings of the 11th Asia-Pacific Symposium on Internetware, pp. 1–10. Cited by: §II.
  • [30] P. W. McBurney and C. McMillan (2015) Automatic source code summarization of context for java methods. IEEE Transactions on Software Engineering 42 (2), pp. 103–119. Cited by: §II.
  • [31] L. Moreno, J. Aponte, G. Sridhara, A. Marcus, L. Pollock, and K. Vijay-Shanker (2013) Automatic generation of natural language summaries for java classes. In 2013 21st International Conference on Program Comprehension (ICPC), pp. 23–32. Cited by: §II.
  • [32] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318. Cited by: §IV-B.
  • [33] J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §II.
  • [34] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §I, §V-A.
  • [35] A. Raganato, J. Tiedemann, et al. (2018) An analysis of encoder representations in transformer-based machine translation. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Cited by: §I.
  • [36] P. Rodeghero, C. Liu, P. W. McBurney, and C. McMillan (2015) An eye-tracking study of java programmers and application to source code summarization. IEEE Transactions on Software Engineering 41 (11), pp. 1038–1054. Cited by: §II.
  • [37] P. Rodeghero, C. McMillan, P. W. McBurney, N. Bosch, and S. D’Mello (2014) Improving automated source code summarization via an eye-tracking study of programmers. In Proceedings of the 36th international conference on Software engineering, pp. 390–401. Cited by: §II.
  • [38] R. Singh and N. S. Mangat (2013) Elements of survey sampling. Vol. 15, Springer Science & Business Media. Cited by: §V-C.
  • [39] G. Sridhara, E. Hill, D. Muppaneni, L. Pollock, and K. Vijay-Shanker (2010) Towards automatically generating summary comments for java methods. In Proceedings of the IEEE/ACM international conference on Automated software engineering, pp. 43–52. Cited by: §I, §II.
  • [40] G. Sridhara, L. Pollock, and K. Vijay-Shanker (2011) Automatically detecting and describing high level actions within methods. In 2011 33rd International Conference on Software Engineering (ICSE), pp. 101–110. Cited by: §I, §II.
  • [41] G. Sridhara, L. Pollock, and K. Vijay-Shanker (2011) Generating parameter comments and integrating with method summaries. In 2011 IEEE 19th International Conference on Program Comprehension, pp. 71–80. Cited by: §II.
  • [42] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §I, §V-A.
  • [43] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, Ł. Kaiser, N. Kalchbrenner, N. Parmar, et al. (2018) Tensor2tensor for neural machine translation. arXiv preprint arXiv:1803.07416. Cited by: §I.
  • [44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §I, §III-B1.
  • [45] Y. Wan, Z. Zhao, M. Yang, G. Xu, H. Ying, J. Wu, and P. S. Yu (2018) Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 397–407. Cited by: §II, §IV-B, §IV, §VII, footnote 2.
  • [46] C. Wang, K. Cho, and J. Gu (2020) Neural machine translation with byte-level subwords. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 9154–9160. Cited by: §I, §III-A1.
  • [47] W. Wang, Y. Zhang, Y. Sui, Y. Wan, Z. Zhao, J. Wu, P. Yu, and G. Xu (2020) Reinforcement-learning-guided source code summarization via hierarchical attention. IEEE Transactions on Software Engineering. Cited by: §I, §II, §IV-A, §VII.
  • [48] X. Wang, L. Pollock, and K. Vijay-Shanker (2017) Automatically generating natural language descriptions for object-related statement sequences. In 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 205–216. Cited by: §II.
  • [49] B. Wei, G. Li, X. Xia, Z. Fu, and Z. Jin (2019) Code generation as a dual task of code summarization. In Advances in Neural Information Processing Systems, pp. 6563–6573. Cited by: §I, §II, §IV-A, §VII.
  • [50] S. Wiseman and A. M. Rush (2016) Sequence-to-sequence learning as beam-search optimization. arXiv preprint arXiv:1606.02960. Cited by: §III-C.
  • [51] E. Wong, T. Liu, and L. Tan (2015) Clocom: mining existing source code for automatic comment generation. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), pp. 380–389. Cited by: §II.
  • [52] X. Xia, L. Bao, D. Lo, Z. Xing, A. E. Hassan, and S. Li (2017) Measuring program comprehension: a large-scale field study with professionals. IEEE Transactions on Software Engineering 44 (10), pp. 951–976. Cited by: §I.
  • [53] W. Ye, R. Xie, J. Zhang, T. Hu, X. Wang, and S. Zhang (2020) Leveraging code generation to improve code retrieval and summarization via dual learning. In Proceedings of The Web Conference 2020, pp. 2309–2319. Cited by: §II.
  • [54] J. Zhang, X. Wang, H. Zhang, H. Sun, and X. Liu (2020) Retrieval-based neural source code summarization. In Proceedings of the 42nd International Conference on Software Engineering. IEEE, Cited by: §I, §II, §IV-A, §VII.
  • [55] W. Zheng, H. Zhou, M. Li, and J. Wu (2017) Code attention: translating code to comments by exploiting domain features. arXiv preprint arXiv:1709.07642. Cited by: §II.
  • [56] Y. Zhou, X. Yan, W. Yang, T. Chen, and Z. Huang (2019) Augmenting java method comments generation with context information based on neural networks. Journal of Systems and Software 156, pp. 328–340. Cited by: §II.