Learning to Generate Code Comments from Class Hierarchies

03/24/2021 ∙ by Jiyang Zhang, et al. ∙ 0

Descriptive code comments are essential for supporting code comprehension and maintenance. We propose the task of automatically generating comments for overriding methods. We formulate a novel framework which accommodates the unique contextual and linguistic reasoning that is required for performing this task. Our approach features: (1) incorporating context from the class hierarchy; (2) conditioning on learned, latent representations of specificity to generate comments that capture the more specialized behavior of the overriding method; and (3) unlikelihood training to discourage predictions which do not conform to invariant characteristics of the comment corresponding to the overridden method. Our experiments show that the proposed approach is able to generate comments for overriding methods of higher quality compared to prevailing comment generation techniques.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Developers rely on natural language comments to understand key aspects of the source code they accompany, such as implementation, functionality, and usage. Comments are critical in supporting code comprehension and maintenance. However, writing them can be laborious and time-consuming, which is not ideal for fast-development cycles. Unsurprisingly, this has sparked interest in automatic comment generation Iyer et al. (2016); Ahmad et al. (2020); Clement et al. (2020); Yu et al. (2020).

Prior work on automatic comment generation study various comment types including class-level comments Haiduc et al. (2010); Moreno et al. (2013) and method-level comments Sridhara et al. (2010); Iyer et al. (2016); Liang and Zhu (2018); Hu et al. (2019); LeClair et al. (2020); Fernandes et al. (2019); Liu et al. (2020b). A large portion of method-level comments correspond to overriding methods. Figure 1 shows an example of method overriding, where the class inherits characteristics (methods and fields) from its superclass , and . overrides .. Tempero et al. (2010) find that half of the projects they inspected have 53% or more of their subclasses overriding at least one method from the superclass.

Automatic comment generation for overriding methods poses unique challenges. In the example in Figure 1, . adds the functionality to initialize the encoding, i.e., when the encoding is , the method will compute and use the encoding. Generating a comment for the overriding method, ., requires reasoning about how the method relates to the overridden method, ., and addressing the customization that is done. In this paper, we formulate the task of generating comments for overriding methods and propose an approach, Hierarchy-Aware Seq2Seq, for facilitating the complex contextual and linguistic reasoning that is needed for performing this task.

Hierarchy-Aware Seq2Seq leverages contextual information from multiple components of the class hierarchy. Namely, we incorporate a learned representation of the comment corresponding to the overridden method in the superclass, as this provides a general template that can be adapted. We also encode the class name, which often sheds light on the customization and sometimes appears in the comment, as seen in Figure 1 ( in the comment).

Relative to comments for overridden methods, we observe that comments accompanying overriding methods use more specific language to describe the specialized functionality of the overriding method (e.g., “ASN.1” in Figure 1). However, encoder-decoder models naturally prefer tokens with higher frequency in the data during decoding, a known phenomena in dialog generation Sordoni et al. (2015); Serban et al. (2016); Li et al. (2016) that causes a preference for generic language. To encourage predictions which are more specific in nature, we factor out the notion of specificity Zhang et al. (2018); Ko et al. (2019) in our model and condition on learned representations of specificity when generating comments. Additionally, we capture the intuition that the overriding comment should retain key aspects of the overridden comment that are not affected by the customization done in the overriding method. To do this, we use unlikelihood training Li et al. (2020); Welleck et al. (2020) with synthetically generated sequences that do not conform to invariant characteristics of the overridden comment.

public class Object {
    protected byte[] encoding;
    /** Returns encoded form of the object. */
    public byte[] getEncoded() {
        return encoding;}}
(a) in the superclass, .
public class InfoAccessSyntax extends Object {
    /** Returns ASN.1 encoded form of this infoAccessSyntax. */
    @Override public byte[] getEncoded() {
        if (encoding == null) {
            encoding = ASN1.encode(this);}
        return encoding;}}
(b) in the subclass, .
Figure 1: Class extends the superclass and overrides the method.

For training and evaluation, we construct a parallel corpus consisting of 10,980 Java overriding-overridden method-comment tuples, with the corresponding details regarding class hierarchy. Experimental results show that our approach outperforms multiple baselines, including prevailing models for comment generation Hu et al. (2019) and comment update Panthaplackel et al. (2020) by significant margins. Through an ablation study, we demonstrate the effectiveness of the designed class hierarchy context, specificity-factored architecture, and unlikelihood training for conformity, where all of them statistically significantly improved the generation quality.

We summarize our main contributions as follows: (1) we formulate the novel task of generating comments for overriding methods; (2) to address this task, we design an approach that leverages aspects of the class hierarchy to gather broader contextual information as well as to learn notions of specificity and conformity, and we show that this approach can outperform multiple baselines; (3) for training and evaluation, we build a large corpus of overriding-overridden method-comment tuples, with their associated class hierarchy information. We will make our code and data publicly available upon publication.

2 Task

When a developer overrides a method, we aim to automatically generate a natural language comment which accurately reflects the method’s behavior, capturing important aspects of the class hierarchy as well as the customization that extends the implementation in the superclass. Concretely, in Figure 1, suppose a developer writes the method body of () in class () which overrides the parent method, , in the superclass, (). Our task is to generate the comment for , (“Returns ASN.1 encoded form of this infoAccessSyntax”), using context provided by the overriding method body, , as well as the inputs from the class hierarchy. This includes the name of (), the name of (), the overridden method body (), and (i.e., ’s comment, “Returns encoded form of the object”).

3 Hierarchy-Aware Seq2Seq

Our approach, Hierarchy-Aware Seq2Seq, leverages context from the class hierarchy to generate . Depicted in Figure 2, this is a Seq2Seq Sutskever et al. (2014) model that decodes using learned representations from three encoders: encoder (Section 3.1), encoder (Section 3.2), and encoder (Section 3.3). We also incorporate token-level auxiliary features into each of these encoders in order to further capture patterns pertaining to the class hierarchy, as well as properties of code and comments. Next, we discourage generating generic predictions by additionally injecting tailored representations that capture specificity (Section 3.5). Finally, we use unlikelihood training to discourage the model from making predictions which fail to capture key aspects of that should persist in the generated comment for (Section 3.6).
















specificity level
training time:
inference time: (maximum specificity level)
Figure 2: The neural architecture of Hierarchy-Aware Seq2Seq.

3.1 Method Encoder

We use a bidirectional GRU Cho et al. (2014) encoder to learn a representation for

. We concatenate auxiliary features to the embedding vector corresponding to each input token before feeding it into the encoder. We include indicator features for whether a token is a Java keyword or operator, to capture common patterns among these types of tokens. Next, we compute the

diff between and in order to construct features identifying whether a code token is retained, added, deleted, or replaced. By establishing the similarities and differences, we aim to capture how the subclass implementation () relates to that of the superclass (). In addition, we add the features indicating whether the code tokens overlap with class names ( and ) or as they usually contain important tokens such as object names and field names which may appear in the comments. We believe this will help guide the model in determining how to utilize the context provided by other components of the class hierarchy.

3.2 Class Name Encoder

Because is designed specifically for the subclass (), the class name () can often provide deeper insight into ’s functionality. For instance, in Figure 1, it is impossible to generate the correct comment without context from (). Using a bidirectional GRU encoder, we learn a representation of as a sequence of subtokens (we split class names in the form of CamelCase into subtokens). Similarly, we extract the features identifying whether a token in is retained, added, deleted, or replaced compared with . Features which indicate the overlap between and are included as well.

3.3 Superclass Comment Encoder

We find the hierarchical relationship between and to often hold between their accompanying comments. Namely, similar to how inherits the general structure and functionality from , often follows the general format and shares some amount of content with , as seen in Figure 1. To provide context from the comment in the superclass, we introduce an encoder for learning a representation of . Like the other encoders, we concatenate features to the embedding vector corresponding to each token in the input sequence. We include features capturing lexical overlap with and as we expect similar patterns to emerge in . We also incorporate features that have been used to characterize comments in prior work (Panthaplackel et al., 2020): whether the token appears more than once, is a stop word, and its part-of-speech tag.

3.4 Decoder

We concatenate the final hidden states of each of the encoders to initialize the GRU decoder. To further inform the decoder of the context from the three inputs, we leverage attention Luong et al. (2015) over the hidden states of all of these encoders. We additionally allow the decoder to copy tokens related to the implementation and class hierarchy from context provided by the inputs through a pointer network Vinyals et al. (2015) over the hidden states of all three encoders.

3.5 Conditioning on Specificity

To test the hypothesis that contains information specific to , we compare the specificity of and using normalized inverse word frequency (NIWF) Zhang et al. (2018). NIWF is a frequency-based metric that computes the maximum of the Inverse Word Frequency (IWF) of all the tokens in a comment , normalized to 0–1:

where is the number of comments in the corpus, denotes the number of comments that contain the token . We apply add-one smoothing such that the metric can be computed when the comment contains out-of-vocabulary tokens.

We found significant differences in terms of the distribution of NIWF among and : the NIWF metric of first sentence comments (the summary sentences of comments) is 0.249 on average for and is 0.174 on average for ; the NIWF metric of full description comments (the entire main descriptions, including the summary sentences as well as low-level descriptions of the methods’ functionality) is 0.244 on average for and is 0.181 on average for . In both first sentence and full description cases, the NIWF of is statistically significantly higher than that of using paired Wilcoxon signed-rank tests Wilcoxon (1945) under confidence level of 95%.

To encourage the decoder to predict a sequence that is specific and concretely reflects the functionality of , we follow prior work in dialog generation Ko et al. (2019) and learn embeddings of specificity jointly with the model. Namely, we discretize NIWF into levels of equal size, and associate an embedding with each level. During training, each comment is assigned into its corresponding level, thus the embeddings are trained jointly with the model. At test time, following the intuition that overriding comments should be more specific, we use the level that maximizes specificity, and concatenate its embedding with the decoder input word embedding at each time step.111In our preliminary experiments, we also explored using at inference time (thus we used add-one smoothing during computing NIWF such that it can be computed on in test set that may contain out-of-vocabulary words), but it resulted in worse performance.

Because specificity alone would encourage the model to generate words of lower frequency, we additionally encourage the model to prefer tokens that are semantically similar to the input. Specifically, we calculate coherence, which measures the similarity between the and inputs representations; the representation of a sentence () is computed as the weighted average of all word embeddings in , where the weight of each token is its inverse document frequency (with add-one smoothing):

where ,

computes the cosine similarity, and

is the word embedding of . Similar to specificity, these coherence representations are also discretized into 5 levels, and their embeddings jointly trained with the model. At inference time, we select the maximum level for coherence.

3.6 Unlikelihood Training

We additionally ensure that the model prediction conforms to invariant characteristics of . In other words, the model prediction should be suitable for the overriding method while simultaneously retaining the pertinent content and structure of the comment accompanying the overridden method. To discourage the model from generating sequences that do not preserve the salient information, we incorporate an unlikelihood objective into our training procedure Li et al. (2020); Welleck et al. (2020)

. Essentially, the model is trained to assign low probability scores to non-conforming sequences which do not contain all tokens that should be retained from

. For each example in the training set, we synthetically generate a non-conforming instance by randomly removing 10% of ’s tokens which also appear in (if the fraction of such tokens is less than 10%, after removing them, we randomly remove remaining tokens until 10% of ’s tokens are removed). We refer to the examples in our standard training set as positive examples and the synthetically generated non-conforming cases as negative examples.

During training, the standard likelihood loss can be calculated for each positive example:

where is the input contexts, is the positive target, and is the -th token of . Meanwhile, the unlikelihood loss can be calculated on the corresponding negative example:


is the generated negative target. The overall loss function consists of mixing of likelihood and unlikelihood losses:

where is the mixing hyper-parameter.

4 Dataset

Train Valid Test
#Projects 414 16 41
Before preprocessing 26,206 1,361 1,534
First sentence 9,389 702 463
Full description 10,980 708 519
Table 1: Number of projects, number of examples before preprocessing, and number of examples after preprocessing (in first sentence and full description datasets) in training, validation, and testing sets.
Avg. len() Avg. len()
First sentence 15.969 16.685
Full description 37.688 32.882
Table 2: Average number of subtokens in and in first sentence and full description datasets.

Because prior work do not consider class hierarchy, existing datasets do not include necessary information for our models and task. Therefore, we build a new corpus by mining open-source Java projects for (method, comment) pairs along with information from their class hierarchy. From the same projects that comprise a commonly used comment generation dataset 

Hu et al. (2018), we extract examples in the form: ((, , ), (, , )). From the Javadoc API documentation accompanying a given method, we derive comments from the main description, which precedes the tags 222https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#tag, e.g., .

We specifically consider generating the first sentence of the main description, similar to prior work Hu et al. (2019). The first sentence serves as a summary comment of the high-level functionality of the method. We also consider the more challenging task: generating the full description comment, i.e., the entire main description. The full description comment includes both the high-level summary, as well as low-level description of the method’s functionality.

Preprocessing. Similar to what was done in prior work Movshovitz-Attias and Cohen (2013); Panthaplackel et al. (2020), we partition the dataset in such a way that there is no overlap between the projects used for training, validation, and testing. By limiting the number of closely-related examples (i.e, methods from sibling classes) across partitions, this setting allows us to better evaluate a model’s ability to generalize. We filter out comments with non-English words, and remove those with words, as we find that these often fail to adequately describe functionality. We also discard trivial examples in which = as they lead to unwanted behavior in which the model learns to just copy . Finally, we tokenize source code and comments by splitting by space and punctuation and then split tokens of the form camelCase and snake_case to subtokens in order to reduce vocabulary size.

Statistics. We show statistics on our corpus before and after preprocessing in Table 1. The number of examples for the first sentence is smaller than that of the full description because there are more instances of = for first sentence, which needed to be filtered. This is expected, as they are shorter and more high-level in nature, allowing them to be the same when and have the same high-level description. Furthermore, while there are fewer examples in the test sets, in comparison to the validation sets, the number of projects from which the test examples come from is significantly larger than that of the validation examples. This is an artifact of using the cross-project setting, resulting from differences in the distribution of examples across various projects. Table 2 shows the average number of subtokens in and in our datasets.

5 Experiments

In this section, we describe several baselines, implementation details, evaluation metrics, and experiment setup.

5.1 Baselines

We compare our model with two rule-based baselines (Copy and Class Name Substitution), a generation model without using class hierarchy (Seq2Seq

), and two prevailing deep learning methods (

DeepCom-hybrid Hu et al. (2019) and Edit Model Panthaplackel et al. (2020)).

Copy. This is a rule-based approach which merely copies as the prediction for . This resembles the functionality of the Javadoc tool. 333https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#reusingcomments

Class Name Substitution. Based on our observations, there are many cases in which the developer obtains by simply copying and replacing all occurrences of the parent class name with that of the child class. We simulate this procedure using rule-based string replacement: = .Replace(, ). Note that if the parent class name does not appear in , then =.

Seq2Seq. In line with prior work that generates using alone Iyer et al. (2016), we consider an approach that does not have access to class-related information. This approach is a baseline version of Hierarchy-Aware Seq2Seq model without the class name, superclass comment encoders, without conditioning on specificity, and unlikelihood training. We also disregard the auxiliary features used in the method encoder as these include attributes of the class hierarchy.

DeepCom-hybridHu et al. (2019)

This approach generates a comment for a given Java method by learning the representations of the corresponding abstract syntax tree (AST) and code sequence respectively. They flatten the AST into a sequence using the Structure-based Traversal (SBT) algorithm. The two sequences are encoded with separate long short-term memory (LSTM) networks. An LSTM decoder then uses these learned representations in order to generate a sequence of comment tokens.

Edit Model. We find that developers often produce by editing ; however, these are not always as simple as class name substitution and require more complex edits. To address this, we include a model which learns to edit in order to produce . We adapt a recent comment editing framework that was originally proposed for updating comments that become outdated upon code changes to the corresponding methods Panthaplackel et al. (2020). They first encode the existing comment using a bidirectional GRU encoder and the code edits with another bidirectional GRU encoder. They then use a GRU decoder to generate a sequence of comment edits which are applied to the existing comment. This leads to an updated comment that is consistent with the new version of the method. In our setting, we treat as the “existing comment" and encode the code edits between and . We apply the generated comment edit sequence to in order to produce , which is expected to be consistent with . Additionally, we use the same auxiliary features and reranking mechanism they originally proposed.

Model First sentence Full description
Copy 24.228 19.845 43.352 20.623 18.757 40.846
Class Name Substitution 26.764 22.496 45.632 24.215 21.442 42.014
Seq2Seq 15.389 13.122 29.453 9.003 9.136 24.379
DeepCom-hybrid 26.073 22.256 45.037 21.812 20.779 40.257
Edit Model 31.755 26.926 46.939 23.734 22.571 44.359
Hierarchy-Aware Seq2Seq 36.408 31.955 53.776 27.461 26.361 48.698
Table 3: Comparison of our Hierarchy-Aware Seq2Seq model with the baseline methods. The differences between each pair of models on each metric are statistically significant.
First sentence (1) Hierarchy-Aware Seq2Seq 36.408 31.955 53.776
(2)   -unlikelihood 35.754 31.272 53.732
(3)   -unlikelihood, specificity 34.610 29.962 52.588
(4)   –unlikelihood, specificity, feats 33.715 29.029 52.688
(5)   -unlikelihood, specificity, class name encoder 33.969 28.984 52.098
(6)   -unlikelihood, specificity, superclass comment encoder 22.694 21.209 41.948
Full description (1) Hierarchy-Aware Seq2Seq 27.461 26.361 48.698
(2)   -unlikelihood 27.307 26.323 48.272
(3)   -unlikelihood, specificity 23.681 23.247 46.435
(4)   –unlikelihood, specificity, feats 24.303 23.486 46.722
(5)   -unlikelihood, specificity, class name encoder 23.745 22.887 46.335
(6)   -unlikelihood, specificity, superclass comment encoder 13.890 15.115 33.907
Table 4: Ablation study of our Hierarchy-Aware Seq2Seq model. The differences between the models with the same Greek letter suffix (and only those pairs) are not statistically significant.
public class Object {
    /** Returns encoded form of the object */
    public byte[] getEncoded() {
        return encoding;}}
public class InfoAccessSyntax extends Object {
    /** ? */
    @Override public byte[] getEncoded() {
        if (encoding == null) {
            encoding = ASN1.encode(this);}
        return encoding;}}
Gold: returns asn . 1 encoded form of this info access syntax .
Copy: returns encoded form of the object .
Class Name Substitution: returns encoded form of the info access syntax .
Seq2Seq: gets the encoded value .
DeepCom-hybrid: returns the byte ’s value of the encoding .
Edit Model: returns encoding of the object .
Hierarchy-Aware Seq2Seq: returns asn 1 encoded of the info access syntax .
Table 5: Qualitative analysis of our Hierarchy-Aware Seq2Seq model and the baselines models’ outputs on one example.

6 Results

In this section, we discuss performances of the various models. We first present a quantitative analysis using automatic metrics (Section 6.1), and then present a qualitative analysis by inspecting the models’ outputs (Section 6.2).

6.1 Quantitative Analysis

Metrics. Following prior work in comment generation and code summarization Iyer et al. (2016); Hu et al. (2019); Liang and Zhu (2018); LeClair et al. (2019), we report metrics used to evaluate language generation tasks: BLEU-4 Papineni et al. (2002)444Similar to prior work in code-to-language tasks Iyer et al. (2016), we report average sentence-level BLEU-4., METEOR Banerjee and Lavie (2005) and ROUGE-L Lin (2004). We report results as an average across three random restarts (for learned models).

Results. In Table 3, we present results for baselines and our Hierarchy-Aware Seq2Seq model. We conducted statistical significance testing through bootstrap tests Berg-Kirkpatrick et al. (2012) under confidence level 95%.

We first note that the Copy baseline underperforms the majority of other models (with the exception of Seq2Seq) across metrics for both datasets, demonstrating that generating the comment for an overriding method extends beyond simply repeating the overridden method’s comment.

Both Seq2Seq and DeepCom-hybrid, which do not have access to the class hierarchy, underperform the other three approaches (besides Copy) that do have access to this information, including the rule-based Class Name Substitution baseline. This underlines the importance of leveraging the class hierarchy for generating overriding method comments. Edit Model and Hierarchy-Aware Seq2Seq, the two models which learn to exploit the class hierarchical context, achieve even higher performance than Class Name Substitution. We find Hierarchy-Aware Seq2Seq to yield better performance than Edit Model which suggests that generating a new comment by leveraging contextual information, specificity, and conformity extracted from the class hierarchy can lead to improved performance over simply learning to adapt to fit .

Recall that Seq2Seq’s core neural composition closely matches that of Hierarchy-Aware Seq2Seq but does not have any information pertaining to the class hierarchy. Hence, utilizing class hierarchy can achieve more than 80% improvement for each metric.

We observe that while the results are analogous, performance across all metrics are consistently higher for first sentences than full descriptions, indicating, not surprisingly, that the latter is a more challenging task.

Ablation study. To evaluate the contribution of each of the components of Hierarchy-Aware Seq2Seq, we perform an ablation study, with results shown in Table 4. For both the first sentence and full description tasks, we find that ablating each component deteriorates performance across metrics; this effect is largely more evident on the more challenging full description task. By comparing (1) and (2), we demonstrate the importance of unlikelihood training. It gives statistically significant improvements for all the metrics on the first sentence task. The value of specificity factoring is evident by comparing (2) and (3). Looking at (3) and (4), we observe that ablating features drops performance w.r.t. BLEU-4 and METEOR for the first sentence. Next, comparing (3) and (5), we see that discarding the class name encoder deteriorates BLEU-4 and ROUGE-L for the first sentence task and METEOR on the full description task. Finally, we observe a sharp drop in performance when removing the parent comment encoder (between (3) and (6)), underlining the importance of the context provided by the superclass comment for our proposed task.

6.2 Qualitative Analysis

In Table 5, we show the models’ generated (first sentence) for one overriding method ( in ). Copying results in a generic and inappropriate . Class Name Substitution produces the correct , but the rest of the generated comment is not specific enough. Without as context, Seq2Seq and DeepCom-hybrid’s generated comments are of low quality and substantially diverged from . Edit Model performs a rephrasing for , because its model architecture is not designed towards improving the specificity of the . Our Hierarchy-Aware Seq2Seq generated a comment that is very close to the gold with adequate specificity and conformity. We present more examples in Appendix C.

7 Related Work

Most recent comment generation and code summarization approaches entail encoding source code in various forms including the sequence of tokens  Iyer et al. (2016); Allamanis et al. (2016); Ahmad et al. (2020), the AST Liang and Zhu (2018); Alon et al. (2019); Hu et al. (2019); Cai et al. (2020), or the data/control flow graph Fernandes et al. (2019); Liu et al. (2020a), and then decoding natural language output as a sequence of tokens. However, many of these approaches rely on the limited context provided by the body of code alone. In contrast, we additionally use context that is external to the method, as we find this context to be critical for our task of generating comments for overriding methods.

Liu et al. (2019) address the task of comment generation and consider incorporating external context from the method call dependency graph by encoding the method names as a sequence of tokens. Haque et al. (2020) generate comments by encoding the method signatures of all other methods in the same file. Yu et al. (2020) build a class-level contextual graph to relate all methods within the same class to a target method for which a comment is to be generated. However, none of these approaches consider comments accompanying the methods extracted from the broader context. In this work, rather than the context of the call graph or other methods within the same class or file, we use the context provided by the class hierarchy, and we also extract a comment (in the superclass) from this context, which we found to be one of the most critical components of our approach (Section 6.1).

Zhai et al. (2020) propose a rule-based approach for generating new comments by propagating comments from various code elements such as methods and classes. Such a system can generate a comment for the subclass method by simply propagating the superclass comment (resembling our Copy baseline) or doing simple string replacements of class names (resembling our Class Name Substitution baseline). We instead learn how to leverage in our Hierarchy-Aware Seq2Seq model, which outperforms these baselines. Allamanis et al. (2015) propose a log-bilinear neural language model for generating method names, which incorporates context from sibling methods and the superclass. Iyer et al. (2018) study the reverse problem of generating a method given a natural language query and context of the containing class. This includes class member variables and the return types, names, and parameters of the other methods in the class. We also use the class context (i.e., ), and context from another class, specifically, through the superclass.

8 Conclusion

This paper has introduced a new approach to automatic comment generation for software by exploiting the hierarchical class structure common in object-oriented programming languages. We propose a new framework which utilizes this hierarchy for contextual information as well as for learning notions of specificity and conformity, and we show that it outperforms existing state-of-the-art models on the novel task of automatically commenting overriding methods. We also developed a new dataset that allows training and testing such hierarchy-aware comment generation models. Integrating this approach with the growing body of work in machine learning and natural language processing for software development will lead to the emergence of more intelligent, effective software engineering environments.


We thank Darko Marinov, Thomas Wei, and the anonymous reviewers for their feedback on this work. This research was partially supported by the University of Texas at Austin Continuing Fellowship, the Bloomberg Data Science Fellowship, a Google Faculty Research Award, and the US National Science Foundation under Grant No. IIS-1850153.

Ethical Considerations

Our dataset has been collected in a manner that is consistent with the licenses provided from the sources (i.e., GitHub repositories).

Our technique is expected to work for software developers. Our technique will suggest a comment when developer override a method, which the developer can review, modify, and add to the codebase. In case the technique makes a bad suggestion, the developer is expected to detect it and refrain from using that suggestion. We expect our technique to improve developers’ productivity and encourage developers to write better documentations for code. The technique should not be used to automatically add comments to codebase without developers first reviewing them.

We conducted experiments involving computation time/power, but we have carefully chosen the number of times to repeat the experiments to both ensure reproducibility of our research and avoid consuming excessive energy. We provide details of our computing platform and running time in our appendix.


  • Ahmad et al. (2020) Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A transformer-based approach for source code summarization. In Annual Meeting of the Association for Computational Linguistics, pages 4998–5007.
  • Allamanis et al. (2015) Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2015. Suggesting accurate method and class names. In International Symposium on the Foundations of Software Engineering, pages 38–49.
  • Allamanis et al. (2016) Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional attention network for extreme summarization of source code. In International Conference on Machine Learning, pages 2091–2100.
  • Alon et al. (2019) Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating sequences from structured representations of code. In International Conference on Learning Representations.
  • Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72.
  • Berg-Kirkpatrick et al. (2012) Taylor Berg-Kirkpatrick, David Burkett, and Dan Klein. 2012. An empirical investigation of statistical significance in NLP. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 995–1005.
  • Cai et al. (2020) Ruichu Cai, Zhihao Liang, Boyan Xu, Zijian Li, Yuexing Hao, and Yao Chen. 2020. TAG : Type auxiliary guiding for code comment generation. In Annual Meeting of the Association for Computational Linguistics, pages 291–301.
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing, pages 1724–1734.
  • Clement et al. (2020) Colin Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, and Neel Sundaresan. 2020. PyMT5: multi-mode translation of natural language and python code with transformers. In EMNLP, pages 9052–9065. Association for Computational Linguistics.
  • Fernandes et al. (2019) Patrick Fernandes, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Structured neural summarization. In International Conference on Learning Representations.
  • Haiduc et al. (2010) Sonia Haiduc, Jairo Aponte, Laura Moreno, and Andrian Marcus. 2010.

    On the use of automated text summarization techniques for summarizing source code.

    In Working Conference on Reverse Engineering, pages 35–44.
  • Haque et al. (2020) Sakib Haque, Alexander LeClair, Lingfei Wu, and Collin McMillan. 2020. Improved automatic summarization of subroutines via attention to file context. In International Working Conference on Mining Software Repositories, pages 300–310.
  • Hu et al. (2018) Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. In International Conference on Program Comprehension, pages 200–210.
  • Hu et al. (2019) Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2019. Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering, pages 1–39.
  • Iyer et al. (2016) Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016.

    Summarizing source code using a neural attention model.

    In Annual Meeting of the Association for Computational Linguistics, pages 2073–2083.
  • Iyer et al. (2018) Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping language to code in programmatic context. In Conference on Empirical Methods in Natural Language Processing, pages 1643–1652.
  • Ko et al. (2019) Wei-Jen Ko, Greg Durrett, and Junyi Jessy Li. 2019. Linguistically-informed specificity and semantic plausibility for dialogue generation. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3456–3466.
  • LeClair et al. (2020) Alexander LeClair, Sakib Haque, Linfgei Wu, and Collin McMillan. 2020.

    Improved code summarization via a graph neural network.

    Proceedings of the 28th International Conference on Program Comprehension.
  • LeClair et al. (2019) Alexander LeClair, Siyuan Jiang, and Collin McMillan. 2019. A neural model for generating natural language summaries of program subroutines. In International Conference on Software Engineering, pages 795–806.
  • Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting objective function for neural conversation models. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119.
  • Li et al. (2020) Margaret Li, Stephen Roller, Ilia Kulikov, Sean Welleck, Y-Lan Boureau, Kyunghyun Cho, and Jason Weston. 2020. Don’t say that! making inconsistent dialogue unlikely with unlikelihood training. In Annual Meeting of the Association for Computational Linguistics, pages 4715–4728.
  • Liang and Zhu (2018) Yuding Liang and Kenny Qili Zhu. 2018. Automatic generation of text descriptive comments for code blocks. In

    Conference on Artificial Intelligence

    , pages 5229–5236.
  • Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81.
  • Liu et al. (2019) Bohong Liu, Tao Wang, Xunhui Zhang, Qiang Fan, Gang Yin, and Jinsheng Deng. 2019. A neural-network based code summarization approach by using source code and its call dependencies. In Asia-Pacific Symposium on Internetware.
  • Liu et al. (2020a) Shangqing Liu, Y. Chen, Xiaofei Xie, Jing Kai Siow, and Yang Liu. 2020a. Automatic code summarization via multi-dimensional semantic fusing in gnn. ArXiv, abs/2006.05405.
  • Liu et al. (2020b) Zhongxin Liu, Xin Xia, Meng Yan, and Shanping Li. 2020b. Automating just-in-time comment updating. In Automated Software Engineering, page to appear.
  • Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015.

    Effective approaches to attention-based neural machine translation.

    In Conference on Empirical Methods in Natural Language Processing, pages 1412–1421.
  • Moreno et al. (2013) Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori L. Pollock, and K. Vijay-Shanker. 2013. Automatic generation of natural language summaries for Java classes. In International Conference on Program Comprehension, pages 23–32.
  • Movshovitz-Attias and Cohen (2013) Dana Movshovitz-Attias and William W. Cohen. 2013.

    Natural language models for predicting programming comments.

    In Annual Meeting of the Association for Computational Linguistics, pages 35–40.
  • Panthaplackel et al. (2020) Sheena Panthaplackel, Pengyu Nie, Milos Gligoric, Junyi Jessy Li, and Raymond J. Mooney. 2020. Learning to update natural language comments based on code changes. In Annual Meeting of the Association for Computational Linguistics, pages 1853–1868.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Annual Meeting of the Association for Computational Linguistics, pages 311–318.
  • Serban et al. (2016) Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In Conference on Artificial Intelligence, pages 3776–3784.
  • Sordoni et al. (2015) Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural network approach to context-sensitive generation of conversational responses. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 196–205.
  • Sridhara et al. (2010) Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay-Shanker. 2010. Towards automatically generating summary comments for Java methods. In Automated Software Engineering, pages 43–52.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In International Conference on Neural Information Processing Systems, pages 3104–3112.
  • Tempero et al. (2010) Ewan Tempero, Steve Counsell, and James Noble. 2010. An empirical study of overriding in open source java. In Proceedings of the Thirty-Third Australasian Conferenc on Computer Science - Volume 102, ACSC ’10, page 3–12, AUS. Australian Computer Society, Inc.
  • Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in Neural Information Processing Systems, pages 2692–2700.
  • Welleck et al. (2020) Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2020. Neural text generation with unlikelihood training. In International Conference on Learning Representations.
  • Wilcoxon (1945) Frank Wilcoxon. 1945. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83.
  • Yu et al. (2020) Xiaohan Yu, Quzhe Huang, Zheng Wang, Yansong Feng, and Dongyan Zhao. 2020. Towards context-aware code comment generation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3938–3947.
  • Zhai et al. (2020) Juan Zhai, Xiangzhe Xu, Yu Shi, Minxue Pan, Shiqing Ma, Lei Xu, Weifeng Zhang, Lin Tan, and Xiangyu Zhang. 2020.

    CPC: Automatically classifying and propagating natural language comments via program analysis.

    In International Conference on Software Engineering, pages 1359–1371.
  • Zhang et al. (2018) Ruqing Zhang, Jiafeng Guo, Yixing Fan, Yanyan Lan, Jun Xu, and Xueqi Cheng. 2018. Learning to control the specificity in neural response generation. In Annual Meeting of the Association for Computational Linguistics, pages 1108–1117.

Appendix A Incorporating AST Representations

Model First sentence Full description
Hierarchy-Aware Seq2Seq 35.899 31.131 53.548 27.719 26.908 49.352
Hierarchy-Aware Seq2Seq +AST 35.837 30.909 53.749 27.908 26.360 49.935
Table 6: Comparison of incorporating vs. not incorporating AST sequence into our model. The differences between the models with the same Greek letter suffix (and only those pairs) are not statistically significant.

From Table 3, we observe that DeepCom-hybrid outperforms Seq2Seq. Both models only have access to ; however, DeepCom-hybrid learns a representation for the AST corresponding to in addition to its sequence of code tokens. While our model (Hierarchy-Aware Seq2Seq) equipped with context from the class hierarchy without AST-level information outperforms DeepCom-hybrid, we are interested in seeing whether using AST can further boost the performance of our models. For this, we introduce a fourth encoder into the Hierarchy-Aware Seq2Seq model that learns a representation of ’s AST. Namely, we use DeepCom-hybrid’s SBT algorithm to traverse ’s AST and form an AST sequence. We then use a two-layer bidirectional GRU encoder to encode this sequence. We present results in Table 6. We find that incorporating the AST does not yield statistically significant differences in performance.

Model Training Time[s] Testing Time[s] #Parameters
Copy n/a 6 n/a
Class Name Substitution n/a 6 n/a
Seq2Seq 600 7,269 1.33M
DeepCom-hybrid 14,162 15,812 17.42M
Edit Model 1,147 9,384 2.19M
Hierarchy-Aware Seq2Seq 919 8,143 3.33M
Table 7: Average training time, average testing time, and number of parameters of our Hierarchy-Aware Seq2Seq model and baseline models.

Appendix B Implementation and Experiments Details

In this section, we provide all implementation and experiments details for reproducibility that have not been covered in the main text.

b.1 Hyper-Parameters

The hyper-parameters of the Hierarchy-Aware Seq2Seq model (Section 3) and the Seq2Seq model (Section 5.1) are described below. We use embedding size . Each encoder is a 2-layer bidirectional GRU with hidden dimension , and each decoder is a 2-layer unidirectional GRU with hidden dimension . Dropout rate is set to . We optimize the negative log likelihood with Adam optimizer using an learning rate of . In addition, we early stop the training if the validation loss is not improving for consequent epochs. During inference, beam size of 20 is used. For DeepCom-hybrid Hu et al. (2019) and Edit Model Panthaplackel et al. (2020), we use the hyper-parameters provided in the original papers.

We tune all of the hyper-parameters by maximizing the BLEU-4 on the validation set. The range of learning rates explored is {0.1, 0.01, 0.001}. The range of beam sizes explored are the integers . The range of dropout rates explored is {0.4, 0.6, 0.7, 0.8}. We also tried two optimization algorithms: SGD and Adam. We apply the same tuning process to the hyper-parameters for the other models.

b.2 Computing Environment

We trained and tested all models on a super computer running Linux operating system, where each model use one NVIDIA GTX 1080-TI GPU and each four models share two Intel(R) Xeon(R) CPU E5-2620 CPUs and 128GB of RAM. We used Pytorch 1.2.0 to implement our

Hierarchy-Aware Seq2Seq model and the Seq2Seq baseline. We trained and evaluated each model in each evaluation setting three times (each time with a different random initialization of weights) and report average metrics.

b.3 Runtime and Number of Parameters

Table 7 shows our Hierarchy-Aware Seq2Seq model and baselines’ average runtime and number of parameters.

Appendix C Additional Qualitative Analysis Examples

In Table 8, we show predictions for various examples in the test set.

public class Cache {
    /** Empties the cache. */
    public synchronized void clear() {}
public class DiskBasedCache {
    /** ? */
    @Override public synchronized void clear() {
        File[] files = mRootDirectory.listFiles();
        if (files != null) {
            for (File file : files) {
        mTotalSize = 0;
        VolleyLog.d("Cache cleared.");
Gold: clears the cache . deletes all cached files from disk .
Copy: empties the cache .
Class Name Substitution: empties the disk based cache .
Seq2Seq: removes all mappings from disk .
DeepCom-hybrid: deletes all cached files from disk .
Edit Model: clears the cache .
Hierarchy-Aware Seq2Seq: empties the cache . deletes all cached files from disk .
public class Reader {
    /** Reset the stream. */
    public void reset() throws java.io.IOException {
public class CharArrayReader {
    /** ? */
    @Override public void reset() throws IOException {
        synchronized (lock) {
            if (isClosed()) {
                throw new IOException();
            pos = markedPos != -1 ? markedPos : 0;
Gold: resets this reader ’ s position to the last mark ( ) location .
Copy: reset the stream .
Class Name Substitution: empties the disk based cache .
Seq2Seq: removes all of the elements from this deque .
DeepCom-hybrid: overrides the reader to the last marked location ( ) and reset that it can be delegated to # sent out ( ) .
Edit Model: clears the stream .
Hierarchy-Aware Seq2Seq: reset the stream to the tail of this reader .
Table 8: Qualitative analysis of our Hierarchy-Aware Seq2Seq model and the baselines models’ outputs on one example.