DeepSumm – Deep Code Summaries using Neural Transformer Architecture

by   Vivek Gupta, et al.
Stanford University

Source code summarizing is a task of writing short, natural language descriptions of source code behavior during run time. Such summaries are extremely useful for software development and maintenance but are expensive to manually author,hence it is done for small fraction of the code that is produced and is often ignored. Automatic code documentation can possibly solve this at a low cost. This is thus an emerging research field with further applications to program comprehension, and software maintenance. Traditional methods often relied on cognitive models that were built in the form of templates and by heuristics and had varying degree of adoption by the developer community. But with recent advancements, end to end data-driven approaches based on neural techniques have largely overtaken the traditional techniques. Much of the current landscape employs neural translation based architectures with recurrence and attention which is resource and time intensive training procedure. In this paper, we employ neural techniques to solve the task of source code summarizing and specifically compare NMT based techniques to more simplified and appealing Transformer architecture on a dataset of Java methods and comments. We bring forth an argument to dispense the need of recurrence in the training procedure. To the best of our knowledge, transformer based models have not been used for the task before. With supervised samples of more than 2.1m comments and code, we reduce the training time by more than 50 achieve the BLEU score of 17.99 for the test set of examples.



There are no comments yet.


page 8

page 9


A Neural Model for Generating Natural Language Summaries of Program Subroutines

Source code summarization -- creating natural language descriptions of s...

Associating Natural Language Comment and Source Code Entities

Comments are an integral part of software development; they are natural ...

CodeSum: Translate Program Language to Natural Language

During software maintenance, programmers spend a lot of time on code com...

Action Word Prediction for Neural Source Code Summarization

Source code summarization is the task of creating short, natural languag...

Recommendations for Datasets for Source Code Summarization

Source Code Summarization is the task of writing short, natural language...

Improved Automatic Summarization of Subroutines via Attention to File Context

Software documentation largely consists of short, natural language summa...

Automatically Extracting Subroutine Summary Descriptions from Unstructured Comments

Summary descriptions of subroutines are short (usually one-sentence) nat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A “summary” of source code is a brief natural language description of that section of source code. One of the most common targets for summarization are the subroutines or more generally known methods in a program; for example, the one-sentence descriptions of Java methods widely used in automatically-formatted documentation e.g. JavaDocs as shown below:


    public void render(GameData data)
        setText(Message.render(data, type.getPattern(), attributes))


    Renders the message and updates the message text

These comments are useful because they help programmers understand the role that the method plays in a program. Empirical studies have repeatedly shown that the understanding the role of the methods in a program is a crucial step to understanding the program’s behavior overall. Even a short summary of a method such as shown above can tell a programmer a lot about that method and the program as a whole.

A holy grail of software engineering research has long been to generate these summaries automatically. Forward et al.. Forward and Lethbridge (2002) pointed out in 2002 that “software professionals value technologies that improve automation of the documentation process,” and “that documentation tools should seek to better extract knowledge from core resources”, such as source code [7]. Tools such as Resharper, JavaDoc and Doxygen automate the format and presentation of documentation, but still leave programmers with the most labor-intensive effort of writing the text and examples.

The research task of generating such summaries is also known as "source code summarization" with much emphasis on summarization of methods. For several years, significant progress was made based on content selection and sentence templates but all these techniques have lately given the way to AI based systems based on big data input.

The inspiration for a vast majority of efforts into AI-based code summarization originates in neural machine translation (NMT) from the natural language processing research community. It is typically thought of in terms of sequence to sequence(seq2seq) learning. In software engineering research, machine translation can be considered as a metaphor for source code summarization: the words and tokens in the body of a method are one sequence, while the desired natural language summary is the target sequence. This application of NMT to code summarization has shown strong benefits in a variety of applications. Much of NMT based methods used for summarization task uses attention based architecture which is based on the method to jointly align and translate - which is either based on recurrence or on convolutions. These underlying architectures have systematic problems in them such as exploding or vanishing gradient problem to address long range dependencies. They are computational complex and are not very interpretable. With this project we aim to solve such problem using Transformer based architecture and compare its performance with traditional recurrent, encoder-decoder based seq2seq model with attention.

2 Related Work

Early methods - Heuristic based/ Template based methods

Hauic et al. Haiduc et al. (2010) are often credited with the first attempt to create textual summaries of the code (class or a method) and were the first one to coin the term ’source code summarization’. This early work applied TR methods like Latent Semantic Indexing and combined them with structural information of the code to make effective summaries of the code. A different work was developed by Sridhar et al. Moreno et al. (2013) presented a new method to automatically produce descriptive summary comments for software methods. Based on the method’s signature and body, their comment generator found the content of the summary and produced the natural language text that summarizes the method. Moreno et al..Moreno et al. (2013) suggested a new method to automatically produce natural language summaries for software classes by using stereotype of class.

As in other research areas related to natural language generation, data-driven techniques have largely supplanted template-based techniques due to a much higher degree of flexibility and reduced human effort in template creation.

End-to-end data driven methods

Iyer et al. Iyer et al. (2016)

presented an end to end natural language generation system called CODE-NN that jointly performs content selection using an attention mechanism, and surface realization using Long Short Term Memory (LSTM) networks. The system generates a summary one word at a time, guided by an attention mechanism over embeddings of the source code, and by context from previously generated words provided by a LSTM network. The simplicity of the model allows it to be learned from the training data without the burden of feature engineering (Angeli

et al.., 2010 Angeli et al. (2010)) or the use of an expensive approximate decoding algorithm Konstas and Lapata (2013). Much of our work is based on the initial setup done by Iyer et al. Iyer et al. (2016) but incorporating recent advancements in achieving the state of the art models. Of note is that the attentional encoder-decoder seq2seq model originally described by Bahdanau et al.. Bahdanau et al. (2014) is at the core of many of these papers, as it provides strong baseline performance even for many software engineering tasks.

Wei et al. Wei et al. (2019)

explored code summarization task along with code generation task holistically to produce a model that performs summarization task. They exploit the duality between these tow tasks and a propose a dual training framework to train the tow tasks simultaneously. They consider the dualities on probablity and attention weights, and design corresponding regularization terms to constrain the duality.

Hu et al. Hu et al. (2018) explored the use of AST - Abstract Syntax Trees which capture structures and semantics of Java methods. ASTs are converted into sequences before they are fed into atypical encoder-decoder based seq2seq model. The model itself is an off-the-shelf encoder-decoder; the main advancement is the AST-annotated representation called Structurebased Traversal (SBT). SBT is essentially a technique for flattening the AST and ensuring that words in the code are associated with their AST node type. Alex et al.LeClair et al. (2019) further exploited this concept and combined this with an attentional encoder-decoder system, except with two encoders: one for code/text data and one for AST data. Alex et al. LeClair and McMillan (2019) also released the dataset which we use in our work.

3 Approach

In this work we mainly consider two set of model architectures. The first architecture, the Neural Machine Translation based encoder-decoder (seq2seq) architecture with attention and the secondly Transformer architecture. Seq2seq based models provides baseline results to compare and contrast with Transformer model.

3.1 Neural Machine Translation Architecture

The workhorse of most Neural Machine Translation (NMT) systems is the attentional encoder-decoder architecture Luong et al. (2015) . This architecture originated in work by Bahdanau et al. Bahdanau et al. (2014) and is explained in great detail by a plethora of very highly regarded sources such as Luong et al. (2015), Sutskever et al. (2014). In this section, we cover only the concepts necessary to understand our approach at a high level.

In an encoder-decoder architecture, there are a minimum of two recurrent neural networks (RNNs). The first, called the encoder, converts an arbitrary-length sequence into a single vector representation of a specified length. The second, called the decoder, converts the vector representation given by the encoder into another arbitrary-length sequence. Encoder generally as a bidirectional RNN while decoder is a unidirectional RNN. The sequence inputted to the encoder is one language e.g. English, and the sequence from the decoder is another language e.g. French. For our experiments, we modeled functions as the source sequence and comments as the target sequence to be fed to this recurrent and attention based encoder-decoder architecture. The final hidden states of the encoder are concatenated and set as the initial state of the decoder after a linear projection. With the encoder hidden states at each step and the decoders hidden state at each step we compute a multiplicative attention score. We compute a softmax over these scores and multiply the scores with the hidden states and accumulate them to form the attention score. Attention score is concatenated with the decoder hidden states and sent through a linear layer and we compute the softmax over the possible words of the output vocabulary to produce the next word prediction.

Figure 1: NMT:Seq2seq architecture

3.2 Our approach - Transformer architecture

Much of our work is based on the seminal work done by Vaswani et al. Vaswani et al. (2017). It is built by many multi-head self attentions blocks. We will delve into the details of this architecture in this section.

3.2.1 Encoder and Decoder Stacks

The encoder is composed of a stack of N identical layer. N is a parameter that can be tuned. Each stack of encoder has a multi-head attention layer followed by position wise feed forward network. A residual connection is applied follwoed by layer normalization

Ba et al. (2016) as shown in the figure 2

Similarly, the decoder is composed of N identical layers and has same set of layers as an encoder but in between the two layers is inserted a masked multihead attention layer that attends to the output of the final encoder layer.

3.2.2 Attention functions

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values. In scaled dot product attentions, the input consists of queries and keys of dimension , and the values of dimension . We compute dot products of the query with all the keys, scale down the results by a factor of , and apply a softmax to obtain the weights on the values. While implementing it in vectorized form, we compute this by packing all the queries in a matrix, Q. The keys are also packed in the matrices K and V. Since this is computed on the same sequence, we call this function as a Self Attention. Thus,


Instead of performing a single self attention function, Vaswani et al. Vaswani et al. (2017) found beneficial to to have multiple instances of these self attention functions that operate on multiple (h times) projections of the Q, K and V. The outputs are then concatenated and once again projected, resulting in the final values.


where each head

head = Self-Attention(QW,KW,VW) (3)

Where the projections are parameter matrices

  • W

  • W

  • W

  • W

Using our experiments, we tune for h, , ,

The overall flow of tensors is shown in the figure


Figure 2: Transformer: Flow of tensors

3.3 Threats to the approach

We acknowledge the following issues or threats to the approach:

  • The results are very much biased to the Java dataset that is used to train the model. With different dataset, the results could be different.

  • We were not able to exploit structural composition of the methods and solely relied on the resulting tokens and ignored casing. In production scenario, both these are very important considerations to the comments that go with the code. Our main objective, however is to compare recurrent versus transformer based sequence trasduction procedure.

4 Experiments

4.1 Data

The dataset we use in this paper is based on the dataset provided by LeClair et al. LeClair et al. (2019). We used this dataset because it is both the largest and most recent in source code summarization. LeClair et al. provided the dataset after minimal initial processing that filtered for Java methods with JavaDoc comments in English, and removed methods over 100 words long and comments >13 and <3 words. The result is a dataset of 2.1m Java methods and associated comments. Furthermore, LeClair et al. provided 3 groups of data,

  • Raw data set: 51m Java methods along with original source files.

  • Filtered data set: 2.1m Java method and comments with unprocessed source code and unprocessed comments.

  • Tokenized data set: 2.1m Java method and comments. Preprocessed source code with special characters removed, camel case split, lowercased. Comments are the first line of the javadoc lowercased with special characters removed. Our Train, validation and test sets were formed from this group of the data.

Due to the limited resources for the work, we did not use the entire dataset but only a partial set of it for the large runs of experiments. We randomly selected three training sets from the original dataset which are described in the below table and ran various experiments on them to test and validate both the approaches.

Set Label Training Validation Testing
Small Set 100,000 3000 3000
Medium Set 1m 5000 5000
Large Set 2.1m 10,000 10,000
Table 1: Supervised Data set

We target the problem of source code summarization of methods – automatic generation of natural language descriptions of methods. Specifically, we target summarization of Java methods, with the objective of creating method summaries like those used in JavaDocs. While we limit the scope of the experiments in this paper to Java dataset, in principle the techniques described in this paper are applicable to any programming language.

4.2 Evaluation method

To evaluate the performance of the model in train and validation sets, we use perplexity measure and to evaluate a model in the test set we use BLEU score Papineni et al. (2002).

Perplexity is the measure of how well a probability distribution or a model such as ours, in a train and validation setting, predicts a sample. It is used to compare probabilistic natural language models . a low perplexity indicates the probability distribution is good at predicting the sample. The benefit of using perplexity is that it is easier to calculate.

To measure the performance of the model we used BLEU score or BiLingual Evaluation Understudy Papineni et al. (2002)

. BLEU is a measure of the number of matching Ngrams between the machine’s output and a reference translation. BLEU has been the traditional standard for MT systems and is considered to be the correspondence between a machine’s output and that of a human. Technically speaking we used Pytorch’s, Torchtext implementation of BLEU

Contributors. (2018),

4.3 Experimental details

To get results greedily, we performed a screening test on the hyperparameters and model configurations. To accomplish this we used Small dataset in the table

1 and ran many experiments to get most likelihood set of hyper parameters that might work better. We further screened the parameters obtained in the Medium dataset in table 1. We finally consumed these hyperparameters and configurations in the model with Large dataset.

We used grid search on learning rate, encoder, decoder layer(s), encoder, decoder heads and dimensionality of the model. For the final set of results, we fixed learning rate to the best and the stable rate from the experiments we conducted.

The baselines used for the model was from the work done by LeClair et al. LeClair et al. (2019) in producing models namely ast-attendgru and attendgru. LeClair et al. exploited the structural composition of the tokens in the methods along with the tokens themselves in ast-attendgru model, and for attendgru they used a vanilla off the shelf attention based seq2seq model. Since we wanted to experience the difference between recurrent based and a transformer based architecture, deepsumm-attention from section 3.1 on small dataset was our another baseline. deepsumm-transformer from section 3.2 and figure 2 was the model under observation for the 3 different classes of our dataset.

4.4 Results

4.4.1 Baseline

Model Dataset BLEU
ast-attendgru Large 19.6
attendgru Large 19.4
deepsumm-attention Small 10.70
deepsumm-transformer Small 11.86
deepsumm-transformer Medium 16.95
deepsumm-transformer Large 17.99
Table 2: Baselines

4.4.2 Detailed results

N: Encoder, Decoder layers, : Input dimension of the token embedding, batch: Batch size, h: encoder, decoder heads, Size: Number of parameters(x),

Dataset N batch h , Epochs BLEU Size
Small 3 256 128 8 512 10 33.102 11.43 17.5
2 30.937 11.8 16.3
4 43.732 9.45 18.7
1 32.563 11.86 14.8
2 128 35.088 11.63 7.7
1 35.854 11.35 7.2
1 512 30.47 11.84 31.1
2 30.42 11.59 35.3
2 4 31.186 11.78 16.3
Medium 3 256 128 8 512 5 18.003 16.41 53.5
2 17.928 16.8 52.2
1 19.415 16.32 50.9
2 4 18.376 16.95 52.2
Large 3 256 256 8 512 5 16.152 17.91 76.6
2 16.431 17.81 75.3
1 17.419 17.81 75.3
2 4 16.539 17.99 75.2
Table 3: deepSumm - Transformer

We expected and observed a drop in training time of the deepsumm transformer based models from its attention based recurrent architectures by atleast half. This enabled us to run the experiment on the larger dataset. This is inline with what we learnt from the work done by Vaswani et al. Vaswani et al. (2017). We also hoped to deepsumm transformer model to outperform the recurrent based baseline in both processing time and the performance metric. The results obtained were in line.

We were however not able to come close to the work done and results obtained by LeClair on the same dataset LeClair et al. (2019). We suspect not having structural composition embedded into the model is contributing to the loss. We also strongly suspect that our training time and exploration of the gradient space via a static learning rate is not optimal and may also have come strongly contributed to the shortcoming. We were only able to train four models for the larger dataset and for only 5 epochs.

5 Analysis

We take a look at 2 examples in this section to perform inference on the generated natural language summaries of the methods. The examples are handpicked for illustration purposes only.

  1. Original Code:

    public PartVO getChild(){
            return child;

    Tokenized input to the model: "public part vo get child return child".

    Reference comment: "get the child part".

    deepSumm prediction: "gets the child".

    Figure 3: Attention distribution for the first example

    We see this example as an acceptable generation of the summary as compared to the reference. The weights on the attention distribution shows that to generate "gets" the decoder is paying attention to "get" and "return" tokens from the source sequence which is justifiable. It follows the suit for the predicting "child".

  2. Original Code:

    public double getOxygenConsumptionRate() {
            return getValueAsDouble(OXYGEN_CONSUMPTION_RATE);

    Tokenized input to the model: "public double get oxygen consumption rate return get value as double oxygen consumption rate".

    Reference comment: "gets the oxygen consumption rate".

    deepSumm prediction: "returns the water consumption rate for this object".

    Figure 4: Attention distribution for the 2nd example

    We see this example as a problematic generation of the summary as compared to the reference. Looking at the weights, the model while decoding the third token, paid attention to the consumption and predicted water. We think that the greedy algorithm paid the price of the faulty search here.

  3. Original Code:

    public void setMaximumColorDepth(String value) {
            this.maximumColorDepth = value;

    Tokenized input to the model: "public double get oxygen consumption rate return get value as double oxygen consumption rate".

    Reference comment: "gets the oxygen consumption rate".

    deepSumm prediction: "returns the water consumption rate for this object".

    Recurrent attention(Small model) prediction: "sets the maximum color depth of the color"

    Figure 5: Attention distribution for the 3rd example

    We see this example as a good generation of the summary as compared to the reference and as compared tot he attentive RNN model which had a poor prediction.

6 Conclusion and Future work

With this work, we presented a model that effectively generates natural language summaries from program methods written in Java. Furthermore, we were able to create two contrasting models, one that uses Recurrence and other that uses state of the art, Transformer architecture. We observed the benefits of Transformer as cited by Vaswani et al. Vaswani et al. (2017) first hand by running various experiments on both these models. Our transformer model was able to outperform recurrence based models by more that a BLEU score in the same dataset. The training time also improved. Empirically with this dataset by a factor of half when compared with recurrence. While we were not able to achieve state of the art results in the task but we think that employing effective structural information along with programming constructs in representing the methods will come handy along with more efficient strategy in exploring the gradient space and training the network for longer time. Furthermore, exploring ensemble methods in NLP as done by LeClair et al. LeClair et al. (2019) will also be a great strategy. We think, that these will be helpful next steps to take the next step in this line of work.


  • G. Angeli, P. Liang, and D. Klein (2010) A simple domain-independent probabilistic approach to generation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, pp. 502–512. External Links: Link Cited by: §2.
  • J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. External Links: 1607.06450 Cited by: §3.2.1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. External Links: 1409.0473 Cited by: §2, §3.1.
  • T. Contributors. (2018) External Links: Link Cited by: §4.2.
  • A. Forward and T. C. Lethbridge (2002) The relevance of software documentation, tools and technologies: a survey. In Proceedings of the 2002 ACM Symposium on Document Engineering, DocEng ’02, New York, NY, USA, pp. 26–33. External Links: ISBN 1581135947, Link, Document Cited by: §1.
  • S. Haiduc, J. Aponte, L. Moreno, and A. Marcus (2010)

    On the use of automated text summarization techniques for summarizing source code.

    In WCRE, G. Antoniol, M. Pinzger, and E. J. Chikofsky (Eds.), pp. 35–44. External Links: ISBN 978-0-7695-4123-5, Link Cited by: §2.
  • X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin (2018) Deep code comment generation. In Proceedings of the 26th Conference on Program Comprehension, ICPC ’18, New York, NY, USA, pp. 200–210. External Links: ISBN 9781450357142, Link, Document Cited by: §2.
  • S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer (2016)

    Summarizing source code using a neural attention model

    In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 2073–2083. External Links: Link, Document Cited by: §2.
  • I. Konstas and M. Lapata (2013) A global model for concept-to-text generation. J. Artif. Int. Res. 48 (1), pp. 305–346. External Links: ISSN 1076-9757 Cited by: §2.
  • A. LeClair, S. Jiang, and C. McMillan (2019) A neural model for generating natural language summaries of program subroutines. External Links: 1902.01954 Cited by: §2, §4.1, §4.3, §4.4.2, §6.
  • A. LeClair and C. McMillan (2019) Recommendations for datasets for source code summarization. External Links: 1904.02660 Cited by: §2.
  • M. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. External Links: 1508.04025 Cited by: §3.1.
  • L. Moreno, J. Aponte, G. Sridhara, A. Marcus, L. L. Pollock, and K. Vijay-Shanker (2013) Automatic generation of natural language summaries for java classes. 2013 21st International Conference on Program Comprehension (ICPC), pp. 23–32. Cited by: §2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, USA, pp. 311–318. External Links: Link, Document Cited by: §4.2, §4.2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. External Links: 1409.3215 Cited by: §3.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. External Links: 1706.03762 Cited by: §3.2.2, §3.2, §4.4.2, §6.
  • B. Wei, G. Li, X. Xia, Z. Fu, and Z. Jin (2019) Code generation as a dual task of code summarization. External Links: 1910.05923 Cited by: §2.