TranS^3: A Transformer-based Framework for Unifying Code Summarization and Code Search

03/06/2020 ∙ by Wenhua Wang, et al. ∙ University of Technology Sydney 0

Code summarization and code search have been widely adopted in sofwaredevelopmentandmaintenance. However, fewstudieshave explored the efcacy of unifying them. In this paper, we propose TranS^3 , a transformer-based framework to integrate code summarization with code search. Specifcally, for code summarization,TranS^3 enables an actor-critic network, where in the actor network, we encode the collected code snippets via transformer- and tree-transformer-based encoder and decode the given code snippet to generate its comment. Meanwhile, we iteratively tune the actor network via the feedback from the critic network for enhancing the quality of the generated comments. Furthermore, we import the generated comments to code search for enhancing its accuracy. To evaluatetheefectivenessof TranS^3 , we conduct a set of experimental studies and case studies where the experimental results suggest that TranS^3 can signifcantly outperform multiple state-of-the-art approaches in both code summarization and code search and the study results further strengthen the efcacy of TranS^3 from the developers' points of view.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Code summarization and code search have become increasingly popular in software development and maintenance (LeClair et al., 2019; Movshovitz-Attias and Cohen, 2013; Sridhara et al., 2010; Lv et al., 2015; Gu et al., 2018; Yao et al., 2019)

, because they can help developers understand and reuse billions of lines of code from online open-source repositories and thus significantly enhance software development and maintenance process

(Yao et al., 2019). In particular, since much of the software maintenance effort is spent on understanding the maintenance task and related software source code (Lientz and Swanson, 1980), effective and efficient documentation is quite essential to provide high-level descriptions of program tasks for software maintenance. To this end, code summarization aims to automatically generate natural language comments for documenting code snippets (Moreno and Marcus, 2017). On the other hand, over years various open-source and industrial software systems have been rapidly developed where the source code of these systems is typically stored in source code repositories. Such source code can be treated as important reusable assets for developers because they can help developers understand how others addressed similar problems for completing their program tasks, e.g., testing (Zhang et al., 2018; Zhou et al., 2018; Chen and Zhang, 2018; Lu et al., 2016), fault localization (Li et al., 2019; Zhang et al., 2019, 2017), program repair (Ghanbari et al., 2019; Lou et al., 2019). Correspondingly, there also raises a strong demand for an efficient search process through a large codebase to find relevant code for helping programming tasks. To this end, code search refers to automatically retrieving relevant code snippets from a large code corpus given natural language queries.

The recent research progress towards code summarization and code search can mainly be categorized to two classes: information-retrieval-based approaches and deep-learning-based approaches. To be specific, the information-retrieval-based approaches derive the natural language clues from source code, compute and rank the similarity scores between them and source code/natural language queries for recommending comments/search results (Wong et al., 2013; Movshovitz-Attias and Cohen, 2013; Lu et al., 2015; Lv et al., 2015). The deep-learning

-based approaches use deep neural networks to encode source code/natural language into a hidden space, and utilize neural machine translation models for generating code comments and computing similarity distance to derive search results

(Hu et al., 2018b; Chen and Zhou, 2018; Gu et al., 2018; Akbar and Kak, 2019; Yao et al., 2019).

Based on the respective development of code summarization and code search techniques, we infer that developing a unified technique for optimizing both domains simultaneously is not only mutually beneficial but also feasible. In particular, on one hand, since the natural-language-based code comments can reflect program semantics to strengthen the understanding of the programs (LeClair et al., 2019), adopting them in code search can improve the matching process with natural language queries (Yao et al., 2019). Accordingly, injecting code comments for code search is expected to enhance the search results (Scholer et al., 2014). On the other hand, the returned search results can be utilized as an indicator of the accuracy of the generated code comments to guide their optimization process. Moreover, since code summarization and code search can share the same technical basis as mentioned above, it can be inferred that it is feasible to build a framework to unify and advance both the domains. Therefore, it is essential to integrate code summarization with code search.

Although integrating code summarization with code search can be promising, there remains the following challenges that may compromise its performance: (1) state-of-the-art code summarization techniques render inferior accuracy. According to the recent advances in code summarization (Wong et al., 2013; Iyer et al., 2016; Hu et al., 2018a), the accuracy of the generated code comments appears to be inferior for real-world applicability (around 20% in BLEU-1 with many well-recognized benchmarks). Integrating such code comments might lead to inaccuracies of matching natural language queries and further compromise the performance of code search. (2) how to effectively and efficiently integrate code summarization with code search remains challenging. Ideally, the goal of integrating code summarization with code search is to optimize the performance of both domains rather than causing trade-offs. Moreover, such integration is expected to introduce minimum overhead. To this end, it is essential to propose an effective and efficient integration approach.

To tackle the aforementioned problems, in this paper, we propose a framework, namely for optimizing both code summarization and code search based on a recent NLP technique—transformer (Vaswani et al., 2017). Unlike the traditional CNN-based approaches that suffer from long-distance dependency problem (Wang et al., 2016a) and RNN-based approaches that suffer from excessive load imposed by sequential computation (Huang et al., 2013), transformer advances in applying the self-attention mechanism which can parallelize the computation and preserve the integral textual weights for encoding to achieve the optimal accuracy of text representation (Vaswani et al., 2017).

consists of two components: the code summarization component and code search component. Specifically, the code summarization component is initialized by preparing a large-scale corpus of annotated

pairs to record all the code snippets with their corresponding comments as the training data. Next, we extract the semantic granularity of the training programs for constructing a tree-transformer to encode the source code into hidden vectors. Furthermore, such annotated pair vectors are injected into our deep reinforcement learning model, i.e., the actor-critic framework, for the training process, where the actor network is a formal encoder-decoder model to generate comments given the input code snippets; and the critic network evaluates the accuracy of the generated comments according to the ground truth (the input comments) and give feedback to the actor network. At last, given the resulting trained actor network and a code snippet, its corresponding comment can be generated. Given a natural language query, the code search component is launched by encoding the natural language query, the generated code comments, and the code snippets into the vectors respectively via transformer and tree-transformer. Next, we compute similarity scores between query/code vectors and query/comment vectors for deriving and optimizing their weighted

. Eventually, we rank all the code snippets according to their s for recommending the search results. The underlying transformer in can enhance the quality of the generated code and thus strengthen the code search results by importing the impact from the generated comments. Moreover, since the code search component applies the encoder trained by the code summarization component without incurring extra training process, its computing overhead can be maintained minimum.

To evaluate the effectiveness and efficiency of , we conduct a set of experiments based on the GitHub dataset in (Barone and Sennrich, 2017) which includes over 120,000 code snippets of Python functions and their corresponding comments. The experimental results suggest that can outperform multiple state-of-the-art approaches in both code summarization and code search, e.g., can significantly improve the code summarization accuracy from 47.2% to 141.6% in terms of BLEU-1 and the code search accuracy from 5.1% to 28.8% in terms of MRR compared with the selected state-of-the-art approaches. In addition, we also conduct case studies for both code summarization and code search where the study results further verify the effectiveness of .

In summary, the main contributions of this paper are listed as follows:

  • Idea. To the best of our knowledge, we build the first transformer-based framework for integrating code summarization and code search, namely , that can optimize the accuracy of both domains.

  • Technique. To precisely represent the source code, we design a transformer-based encoder and a tree-transformer-based encoder for encoding code and comments by injecting the impact from the semantic granularity of well-formed programs.

  • Evaluation. To evaluate , we conduct a substantial number of experiments based on real-world benchmarks. The experimental results suggest that can outperform several existing approaches in terms of accuracy of both code summarization and code search. In addition, we also conduct empirical studies with developers. The results suggest that the quality of the generated comments and search results are widely acknowledged by developers.

The reminder of this paper is organized as follows. Section 2 illustrates some preliminary background techniques. Section 3 gives an example to illustrate our motivation for unifying code summarization and code search. Section 4 elaborates the details of our proposed approach. Section 5 demonstrates the experimental and study results and analysis. Section 6 introduces the threats to validity. Section 7 reviews the related work. Section 8 concludes this paper.

2. Background

In this section, we present the preliminary background techniques relevant to , including language model, transformer, and reinforcement learning, which are initialized by introducing basic notations and terminologies. Let denote the code sequence of one function, where represents a token of the code, e.g., … “def”, “fact”, or “i” in a Python statement “def fact(i):”. Let denote the sequence of the generated comments, where denotes the sequence length. Let denote the maximum step of decoding in the encoder-decoder framework. We use notation to represent the comment subsequence and as the training dataset, where is the size of training set.

2.1. Language Model

A language model refers to the decoder of neural machine translation which is usually constructed as the probability distribution over a particular sequence of words. Assuming such sequence with its length

, the language model defines as its occurrence probability which is usually computed based on the conditional probability from a window of predecessor words, known as -gram (Wang et al., 2016b), as shown in Equation 1.

(1)

While the -gram model can only predict a word based on a fixed number of predecessor words, a neural language model can use predecessor words with longer distance to predict a word based on deep neural networks which include three layers: an input layer which maps each word to a vector, a recurrent hidden layer which recurrently computes and updates a hidden state after reading

, and an output layer which estimates the probabilities of the subsequent words given the current hidden state. In particular, the neural network reads individual words from the input sentence, and predicts the subsequent word in turn. For the word

, the probability of its subsequent word , can be computed as in Equation 2:

(2)

where is a stochastic output layer (e.g., a softmax for discrete outputs) that generates output tokens with the hidden state computed as Equation 3:

(3)

where denotes the weight of the token .

2.2. Transformer

Many neural machine translation approaches integrate the attention mechanism with sequence transduction models for enhancing the accuracy. However, the encoding networks are still exposed with challenges. To be specific, the CNN-based encoding networks are subjected to long-distance dependency issues and the RNN-based encoding networks are subjected to the long-time computation. To address such issues, transformer (Vaswani et al., 2017) is proposed to effectively and efficiently improve the sequence representation by adopting the self-attention mechanism only. Many transformer-based models e.g., BERT (Devlin et al., 2018), ERNIE (Sun et al., 2019), XLNET (Yang et al., 2019), have been proposed and verified to dramatically enhance the performance of various NLP tasks such as natural language inference , text classification, and retrieval question answering (Bowman et al., 2015; Voorhees, 2001).

Transformer consists of

identical layers where each layer consists of two sub-layers. The first sub-layer realizes a multi-head self-attention mechanism, and the second sub-layer is a simple, position-wise fully connected feed-forward neural network, as shown in Figure

1. Note that the output of the first sub-layer is input to the second sub-layer and the outputs of both the sub-layers need to be normalized prior to the subsequent process.

Figure 1. The Transformer Model Architecture.

2.2.1. Self-attention Mechanism

The attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

The input consists of queries, keys and values of the dimension . Accordingly, transformer computes the dot products of the query with all keys, divides each resulting element by , and applies a softmax function to obtain the weights on the values. In practice, we simultaneously compute the attention function on a set of queries which are packed together into a matrix . In addition, the keys and values are also packed together into matrices and . Therefore, the matrix of outputs can be computed as:

(4)

Instead of implementing a single attention function, transformer adopts a multi-head attention which allows the model to jointly attend to information from different representation subspaces at different positions.

The self-attention mechanism derives the relationships between the current input token and all the other tokens to determine the current token vector for the final input representation. By taking advantage of the overall token weights, such mechanism can dramatically alleviate the long-distance dependency problem caused by the CNN-based transduction models, i.e., compromising the contributions of the long-distance tokens. Moreover, the multi-head self-attention mechanism can parallelize the computation and thus resolve the excessive computing overhead caused by the RNN-based transduction models which sequentially encode the input tokens.

2.2.2. Position-wise Feed-Forward Neural Network

In addition to multi-head self-attention sub-layers, each of the layers contains a fully connected feed-forward neural network, which is applied to each position separately. Since transformer contains no recurrence or convolution, in order to utilize the order of the sequence, transformer injects “positional encodings” to the input embedding.

Since transformer has been verified to be dramatically effective and efficient in encoding word sequences, we infer that by representing code as a sequence, transformer can also be expected to excel in the encoding efficacy. Therefore, in , we adopt transformer as the encoder.

2.3. Reinforcement Learning for Code Summarization

In code summarization, reinforcement learning (RL)(Thrun and Littman, 2005) refers to interacting with the ground truth, learning the optimal policy from the reward signals, and generating texts in the testing phase. It can potentially solve the exposure bias problem introduced by the maximum likelihood approaches which is used to train the RNN model. Specifically in the inference stage, a typical RNN model generates a sequence iteratively and predicts next token conditioned on its previously predicted ones that may never be observed in the training data (Yu et al., 2017)

. Such a discrepancy between training and inference can become cumulative along with the sequence and thus prominent as the length of sequence increases. While in the reinforcement-learning-based framework, the reward, other than the probability of the generated sequence, is calculated to give feedback to train the model to alleviate such exposure bias problem. Such text generation process can be viewed as a Markov Decision Process (MDP)

. Specifically in the MDP settings, at time consists of the code snippets and the predicted words . The space is defined as the dictionary where the words are drawn, i.e., . Correspondingly, the transition function is defined as , where the (word) becomes a part of the subsequent and the can be derived. The objective of the generation process is to find a that iteratively maximizes the expected of the generated sentence sampled from the model’s , as shown in Equation 5,

(5)

where is the policy parameter to be learned, is the training set, denotes the predicted s/words, and is the reward function.

To learn the policy, many approaches have been proposed, which are mainly categorized into two classes (Sutton and Barto, 1998): (1) the policy-based approaches (e.g., Policy gradients (Williams, 1992)) which optimize the policy directly via policy gradient and (2) the value-based approaches (e.g., Q-learning (Watkins and Dayan, 1992)

) which learn the Q-function, and at each time the agent selects the action with the highest Q-value. It has been verified that the policy-based approaches may suffer from a variance issue and the value-based approaches suffer from a bias issue

(Keneshloo et al., 2018). To address such issues, the Actor-Critic learning approach is proposed (Konda, 2003) to combine the strengths of both policy- and value-based approaches where the actor chooses an action according to the probability of each action and the critic assigns the value to the chosen action for speeding up the learning process for the original policy-based approaches.

In this paper, we adopt the actor-critic learning model for code summarization of .

3. Illustrative Example

In this section, we use a sample Python code snippet to illustrate our motivation for unifying code summarization and code search. Figure 2 shows the Python code snippet, the comment generated by our approach and its associated natural language query in our dataset. Traditional code search approaches usually compute the similarity scores of the query vector and the code snippet vectors for recommending and returning the relevant code snippets. On the other hand, provided the comment information, it is plausible to enhance the code search results by enabling an additional mapping process between the query and the comments corresponding to the code snippets. For example, in Figure 2, given the query “get the recursive list of target dependencies”, although the code snippet can provide some information such as “dependencies”, “target”, which might be helpful for being recommended, its efficacy can be compromised due to the disturbing information such as “dicts”, “set”, “pending” in the code snippet. It is expected to enhance the search result by integrating the comment information during the searching process when it has the identical “target”, “dependencies” with the query. To this end, we infer that a better code search result can be expected if high-quality comment information can be integrated in the code search process.

Figure 2. An example Python code snippet and the corresponding query and generated comment.
Figure 3. The Overview of Our Proposed approach . (a) the code summarization part; (b) the code search part.

4. The Approach of

We formulate the research problem of integrating code summarization with code searchas as follows:

  • First, we attempt to find a policy that generates a sequence of words from dictionary to annotate the code snippets in the corpus as their comments. Next, given a natural language query , we aim to find the code snippets that can satisfy the query under the assistance of the generated comments.

Figure 4. An example of Self-Attention Mechanism.

To address such research problem, we propose with its framework shown in Figure 3, where Figure 3(a) presents the code summarization part and Figure 3(b) presents the code search part.

4.1. Transformer- and Tree-Transformer-based Encoder

In , we utilize transformer to build the encoder. Specifically, we develop the transformer-based encoder to encode the comments, the query and each program statement. Moreover, we develop a tree-transformer-based encoder that exploits the semantic granularity information of programs for enhancing the program encoding accuracy.

Transformer-based Encoder. The transformer-based encoder is initialized by embedding the input tokens into vectors via word embedding (Mikolov et al., 2013). Specifically, we tokenize the natural language comments/queries based on their intervals and the code based on a set of symbols, i.e., { . , ” ’ : * () ! - _ (space)}. Next, we apply word embedding to derive each token vector in one input sequence. Furthermore, for each token vector , we derive its representation according to the self-attention mechanism as follows: (1) deriving the query vector , the key vector , and the value vector by multiplying with a randomly-generated matrix, (2) computing the scores of from all the input token vectors by the dot product of , where and denotes the number of input tokens, (3) dividing the scores of by where denotes the dimension number of and normalizing the results by softmax to obtain the weights (contributions) of all the input token vectors, (4) multiplying such weights and their corresponding value vectors to obtain an interim vector space , and (5) summing all the vectors in for deriving the final vector of , . As a result, all the token vectors are input to the feed-forward neural network to obtain the final representation vector of the input sequence of natural language comment.

We use Figure 4 as an example to illustrate how the transformer-based encoder works, where the token vectors of “Software” and “Engineering” are embedded as and respectively. For , its corresponding , , and are derived in the beginning. Next, its scores from all the token vectors, i.e., and , can be computed by (112) and (96). Assuming is 64, by dividing the resulting dot products by and normalizing, the weights of and can be computed as 0.88 and 0.12. At last, can be derived by 0.88* + 0.12*.

Tree-Transformer-based Code Encoder. It can be observed that well-formed source code can reflect the program semantics through its representations, e.g., the indents of Python. In general, in a well-formed program, the statement with fewer indents tends to indicate more abstracted semantics than the one with longer indents. Therefore, we infer that incorporating the indent-based semantic granularity information for encoding can inject program semantics for program comprehension and thus be promising to enhance the encoding accuracy. Such injection can potentially be advanced when leveraging transformer. In particular, in addition to the original self-attention mechanism which determines the token vector score by only importing the token-level weights, statement-level impacts can be injected by analyzing statement indents, obtaining the semantic hierarchy of the code, and realizing the hierarchical encoding process.

Input : ordered tree ()

Output: vector representation of the tree

1:function PostorderTraverse
2:     node_list root node
3:     if isLeaf(root_node) then
4:         return Transformer(node_list);
5:     else
6:         for  in range(len(root_node’s children)) do
7:              node_list.append(PostOrderTraverse(’s children))          
8:         return Transformer(node_list)      
Algorithm 1 Tree Transformer Encoding Algorithm.

In this paper, we design a tree-transformer-based encoder that incorporates indent-based semantic granularity for encoding programs. Firstly, we construct an ordered tree according to the indent information of a well-formed program. In particular, by reading the program statements in turn, we initialize the tree by building the root node out of the function definition statement. Next, we iteratively label each of the subsequent statements with an indent index assigned by counting the indents such that the statements with the same indent index are constructed as the ordered sibling nodes and the preceding statement above such statement block with the indent index is constructed as their parent node. Secondly, we encode each node (i.e., each statement) of the tree into a vector by transformer. At last, we build the tree-transformer accordingly to further encode all the vector nodes of the tree for obtaining the code snippet representations. Specifically, we traverse the tree in a post-order manner. Assuming a node and its parent node , if is a leaf node, we replace the vector of , namely by the vector list {, } and subsequently traverse ’s other child nodes; otherwise, we traverse ’s child nodes. Next, we encode node with the updated vector list {, } by transformer when it has no child nodes. The tree-transformer encoding process is shown as Algorithm 1.

Figure 5 illustrates indent-based tree representation of the code snippet given in Figure 2. We use this example to describe how the tree-transformer-based encoder works. Specifically in Figure 2, we construct nodes “Dependencies = set()”, “pending=set(roots)”, “while pending:” and “return list(…)” as siblings because they are assigned with the same indents and one-shorter-indent preceding statement “def DeepDependencyTargets(target_dicts, roots):”, which is constructed as their parent node. Then, we encode all the statement nodes into vectors by transformer respectively. Next, as the root’s child nodes “Dependencies = set()” and “pending=set(roots)” are leaf nodes, we replace the root vector by the vector list of them three. Then, since the root’s child node “while pending:” is not the leaf node, we first encode its child node “if (r in dependencies):” with “continue” by transformer, and then encode the resulting vector with the siblings of “while pending:” and “if (r in dependencies):” together by transformer. At last, we encode the root node with all its child nodes to obtain the final representation of this code snippet.

4.2. Code Summarization

Initialized by collecting code snippets with their associated comments and forming pairs for training the code summarization model, the code summarization component is implemented via reinforcement learning (i.e., the actor-critic framework), where the actor network establishes an encoder-decoder mechanism to derive code comments and the critic network iteratively provides feedback for tuning the actor network. In particular, the actor network leverages a transformer-based or a tree-transformer-based encoder to encode the collected code into hidden space vectors and applies a transformer-based decoder to decode them to natural language comments. Next, by computing the similarity between the generated and the ground-truth comments, the critic network iteratively provides feedback for tuning the actor network. As a result, given a code snippet, its corresponding natural language comment can be generated based on the trained code summarization model.

Figure 5. The Source Code and the Tree Structure.

4.2.1. Actor Network.

The actor network is composed of an encoder and decoder.

Encoder. We construct the tree representation of the source code and establish a tree-transformer, described as Section 4.1, to encode the source code into hidden space vectors for the code representation.

Decoder. After obtaining the code snippet representations, implements the decoding process for them, i.e., generating comments from the hidden space, to derive their associated natural language comments.

The decoding process is launched by generating an initial decoding state by encoding the given code snippet. At step , state is generated to maintain the source code snippet and the previously generated words , i.e., . Specifically, the previously generated words are encoded into a vector by transformer and subsequently concatenated with state . Our approach predicts the th word by using a softmax function. Let denote the probability distribution of the th word in the state , we can obtain the following equation:

(6)

Next, we update to to generate the next word. This process is iterated till it exceeds the max-step or generates the end-of-sequence (EOS) token for generating the whole comment corresponding to the code snippet.

4.2.2. Critic Network

To enhance the accuracy of the generated code comments, applies a critic network to approximate the value of the generated comments at time to issue a feedback to tune the network iteratively. Unlike the actor network which outputs a probability distribution, the critic network outputs a single value on each decoding step. To illustrate, given the generated comments and the reward function , the value function is defined to predict the total reward from the state at time , which is formulated as follows,

(7)

where is the max step of decoding and is the representation of code snippet. By applying the reward function, we can obtain an evaluation score (e.g., BLEU) when the generation process of the comment sequences is completed. Such process is terminated when the associated step exceeds or generates the end-of-sequence (EOS) token. For instance, a BLEU-based reward function can be calculated as

(8)

where , and is the generated comment an is the ground truth.

4.2.3. Model Training

For the actor network, the training objective is to minimize the negative expected reward, which is defined as , where is the parameter set of the actor network. Defining policy as the probability of a generated comment, we adopt the policy gradient approach to optimize the policy directly, which is widely used in reinforcement learning.

The critic network attempts to minimize the following loss function,

(9)

where is the target value, is the value predicted by the critic network with its parameter set . Eventually, the training for comment generation is completed after converges.

Denoting all the parameters as , the total loss of our model can be represented as . We employ stochastic gradient descend with the diagonal variant of AdaGrad (Duchi et al., 2011) to tune the parameters of for optimizing the code summarization model.

4.3. Code Search

Given a natural language query, encodes all the code snippets and the generated comments into vector sets by tree-transformer-based encoder and transformer-based encoder respectively, and encodes the query into a vector by transformer-based encoder. Next, we compute the similarity scores between the query vector and the vectors in both the code snippets vector set and the generated comments vector set. At last, we rank all the code snippets to recommend the search results derived from the linear combination of the two similarity score sets which are trained for optimality.

As shown in Figure 3 (b), we encode the code snippets and the generated comments into vector spaces and by the tree-transformer-based encoder and transformer-based encoder respectively. We also encode the given natural language query into a vector by transformer-based encoder. Next, we compute the similarity scores between and the vectors of the code snippet from both and as and . Furthermore, we derive the weighted score of the code snippet , , by linearly combining and , as shown in Equation 10. Eventually, we rank all the code snippets according to their s for recommending the search results,

(10)

where is a parameter that ranges from 0 to 1 and determined after training, is computed by consine. Specially, given the query , the training objective is to ensure , where demonstrates the code snippets in the dataset expect .

5. Evaluation

We conduct a set of extensive experiments on the effectiveness and efficiency of in terms of both the code summarization and code search components compared with state-of-the-art approaches.

5.1. Experimental Setups

To evaluate the performance of our proposed approach, we use the dataset presented in (Barone and Sennrich, 2017) where over 120,000 code;comment pairs are collected from various Python projects in GitHub (12) with 50,400 code tokens and 31,350 comment tokens in its vocabulary respectively. For cross validation, we shuffle the original dataset and use the first 60% for training, 20% for validation, and the rest for testing.

In our experiments, the word embedding size is set to 1280, the batch size is set to 2048, the layer size is set to 6, and the head number is set to 8. We pretrain both actor network and critic network with 20000 steps each, and train the actor-critic network with 100000 steps simultaneously. For the code search part, the comments in the dataset are utilized for the query. All the experiments in this paper are implemented with Python 3.5, and run on a computer with a 2.8 GHz Intel Core i7 CPU, 64 GB 1600 MHz DDR3 RAM, and a Saturn XT GPU with 24 GB memory running RHEL 7.5.

5.2. Result Analysis

5.2.1. Code summarization

To evaluate the code summarization component of , we select several state-of-the-art approaches, i.e., Hybrid-DeepCom (Hu et al., 2019), CoaCor (Yao et al., 2019), and AutoSum (Wan et al., 2018) for performance comparison with . In particular, Hybrid-DeepCom (Hu et al., 2019) utilizes the AST sequence converted by traversing the AST as the code representation and input the AST sequence to the GRU-based NMT for code summarization via combining lexical and structure information. CoaCor (Yao et al., 2019) utilizes the plain text of source code and an LSTM-based encoder-decoder framework for code summarization. AutoSum (Wan et al., 2018) utilizes a tree-LSTM-based NMT model and inputs the code snippet as plain text to the code-generation model with reinforcement learning for performance enhancement. Similarly, the evaluation for is also designed to explore the performance of its different components, where adopts the transformer-based encoder; adopts the tree-transformer-based encoder for source code; and , i.e., the complete , utilizes the tree-transformer-based encoder for source code and reinforcement learning for further enhancing the code summarization model.

We evaluate the performance of all the approaches based on four widely-used evaluation metrics adopted in neural machine translation and image captioning: BLEU

(Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), ROUGE (Lin, 2004) and CIDER (Vedantam et al., 2015)

. In particular, BLEU measures the average n-gram precision on a set of reference sentences with a penalty for short sentences. METEOR evaluates how well the results capture content from the references via recall which is computed via stemming and synonymy matching. ROUGE-L imports account sentence level structure similarity and identifies the longest co-occurrence in sequential n-grams. CIDER is a consensus-based evaluation protocol for image captioning that evaluates the similarity of the generated comments and the ground truth.

Approaches BLEU-1 METEOR ROUGE-L CIDER
Hybrid-DeepCom 15.60 6.09 14.33 51.88
CoaCor 25.60 9.52 29.38 78.11
AutoSum 25.27 9.29 39.13 75.01
27.69 10.26 41.87 81.01
32.05 11.74 45.92 84.56
37.69 13.52 51.27 87.24
Table 1. Code summarization results with different metrics. (Best scores are in boldface.)
Issue link
Generated
comment
Feedback
1 https://github.com/mikunit567/GAE/issues/1 Validate a given xsrf token by retrieving it. “Yes, this is correct. Validate a retrieved XSRF from the memory cache and then with the token perform an associated action.”
2 https://github.com/hamzafaisaljarral/scoop/issues/1 Iterates through the glob nodes. “Yup you have got that right but for better understanding you have to look into django-shop documentation and look into django-cms documentation as well.”
3 https://github.com/rumd3x/PSP-POC/issues/1 Combine two lists in a list. “The pstats package is used for creating reports from data generated by the Profiles class. The add_callers function is supposed to take a source list, and a target list, and return new_callers by combining the call results of both target and source by adding the call time.”
Table 2. Sample issues for code summarization case study

Table 1 demonstrates the code summarization results of all the approaches in terms of the selected metrics. While the compared approaches achieve close performances, e.g., around 20% in terms of BLEU-1, can approximate 38%. In particular, we can obtain the following detailed findings. First, we can observe that can significantly outperform all the compared approaches in terms of all the evaluated metrics. For instance, the complete , i.e., can outperform all the compared approaches from 47.2% to 141.6% in terms of BLEU-1. Such performance advantages can indicate the superiority of the transformer-enabled self-attention mechanism over the mechanisms, including the attention mechanism, that are adopted in other RNN-based approaches, because the self-attention mechanism can effectively capture the impacts of the overall text on all the tokens of the input sequences for better reflecting their semantics and thus optimizing the language model weights. Next, we can verify that each component of is effective for enhancing the performance. For instance, by applying the tree-transformer-based encoder, can dramatically outperform that only applies the transformer-based encoder by 15.7% in terms of BLEU-1. We can verify that our tree transformer based on identifying and leveraging the indent-based program semantic granularity can effectively strengthen the language model by augmenting the semantic level information for tokens. Moreover, by applying reinforcement learning, outperforms , i.e., by 17.5% in terms of BLEU-1, which can further verify the strength of reinforcement learning as verified in (Yao et al., 2019; Wan et al., 2018). Note that the performance of certain approaches, e.g., Hybrid-DeepCom, dramatically differs from its original performance in (Hu et al., 2019) mainly because of the training data and programming language differences.

We also conduct a set of case studies to further evaluate the effectiveness of . In particular, we first collect Python projects from GitHub and input them to our -trained model for generating their corresponding comments. Next, we issue such generated comments to the corresponding developers for their evaluations on the quality of the generated comments. In total, we received 24 responses, among which 11 developers confirmed the correctness of the generated comments to summarize their code snippets. In addition, 5 developers extended detailed explanations of the associated code which also expose their support to our generated comments. The rest responses are unrelated to the correctness of our generated comments. Table 2 presents selected examples of the developer feedback where the first and second case indicate that the developers confirm the correctness of our generated comments while the third case reveals that the developer is supportive to the generated comment though he did not directly present it.

5.2.2. Code search

To evaluate the effectiveness of the code search component of , we select several state-of-the-art approaches for comparison. Firstly, for the aforementioned approaches Hybrid-DeepCom, AutoSum, and CoaCor, we utilize the generated comments of those approaches as the input for the code search part of . In addition to further utilizing them for code search, we also compare with DeepCS (Gu et al., 2018) which utilizes RNN to encode code and query and compute the distance between the code vector and the query vector for returning the code snippets with the closest vectors. The performance of code search is evaluated in terms of four widely-used metrics: MRR (Mean Reciprocal Rank) (Craswell, 2009), nDCG (normalized Discounted Cumulative Gain) (Wang et al., 2013) and Success Rate@k (Xuan et al., 2016), where MRR measures the average reciprocal ranks of results given a set of queries and the reciprocal rank of a query is computed as the inverse of the rank of the first hit result; nDCG considers the ranking of the search results which evaluates the usefulness of result based on its position in the result list; and Success Rate@k measures the percentage of queries for which more than one correct result exist in the top ranked results.

Approaches MRR nDCG SR@5 SR@10
DeepCS 48.41 58.85 57.44 66.78
CoaCor 59.33 67.51 67.05 73.58
Hybrid-DeepCom 50.92 59.92 60.52 68.35
AutoSum 57.68 63.52 63.43 70.16
58.43 65.13 63.28 70.85
60.57 68.43 65.16 74.13
62.37 70.62 66.95 75.21
Table 3. Code search accuracy compared with baselines. (Best scores are in boldface.)

Table 3 shows the code search result comparisons between our proposed approach and the aforementioned baselines where we can observe that can outperform all the other approaches in terms of all the evaluate metrics. Specifically, in terms of MRR, can outperform all the other approaches from 5.12% to 28.8%. Compared with the code summarization results, the advantages of over the same adopted approaches on code search dramatically shrinks which can be discussed as follows: (1) the code search metrics are naturally subject to less distinguishable results than the code summarization metrics. For Hybrid-DeepCom, AutoSum, and which all utilize the generated comments to strengthen their code search performance, their adopted code summarization metrics are essentially based on word frequency which generally are fine-grained, e.g., BLEU-based metrics, while their code search metrics are generally based on coarse-grained query-wise comparisons. Therefore, the code summarization metrics tend to result in distinguishable results for different techniques because they are likely to reflect the trivial difference between two generated comments. However, their corresponding code search results might not be that distinguishable because the two generated comments might be trained to result in the result in the identical code rankings. For instance, suppose two code summarization approaches generate the comments “returns the path of the target dependencies” and “derive a target-dependency list” respectively. While they can be used to represent the same code snippets, they may result in different BLEU scores because they consist of different words. However, if they are used for code search, they can both rank the code snippet of Figure 2 on the top and thus result in the identical score in terms of the code search metrics. (2) CoaCor (Yao et al., 2019) can approach a close performance to because its rewarding mechanism utilizes the search accuracy to guide the code annotation generation and search modeling directly. However, We can observe that significantly outperforms CoaCor in terms of code summarization (by 47.2%). Therefore, to bridge such performance gap, CoaCor has to pay extra effort for enhancing its modeling process while can limit its effort in training the model once and for all for optimizing both code summarization and code search.

We also conduct a case study to evaluate the effectiveness of . We organized 5 postgraduate students and 5 developers with certain Python background. We designed 15 programming tasks where each participant is asked to choose 3 tasks for code search using as well as our benchmark. Two example tasks are listed as follows:

  • Task 1: Remove all the files in a directory.

  • Task 2: Sends a message to the admins.

Then, they are asked to evaluate if the searched code snippets can solve the tasks or are helpful for solving them, by giving a score on a five-point Likert scale (strongly agree is 5 and strongly disagree is 1). For the 10 participants, the average Likert score is 3.167 (with standard deviation of 1.472), which indicates that in general, the efficacy of

can be acceptable.

6. Threats to Validity

There are several threats to validity of our proposed approach and its results, which are presented as follows.

The main threat to internal validity is the potential defects in the implementation of our techniques. To reduce such threat, we adopted a commonly-used benchmark with over 120,000 Python functions for evaluating the effectiveness and efficiency of our proposed approach and several existing approaches for comparison. Moreover, to ensure the fair comparison, we directly downloaded the optimized models of the existing approaches for comparison.

The threats to external validity mainly lie in the dataset quality and the evaluation metrics of our experiments. On one hand, the quality of the training data, i.e., the pairs adopted in our experiment was not evaluated. Among the over 120,000 python functions, it is likely that part of the poor-quality data can taint the training results. However, since (1) all the approaches were evaluated in the identical benchmark, and (2) the adopted evaluation metrics measure the performance of the approaches by word frequency where the corresponding performance difference among the approaches can indicate their word mapping levels, we can also infer that given high-quality training data, the performance distribution of all the approaches are likely to maintain consistency, where can still outperform the other approaches in terms of the word-frequency metrics. Moreover, the performance of the tree-transformer-based encoder heavily relies on the quality of program forms. However, the experimental results indicate that the tree-transformer-based encoder can achieve better performance than the transformer-based encoder regardless the quality of the program forms. On the other hand, the word-frequency-based metrics cannot fully reflect the the semantic correctness of the approaches. To reduce such threat, we adopted a set of empirical studies such that developers can feedback for the quality of our code summarization and code search results. The positive study results can strengthen the validity of the effectiveness and efficiency of .

7. Related Work

7.1. Code Summarization

The code summarization techniques can be mainly categorized as information-retrieval-based approaches and deep-learning-based approaches.

Information-retrieval-based approaches. Wong et al. (Wong et al., 2013) proposed AutoComment which leverages code-description mappings to generate description comments for similar code segments matched in open-source projects. Similarly they also apply code clone detection techniques to find similar code snippets and extract comments from the similar code snippets (Wong et al., 2015). Movshovitz-Attias et al. (Movshovitz-Attias and Cohen, 2013) predicted comments from Java source files using topic models and n-grams. Haiduc et al. (Haiduc et al., 2010) combined IR techniques, i.e., Vector Space Model and Latent Semantic Indexing, to generate terms-based summaries for Jave classes and methods.

Deep-learning-based approaches.

The deep-learning-based approaches usually leverage Recurrent Neural Networks (RNNs) or Convolution neural networks (CNNs) with the attention mechanism. For instance, Iyer et al.

(Iyer et al., 2016) proposed to use RNN with an attention mechanism—CODE-NN to produce comments for C# code snippets and SQL queries. Allamanis et al. (Allamanis et al., 2016) proposed an attentional CNN on the input tokens to detect local time-invariant and long-range topical attention features to summarize code snippets into function name-like summaries. Considering the API information, Hu et al. (Hu et al., 2018b) proposed TL-CodeSum to generate summaries by capturing semantics from the source code with the assistance of API knowledge. Chen et al. (Chen and Zhou, 2018) proposed BVAE which utilizes C-VAE to encode code and L-VAE to encode natural language. In addition to such encoder-decoder-based approaches, Wan et al. (Wan et al., 2018) drew on the insights of deep reinforcement learning to alleviate the exposure bias issue by integrating exploration and exploitation into the whole framework. Hu et al. (Hu et al., 2018a) proposed DeepCom which takes AST sequence converted by traversing the AST as the input of NMT and they also extended this work by considering hybrid lexical and syntactical information in (Hu et al., 2019). Leclair et al. (LeClair et al., 2019) combined words from code with code structure from AST, which allows the model to learn code structure independent of the text in code.

7.2. Code Search

Code search techniques also mainly consists of information-retrieval-based approaches and deep-learning-based approaches.

Information-retrieval-based approaches. Hill et al. (Hill et al., 2014) proposed CONQUER which integrates multiple feedback mechanisms into the search results view. Some approaches proposed to extend the queries, for example, Lu et al. (Lu et al., 2015) proposed to extend queries with synonyms generated from WordNet and then match them with phrases extracting from code identifiers to obtain the search results. Lv et al. (Lv et al., 2015) designed a API understanding component to figure out the potential APIs and then expand the query with the potential APIs and retrieve relevant code snippets from the codebase. Similarly, Raghothaman et al. (Raghothaman et al., 2016) proposed swim, which first suggests an API set given a query by the natural language to API mapper that is extracted from clickthrough data in search engine, and then generates code using the suggested APIs by the synthesizer.

Deep-learning-based approaches The deep learning-based approaches usually encode the code snippets and natural language query into a hidden vector space, and then train a model to make the corresponding code and query vector more similar in the hidden space. Gu et al. (Gu et al., 2018) proposed DeepCS, which reads code snippets and embeds them into vectors. Then, given a query, it returns the code snippets with the nearest vectors to the query. Luan et al. (Luan et al., 2018) proposed Aroma, which takes a code snippet as input, assembles a list of method bodies that contain the snippet, clusters and intersects those method bodies to offer code recommendations. Different from the above approaches, Akbar et al. (Akbar and Kak, 2019)

presented a framework that incorporates both ordering and semantic relationships between the terms and builds one-hot encoding model to rank the retrieval results. Chen et al.

(Chen and Zhou, 2018) proposed BVAE, which includes C-VAE and L-VAE to encode code and query respectively, based on which semantic vector for both code and description and generate completely. Yao et al. (Yao et al., 2019) proposed CoaCor, which designs a rewarding mechanism to guide the code annotation model directly based on how effectively the generated annotation distinguishes the code snippet for code retrieval.

Other approaches. Sivaraman et al. (Sivaraman et al., 2019)

proposed ALICE, which integrates active learning and inductive logic programming to incorporate partial user feedback and refine code search patterns. Takuya et al.

(Takuya and Masuhara, 2011) proposed Selene, which uses the entire editing code as query and recommends code based on a associative search engine. Lemons et al. (Lemos et al., 2007) proposed a test-driven code search and reuse approach, which searches code according to the behavior of the desired feature to be searched.

8. Conclusion

In this paper, we propose , which is a transformer-based framework to integrate code summarization with code search. Specifically, enables an actor-critic network for code summarization. In the actor network, we build transformer- and tree-transformer-based encoder to encode code snippets and decode the given code snippet to generate their comments. Meanwhile, we utilize the feedback from the critic network to iteratively tune the actor network for enhancing the quality of the generated comments. Furthermore, we import the generated comments to code search for enhancing its accuracy. We conduct a set of experimental studies and case studies to evaluate the effectiveness of , where the experimental results suggest that can significantly outperform multiple state-of-the-art approaches in both code summarization and code search and the study results further strengthen the efficacy of from the developers’ points of view.

References

  • S. A. Akbar and A. C. Kak (2019) SCOR: source code retrieval with semantics and order. In Proceedings of the 16th International Conference on Mining Software Repositories, pp. 1–12. Cited by: §1, §7.2.
  • M. Allamanis, H. Peng, and C. Sutton (2016) A convolutional attention network for extreme summarization of source code. In

    International Conference on Machine Learning

    ,
    pp. 2091–2100. Cited by: §7.1.
  • S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, Vol. 29, pp. 65–72. Cited by: §5.2.1.
  • A. V. M. Barone and R. Sennrich (2017) A parallel corpus of python functions and documentation strings for automated code documentation and code generation. arXiv preprint arXiv:1707.02275. Cited by: §1, §5.1.
  • S. Bowman, G. Angeli, C. Potts, and C. Manning (2015) A large annotated corpus for learning natural language inference. In EMNLP, Cited by: §2.2.
  • L. Chen and L. Zhang (2018) Speeding up mutation testing via regression test selection: an extensive study. See DBLP:conf/icst/2018, pp. 58–69. External Links: Link, Document Cited by: §1.
  • Q. Chen and M. Zhou (2018) A neural framework for retrieval and summarization of source code. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 826–831. Cited by: §1, §7.1, §7.2.
  • N. Craswell (2009) Mean reciprocal rank. Cited by: §5.2.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.2.
  • J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §4.2.3.
  • A. Ghanbari, S. Benton, and L. Zhang (2019) Practical program repair via bytecode mutation. See DBLP:conf/issta/2019, pp. 19–30. External Links: Link, Document Cited by: §1.
  • [12] Github. https://github.com/. Cited by: §5.1.
  • X. Gu, H. Zhang, and S. Kim (2018) Deep code search. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 933–944. Cited by: §1, §1, §5.2.2, §7.2.
  • S. Haiduc, J. Aponte, L. Moreno, and A. Marcus (2010)

    On the use of automated text summarization techniques for summarizing source code

    .
    In Reverse Engineering, Cited by: §7.1.
  • E. Hill, M. Roldanvega, J. A. Fails, and G. Mallet (2014) NL-based query refinement and contextualized code search results: a user study. Cited by: §7.2.
  • X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin (2018a) Deep code comment generation. In Proceedings of the 26th Conference on Program Comprehension, pp. 200–210. Cited by: §1, §7.1.
  • X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin (2019) Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering, pp. 1–39. Cited by: §5.2.1, §5.2.1, §7.1.
  • X. Hu, G. Li, X. Xia, D. Lo, S. Lu, and Z. Jin (2018b) Summarizing source code with transferred api knowledge. In

    International Joint Conference on Artificial Intelligence 2018

    ,
    pp. 2269–2275. Cited by: §1, §7.1.
  • Z. Huang, G. Zweig, M. Levit, B. Dumoulin, B. Oguz, and S. Chang (2013) Accelerating recurrent neural network training via two stage classes and parallelization. In

    2013 IEEE Workshop on Automatic Speech Recognition and Understanding

    ,
    pp. 326–331. Cited by: §1.
  • S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer (2016)

    Summarizing source code using a neural attention model.

    .
    In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (1), pp. 2073–2083. Cited by: §1, §7.1.
  • Y. Keneshloo, T. Shi, C. K. Reddy, and N. Ramakrishnan (2018) Deep reinforcement learning for sequence to sequence models. arXiv preprint arXiv:1805.09461. Cited by: §2.3.
  • V. Konda (2003) Actor-critic algorithms. Siam Journal on Control & Optimization 42 (4), pp. 1143–1166. Cited by: §2.3.
  • A. LeClair, S. Jiang, and C. McMillan (2019) A neural model for generating natural language summaries of program subroutines. In Proceedings of the 41st International Conference on Software Engineering, pp. 795–806. Cited by: §1, §1, §7.1.
  • O. A. L. Lemos, S. K. Bajracharya, J. Ossher, R. S. Morla, P. C. Masiero, P. Baldi, and C. V. Lopes (2007) CodeGenie: using test-cases to search and reuse source code.. Ase, pp. 525–526. Cited by: §7.2.
  • X. Li, W. Li, Y. Zhang, and L. Zhang (2019) DeepFL: integrating multiple fault diagnosis dimensions for deep fault localization. See DBLP:conf/issta/2019, pp. 169–180. External Links: Link, Document Cited by: §1.
  • B. P. Lientz and E. B. Swanson (1980) Software maintenance management. IEE Proceedings E Computers and Digital Techniques Transactions on Software Engineering 127 (6). Cited by: §1.
  • C. Y. Lin (2004) Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out. Cited by: §5.2.1.
  • Y. Lou, J. Chen, L. Zhang, D. Hao, and L. Zhang (2019) History-driven build failure fixing: how far are we?. See DBLP:conf/issta/2019, pp. 43–54. External Links: Link, Document Cited by: §1.
  • M. Lu, X. Sun, S. Wang, D. Lo, and Y. Duan (2015) Query expansion via wordnet for effective code search. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), pp. 545–549. Cited by: §1, §7.2.
  • Y. Lu, Y. Lou, S. Cheng, L. Zhang, D. Hao, Y. Zhou, and L. Zhang (2016) How does regression test prioritization perform in real-world software evolution?. See DBLP:conf/icse/2016, pp. 535–546. External Links: Link, Document Cited by: §1.
  • S. Luan, D. Yang, K. Sen, and S. Chandra (2018) Aroma: code recommendation via structural code search. arXiv preprint arXiv:1812.01158. Cited by: §7.2.
  • F. Lv, H. Zhang, J. Lou, S. Wang, D. Zhang, and J. Zhao (2015) Codehow: effective code search based on api understanding and extended boolean model (e). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 260–270. Cited by: §1, §1, §7.2.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. Computer Science. Cited by: §4.1.
  • L. Moreno and A. Marcus (2017) Automatic software summarization: the state of the art. In 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C), pp. 511–512. Cited by: §1.
  • D. Movshovitz-Attias and W. W. Cohen (2013) Natural language models for predicting programming comments. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2, pp. 35–40. Cited by: §1, §1, §7.1.
  • K. Papineni, S. Roukos, T. Ward, and W. J. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §5.2.1.
  • M. Raghothaman, Y. Wei, and Y. Hamadi (2016) Swim: synthesizing what i mean-code search and idiomatic snippet synthesis. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp. 357–367. Cited by: §7.2.
  • F. Scholer, H. E. Williams, and A. Turpin (2014) Query association surrogates for web search. Journal of the American Society for Information Science & Technology 55 (7), pp. 637–650. Cited by: §1.
  • A. Sivaraman, T. Zhang, G. Van den Broeck, and M. Kim (2019) Active inductive logic programming for code search. In Proceedings of the 41st International Conference on Software Engineering, pp. 292–303. Cited by: §7.2.
  • G. Sridhara, E. Hill, D. Muppaneni, L. Pollock, and K. Vijay-Shanker (2010) Towards automatically generating summary comments for java methods. In Proceedings of the IEEE/ACM international conference on Automated software engineering, pp. 43–52. Cited by: §1.
  • Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu, H. Tian, and H. Wu (2019) ERNIE: enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223. Cited by: §2.2.
  • R. S. Sutton and A. G. Barto (1998) Introduction to reinforcement learning. Vol. 135, MIT press Cambridge. Cited by: §2.3.
  • W. Takuya and H. Masuhara (2011) A spontaneous code recommendation tool based on associative search. In Proceedings of the 3rd International Workshop on Search-Driven Development: Users, Infrastructure, Tools, and Evaluation, pp. 17–20. Cited by: §7.2.
  • S. Thrun and M. L. Littman (2005) Reinforcement learning: an introduction. IEEE Transactions on Neural Networks 16 (1), pp. 285–286. Cited by: §2.3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.2.
  • R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: consensus-based image description evaluation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 4566–4575. Cited by: §5.2.1.
  • E. M. Voorhees (2001) The trec question answering track. Natural Language Engineering 7 (4), pp. 361–378. Cited by: §2.2.
  • Y. Wan, Z. Zhao, M. Yang, G. Xu, H. Ying, J. Wu, and P. S. Yu (2018) Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 397–407. Cited by: §5.2.1, §5.2.1, §7.1.
  • J. Wang, L. Yu, K. R. Lai, and X. Zhang (2016a)

    Dimensional sentiment analysis using a regional cnn-lstm model

    .
    In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 225–230. Cited by: §1.
  • S. Wang, D. Chollak, D. Movshovitz-Attias, and L. Tan (2016b) Bugram: bug detection with n-gram language models. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, pp. 708–719. Cited by: §2.1.
  • Y. Wang, L. Wang, Y. Li, D. He, T. Y. Liu, and W. Chen (2013) A theoretical analysis of ndcg type ranking measures. Journal of Machine Learning Research 30, pp. 25–54. Cited by: §5.2.2.
  • C. J. Watkins and P. Dayan (1992) Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §2.3.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning, pp. 5–32. Cited by: §2.3.
  • E. Wong, J. Yang, and L. Tan (2013) Autocomment: mining question and answer sites for automatic comment generation. In Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering, pp. 562–567. Cited by: §1, §1, §7.1.
  • E. Wong, T. Liu, and L. Tan (2015) Clocom: mining existing source code for automatic comment generation. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), pp. 380–389. Cited by: §7.1.
  • L. Xuan, Z. Wang, Q. Wang, S. Yan, X. Tao, and M. Hong (2016) Relationship-aware code search for javascript frameworks. In Acm Sigsoft International Symposium on Foundations of Software Engineering, Cited by: §5.2.2.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §2.2.
  • Z. Yao, J. R. Peddamail, and H. Sun (2019) CoaCor: code annotation for code retrieval with reinforcement learning. In The World Wide Web Conference, pp. 2203–2214. Cited by: §1, §1, §1, §5.2.1, §5.2.1, §5.2.2, §7.2.
  • L. Yu, W. Zhang, J. Wang, and Y. Yu (2017) Seqgan: sequence generative adversarial nets with policy gradient. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.3.
  • M. Zhang, Y. Li, X. Li, L. Chen, Y. Zhang, L. Zhang, and S. Khurshid (2019) An empirical study of boosting spectrum-based fault localization via pagerank. IEEE Transactions on Software Engineering (), pp. 1–1. External Links: Document, ISSN 2326-3881 Cited by: §1.
  • M. Zhang, X. Li, L. Zhang, and S. Khurshid (2017) Boosting spectrum-based fault localization using pagerank. See DBLP:conf/issta/2017, pp. 261–272. External Links: Link, Document Cited by: §1.
  • M. Zhang, Y. Zhang, L. Zhang, C. Liu, and S. Khurshid (2018) DeepRoad: gan-based metamorphic testing and input validation framework for autonomous driving systems. See DBLP:conf/kbse/2018, pp. 132–142. External Links: Link, Document Cited by: §1.
  • H. Zhou, W. Li, Y. Zhu, Y. Zhang, B. Yu, L. Zhang, and C. Liu (2018) DeepBillboard: systematic physical-world testing of autonomous driving systems. CoRR abs/1812.10812. External Links: Link, 1812.10812 Cited by: §1.