Improving Stack Overflow question title generation with copying enhanced CodeBERT model and bi-modal information

Context: Stack Overflow is very helpful for software developers who are seeking answers to programming problems. Previous studies have shown that a growing number of questions are of low-quality and thus obtain less attention from potential answerers. Gao et al. proposed a LSTM-based model (i.e., BiLSTM-CC) to automatically generate question titles from the code snippets to improve the question quality. However, only using the code snippets in question body cannot provide sufficient information for title generation, and LSTMs cannot capture the long-range dependencies between tokens. Objective: We propose CCBERT, a deep learning based novel model to enhance the performance of question title generation by making full use of the bi-modal information of the entire question body. Methods: CCBERT follows the encoder-decoder paradigm, and uses CodeBERT to encode the question body into hidden representations, a stacked Transformer decoder to generate predicted tokens, and an additional copy attention layer to refine the output distribution. Both the encoder and decoder perform the multi-head self-attention operation to better capture the long-range dependencies. We build a dataset containing more than 120,000 high-quality questions filtered from the data officially published by Stack Overflow to verify the effectiveness of the CCBERT model. Results: CCBERT achieves a better performance on the dataset, and especially outperforms BiLSTM-CC and a multi-purpose pre-trained model (BART) by 14 average, respectively. Experiments on both code-only and low-resource datasets also show the superiority of CCBERT with less performance degradation, which are 40



page 1

page 2

page 3

page 4


SOTitle: A Transformer-based Post Title Generation Approach for Stack Overflow

On Stack Overflow, developers can not only browse question posts to solv...

Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks

Open-ended video question answering aims to automatically generate the n...

Generating Question Titles for Stack Overflow from Mined Code Snippets

Stack Overflow has been heavily used by software developers as a popular...

Code2Que: A Tool for Improving Question Titles from Mined Code Snippets in Stack Overflow

Stack Overflow is one of the most popular technical Q A sites used by ...

Multi-Domain Dialogue State Tracking – A Purely Transformer-Based Generative Approach

We investigate the problem of multi-domain Dialogue State Tracking (DST)...

You May Not Need Attention

In NMT, how far can we get without attention and without separate encodi...

Semi-supervised Question Retrieval with Gated Convolutions

Question answering forums are rapidly growing in size with no effective ...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stack Overflow (SO) is one of the most successful communities for software developers, who can seek answers to programming problems from peers. Users with a good reputation in SO have the right to modify and close other users’ inappropriate questions111 The open-data policy of SO has been attracting intense research interests [Chakraborty2021HowDD, Rubei2020PostFinderMS, Uddin2020MiningAU]. A recent study of Mondal et al. [mondal2021early] shows that a growing number of open questions in SO remain unanswered, partly because some developers fail to write high-quality questions.

A question in SO generally consists of the three components: title, body, and tags. The SO community has given many writing suggestions in the official tutorial222, such as making a clear and informative title, introducing the problem before posting any code, and including all relevant tags. Researchers have also made great efforts to help improve question quality from these components. For example, Wang et al. [wang2018users] looked into the influence of crowd-sourcing revision333 on question body quality. Wang et al. [wang2019sotagrec]

built a hybrid tag recommender based on convolutional neural networks and collaborative filtering. Recently, Gao et al.

[gao2020generating] for the first time proposed an end-to-end approach to generate question titles from given code snippets. Previous studies [arora2015good, calefato2018ask, correa2013fit, yao2013want] in SO have also demonstrated the importance of question titles to the overall quality of questions. Therefore, in this paper, we also focus on automatically generating high-quality question titles.

Gao et al. [gao2020generating]

used a Bi-directional Long Short-Term Memory (BiLSTM) network incorporated with the copy

[Gu2016IncorporatingCM] and coverage [Tu2016ModelingCF] mechanism, which we refer to as "BiLSTM-CC" to generate titles from code snippets mined in corresponding question bodies. Despite the encouraging performance they have achieved, we argue that LSTMs may lack the ability to parse long-range dependencies. In addition, according to the official suggestions444, developers are not recommended to write only source code as questions. Since a question body usually consists of the bi-modal content (i.e., text descriptions and code snippets), the information that developers infer from code snippets without surrounding contexts can be broken and misleading.

In this paper, we redefine the task proposed by Gao et al. [gao2020generating] to Title Generation from the Entire Question Body, namely TGEQB. We formulate this task as an abstractive summarization problem, and also propose our CCBERT model, which combines the Copy mechanism [Gu2016IncorporatingCM] to handle rare tokens and the pre-trained CodeBERT [Feng2020CodeBERTAP] model to parse bi-modal content. We follow the encoder-decoder paradigm, and use CodeBERT to encode question bodies into hidden representations, a stacked Transformer decoder to generate predicted tokens, and an additional copy attention layer to refine the output distribution. Both our encoder and decoder perform the multi-head self-attention operation, which helps CCBERT better capture the long-range dependencies than LSTMs.

To verify the effectiveness of our model, we build a large-scale dataset with more than 120,000 high-quality sample questions filtered from the data555

officially published by Stack Overflow in December 2020, which contains all the historical questions from 2008 to 2020. These samples are further split into Python and Java subsets for separate evaluation. We employ BLEU and ROUGE as the evaluation metrics, and compare CCBERT with the four baseline models, i.e., TF-IDF, BiLSTM, BiLSTM-CC, and BART. The experimental results show that CCBERT outperforms all the baseline models on both subsets, especially 14% better than BiLSTM-CC and 4% better than BART on average. In addition, the instance analysis shows that most of our generated titles are readable and semantic relevant to the original ones. We further conduct code-only and low-resource experiments to evaluate the generalization performance and data efficiency of our model. Results show that CCBERT has 25% and 5% performance degradation under the two circumstances respectively, while it is 40% and 13.5% for BiLSTM-CC.

The contributions of our work are as follows:

  • We introduce a new task named TGEQB, which is to generate high-quality titles from the entire question bodies containing bi-modal content in order to help improve the question quality.

  • We propose a novel model named CCBERT, which combines the copy mechanism and CodeBERT to handle rare tokens and long-range dependencies in the bi-modal context.

  • We have released our dataset and all relevant source code666 to facilitate future research and application.

The rest of this paper is organized as follows: Section 2 reveals the motivation of our work. Section 3 introduces the details of our proposed approach. Section 4 describes the basic setup of our experiment, including data collection, baseline models, evaluation metrics, and model settings. Section 5 presents the experimental results. Section 6 introduces the related works. Section 7 discusses threats to the validity of our work. Finally, we conclude this paper and introduce the future work in Section 8.

2 Motivation

Figure 1: The proportion of questions with text descriptions or code snippets
Figure 2: The overlap between title and code snippets/text descriptions

Generally, questions in Stack Overflow contain text descriptions and code snippets in their bodies, developers should refer to all of them when writing titles. And it is challenging for an LSTM-based model like in Gao et al’s work [gao2020generating] to parse the entire long question body. Therefore, in this section, we aim to investigate the necessity of title generation using bi-modal content and the challenge of long sequence parsing.

Based on the aforementioned data published by SO, we filter out questions that do not meet the conditions described in 4.1, and then obtain a collection of 3.2 million questions that we regard as high-quality ones. Then, we split every question body into text descriptions and code snippets. Next, we separately count the questions containing text descriptions/code snippets by year. We draw a line chart in Fig. 1 to show the statistical results, where the x-axis denotes the year and y-axis denotes the proportion of the questions with text descriptions or code snippets. We can find that the proportion of questions containing text descriptions is almost unchanging (100%) every year, and the proportion of questions containing code snippets has been increasing in recent years, reaching 90% in 2020. Since the evidence shows that a high-quality question is very likely to contain both text descriptions and code snippets, we cannot neglect neither of them when trying to understand the overall meaning of the question.

Then, we conduct another statistical experiment to figure out the respective impact of text descriptions and code snippets on the writing of question titles. More specifically, we extract the high-quality questions posted in 2019 and 2020 as a representative of the latest situations. Then we tokenize777Tokenization is done using our method described in Section 4.1 both the title and code snippets/text descriptions of every question to find out how many tokens in the title also appear in code snippets/text descriptions (i.e., the overlap ratio). We draw two histograms to represent distributions of the overlap ratio between titles and code snippets/text descriptions in Fig. 2

. To be specific, the x-axis denotes the overlap ratio, the y-axis denotes the occurrence probability of different overlap ratio intervals. According to the mathematical expectation of these two distributions, tokens in a question title have a 23% probability to appear in code snippets and a 58% probability to appear in text descriptions. Therefore, we can conclude that both text and code in question bodies should be considered when writing titles, and text descriptions are worth more consideration.

Figure 3: The length distribution of the entire content and code snippets in the question body

Finally, we find that the entire content of a question body can be very long, which brings new challenge to our title geneartion model. We draw two box plots in Fig. 3 to represent the length distributions of the entire question body and the code snippets of our experimental dataset (described in Table 1). We can find that questions related to the Python and Java language share a similar length distribution, where the code snippets only occupy less than a half of the body content. In addition, over 24% questions have more than 300 tokens in their bodies, and some questions even have 500 tokens and more. Sequential models like LSTMs are difficult to capture the long-range dependencies between tokens, so we apply Transformer-based approach to tackle this problem.

Figure 4: The three-step framework of our approach for Stack Overflow question title generation

3 Proposed Approach

We describe the details of our approach in this section, including the overall framework and the detailed architecture of our CCBERT model.

3.1 Overview

We aim to help developers ask high-quality questions with a better chance to get answers in Stack Overflow. Considering that it is not enough to generate high-quality titles using only code snippets, we introduce a new title generation task to take into account the bi-modal information of both text descriptions and code snippets in question bodies.

Our approach demonstrated in Fig. 4 shares a similar framework with Gao et al.’s work [gao2020generating], which contains three steps, i.e., data preparation, model training, and title generation. As for data preparation, we first carefully filter the raw data published by SO to make sure that our dataset only contains the high-quality questions. Then, we organize the questions into "<body, title>" pairs. As for model training, we respectively feed the bi-modal question bodies and question titles into the CodeBERT encoder and the Transformer decoder. When our CCBERT model shows its best performance on the validation set, we use it to generate titles on the test set.

3.2 Ccbert

We propose CCBERT, a novel model combining the pre-trained CodeBERT model and the copy mechanism. It is an attentional encoder-decoder system, which can be trained and used in an end-to-end manner. Our model architecture is illustrated in Fig. 5. Formally, given a token sequence of a question body and a token sequence of its corresponding question title sampled from the data distribution , CCBERT learns to generate based on , and updates its parameters to maximize the log-likelihood of the data.

3.2.1 CodeBERT Encoder

Unlike the general models used for summarization [Gehrmann2018BottomUpAS, Liu2019TextSW, See2017GetTT], our model needs to understand both Natural Language (NL) and Programming Language (PL), which is determined by the characteristics of our dataset. Recently, Feng et al. [Feng2020CodeBERTAP]

introduced CodeBERT, which could capture the semantic relationship between NL and PL, and produce vector representations that support downstream tasks, such as defect prediction

[Pan2021AnES], program repair [Mashhadi2021ApplyingCF], etc. CodeBERT is a stack of multiple Transformer-encoder layers which mainly perform bidirectional self-attention operation.

During the pre-training period, CodeBERT is trained through two objectives, namely Masked Language Modeling (MLM) and Replaced Token Detection (RTD) on the data888 extracted from Github repositories containing source code and code comments that follow the data distribution . We briefly describe these two objectives in the following. First, given a sequence containing both NL and PL tokens sampled from , a random set of positions (i.e., ) of are selected to be masked out with a special [MASK] token, and we get the masked sequence


The MLM objective is to predict the masked tokens, its loss function is formulated as


where denotes the parameters of CodeBERT, and denote the token in the sequences and respectively, and is a discriminator that predicts the probability of every token’s presence in a specific position from its vocabulary.

Then, some plausible alternative tokens are placed in the masked positions using a generator , and we get the corrupted sequence , formulated as follows:


The RTD objective is to determine whether a token is replaced or not, its loss function is formulated as follows:


where is the discriminator that predicts the probability of a specific token being replaced, and is an indicator function. The final loss function for the pre-training stage of CodeBERT is


During the fine-tuning period, we use CodeBERT as our encoder and further update to fit from the data distribution to . Formally, given a question body containing text descriptions and code snippets, we first turn it into a sequence of tokens (i.e., ) with a byte pair encoding tokenizer, which is built in the CodeBERT model. Then, we surround with two special tokens (i.e., ) to be consistent with the data format used during pre-training. After the preparation, we feed to the encoder and get a matrix that consists of the encoded vectors of all input tokens


where and each vector is a hidden representation of the semantic relationship of a token against others. is passed to the decoder and the copy attention layer for further operations.

Figure 5: The detailed structure of CCBERT at the decoding step

3.2.2 Transformer Decoder

We follow the implementation of the vanilla Transformer decoder [Vaswani2017AttentionIA] by stacking several identical layers that perform the multi-head self-attention mechanism. The input for our decoder is two-fold, one is the hidden vectors provided by the encoder, and the other is a token sequence (i.e., 999"SOS" and "EOS" means "the start of a sequence" and "the end of a sequence" respectively.) guiding the generation of the next token. Suppose that we are now going to generate the token (i.e, at the decoding step).

During training, we use the original title sequence as in the manner of teacher forcing [Sutskever2014SequenceTS]. Besides, we need a special mask to prevent tokens at the tail from getting involved in the self-attention operation. This is called the masked self-attention, and in this way we can generate tokens in all positions simultaneously. (During referencing, we follow the manner of auto-regression and use the sequence generated by our model at step as . In this situation, all the previously generated tokens should be considered to predict the next token, so the mask here is .)

Then, our decoder use the same embedding layer shared with the encoder to turn the input sequence into a matrix containing the embedding vectors of input tokens (i.e, ). Finally, we feed , , and to the decoder and get a matrix containing the hidden representations of all predicted tokens


where . We take the vector as the hidden representation of the predicted token by our decoder at step . Both and are passed to the copy attention layer for further operations.

3.2.3 Copy Attention Layer

According to the previous statistics, we find that most tokens in question titles appear in question bodies. Besides, some rare tokens in question bodies also exist in question titles, such as variable names, class libraries, application frameworks, etc. In this case, we incorporate the copy mechanism to facilitate our model to copy tokens from the question body to the generated title. Following the practice of the pointer-generator network [See2017GetTT], which directly takes the attention distribution as the probability of copying a specific token, we implement our copy attention layer above the encoder and the decoder. Suppose that we are continuing to generate the token, the copy attention layer takes the encoder hidden states , the embedding vector , and the decoder hidden state as inputs. Instead of using the cross-attention between the encoder and decoder computed in the Transformer decoder, we calculate the encoder-decoder attention as in traditional sequence-to-sequence networks:


where , , , and are learnable parameters, is the attention score between the decoder hidden state and the hidden state of the token in the source input, and is a vector containing the attention scores between the decoder hidden state and all the encoder hidden states.

Then, along with is used to calculate the context vector


which is further used to get the probability distribution

over the target vocabulary, and the copy probability that determines whether to copy from the input tokens or not, calculations are formulated as follows:


where matrix , vectors , , , and scalar , are learnable parameters.

Finally, we can get the probability distribution of the token in the original title:


The training loss for step is the negative log-likelihood of the target token:


The overall loss for the whole sequence is


which is used to update the parameters

of our CCBERT model through backpropagation.

After training, we use the method of beam search to generate each token in the predicted title. To be specific, a special token <SOS> along with the encoded vectors of a question body is feed to the decoder for the first candidate tokens chosen by beam search. Recursively, the generated sequences are feed again to the decoder for further generation. The process ends when it meets the <EOS> token.

4 Experimental Setup

In this section, we illustrate the construction procedure of our dataset, the baseline models, the evaluation metrics, and the hyperparameter settings for our CCBERT model.

4.1 Data Preparation

We use the aforementioned data published by Stack Overflow that contains all the historical questions from 2008 to 2020 to build our dataset. Each line of the data is a record having the complete information of a question post, including post time, scoring, tags, body, title, etc.

We extract the questions related to Python and Java programming languages according to question tags. We split them into the Python subset and Java subset, which are used individually for training and testing to verify the generality of our model in different domains. To avoid the influence of noise tags, we further filter out the records having the tags of other programming languages including "JavaScript", "C#", "PHP", "HTML", and "C++"101010The most popular language tags in SO other than Python and Java. in these subsets. Then, we only reserve the questions that meet the following conditions:

  1. The question is not closed;

  2. The question has an accepted answer;

  3. The question gets more than one vote.

We also follow Gao et al. [gao2020generating] to filter out questions that do not contain interrogative keywords such as "how", "what", "why", "which", "when" in their titles. Next, we parse the question bodies and turn them into segments composed of text descriptions and code snippets. For clarification, question bodies are organised in the HTML format, so we believe that the content wrapped by "<code></code>" tags are code snippets, and the rest are text descriptions. Finally, we remove questions that do not include both text descriptions and code snippets in their bodies. Up to this point, there are good reasons to believe the selected questions are the high-quality ones in SO.

In addition, we notice that the NLTK tokenizer111111 can not separate special tokens in code snippets, which leads to an incredibly large vocabulary and exacerbates the out-of-vocabulary problem. To tackle this issue, we design a simple tokenizing algorithm to split all special characters indiscriminately in the ASCII charset. In this way, we can get a smaller vocabulary but longer sequences in return. So, we further count the length of tokenized question bodies and filter out 5% of questions whose body length exceeds 1000 or the title length exceeds 25. In the end, we obtain more than 60,000 questions for each subset.

   Subset Training Validation Test
   Java 57118 2000 2000
   Python 60458 2000 2000
Table 1: The partition size of our dataset

As for data partitioning, considering the target leakage problem between the training set and test set, we sort the questions in chronological order and choose the latest records for validating and testing, and the rest for training. The statistics of our processed dataset are shown in Table 1.

4.2 Comparisons

To demonstrate how competitive CCBERT is, we choose several state-of-the-art models as baselines, which have been widely studied in the field of natural language processing. We briefly introduce the general ideas of these models in the following.

  1. [(1)]

  2. TF-IDF     This method is a classic full text searching algorithm, its name stands for "Term Frequency (TF)  Document Frequency (IDF)". TF-IDF is a weighting algorithm for a bag-of-words language model. To be specific, the "bag" contains a list of unique terms sourcing from a given corpus. A paragraph can be turned into a vector by counting its in-bag terms’ frequency (TF). Considering the fact that the probability of a term’s occurrence is often in inverse proportion to its importance, one can use the term frequency of appearing in all documents () to divide TF and get the revised weight of each term. In this way, we can calculate the distance between paragraphs in the vector space. In our experiment, we use Lucene121212Apache Lucene computes the similarity using TF-IDF by default. to find the most similar question in the training set given a question body.

  3. BiLSTM

        Long Short Term Memory networks (LSTMs) are a special kind of Recurrent Neural Networks (RNNs), with an additional cell state and three carefully designed "gates" to overcome the problem of long-term dependencies. The idea of Bidirectional LSTMs (BiLSTMs) is to duplicate the first recurrent layer in the network, and then provide the input sequence to the first layer and a reversed copy of the input to the second, so that all available information in the past and future of a specific processing step can be considered during training. We stack two BiLSTM layers as the encoder and two LSTM layers as the decoder, along with the attention mechanism introduced by Bahdanau et al.

    [Bahdanau2015NeuralMT] to build a model as our baseline, which we refer to as "BiLSTM".

  4. BiLSTM-CC    This was the method used by Gao et al. [gao2020generating] to generate question titles from code snippets. It shares the same structure as the above mentioned BiLSTM model, except to assemble another two non-trivial mechanisms. One is the copy mechanism we have illustrated above, the other is the coverage mechanism. Tu et al. [Tu2016ModelingCF]

    first introduced the "coverage" vector that keeps track of the attention history and further facilitates the attention calculation, so that a neural machine translation system would consider more about untranslated words. Gao et al.

    [gao2020generating] took advantage of the coverage penalty to suppress meaningless repetitions during generation. In our experiment, we build the BiLSTM and BiLSTM-CC models with OpenNMT131313, which is a well-acknowledged framework to build sequence-to-sequence models.

  5. BART     Lewis et al. [Lewis2020BARTDS] proposed the BART model to bridge the gap between pre-trained Bidirectional encoder (i.e. BERT [Devlin2019BERTPO]) and pre-trained Auto-Regressive Transformer (i.e. GPT [Radford2019LanguageMA]

    ), which are good at comprehension and generation tasks respectively. BART is pre-trained under the supervision of several denoising objectives, where input text is corrupted by a stochastic noising function and the model is demanded to reconstruct the original text. BART is particularly effective when fine-tuned for neural machine translation and abstractive text summarization tasks, such as WMT, CNN/DailyMail, XSum etc. We use the open-source code and pre-trained parameters

    141414 for BART to verify its performance on our dataset.

In addition to the above baselines, we implement an oracle method to show the best performance of an extractive model.

  1. [(5)]

  2. Oracle     The idea of extractive summarization, which is to select primary sentences that best match the target summary, inspires us to explore the possibility of making up a title only using tokens that appear in the question body. However, there are millions of permutations of a title containing tens of tokens, which is a much more complicated situation than selecting and arranging sentences. In addition, tokens arranged in the correct order do not necessarily make sense. So, instead of building another baseline model, we remove the tokens in a question title if they are not in the question body and keep the rest as the "generated" title, to simulate the best performance of an extractive model. Considering our objective is to maximize the BLEU and ROUGE score, we follow the work of Liu et al. [Liu2019TextSW] and implement another method based on beam search (with 20 beam width) to find the permutation that performs the best on these two metrics. It turns out that the second method does no better than the first one on both metrics due to the limited searching space. Therefore, we use the simple method mentioned above as the oracle method indicating the best possible result we can expect from a model.

4.3 Evaluation Metrics

Similar to Gao et al. [gao2020generating]

, we employ two automated evaluation metrics widely used in text generation tasks to determine the similarity between generated titles and the original ones.

4.3.1 Bleus-4

The Bi-Lingual Evaluation Understudy (BLEU) method was first introduced by Papineni et al. [Papineni2002BleuAM] to measure the performance of a translation system. The first step in this method is to compute a modified precision


where is the total count of each candidate clipped by its maximum reference count


The next step is to compute a brevity penalty


where is the length of a candidate translation and is the length of the effective reference corpus. Then, we can get the BLEU score


In our experiment, we choose to have the BLEU-4 score. Besides, we apply a smoothing method introduced by Lin et al. [Lin2004ORANGEAM] to add one to the hit count and total count for . In this way, candidate translations with less than n words can still get a positive score. We refer to the smoothed method as BLEUS-4 and use its implementation of NLTK151515 in our experiment.

4.3.2 Rouge

The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) was introduced by Lin et al. [Lin2004ROUGEAP] to measure the quality of machine-generated summaries. It consists of several measures including ROUGE-N and ROUGE-L, which will be used in our experiment. On complementary of BLEU’s bias on precision, ROUGE-N focuses on the recall, which is calculated as


where is the maximum number of overlap between

in a candidate summary and the reference summaries. ROUGE-L takes advantage of both the Longest Common Subsequence (LCS) and the F-measure to estimate the similarity between two summaries: the candidate summary

of length and the reference summary of length . The calculation is as follows:


and we use in our experiment. We choose ROUGE-1, ROUGE-2, and ROUGE-L implemented by an open source library161616 for the evaluation metrics.

4.4 Model Settings

We implement our encoder with the pre-trained parameters171717

of CodeBERT-base and keep its initial settings, where the vocabulary size is 50265, the hidden size is 768, the dropout probability is 0.1, and the Transformer layer number is 12. Accordingly, we build a 12-layer decoder with randomly initialized parameters. Optimization is performed using the adaptive moment estimation (Adam) algorithm with

, , and

. We also apply a linear warm-up strategy to gradually increase the learning rate in the first 10% training steps. Four NVIDIA GeForce RTX 2080 Ti GPUs are used to train our model, where the training epoch is 10 and batch size is 32. During decoding, the beam size is set to 10. We adjust all the hyperparameters to the validation set and report the evaluation results on the test set.

5 Results and Analysis

  Subset Model BLEUS-4 ROUGE-1 ROUGE-2 ROUGE-L
  Python Oracle 51.44 82.40 62.55 81.82
TF-IDF 10.26 21.88 5.29 21.02
BiLSTM 15.93 37.15 14.24 36.55
BiLSTM-CC 18.84 41.49 18.98 40.56
BART 20.43 44.93 21.77 43.97
CCBERT 22.02 46.69 22.55 44.86
  Java Oracle 54.59 83.76 65.14 83.31
TF-IDF 9.79 19.91 4.44 19.18
BiLSTM 14.59 32.13 11.95 31.86
BiLSTM-CC 18.22 38.89 18.09 38.21
BART 19.32 42.52 19.93 41.66
CCBERT 20.90 43.06 21.15 41.76
Table 2: The evaluation results of both data subsets.
  Question Body Titles
   I need to create this shape. I understand how to create simple shapes such as a cube, but I don’t understand at all how to create such a shape. How to get the right points for these arrays? Please, help
TriangleMesh mesh = new TriangleMesh();
    //which points should be here
    //which points should be here
    return mesh;
Origin: how to create such shape using javafx trianglemesh
Oracle: how to create such shape trianglemesh
TF-IDF: how update the value in json file using java jackson
BiLSTM: how to create a shape shape
BiLSTM-CC: how to create a shape for a cube
BART: how to create this shape
CCBERT: how to create this shape using trianglemesh
BiLSTM-CC: how to get trianglemesh from trianglemesh
CCBERT: how to get arrays from trianglemesh
   I am a beginner in mobile application building. I tried to put insert data function in my android studio but those insert function doesn’t work and the input data can’t be inserted…
 I put code in and It doesn’t give error report but when run the emulator and input data, my input can be inserted to sqlite database.
    myDb = new DatabaseHelper(…
public boolean insertData(String…
    SQLiteDatabase db = this.getWritableDatabase(…
    long result = db.insert(TABLE_NAME…
Origin: how to insert data to sqlite through user input
Oracle: to insert data to sqlite input
TF-IDF: listview not show items stored in sqlite database
BiLSTM: how to display data in android
BiLSTM-CC: how to insert data function in android studio
BART: how to insert data in android studio
CCBERT: how to add input data to sqlite database
BiLSTM-CC: how to add data to an activity in android
CCBERT: how to call sqlite function in android
   Here is an example I ran across while studying the functional interface concept.
interface Sayable{ void say(); }
public class MethodReference {
    public static void saySomething(
    public static void main(String[] args) {
       //Referring static method
       Sayable sayable = MethodReference…
       //Calling interface method
This is printing…My question how it is printing the output when we call the say() method(which is not implemented)
Origin: how the functional interface working in java 8
Oracle: how the functional interface in
TF-IDF: why does lambda translation need static method
BiLSTM: what is the output of the following method
BiLSTM-CC: printing static method when printing interface
BART: what output is say() method in functional interface
CCBERT: how does this functional interface work
BiLSTM-CC: why can’t i instantiate an interface in java
CCBERT: how to implement static method in java 8
   I want to work with crypto-stock data described here in my spring boot application. The RESTTemplate uses Gson for deserialization. Response data looks like:
{"IOST": {"EUR": 0.01147,…
I have already…problem is that this comes as a single object with key-value pairs insted of as an array. The result should be a list of following objects:
public class Symbol {
    private Long id;
    private String symbol…
Any idea how this can be accomplished this?
Origin: how to deserialize a key-value map to a list
Oracle: how to a key-value to a list
TF-IDF: bigdecimal not keeping actual value when returned
BiLSTM: how to use resttemplate in spring boot application
BiLSTM-CC: how to crypto-stock data in spring boot
BART: how to deserialize crypto-stock data in spring boot
CCBERT: how to deserialize key-value pairs with gson
BiLSTM-CC: how to parse json object in java
CCBERT: how to use deserialization with gson
Table 3: The examples of our testing questions and automatically generated titles. Specifically, the green color marks the tokens appearing in original titles, the orangered color marks the wrong focus, and the gray color marks the code snippets. The models with a subscript in their names are trained on the code-only datasets.

In this section, we demonstrate the effectiveness of our model by conducting experiments to answer the following Research Questions (RQs):

  1. [RQ-1]

  2. Does our CCBERT model outperform the baseline models?

  3. What is the advantage of using the bi-modal information of the entire question bodies?

  4. How effective is our CCBERT model under low-resource circumstances?

5.1 RQ-1: Does our CCBER model outperform the baseline models?

Methods: In order to investigate the performance superiority, we compare CCBERT with the four baselines mentioned in Section 4.2. Table 2 shows the performance of CCBERT and the compared baselines on four evaluation metrics. In addition, we present four examples to intuitively show the superiority of CCBERT in Table 3.

Results: From Table 2 and Table 3, we have the following findings:

(1) The performance rankings are the same on both subsets, where CCBERT outperforms all the baselines ranging from the retrieval-based model (TF-IDF) to the large-scale pre-trained model (BART). We have also noticed that the Java subset is more difficult for all the models, which is similar to the results reported by Gao et al. [gao2020generating]. Part of the reason is that the Java subset has a 23% larger vocabulary than the Python subset, all the models are more likely to encounter difficult rare tokens.

(2) TF-IDF has the worst performance among all the baseline models and is barely able to compare with other baselines. This is not surprising because questions containing duplicated content in Stack Overflow have a high possibility to be closed, let alone only a small number of questions are available in our training set. Besides, the nature of TF-IDF is a bag-of-word model, which does not take into account the overall meaning of the context, so it is barely possible for TF-IDF to retrieve the appropriate questions. All the samples in Table 3 show that the retrieved questions are totally different from the original ones.

(3) BiLSTM-CC and BiLSTM outperform TF-IDF by a large margin, indicating the superiority of neural generative models. Besides, BiLSTM-CC outperforms the vanilla BiLSTM by 18%, 12%, 33%, and 11% in terms of the four metrics on the Python subset, and by 25%, 21%, 51%, and 20% on the Java subset, which proves the effectiveness of the copy and coverage mechanisms. For instance, in the first sample of Table 3, BiLSTM generates a repeated word "shape" due to the lack of the coverage mechanism. And in the second sample, BiLSTM-CC borrows more tokens from the question body, showing the effect of the copy mechanism. Despite the good performance of BiLSTM-CC, our CCBERT model outperforms it by 17%, 13%, 19%, and 11% on the Python subset, and by 15%, 11%, 17%, and 9% on the Java subset, indicating a bigger superiority of Transformer-based models and the pre-training strategy. From the generated samples in Table 3, we can find that CCBERT is better at handling long-range dependencies than BiLSTM and BiLSTM-CC. For instance, in the first sample, CCBERT notices that "this shape" refers to the "TriangleMesh" appeared later in the question body, while BiLSTM-CC tends to focus on the words (i.e., "shape" and "cube") at the beginning of the question body. And it is the same for the second and the fourth samples, where unwanted words (i.e., "android studio", "spring boot", and "crypto-stock") in the front of the question body attract more attention from BiLSTM-CC, while CCBERT can find the really important words (i.e., "sqlite" and "deserialization", and "key-value") that hide in the middle of the question body.

(4) BART is a rather competitive model, where CCBERT outperforms it by 8%, 4%, 3.5%, and 2% in terms of the four metrics on the Python subset, and by 8%, 1%, 6%, and 0.2% on the Java subset. According to the samples in Table 3, BART is good at generating clear and readable titles, because it is a generation-oriented model that has been pre-trained on a huge natural language corpus. But we can see from the first and second samples that titles generated by BART misses the key words (i.e., "TriangleMesh" and "sqlite"), we attribute this problem to BART’s inferior understanding of source code. For instance, in the first sample, BART cannot find that the "shape" at the beginning refers to the "TiangleMesh" object declared in the following code snippet. In the second sample, a major part of the body describes inserting data into the SQLite database, while BART only focuses on the unimportant word "android studio". On the contrary, with the help of bi-modal pre-trained CodeBERT encoder, our CCBERT model does better at understanding the source code, and generates titles that are more semantic relevant to the original ones.

(5) The Oracle model has a surprisingly good performance on both subsets, which shows much space for improvement of current models. In terms of the recall-oriented ROUGE metric, the excellent performance of the Oracle model indicates that most tokens in a question title come from the corresponding question body. However, our CCBERT model can only identify part of the useful tokens in question bodies, which further leads to a moderate performance on the BLEU metric. But we can find that all the titles generated by the Oracle model hardly conform to the grammatical norms from the generated samples in Table 3, which indicates the necessity of applying generative models on this task. As for the reasons of the huge performance difference between our CCBERT model and the Oracle model, we think that the complex long-range bi-modal contexts may not have been well handled by our model, and the personalized writing habits of question titles also makes it hard for an automatic model to summarize in the same way as developers do.

Answer to RQ-1: Our CCBERT model outperforms the TF-IDF, BiLSTM, BiLSTM-CC, and BART models in terms of all the automated evalutaion metrics on both Python and Java subsets.

5.2 RQ-2: What is the advantage of using the bi-modal information of the entire question bodies?

Motivation: Despite we have illustrated the necessity of using both text descriptions and code snippets to generate high-quality question titles, we would like to quantify the improvement of using the bi-modal information over the code-only setting in Gao et al.’s work [gao2020generating].

Methods: We post-process all question bodies in our dataset to weed out everything other than code snippets. To keep continuity with the previous experimental settings, the new dataset has questions in the same order as before during training and testing. We choose BiLSTM-CC and CCBERT as representatives of our generative models and show their performance in Table 4 ( and denote the code-only Python and Java subsets, respectively.)

  BiLSTM-CC 12.65 28.70 7.97 28.46
CCBERT 15.03 38.06 14.95 36.97
  BiLSTM-CC 11.45 24.32 6.81 24.86
CCBERT 15.37 35.80 14.43 35.10
Table 4: The evaluation results of code-only subsets.

Results: There is an average 24%–40% decline in the performance of both models when only using code snippets for training. Specifically, the performance of CCBERT declines by 32%, 18%, 34%, and 18% in terms of the four metrics on the subset, and by 26%, 17%, 32%, and 16% in terms of the four metircs on the subset; while BiLSTM-CC drops its performance by 33%, 31%, 58%, and 30% in terms of the four metircs on the subset, by 37%, 37%, 62%, and 35% on the subset. We also find that the performance difference between CCBERT and BiLSTM-CC has been widened, especially on the ROUGE score. Such results are expected because code snippets themselves can not offer sufficient context to a question. As shown in Table 3, the overlap ratios between the code snippets and their corresponding titles are very low of all the four samples. Therefore, the generated titles may be incomplete and incorrect. For example, in the first sample , both BiLSTM-CC and CCBERT pay attention to the "TriangleMesh", but neither of them deduces the word "shape" used in the title. In the second sample, both models fail to tell that the actual purpose of using "Android activity" and "SQLite database" is to insert user input data into the SQLite database. In the third and fourth sample, although both models manage to create the words (i.e., "instantiate", "implement", "parse", "deserialization") that are not in the code snippets, there are still few overlaps between their generated titles and the original ones. However, under code-only circumstances and without changing model structure and hyperparameters, CCBERT shows the superiority over BiLSTM-CC, which also indicates its generalization ability on different tasks.

Answer to RQ-2: Applying bi-modal information greatly boosts the performance of both BiLSTM-CC and CCBERT models.

5.3 RQ-3: How effective is our CCBERT model under low-resource circumstances?

Motivation: Data-hungry is a common issue in the field of deep learning, which greatly hinders the application of many excellent models. In our situation, we may need a great number of high-quality questions for our model to learn from. Previous experiments have proved that CCBERT can well handle the situation with around 60,000 samples for training. Since all the questions satisfying our high-quality conditions have been retained, we carry out an experiment to discuss the effectiveness of our model under low-resource circumstances.

Methods: We first make several copies of both subsets, and then randomly erase a certain amount of questions in the training sets, leaving the validation and test sets untouched. We choose three fractions as the percentage of samples to erase, which are 1/2, 3/4, and 7/8. This makes three new training sets sized of 30,229, 15,115, 7,558 for Python (i.e, , , and ), and three sized of 28,559, 14,280, 7,140 for Java (i.e, , , and ). Along with the CCBERT model, we also train BiLSTM-CC on these datasets for comparison. Table 5 presents the experimental results on the datasets with different sizes.

 BiLSTM-CC 16.54 38.07 15.37 37.91
17.71 40.30 16.94 39.81
18.46 41.02 18.84 40.53
18.84 41.49 18.98 40.56
 CCBERT 20.96 44.58 21.09 42.85
21.06 44.91 21.68 43.15
21.82 46.30 22.30 44.49
22.02 46.69 22.55 44.86
 BiLSTM-CC 15.92 35.88 14.35 35.78
16.91 37.18 15.96 36.87
17.92 38.57 16.34 38.09
18.22 38.89 18.09 38.21
 CCBERT 19.56 41.66 19.59 40.29
19.92 42.78 19.73 41.02
20.46 43.05 20.39 41.62
20.90 43.06 21.15 41.76
Table 5: The performance of CCBERT and BiLSTM-CC on the datasets with different sizes.

Results: It is as expected that both the models have suffered performance degradation on smaller datasets. To our surprise, even if the amount of data decreases exponentially, the performance has a steady decline. When the data size is reduced by 87.5%, CCBERT only drops the scores by 5% on average, but it is 13.5% for BiLSTM-CC. Specifically, the performance of CCBERT declines by 5%, 4%, 7%, and 4% in terms of the BLEUS-4, ROUGE-1, ROUGE-2, and ROUGE-L metrics on the subset, by 7%, 3%, 8%, and 3% in terms of the four metircs on the subset; while BiLSTM-CC drops its performance by 14%, 9%, 23%, and 7% on the subset, by 14%, 8%, 26%, and 7% on the subset. This indicates that our task is not so sensitive to the data volume, and further verifies the existence of a writing pattern shared by those high-quality questions. Once a model learns this pattern, it is capable to generalize among unseen samples. At the same time, we can not say that developers have no personalized writing habits, so a larger amount of data can help eliminate such noises and improve the performance of CCBERT. In fact, facilitated by the pre-trained CodeBERT encoder, CCBERT is better initialized to resist data noises.

Answer to RQ-3: Comparing with BiLSTM-CC, our CCBERT model shows significant superiority under low-resource circumstances.

6 Related Work

Since we treat our TGEQB task in the way of abstractive summarization, some related works have been taken as references, covering the fields of text summarization and code summarization. We briefly introduce the recent literature in this section.

6.1 Text Summarization

There are extractive and abstractive ways for summarization tasks, and both ways have been attracting extensive research interest.

The extractive models focus on selecting sentences or paragraphs from source texts to best match the target summary. The idea of using a hierarchical encoder and an extractor for document summarization was proposed by Cheng et al. [Cheng2016NeuralSB]. Later, researchers have proposed a variety of solutions to deal with different detailed problems. For example, Zhou et al. [Zhou2018NeuralDS] argued that the sentence scoring and selection steps should not be done separately, so they proposed an integrated model to merge the two steps together. Xu et al. [Xu2020DiscourseAwareNE] argued that BERT-based models cannot capture dependencies among discourse units, which leads to the problem of having unwanted phrases in extracted summaries. To tackle this problem, they proposed to encode the rethorical structure theory trees with a graph convolutional network. Jia et al. [Jia2020NeuralES] argued that BERT-based models neglect the inherent dependencies among reference sentences, and they proposed to refine the sentence representations with a redundancy aware graph attention network. These novel models performed well on semantic parsing, but our task requires the model to give readable titles, where the extractive ways have been proved unworkable on our dataset (reference the Oracle method’s performance in Table 3).

In general, abstractive models are not restricted to selecting and rearranging the original text, but to generate each word from a given vocabulary. See et al. [See2017GetTT] argued that vanilla attentional sequence-to-sequence models always produce inaccurate factual details and duplicate phrases. So they proposed to use a hybrid generator incorporating both the copy and coverage mechanisms. Gehrmann et al. [Gehrmann2018BottomUpAS] found the problem that abstractive models were poor at content selection. Instead of adding fancy mechanisms, they proposed a two-stage process, which is to train an extractor and then use it as bottom-up attention to guide the generator. Liu et al. [Liu2019TextSW] extended this idea by using a two-stage fine-tuning on both extraction and generation tasks. In addition, they proposed to use separate optimizers for the encoder and decoder to alleviate mismatch brought by different objectives. Lewis et al. [Lewis2020BARTDS] proposed BART, a generative-oriented pre-training model, which has achieved excellent performance in abstractive summarization tasks, so we choose it as a baseline method to compare with our model. Abstrative summarization is similar to our task in many aspects, but our dataset is more challenging due to the complex bi-modal context and the difficult rare tokens, which is the reason that we adopt the CodeBERT model and the copy mechanism.

6.2 Code Summarization

Code summarization aims to generate readable and meaningful comments that can accurately describe the given programs or subroutines, which is very useful for code search and code comprehension.

One way to deal with source code is to treat it as a sequence. Iyer et al. [Iyer2016SummarizingSC] first proposed to use attentional LSTMs to produce summaries describing code snippets, and they released their training corpus. Hu et al. [Hu2018SummarizingSC] further looked into the possibility of using API knowledge to generate comments that better describe the functionality of source code. Wei et al. [Wei2019RetrieveAR] proposed to use comments of existing similar source code to guide new comment generation. Wei et al. [Wei2019CodeGA] exploited the relations between code summarization and code generation, and proposed a dual framework to train the two tasks simultaneously. The experimental results showed that performance improvements were achieved on both tasks. Hu et al. [Hu2018DeepCC] argued that code tokens should not be processed sequentially, so they proposed an abstract syntax tree-based structural code representation and verified its effectiveness in generating code comments. Ahmad et al. [Ahmad2020ATA] first introduced the Transformer model to this task. They proposed to use a pairwise position encoding to capture the long-range dependencies among code tokens. The above approaches tended to treat source code as text sequences, but they also valued the special information hidden behind the code. Their experiments convinced us that the programming language is a different modality from the natural language.

The another way is to convert the source code into other forms of representation. Wan et al. [Wan2018ImprovingAS] used a sequential encoder as well as a tree-based encoder to capture the overall information from code. They also applied an actor-critic network to overcome the exposure bias issue of the auto-regressive decoder. LeClair et al. [LeClair2019ANM] also used the double encoders and incorporated the copy mechanism to reserve important tokens reported by the AST analyzer. Besides, they [LeClair2020ImprovedCS] further proposed to use a graph-based neural architecture that achieved even better performance. The above-mentioned studies show that source code needs specialized transformation for models to extract the semantic information for summarization. Unfortunately, we find that content marked with the "<code>" tag in our filtered SO questions are not always syntactically correct source code. Therefore, we treat the code snippets as a sequence of tokens, and use CodeBERT as the encoder, which is code-aware and takes sequences as input.

7 Threats to Validity

In this section, we identify the potential threats that might affect the recurrence of our experiments and the validation of our results.

The threats to internal validity concern us in two aspects, one is the re-implementation of baselines, the other is the design of CCBERT model. To address the first issue, we rebuild the default development environment and choose the recommended settings for baseline models. As for the second issue, we have made trade-offs between different techniques. For example, we give up using the coverage mechanism because it is incompatible with the parallel decoding fashion of our Transformer decoder, further experiments also show that our model is not troubled by the repetition problem.

The threats to external validity primarily lie in the generalization ability of our model. There may be a deviation between our experiments and the real world situation. To address this issue, we split our dataset into Python and Java subsets for independent evaluation, and deliberately choose the lately posted questions for testing.

The threats to construct validity

mainly relate to the evaluation measures. Since the nature of our task is a sequence generation problem, where BLEU and ROUGE are the most commonly used metrics, we choose both of them to measure the precision- and recall-oriented performance.

8 Conclusion and Future Work

In this paper, we propose a new task to summarize question titles from bi-modal context and a novel model named CCBERT to tackle this problem. CCBERT incorporates the copy mechanism and the CodeBERT model, which can handle rare tokens and capture the long-range dependencies between bi-modal tokens. We build a large-scale dataset with sufficient high-quality questions related to the Python and Java programming language. The BLEU and ROUGE metrics are used for automated evaluation and a variety of baseline models are chosen for comparison. Experimental results show that our model outperforms the baselines under different settings. We have also released our dataset and source code for follow-up researches. For the future work, we plan to refine the model architecture to boost better performance and apply our approach in more meaningful scenarios.