Learning to Describe Solutions for Bug Reports Based on Developer Discussions

10/08/2021 ∙ by Sheena Panthaplackel, et al. ∙ The University of Texas at Austin 0

When a software bug is reported, developers engage in a discussion to collaboratively resolve it. While the solution is likely formulated within the discussion, it is often buried in a large amount of text, making it difficult to comprehend, which delays its implementation. To expedite bug resolution, we propose generating a concise natural language description of the solution by synthesizing relevant content within the discussion, which encompasses both natural language and source code. Furthermore, to support generating an informative description during an ongoing discussion, we propose a secondary task of determining when sufficient context about the solution emerges in real-time. We construct a dataset for these tasks with a novel technique for obtaining noisy supervision from repository changes linked to bug reports. We establish baselines for generating solution descriptions, and develop a classifier which makes a prediction following each new utterance on whether or not the necessary context for performing generation is available. Through automated and human evaluation, we find these tasks to form an ideal testbed for complex reasoning in long, bimodal dialogue context.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Software bugs in open-source projects are reported through issue tracking systems like GitHub Issues. When a bug is reported, a discussion is initiated among developers to collectively resolve it 

NoyoriGood. The bug resolution process is often strenuous and time-consuming, involving extended deliberations LiuBugSum among multiple participants KavalerPerceived, spanning several months or even longer KikasIssueDynamics. Although a solution often emerges within the discussion AryaInfo, this can easily get lost in a large amount of text LiuBugSum. Wading through a long discussion to determine whether a solution has been recommended, comprehending it, and then implementing it through the necessary code or documentation changes in the code base can be daunting, especially for developers who are not closely following the discussion AryaInfo; TanFirst. This delays implementation, and consequently, the bug persists in the code base, threatening the reliability of the software.

Figure 1: ExoPlayer111https://github.com/google/ExoPlayer/issues/5507 bug report discussion with user-written and system-generated solution descriptions.

As developers scan through the long discussion, it is desirable to have an automated system which can guide them to more easily absorb information relevant towards implementing the solution. In this work, we propose automatically generating a concise natural language description of the solution by synthesizing the relevant content as it emerges in the discussion. For example, as the discussion in Figure 1 progresses, the cause of the bug is identified as the shutter getting closed “when seeking to an unprepared period” and a solution emerges: “suppress closing the shutter in this case, provided the old and new periods belong to the same window.” Our primary task aims to generate a concise description of this solution: Prevent shutter closing for within-window seeks to unprepared periods.

To help quickly mobilize developers for implementation and expedite bug resolution in a real-time setting, the description should be generated as soon as the necessary content emerges in the discussion. In Figure 1, generation should be performed only after utterance #4 is made in the discussion. Since the solution is not formulated until that point, there is insufficient context to reliably generate a description before then. We introduce a secondary task to determine when the necessary context for generating an informative description becomes available in an ongoing discussion. By monitoring progress and later stating an utterance which describes the solution, we envision an intelligent agent which chimes into the discussion to facilitate more efficient bug resolution.

Although the proposed system does not have a “back-and-forth” component, we consider it as a first step towards building dialogue-based AI tools for software development. While there is growing interest in building tools to support software development activities such as code summarization iyer-etal-2016-summarizing; ahmad-etal-2020-transformer, comment updating panthaplackel-etal-2020-learning, and commit message generation loyola-etal-2017-neural, dialogue systems have been largely understudied in this domain.

To develop our proposed system, we build a corpus by mining bug report discussions from GitHub Issues. We develop a novel approach to obtain noisy supervision from commits and pull requests linked to bug reports, which contain the repository changes that resolve these bugs. Namely, we collect natural language descriptions of these changes through their commit messages and pull request titles. We also extract when these changes are made, relative to the time-stamps of utterances in the discussion. We further apply filtering to control noise. Sample data is provided. The full dataset and code will be available upon publication.

Using this data, we establish benchmarks for generating solution descriptions, conditioned on the discussion context. From the long context, a model must learn to tease out and condense information relevant to the solution. Handling long context is critical for tasks with dialogue as input, since the input grows rapidly with the number of participants and interactions. Additionally, the context entails technical text, with natural language and source code often appearing in the same sentence LiUnSupBugSum. So, deducing information from the context to articulate a meaningful description requires complex reasoning. We explore existing generation models including transformer-based VaswaniTransformer encoder-decoder models, as well as the recently proposed PLBART model AhmadPLBART, which was pretrained on large quantities of code and software-related natural language. We evaluate these models through automated metrics and human evaluation.

As a preliminary study on integrating such a generation model into a real-time setting, we design a classifier for determining when to perform generation during an ongoing discussion. Upon every new utterance, the classifier makes a prediction based on whether enough information about the solution has accumulated by that point. By combining the classifier and generation model, we evaluate a pipelined approach for real-time generation, and we demonstrate room for improvement in building a high-performing system.

2 Problem Setting

When a user reports a bug, they state the problem in the title (e.g., “Black screen appears when we seek over an AdGroup”) and initiate a discussion by making the first utterance (), which usually elaborates on the problem. Other participants join the discussion at later points in time through utterances (), where is the total number of utterances. Throughout the discussion, developers discuss various aspects of the bug, including a potential solution AryaInfo. In this work, we propose the task of generating a concise description of the solution (e.g., “Prevent shutter closing for within-window seeks to unprepared periods”) by synthesizing relevant content within the title and sequence of utterances (). In a real-time setting, the formulation of the solution is likely not immediately available but rather emerges as the discussion progresses and the sequence of utterances grows. So, we propose an additional task for monitoring progress in an ongoing discussion to predict the time step in which the title and constitute sufficient context for generating an informative description.

3 Data

Train Valid Test Total
Projects 395 (330) 145 (111) 134 (104) 412 (344)
Examples 9,862 (4,664) 1,232 (599) 1,234 (593) 12,328 (5,856)
   # Commit messages 4,520 (2,355) 410 (234) 386 (189) 5,316 (2,778)
   # PR titles 5,342 (2,309) 822 (365) 848 (404) 7,012 (3,078)
Avg 3.9 (4.5) 3.8 (4.4) 4.0 (4.4) 3.9 (4.5)
Avg 2.9 (3.4) 2.9 (3.4) 3.2 (3.6) 2.9 (3.4)
Avg utterance length (#tokens) 68.4 (75.6) 74.8 (84.3) 70.2 (75.7) 69.2 (76.5)
Avg title length (#tokens) 10.6 (10.6) 11.2 (11.0) 11.5 (11.3) 10.7 (10.7)
Avg description length (#tokens) 9.1 (10.5) 8.9 (9.9) 9.1 (10.1) 9.1 (10.4)

Table 1: Data statistics. In parentheses, we show metrics computed on the filtered subset.333On average, there is 1 utterance after , which is typically for indicating that the bug has been fixed or expressing gratitude.

For these tasks, we build a corpus by mining issue reports corresponding to open-source projects from GitHub Issues, as done in prior work KavalerPerceived; PanchellaWont

. We specifically collect examples from Java projects. Issue reports can entail feature requests as well as bug reports. In this work, we focus on the latter. We identify bug reports by searching for “bug” in the labels assigned to a report and by using a heuristic for identifying bug-related commits 


3.1 Data Collection

A bug report is organized as an event timeline, recording activity from when the report is opened to when it is closed. From comments that are posted on this timeline, we extract utterances which form the discussion corresponding to a bug report, ordered based on their timestamps. We specifically consider bug reports that are linked to source code and documentation changes made in the code repository to resolve the bug NguyenCommit. These changes are made through commits and pull requests, which also appear on the timeline. Changes made in a commit or pull request are described using natural language, in the corresponding commit message loyola-etal-2017-neural; XuCommit or pull request title KononenkoShopify; Zhao2019ImprovingTP respectively. In practice, developers write commit messages and pull request titles after making code changes. However, much like prior work MultiModalChakraborty, we treat them as a proxy for solution descriptions which can drive bug-resolving code changes.

Furthermore, we extract the position of a commit or pull request on the timeline, relative to the utterances in the discussion. We consider this as the point at which a developer acquired enough information about the solution to implement the necessary changes and describe these changes with the corresponding commit message or pull request title. So, if the implementation is done immediately after on the timeline, then we take this position as the “gold” time step for when sufficient context becomes available to generate an informative description of the solution. This leads to examples of the form (Title, , , description).

In this work, we disregard examples consisting of multiple commit messages and PR titles, so there is at most one example per issue report. However, such examples are available in the data we release and can be useful for supporting generating descriptions at multiple time steps in future work.

3.2 Handling Noise

Due to significant noise in large online code bases like GitHub and StackOverflow, automatically extracted data from these sources is typically filtered both for more effective supervision and for more accurate evaluation panthaplackel2020associating; allamanis2016convolutional; Hu2018DeepCC; Fernandes2018StructuredNS; iyer-etal-2016-summarizing; yao2018staqc; yin2018mining. Upon studying the data, we also deemed filtering to be necessary. First, we apply simple heuristics to reduce noise, which we discuss in more detail in Appendix A. From this, we obtain the examples which we focus on in this work, the full dataset. Next, we identify three sources of noise that are more difficult to control with simple heuristics and propose techniques to quantify them. We use these to build a filtered subset of the full dataset that is less noisy. This subset is used for more detailed analysis of the models that are discussed in the paper, and we find that training on this subset leads to improved performance (§4.3).

Generic descriptions: Commit messages and pull request titles are sometimes generic (e.g., “fix issue.”EtemadiCommit. To limit such cases, we compute normalized inverse word frequency (NIWF), which is used in prior work to quantify specificity zhang-etal-2018-learning-control. The filter excludes 1,658 examples in which the reference description’s NIWF score is below 0.116 (10th percentile computed from training data).

Uninformative descriptions: Instead of describing the solution, the commit message or pull request title sometimes essentially re-state the problem (which is usually mentioned in the title of the bug report). To control for this, we compute the percentage of unique, non-stopword tokens in the reference description which also appear in the title. The filtered subset excludes 3,552 additional examples in which this percentage is 50% or more.

Discussions without sufficient context: While enough context is available to a developer to implement a solution at , this context may not always be available in the discussion and could instead be from their technical expertise or external resources. Sometimes, the solution is not mentioned within the discussion. For instance, in the discussion in the footnote444https://github.com/prestodb/presto/issues/14567

, only a stack trace and personal exchanges between developers are present. From the utterance before the PR, “Or PM me the query that failed” suggests that an offline conversation occurred. Since relevant content is not available in such cases, it is unreasonable to expect to generate an informative description. We try to identify examples in which there is no useful content for generating the target output by using a previously proposed approach 

NallapatiSumma for greedily constructing an extractive summary based on a reference abstractive summary. The filtered subset excludes 1,262 more examples for which a summary could not be constructed. After applying all three filtering techniques, we are left with 5,856 examples.

3.3 Preprocessing

Since text in this domain can contain code tokens, we subtokenize (e.g., snake_case snake case, camelCase camel case) in the title, utterances, and description. We retain inlined code; however, we remove code blocks and embedded code snippets, as done in prior work tabassum-etal-2020-code; AhmadPLBART. Capturing meaning from large bodies of code often requires reasoning with respect to the abstract syntax tree AlonCode2Seq and data and control flow graphs AllamanisGraph. We also do not use source code files within a project’s repository. We leave it to future work to incorporate large bodies of code. We discard URLs and mentions of GitHub usernames from utterances. From the description, we remove references to issue numbers and pull request numbers.

3.4 Partitioning

The dataset spans bug reports from April 2011 - July 2020. We partition the dataset based on the timestamp of the commit or pull request associated with a given example. Namely, we require all timestamps in the training set to precede those in the validation set and all timestamps in the validation set to precede those in the test set. Partitioning with respect to time ensures that we are not using models trained on future data to make predictions in the present, more closely resembling the real-world scenario NieTime. Dataset statistics are shown in Table 1.

4 Generating Solution Descriptions

1 2 3 4
Full Title 73.0 88.9 94.0 96.1
54.7 87.6 95.0 97.6
Title + 47.9 82.0 91.2 94.8
Filtered Title 82.3 95.6 98.4 99.4
49.9 87.4 95.1 97.8
Title + 47.5 86.0 94.5 97.5

Table 2: Percent of novel unigrams, bigrams, trigrams, and 4-grams in the reference description, with respect to the title, , and title + . The high percentages show that generating solutions is an abstractive task.

We first study the task of generating informative solution descriptions in a static setting, in which we leverage the oracle context from the discussion (i.e., the title and ). From Table 1, the average length of a single utterance is 70 tokens while the average description length is only 9 tokens. Therefore, this task requires not only effectively selecting content about the solution from the long context (which could span multiple utterances) but also synthesizing this content to produce a concise description. Following see-etal-2017-get

, we compute the percent of novel n-grams in the reference description with respect to the input context in Table 

2. The high percentages underline the need for an abstractive approach, rather than an extractive one which generates a description by merely copying over utterances or sentences within the discussion.555We observe very low performance with extractive approaches, as shown in Appendix C. Furthermore, success on this task requires complex, bimodal reasoning over technical content in the discussion, encompassing both natural language and source code.

4.1 Models

We benchmark various models for this task. To represent the input in neural models, we insert TITLE_START before the title and UTTERANCE_START before each utterance.

Copy Title: Though the bug report title typically only states a problem, we observe that it sometimes also puts forth a possible solution, so we evaluate how well it can serve as a concise description of the solution.

S2S + Ptr: We consider a transformer encoder-decoder model in which we flatten the context into a single input sequence VaswaniTransformer. Generating the output typically requires incorporating out-of-vocabulary tokens from the input that are specific to a given software project, so we support copying with a pointer generator network VinyalsPointer.

Hier S2S + Ptr: Inspired by hierarchical approaches for dialogue response generation SerbanHRED, we consider a hierarchical variant of the S2S + Ptr model with two separate encoders: one that learns a representation of an individual utterance, and one that learns a representation of the whole discussion. We provide more details in Appendix B.

PLBART: AhmadPLBART recently proposed PLBART, which is pretrained on a large amount of code from GitHub and software-related natural language from StackOverflow, using BART-like lewis-etal-2020-bart training objectives. With fine-tuning, PLBART achieves state-of-the-art performance on many program and language understanding tasks like code summarization/generation. We fine-tune PLBART on our training set and evaluate its ability to comprehend bug report discussions and generate descriptions of solutions.666We focus on PLBART rather than vanilla BART because it achieves higher performance, as shown in Appendix D. Note that PLBART truncates input to 1024 tokens.

PLBART (F): Since PLBART is pretrained on a large amount of data, we can afford to reduce the fine-tuning data. So we fine-tune on only the filtered subset of the training set (cf. §3.2), to investigate whether fine-tuning on this “less noisy” sample can lead to improved performance.

4.2 Results: Automated Metrics

We compute common text generation metrics, BLEU-4 

papineni2002bleu, METEOR BanerjeeEtAL2005, and ROUGE-L lin-2004-rouge. We compute statistical significance with bootstrap tests berg-kirkpatrick-etal-2012-empirical with . Results are in Table 3. On the full test set, PLBART outperforms other models by statistically significant margins, demonstrating the value of pretraining on large amounts of data. PLBART (F) underperforms PLBART on the full test set; however, on the filtered subset, PLBART (F) either beats or matches PLBART. We find that there is a large drop in performance across models between the full test set and filtered subset. As demonstrated by the relatively high performance of the naive Copy Title baseline, models can perform well by simply copying or rephrasing the title in many cases, for the full test. However, the filtered subset is designed to remove uninformative reference descriptions that merely re-state the problem. Nonetheless, because critical keywords relevant to the solution are often also in the title, the Copy Title baseline can still achieve reasonable scores on the filtered subset, even beating S2S + Ptr and Hier S2S + Ptr on METEOR. Although automated metrics provide some signal, they emphasize syntactic similarity over semantic similarity. For further evaluation, we conduct human evaluation.

4.3 Results: Human Evaluation

Users are asked to read through the content in the title and the discussion (). For each example, they are shown predictions from the 5 models discussed in Section 4.1, and they must select one or more of the descriptions that is most informative towards resolving the bug. If all candidates are uninformative, then they select a separate option: “All candidates are poor.” There is also another option to indicate that there is insufficient context about the solution (§3.2, making it difficult to evaluate candidate descriptions. They must also write a rationale for their selection. Before starting the annotation task, users must watch a training video in which we walk through seven examples in detail.

Since annotation requires not only technical expertise, but also high cognitive load and time commitment, it is hard to perform human evaluation on a large number of examples with multiple judgments per example. Similar to iyer-etal-2016-summarizing, we resort to having each example annotated by one user to obtain more examples. We recruited 8 graduate students with 3+ years of programming experience and familiarity with Java. Each user annotated 20 examples, leading to annotations for 160 unique examples in the full test set.

Note that these users are not active contributors, thus they will likely select the option pertaining to insufficient context more often than if they were active contributors to these projects who have a deeper understanding of their implementations. However, it is difficult to conduct a user study at a similar scale with contributors. Nonetheless, there are developers aiming to become first-time contributors for a particular project TanFirst. Our study better aligns with this use case.

Full Copy Title 14.4 13.1 24.4
S2S + Ptr 12.6 9.8 25.0
Hier S2S + Ptr 12.4 9.6 24.1
PLBART 16.6 14.5 28.3
PLBART (F) 14.2 12.3 25.1
Filtered Copy Title 10.0 8.3 16.6
S2S + Ptr 10.2 7.5 20.1
Hier S2S + Ptr 9.9 7.4 19.6
PLBART 12.3 9.9 21.1
PLBART (F) 12.3 10.2 21.9

Table 3: Automated metrics for generation. Scores for S2S + Ptr and Hier S2S + Ptr are averaged across three trials. Differences that are not statistically significant are indicated with matching symbols.
Model Full Filtered
Copy Title 8.1 6.0
S2S + Ptr 1.3 1.2
Hier S2S + Ptr 1.3 1.2
PLBART 11.9 10.5
PLBART (F) 33.1 39.5
All Poor 20.0 22.1
Insufficient Context 31.9 25.6

Table 4: Human evaluation results: Percent of annotations for which users selected predictions made by each model. This entails 160 annotations for the full test set, 86 of which correspond to examples in our filtered subset. Differences that are not significant are indicated with matching symbols.

In Table 4, we show that PLBART (F) substantially outperforms all other models, with users selecting its output 33.1% of the time. Even though the title typically only states a problem, users selected it 8.1% of the time. From rationales that users were asked to write, we found that there were cases in which the title not only posed the problem but also offered a solution. Users rarely preferred the output of S2S + Ptr and Hier S2S + Ptr as they usually just rephrased the problem. PLBART also appears to be re-stating the problem in many cases; however, less often than other models.

Though we see similar trends across the full test set and the filtered subset, all models except PLBART (F) tend to perform worse on the filtered subset, as previously observed on automated metrics. Also, the average number of cases with insufficient context is lower for the filtered subset, confirming that we are able to reduce such cases through filtering. We find the results on the filtered data to align better with human judgment. By fine-tuning on the filtered training set, PLBART (F) learns to pick out important information from within the context and generate descriptions which reflect the solution rather than the problem.

Title only
Model 1 2 1 2
Copy Title 100.0 100.0 0.0 0.0
S2S + Ptr 64.8 37.1 31.6 25.3
Hier S2S + Ptr 60.3 34.2 38.7 26.1
PLBART 80.8 77.7 31.0 41.4
PLBART (F) 36.9 28.4 52.8 42.3
Reference 32.7 22.2 38.8 25.4

Table 5: Percent of unigrams and bigrams in the prediction (or reference) which appear in the title and in only (excluding the title), on the CS subset. Lower is better for the title and higher is better for only.
Model Prediction
Copy Title black screen appears when we seek over an ad group .
S2S + Ptr fix black ads
Hier S2S + Ptr fix seeking in ad tag
PLBART suppress closing shutter when seeking over an ad group
PLBART (F) suppress closing the shutter when seeking to an unprepared period

prevent shutter closing for within - window seeks to unprepared periods
Table 6: Model outputs for the example shown in Figure 1.

Title PLBART (F) Reference
(1) Issue with dex: OIDC server is not available at the ’quarkus.oidc.auth-server-url’ URL fix trailing slash in auth - server url strip trailing forward slash from oidc url
(2) InvalidDataTypeException: UDATA contains value larger than Integer.MAX_VALUE DDR issue decoding lookswitch fix bug in byte code dumper when tableswitch instruction precedes tableswitch instruction fix interpretation of switch instructions in byte code dumper
(3) Worldmap viewport changes when switching between dashboard pages don ’ t refresh widget grid when worldmap loses viewport define key prop for map visualization to update map on dimension change
(4) Workaround comments exist in opengrok-indexer/pom.xml file while the related issues are already fixed. fix jflex - de / jflex # 705 ( comment ) use jflex 1.8.2
(5) Why subscribe with single action for onNext design to crush if error happened? 1 . x : fix subscription . subscribe ( ) to return observable . empty ( ) 2 . x : fix subscription . subscribe ( ) to return observable . empty ( ) fixed sonar findings

Table 7: Output of PLBART (F) for a sample of examples in the test set. (1) https://github.com/quarkusio/quarkus/issues/10227, (2) https://github.com/eclipse-openj9/openj9/issues/9294, (3) https://github.com//Graylog2/graylog2-server/issues/7997, (4) https://github.com//oracle/opengrok/issues/3172, (5) https://github.com/ReactiveX/RxJava/issues/637.

4.4 Analysis

Of the 160 annotated examples, users found 109 to have sufficient context about the solution. We consider this the context-sufficient subset (CS), which we will release for future research. To analyze how models exploit the provided context, we measure the percent of n-grams in the prediction which overlap with the title as well as (excluding n-grams already in the title) in Table 5. PLBART (F)’s predictions tend to have less n-gram overlap with the title and more overlap with the utterances. This suggests that this model predicts fewer uninformative descriptions which merely re-state the problem mentioned in the title and instead focuses on other content from the utterances.

In Table 6, we show model outputs for the example in Figure 1. SeqToSeq and Hier S2S + Ptr essentially rephrase aspects of the problem, which are described in the title. Both PLBART and PLBART (F) capture the solution, with PLBART (F) providing more information. When there is sufficient context, 62.4% of the time, either PLBART or PLBART (F) generates output that is informative towards bug resolution. While this demonstrates that fine-tuning this large, pretrained model on our data can be useful in supporting bug resolution in on-line discussions to some extent, it also shows that there is opportunity for improvement.

We manually inspected PLBART (F)’s outputs and associated user rationales. We observe that the model tends to perform better when the solution is clearly stated in 1-3 consecutive sentences (Table 7 (1) and (2)). When more complex synthesis is needed, it sometimes stitches together tokens from the input incorrectly (Table 7 (3)). Next, although the model picks up on information in the context, sometimes, it draws content from an elaboration of the problem from within the discussion rather than a formulation of the solution (Table 7 (4)). This demonstrates that it still struggles to disentangle content relevant to the solution from that about the problem. We also find that it sometimes struggles to generate meaningful output when in-lined code is present, highlighting the challenge in bimodal reasoning about code and natural language (Table 7 (5)). Finally, we find problems with repetition and fluency (Table 7 (1)), as commonly seen in the outputs of neural models HoltzmanCurious.

5 Supporting Real-Time Generation

For generated solution descriptions to be useful in resolving bugs, generation must be performed during ongoing discussions. We conduct a preliminary investigation for integrating the generation task into a real-time setting. Namely, because an informative description can be generated only if there is sufficient context about the solution, an agent must wait until this context becomes available. For this, we train a binary classifier to predict the time step () in which the necessary context is available. For instance, in Figure 1, the solution is formulated in , so the correct behavior is to predict the negative label at and the positive label at . Once the positive label is predicted at time step 777Once the positive label is predicted, predictions are not made at . We leave multiple generation time steps for future work., the description is generated, conditioned on the title and .

5.1 Classifier

Our approach sequentially processes each new utterance and decides if it adds enough information to propose a solution. We first prepend TITLE_START

to the sequence of tokens in the title and encode it with a transformer-based encoder. We consider the contextualized representation of this token as a vector representation of the information available at

, which we denote as . Next, to process an utterance (), we prepend UTTERANCE_START to the sequence of utterance tokens. We concatenate the representation at the previous time step () to the token embeddings and pass them through the encoder. The contextualized representation of the special token becomes . Finally, we pass through 3 linear layers and a sigmoid layer, and then apply softmax to classify whether or not a solution can be formulated at step . By feeding in , we inform the model of the prior context and evaluate the information added by . We weight the positive and negative labels using the inverse of the class proportion to handle class imbalance (1.543 and 0.740 respectively). Additionally, to improve learning, we augment the training data with 12,350 non-bug examples, but apply a lower weight for these examples (0.7). The model is trained to minimize cross entropy loss.

In Table 8, we present overall accuracy as well as accuracy for different sets of test examples based on .888We compare against simple baselines in Appendix E. We observe that accuracy deteriorates substantially as increases, illustrating the challenge in handling long dialogues, and highlighting room for improvement on this task.

The classifier fails to predict the positive label (before or at ) in some cases ( = None). On the examples that it does predict the positive label, on average, is 1.704, 1.895, and 1.804 time steps before for the full, filtered, and curated test sets respectively. While a model should wait until sufficient context is available before generating, sometimes, the last couple utterances before the implementation do not add context about the solution but are rather personal exchanges between developers (e.g., “Thanks for the bug report”, “I’ll open a PR”). For this reason, we believe that predicting the positive label slightly before is acceptable in certain cases.

Full Filtered CS
45.8 47.7 41.9
None 21.6 23.6 33.6
32.5 28.8 24.5
    62.9 57.2 48.7
    33.4 34.2 30.1
    16.4 19.4 2.1
    15.8 15.8 18.8
    5.6 7.0 6.7

Table 8: % of time ; % of time positive label is not predicted , and % of time overall and for varying .
Full @ 11.3 9.9 19.9
@ 14.2 12.3 25.1

@ 9.5 7.8 16.3
@ 12.3 10.2 21.8

@ 9.4 8.4 16.5
@ 14.2 12.3 25.2

Table 9: Comparing PLBART (F)’s performance with context available at vs. . If = None (i.e., positive label is not predicted before or at ), the predicted description is the empty string. All differences are statistically significant.

5.2 Combined System

Finally, we combine the classifier and PLBART (F) (the best generation model from human evaluation) to build a complete system for deciding when a solution can be proposed and then generating one. In Table 9, we report automated metrics for PLBART (F), comparing model output between using the context up till (the predicted time step of classifier) versus . We observe that across metrics, predictions generated by the same underlying model using the context at achieve higher scores than those made using the context at . This highlights the gap in performance caused by error propagation from the classifier. The generation and classification tasks are inherently interdependent. It is not possible to generate an informative description without sufficient context, and “sufficient” context is defined by whether it can be used to generate an informative description. We leave it to future work to design a more intricate end system that is jointly trained on generation and classification.

6 Related Work

Bug report summarization: To help developers efficiently gather information from bug reports, there is interest in automatic bug report summarization. Approaches for this task are designed to generate holistic summaries of bug reports, with a summary being 25% of the length of the bug report LiuBugSum. We instead aim to generate a concise description that captures a very specific aspect of the bug report. Next, bug report summaries are not widely available, so approaches for bug report summarization rely on unsupervised techniques LiUnSupBugSum; LiuBugSum or supervision from a small amount of data RastkarSum; Jiang2016. On the other hand, our approach for obtaining noisy supervision allows us to train supervised models on a large amount of data. Bug report summarization is a post hoc task, done after the bug has been resolved, to help developers address related bug reports in the future. In contrast, our goal is to help resolve the present bug report, so our system must learn when to perform generation during an ongoing discussion. Approaches for bug report summarization have been predominantly extractive whereas our approach is abstractive. Nonetheless, it would be interesting to evaluate how bug report summarization techniques would fair on our task; however, their implementations are not publicly available.

Commit message generation: Unlike the task of automatically generating commit messages to describe code changes that have already been made loyola-etal-2017-neural; XuCommit, our system aims to generate natural language descriptions that can drive code changes.

Response triggering: The classifier for determining when to generate a description relates to chatbots learning to respond at an appropriate time LiuCustomer in dyadic conversations. The goal is to avoid interrupting a user who splits up an utterance across multiple turns. We instead consider multi-party dialogue in which an agent should wait until a specific type of content emerges in the discussion.

Dialogue + software: We view our work as a step towards building an interactive dialogue agent for streamlining software bug resolution. There has been minimal work in building interactive systems for this domain, with the exception of a few for tasks like query refinement ZhangChatbot and code generation chaurasia-mooney-2017-dialog; yao-etal-2019-model. WoodSpeechActs recently built a software-related dialogue corpus through a “Wizard of Oz” experiment to study the potential of a Q&A assistant during bug fixing. lowe-etal-2015-ubuntu developed a dialogue corpus based on Ubuntu chat logs to study Q&A assistants for technical support. In contrast, our dataset is designed for building a collaborative agent which can participate in multi-party conversations rather than one which answers directed questions.

7 Conclusion

In this work, we present our vision for an automated system which generates a concise description of the solution to a bug report as soon as the necessary context becomes available in an ongoing developer discussion. We collect a large dataset through supervision derived from commits and pull requests. Using this data, we investigate approaches for generating informative solution descriptions. We also conduct a preliminary study on integrating such a generation model into a real-time setting by combining it with a classifier for determining when sufficient context emerges in an ongoing discussion. Through automated and human evaluation, we demonstrate the utility of these models and also highlight their shortcomings, to encourage more research in exploring ways to address these challenges and build an improved end-to-end system.


Appendix A Data Cleaning

We focus on closed bug reports from the top 1,000 Java projects (in terms of number of stars), as a way of identifying well-maintained projects Jarczyk2014GitHubPQ. We require there to be at least two distinct “actors” in the discussion, in which the actor can either be a developer who makes an utterance in the discussion or an actor who implements the solution through a commit or pull request. We discard examples in which the reference description is identical to the title (disregarding stopwords), as these are cases in which either the reference description only states the problem and is uninformative or the title already puts forth a solution (in which case a generated description would not be useful). We remove examples with commits or pull requests which simultaneously address multiple bug reports.

We mined 141,389 issues (from 770 of the top 1,000 projects). After applying heuristics, we get 35,010 (from 525 projects), which will be released. Of these, we focus on the 12,328 bug-related issues with a single commit message/PR title.

Full Supervised Extractive 0.537 0.536 0.807 0.010 0.767
LexRank 2.252 1.851 2.629 0.061 2.470
(Lead 1) 4.793 6.537 10.077 2.534 8.752
(Lead 3) 3.085 7.955 9.778 2.303 8.687
2.842 5.425 7.426 1.363 6.712
(Lead 1) 4.028 4.453 7.736 1.451 6.889
(Lead 3) 3.189 5.692 8.153 1.504 7.359
(Last sentence) 3.475 3.480 6.089 0.930 5.476
(Last 3 sentences) 3.234 5.082 7.525 1.287 6.787
Retrieval (Title-Title) 6.866 4.497 11.517 1.281 10.748
Retrieval (Title-Desc) 8.763 6.167 15.965 2.426 14.776
Project Retrieval (Title-Title) 7.442 4.709 11.501 1.49 10.943
Project Retrieval (Title-Desc) 9.118 6.299 14.949 2.232 14.089
Copy Title 14.358 13.142 27.361 11.539 24.427
S2S + Ptr 12.583 9.838 27.589 4.258 25.024
Hier S2S + Ptr 12.365 9.564 26.785 3.672 24.084
PLBART 16.551 14.484 31.564 11.549 28.295
PLBART (F) 14.188 12.302 27.443 8.349 25.128

Supervised Extractive 0.711 0.653 1.084 0.005 1.029
LexRank 2.442 1.946 2.843 0.066 2.637
(Lead 1) 4.951 6.207 9.881 1.938 8.553
(Lead 3) 3.055 7.907 9.890 1.875 8.777
2.899 6.045 8.081 1.507 7.346
(Lead 1) 4.406 4.808 8.424 1.507 7.590
(Lead 3) 3.356 6.257 8.894 1.681 8.060
(Last sentence) 3.515 3.961 6.547 1.046 5.868
(Last 3 sentences) 3.345 5.722 8.200 1.460 7.448
Retrieval (Title-Title) 6.117 3.727 9.546 0.711 8.965
Retrieval (Title-Desc) 6.998 4.542 12.082 1.257 11.410
Project Retrieval (Title-Title) 6.646 4.195 9.603 1.273 9.255
Project Retrieval (Title-Desc) 7.593 5.064 11.895 1.638 11.328
Copy Title 9.962 8.291 18.538 4.943 16.641
S2S + Ptr 10.168 7.521 21.846 2.278 20.116
Hier S2S + Ptr 9.893 7.369 21.562 2.131 19.649
PLBART 12.319 9.877 23.419 5.452 21.097
PLBART (F) 12.266 10.218 23.786 5.712 21.857

Table 10: Comparing models in main paper with low-performing baselines for generating solution descriptions. Scores for Supervised Extractive are averaged across three trials.

Appendix B Details of Hier S2s + Ptr Model

We encode using a transformer-based encoder and feed the contextualized representation of its first token (UTTERANCE_START) into the RNN-based discussion encoder to update the discussion state, . When encoding , we also concatenate to embeddings, to help the model relate with the broader context of the discussion. Note that we treat the title as in the discussion. This process continues until is encoded, at which point all accumulated token-level hidden states are fed into a transformer-based decoder to generate the output.

Unlike the S2S + Ptr model which is designed to reason about the full input at once, this approach reasons step-by-step, with self-attention in the utterance encoder only being applied to tokens within the same utterance. Since the input context for this task is often very large, we investigate whether it is useful to break down the encoding process in this way. We also equip this model with a pointer generator network.

Appendix C Additional Generation Baselines

Full mBART base (randomly initialized) 9.978 6.976 17.000 2.498 15.744
mBART large 15.251 12.503 28.522 9.520 26.109
BART base 14.226 11.522 26.957 8.864 24.746
PLBART 16.551 14.484 31.564 11.549 28.295

Filtered Subset
mBART base (randomly initialized) 8.819 6.151 14.870 2.011 13.574
mBART large 11.663 9.233 22.295 5.159 20.458
BART base 10.820 8.583 21.247 5.055 19.537
PLBART 12.319 9.877 23.419 5.452 21.097

Table 11: Comparing performance of BART-based models. Training/fine-tuning is done with our full training set. Differences that are not statistically significant are shown with matching symbols.

For the generation task, we considered additional baselines; however, since they were performing much lower than other approaches (on wide statistically significant margins), we chose to exclude them from the main paper. We briefly describe these baselines below.

c.1 Extractive Baselines

Supervised Extractive: Using a greedy approach for obtaining noisy extractive summaries NallapatiSumma

, we train a supervised extractive summarization model, similar to 


LexRank: We use LexRank ErkanLexRank, an unsupervised graph-based extractive summarization approach. We extract 1 sentence with threshold 0.1.

(Lead 1): This entails simply taking the first sentence of the first utterance, intended to simulate the Lead-1 baseline that is commonly used in summarization.

(Lead 3): This entails simply taking the first 3 sentences of the first utterance, intended to simulate the Lead-3 baseline that is commonly used in summarization.

: Since some part of the solution is often mentioned within , we copy this utterance.

(Lead 1): Since the length of an utterance is quite different than that of a description (Table 1), we extract only the lead sentence of .

(Lead 3): For the reason stated above, we also apply the Lead-3 baseline to this utterance.

(Last sentence): Rather than extracting the lead sentence, we try extracting the last sentence of .

(Last 3 sentences): Rather than extracting the lead 3 sentences, we try extracting the last 3 sentences of .

c.2 Retrieval Baselines

Retrieval (Title-Title)

: Using TF-IDF, we compute cosine similarity between the test example’s title and titles in the training set, to identify the closest training example, from which we take the description.

Retrieval (Title-Desc): Using TF-IDF, we compute cosine similarity between the test example’s title and descriptions in the training set, to identify the closest training example, from which we take the description.

Project Retrieval (Title-Title): Using TF-IDF, we compute cosine similarity between the test example’s title and titles for the same project in the training set, to identify the closest training example, from which we take the description.

Project Retrieval (Title-Desc): Using TF-IDF, we compute cosine similarity between the test example’s title and descriptions for the same project in the training set, to identify the closest training example, from which we take the description.

c.3 Baseline Results

We present baseline results in Table 10. In addition to the metrics used in the main paper, we report ROUGE-1 and ROUGE-2. All of these baselines substantially underperform models presented in the main paper, especially the Supervised Extractive model. We believe this model performs so poorly due to noise in the supervision and because the extracted summaries are longer and structured differently than the reference descriptions in our dataset. Additionally, there are many examples in which the does not select a single sentence from the input, resulting in the prediction being the empty string. LexRank also performs poorly in terms of automated metrics against the reference description. This unsupervised approach aims to identify a “centroid” sentence that summarizes the full input context and is not designed to specifically focus on solution-related context.

All baselines that extract a whole utterance or sentences from specific utterances perform poorly, demonstrating the need for content selection from the broader context and content synthesis rather than relying on simple heuristics to produce a description of the solution. We find that the retrieval baselines tend to achieve higher scores, as retrieved descriptions are from the same distribution as the reference descriptions. However, these numbers are still quite low, relative to those in the main paper.

Appendix D BART Models for Generating Descriptions

We leverage PLBART AhmadPLBART, which was pretrained on large amounts of code from GitHub and software-related natural language from StackOverflow. Compared to other pretrained models, fine-tuning PLBART achieves higher performance for various NL+code tasks, including code summarization, code generation, code translation, and code classification. Since our task also requires reasoning about code and technical text, we choose to use PLBART over other pretrained models in our work.

For completion, we compare against BART-based models which are not pretrained on code or technical text. First, we consider mBART base (multilingual BART) tang2020multilingual, which is the underlying architecture of PLBART. Without pretraining (randomly initializing the same architecture), performance is very low, as shown in Table 11. The publicly released pretrained mBART model, which is pretrained on non-technical natural language, does not use the base architecture but rather large. We also fine-tune this model on our training set but find that it achieves lower performance than PLBART. Finally, we compare against BART base lewis-etal-2020-bart, which is also pretrained on non-technical natural language. Again, this model underperforms PLBART. Because PLBART’s performance is higher, we choose to focus on this model in our work.

Appendix E Classification Baselines

First Second Rand (uni) Rand (dist) RF Ours
Full Overall (1,234) 29.498 26.013 23.318 24.635 21.907 32.523
    (364) 100.000 0.000 49.359 52.656 30.861 62.912
    (321) 0.000 100.000 24.714 26.272 35.826 33.437
    (183) 0.000 0.000 10.565 10.018 11.475 16.393
    (141) 0.000 0.000 5.201 4.728 8.747 15.839
    (225) 0.000 0.000 0.889 1.333 4.296 5.630

Overall (593) 23.777 24.789 21.360 21.417 19.056 28.780
    (141) 100.000 0.000 52.009 52.955 23.404 57.210
    (147) 0.000 100.000 26.757 23.810 39.229 34.240
    (91) 0.000 0.000 9.524 12.088 12.454 19.414
    (80) 0.000 0.000 4.583 5.000 7.500 15.833
    (134) 0.000 0.000 1.244 1.741 3.731 6.965

Overall (109) 23.853 28.440 19.878 22.018 16.820 24.465
    (26) 100.000 0.000 51.282 55.128 17.949 48.718
    (31) 0.000 100.000 24.731 21.505 23.656 30.108
    (16) 0.000 0.000 2.083 12.500 18.750 2.083
    (16) 0.000 0.000 2.083 6.250 4.167 18.750
    (20) 0.000 0.000 0.000 0.000 13.333 6.667

Table 12: Accuracy (i.e., percent of times ) overall and for varying . Differences in overall accuracy that are not statistically significant are shown with matching symbols.

To better evaluate our classifier for determining when sufficient context is available for generating an informative description, we introduce some simple baselines. We observe that there are many cases in which , i.e., the solution is implemented immediately after the first or second utterance. So, we include the First baseline which always predicts a positive label at , and Second which predicts negative at and positive at , if (otherwise it never predicts positive).

Next, we include the Rand

(uni) baseline which progresses through the discussion, randomly deciding between the positive and negative label after each utterance, based on a uniform distribution. We additionally include


(dist), which instead uses the probability distribution of labels at the example-level estimated from the training and augmentation data (i.e., pos =

=0.549, neg = 0.451).

Finally, we include a random forest classifier (RF) which makes a classification following each utterance,

, until the positive label is predicted or . It uses TF-IDF representations of the title and as well as an aggregated representation of . Additionally, it uses the following features: the position , length of , author of (as an index, with ordering dependent on entry into the discussion), frequency of utterances made by the author, the ratio of the length of to the accumulated length of , and the title length. We rely on Sklearn’s scikit-learn random forest classifier implementation with 100 trees and the gini criterion. To account for class imbalance, the weights are adjusted so they are inversely proportional to class frequencies.

Results are presented in Table 12. Results for the random baselines, random forest classifier, and our classifier are averaged across 3 trials. On the full and filtered test sets, our classifier achieves the highest overall accuracy than. On the CS subset, the Second baseline outperforms our classifier, which we attribute to the small sample size with nearly 30% of the examples having . Nonetheless, RF and our classifier, which dynamically make content-driven predictions, manage to outperform other baselines across all 3 test sets for longer discussions, with higher values of . In general, our classifier still outperforms RF, which we attribute to the more complex transformer-based architecture yielding better utterance representations.

Appendix F Human Evaluation Setup

In the user study, users are shown the title of the bug report, all utterances up till (and including) , and the reference description in our dataset for the given example. We choose to provide this as a manual suggestion to help guide users in better understanding a bug report, for a software project with which they have minimal familiarity. However, we state in our instructions that this is merely provided for reference and is not necessarily the exact and only valid answer.

Next, we show them up to 5 model predictions and ask them to “select the one(s) which add(s) the most amount of useful information that will help resolve the bug, beyond just re-stating the problem itself.” We explain that we consider a description to be informative if it provides content that will be useful towards fixing the issue, beyond just rephrasing the problem itself. And we encourage users to select candidates based on content that is informative, rather than focusing on exact phrasing. If all candidates appear to be poor (completely unrelated to the resolving the bug, uninformative, incomprehensible, or plain wrong), users are asked to select another option: “All candidates are poor.” If there is no useful information towards resolving the bug in the context and they are unable to evaluate candidate descriptions, they are asked to select another option: “The context does not have any useful information for resolving the bug.” They must also justify their selection by writing a brief rationale.

This is a challenging task, as it requires reading through and reasoning about a large amount of text to evaluate each example. To prepare annotators, we first present a set of training examples and a training video in which we demonstrate how the task should be completed.

Appendix G Analyzing CS Subset

We present automated metrics for the CS subset in Table 13. Results are analogous to the full test set, except that the numbers are generally lower for all models other than for PLBART (F), which achieves consistent performance. PLBART (F) slightly underperforms PLBART on automated metrics overall. However, this is because these metrics are computed against the single reference description, which could diverge from how the solution is formulated in the discussion since the developer could have written an uninformative/generic description. To do more fine-grained analysis, in Figure 2, we plot automated metrics for varying percentages of token overlap between the reference description and (excluding tokens already present in the title which have been used to state the problem). Higher overlap suggests that the reference description draws more content from within the discussion. For higher percentages, PLBART (F) generally achieves higher scores against the reference than PLBART and all other models, indicating that this model is better at gathering information from within the discussion. In Table 14, we supplement the n-gram analysis from Section 4.4.

Appendix H Reproducibility Checklist

h.1 Validation Performance

We report performances on the full validation set. Results for the generation task are in Table 15 and results for classification are in Table 16.

Appendix I Hyperparameters

All neural models were implemented using PyTorch. For

S2S + Ptr, Hier S2S

+ Ptr, and our classifier, we use a batch size of 8, an initial learning rate of 3e-05, and a dropout rate of 0.2. Our transformer models have 4 encoder and decoder layers, 4 heads in multi-head attention, a hidden size of 64, and feedforward hidden size 256. For our classifier, we downweight augmentation data by a factor of 0.7. The number of linear classification layers for this model is 3, which have dimension 128. We use Adam as the optimizer and have a learning rate scheduler with gamma 0.95 which decays after an epoch if the validation loss has not improved. We use early stopping with patience 5 during training.

i.1 Tuning

For S2S + Ptr, Hier S2S

+ Ptr, and our classifier, hyperparameters are tuned manually. For batch size, we consider {8,16,32}, learning rate {1e-03, 1e-04, 3e-05}, dropout {0.1, 0.2, 0.4, 0.5, 0.6}, encoder/decoder layers {2, 4, 6, 8}, number of heads {2, 4, 8}, hidden sizes {32, 62, 128}, feedforward dimensions {128, 256, 512}, data augmentation weight {0.1, 0.3, 0.5, 0.7, 0.9, 1.0}, number of classification layers {1,2,3}, classification dimension size {64, 128, 256, 512}. These hyperparameters are tuned on validation data, using accuracy for

When2Describe and the text generation metrics mentioned in Section 4.2 for How2Describe. For tuning, we do not do grid search but rather compare performances between models trained with identical configurations, with the exception of a single parameter. Therefore, the number of hyperparameter tuning runs scales linearly. We ran each configuration once.

For PLBART fine-tuning, we use the same configurations as the scripts released by AhmadPLBART.

Copy Title 12.6 12.2 22.1
S2S + Ptr 11.6 8.9 23.1
Hier S2S + Ptr 12.0 9.0 22.9
PLBART 14.6 13.2 26.0
PLBART (F) 14.2 12.3 25.1

Table 13: Automated metrics for generation on CS subset. Differences that are not statistically significant are indicated with matching symbols.

i.2 Random Seeds

For the randomly initialized models, random seeds are set according to the timestamp, and we average results across 3 trials. For S2S + Ptr, the seeds were: 1620001129, 1620001158, and 1620004022. For Hier S2S + Ptr, the seeds were: 1620001125, 1620001159, and 1620004024. For our classifier, the seeds were: 1620037490, 1620037278, and 1620128612.

i.3 Model Sizes

We present model sizes for the neural models that we introduce in this paper: the S2S + Ptr model has 3,033,728, the Hier S2S + Ptr model has 3,095,936, and our classifier has 2,334,530 parameters,

Appendix J Statistical Significance Testing

We compute statistical significance using bootstrap tests berg-kirkpatrick-etal-2012-empirical with and 10,000 samples of size 5,000 each.

j.1 Running Times

We report average training time, inference time, and # epochs for the various models with consider in this work. The two PLBART models were fine-tuned on NVIDIA DGX GPUs (32 GB) and all other models were trained and evaluated using on GeForce GTX Titan GPUs (8 GB). All models used single-GPU training.

(a) BLEU-4
(c) ROUGE-1
(d) ROUGE-2
Figure 2: Metrics for CS subset, with buckets corresponding to the % of tokens in reference description which also appear in (disregarding title tokens). Bucket 10 corresponds to [0, 10)%, 20 corresponds to [10, 20)%, etc.
Title only
Model 1 2 3 4 1 2 3 4
Full Copy Title 100.0 100.0 100.0 100.0 0.0 0.0 0.0 0.0
S2S + Ptr 65.6 34.4 39.3 46.5 28.6 24.9 27.0 25.0
Hier S2S + Ptr 60.2 33.9 41.1 50.4 37.4 27.9 28.3 29.2
PLBART 79.3 75.0 72.5 71.7 30.7 34.8 34.6 39.9
PLBART (F) 43.2 37.4 38.3 43.1 47.1 38.1 35.6 37.2
Reference 35.1 30.9 33.5 37.7 34.5 22.2 22.2 25.3

Copy Title 100.0 100.0 100.0 100.0 0.0 0.0 0.0 0.0
S2S + Ptr 64.5 33.8 39.1 38.3 29.4 25.3 23.8 0.0
Hier S2S + Ptr 58.4 33.3 39.3 45.7 40.4 28.4 30.0 29.2
PLBART 76.9 73.4 71.1 70.4 34.0 37.0 36.3 41.2
PLBART (F) 38.4 33.9 35.2 40.7 51.0 40.0 36.6 38.1
Reference 23.7 18.6 18.4 16.3 40.1 22.8 21.4 23.0

Copy Title 100.0 100.0 100.0 100.0 0.0 0.0 0.0 0.0
S2S + Ptr 64.8 37.1 38.5 22.5 31.6 25.3 33.1 25.0
Hier S2S + Ptr 60.3 34.2 37.9 28.3 38.7 26.1 29.2 0.0
PLBART 80.8 77.7 72.8 70.3 31.0 41.4 37.0 50.0
PLBART (F) 36.9 28.4 30.8 34.1 52.8 42.3 39.4 45.0
Reference 32.7 22.2 26.2 35.6 38.8 25.4 23.1 27.1

Table 14: Percent of unigrams, bigrams, trigrams and 4-grams in the prediction (or reference) which appear in the title and in only (excluding the title). Lower is better for the title and higher is better for only.

Copy Title
15.223 13.645 28.088 12.322 25.341
S2S + Ptr 12.896 10.241 27.757 4.571 25.921
Hier S2S + Ptr 12.758 10.119 28.722 3.934 25.275
PLBART 16.924 14.979 32.152 12.124 29.623
PLBART (F) 15.059 13.057 29.107 9.111 26.710

Table 15: Scores for generation @ on the 1,232 examples in the full validation set.
Model None
First 37.013 62.987 0.000
Second 26.461 36.526 37.013
Rand (uni) 27.733 45.806 26.461
Rand (dist) 28.003 48.512 23.485
RF 24.188 15.287 60.525
Ours 39.665 40.882 19.453

Table 16: Scores for classification on the 1,232 examples in the full validation set.
Model Train Eval Epoch
Our classifier 2:23:18 0:00:15 14.7
S2S + Ptr 2:56:19 0:01:12 52.0
Hier S2S + Ptr 4:47:34 0:01:22 51.0
PLBART (fine-tuning) 0:32:07 0:00:25 11.0
PLBART (F) (fine-tuning) 0:26:08 0:00:28 15.0

Table 17: Average training time, inference time, and number of epochs (over 3 random restarts) for all models we train. We also report fine-tuning time for PLBART models. Format for time is H:M:S.