CodeExp: Explanatory Code Document Generation

11/25/2022
by   Haotian Cui, et al.
0

Developing models that can automatically generate detailed code explanation can greatly benefit software maintenance and programming education. However, existing code-to-text generation models often produce only high-level summaries of code that do not capture implementation-level choices essential for these scenarios. To fill in this gap, we propose the code explanation generation task. We first conducted a human study to identify the criteria for high-quality explanatory docstring for code. Based on that, we collected and refined a large-scale code docstring corpus and formulated automatic evaluation metrics that best match human assessments. Finally, we present a multi-stage fine-tuning strategy and baseline models for the task. Our experiments show that (1) our refined training dataset lets models achieve better performance in the explanation generation tasks compared to larger unrefined data (15x larger), and (2) fine-tuned models can generate well-structured long docstrings comparable to human-written ones. We envision our training dataset, human-evaluation protocol, recommended metrics, and fine-tuning strategy can boost future code explanation research. The code and annotated data are available at https://github.com/subercui/CodeExp.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2023

Faithful Low-Resource Data-to-Text Generation through Cycle Training

Methods to generate text from structured data have advanced significantl...
research
12/08/2020

Facts2Story: Controlling Text Generation by Key Facts

Recent advancements in self-attention neural network architectures have ...
research
06/27/2022

BashExplainer: Retrieval-Augmented Bash Code Comment Generation based on Fine-tuned CodeBERT

Developers use shell commands for many tasks, such as file system manage...
research
03/07/2023

Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction

Large language models (LLMs) show great potential for synthetic data gen...
research
06/01/2023

Explanation Graph Generation via Generative Pre-training over Synthetic Graphs

The generation of explanation graphs is a significant task that aims to ...
research
12/12/2022

T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics

Modern embedding-based metrics for evaluation of generated text generall...
research
08/16/2023

Boosting Commit Classification with Contrastive Learning

Commit Classification (CC) is an important task in software maintenance,...

Please sign up or login with your details

Forgot password? Click here to reset