Learning code summarization from a small and local dataset

06/02/2022
by   Toufique Ahmed, et al.
0

Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many software engineering tasks. These models are pre-trained (using self-supervision) with billions of code tokens, and then fine-tuned with hundreds of thousands of labeled examples, typically drawn from many projects. However, software phenomena can be very project-specific. Vocabulary, and other phenomena vary substantially with each project. Thus, training on project-specific data, and testing on the same project, is a promising idea. This hypothesis has to be evaluated carefully, e.g., in a time-series setting, to prevent training-test leakage. We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient (and thus prima facie well-suited for learning in a limited-sample same-project setting) and a maximalist hybrid approach, fine-tuning first on many projects in many languages and then training on the same-project. We find that the maximalist hybrid setting provides consistent, substantial gains over the state-of-the-art, on many different projects in both Java and Python.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/07/2022

Assessing Project-Level Fine-Tuning of ML4SE Models

Machine Learning for Software Engineering (ML4SE) is an actively growing...
research
07/09/2022

Few-shot training LLMs for project-specific code-summarization

Very large language models (LLMs), such as GPT-3 and Codex have achieved...
research
10/21/2022

Low-Resources Project-Specific Code Summarization

Code summarization generates brief natural language descriptions of sour...
research
08/29/2022

Exploring and Evaluating Personalized Models for Code Generation

Large Transformer models achieved the state-of-the-art status for Natura...
research
06/19/2023

Understanding the Challenges of Deploying Live-Traceability Solutions

Software traceability is the process of establishing and maintaining rel...
research
04/15/2022

Evaluating few shot and Contrastive learning Methods for Code Clone Detection

Context: Code Clone Detection (CCD) is a software engineering task that ...
research
03/16/2023

Measuring Improvement of F_1-Scores in Detection of Self-Admitted Technical Debt

Artificial Intelligence and Machine Learning have witnessed rapid, signi...

Please sign up or login with your details

Forgot password? Click here to reset