Data Augmentation Approaches for Source Code Models: A Survey

05/31/2023
by   Terry Yue Zhuo, et al.
0

The increasingly popular adoption of source code in many critical tasks motivates the development of data augmentation (DA) techniques to enhance training data and improve various capabilities (e.g., robustness and generalizability) of these models. Although a series of DA methods have been proposed and tailored for source code models, there lacks a comprehensive survey and examination to understand their effectiveness and implications. This paper fills this gap by conducting a comprehensive and integrative survey of data augmentation for source code, wherein we systematically compile and encapsulate existing literature to provide a comprehensive overview of the field. We start by constructing a taxonomy of DA for source code models model approaches, followed by a discussion on prominent, methodologically illustrative approaches. Next, we highlight the general strategies and techniques to optimize the DA quality. Subsequently, we underscore techniques that find utility in widely-accepted source code scenarios and downstream tasks. Finally, we outline the prevailing challenges and potential opportunities for future research. In essence, this paper endeavors to demystify the corpus of existing literature on DA for source code models, and foster further exploration in this sphere. Complementing this, we present a continually updated GitHub repository that hosts a list of update-to-date papers on DA for source code models, accessible at <https://github.com/terryyz/DataAug4Code>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/10/2022

A Comprehensive Survey of Data Augmentation in Visual Reinforcement Learning

Visual reinforcement learning (RL), which makes decisions directly from ...
research
03/13/2023

Boosting Source Code Learning with Data Augmentation: An Empirical Study

The next era of program understanding is being propelled by the use of m...
research
08/18/2023

Document Automation Architectures: Updated Survey in Light of Large Language Models

This paper surveys the current state of the art in document automation (...
research
09/13/2023

OWL Reasoners still useable in 2023

In a systematic literature and software review over 100 OWL reasoners/sy...
research
10/10/2022

SimSCOOD: Systematic Analysis of Out-of-Distribution Behavior of Source Code Models

While large code datasets have become available in recent years, acquiri...
research
09/23/2021

Document Automation Architectures and Technologies: A Survey

This paper surveys the current state of the art in document automation (...
research
04/05/2018

Visual augmentation of source code editors: A systematic review

Source code written in textual programming languages is typically edited...

Please sign up or login with your details

Forgot password? Click here to reset