Enhancing Code Classification by Mixup-Based Data Augmentation

10/06/2022
by   Zeming Dong, et al.
0

Recently, deep neural networks (DNNs) have been widely applied in programming language understanding. Generally, training a DNN model with competitive performance requires massive and high-quality labeled training data. However, collecting and labeling such data is time-consuming and labor-intensive. To tackle this issue, data augmentation has been a popular solution, which delicately increases the training data size, e.g., adversarial example generation. However, few works focus on employing it for programming language-related tasks. In this paper, we propose a Mixup-based data augmentation approach, MixCode, to enhance the source code classification task. First, we utilize multiple code refactoring methods to generate label-consistent code data. Second, the Mixup technique is employed to mix the original code and transformed code to form the new training data to train the model. We evaluate MixCode on two programming languages (JAVA and Python), two code tasks (problem classification and bug detection), four datasets (JAVA250, Python800, CodRep1, and Refactory), and 5 model architectures. Experimental results demonstrate that MixCode outperforms the standard data augmentation baseline by up to 6.24% accuracy improvement and 26.06% robustness improvement.

READ FULL TEXT
research
03/13/2023

Boosting Source Code Learning with Data Augmentation: An Empirical Study

The next era of program understanding is being propelled by the use of m...
research
10/06/2022

Enhancing Mixup-Based Graph Learning for Language Processing via Hybrid Pooling

Graph neural networks (GNNs) have recently been popular in natural langu...
research
02/02/2023

How to choose "Good" Samples for Text Data Augmentation

Deep learning-based text classification models need abundant labeled dat...
research
08/22/2020

Self-Competitive Neural Networks

Deep Neural Networks (DNNs) have improved the accuracy of classification...
research
10/04/2020

Reverse Operation based Data Augmentation for Solving Math Word Problems

Automatically solving math word problems is a critical task in the field...
research
04/07/2022

Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation

Code search aims to retrieve the most semantically relevant code snippet...
research
10/27/2022

An Adversarial Active Sampling-based Data Augmentation Framework for Manufacturable Chip Design

Lithography modeling is a crucial problem in chip design to ensure a chi...

Please sign up or login with your details

Forgot password? Click here to reset