Active Code Learning: Benchmarking Sample-Efficient Training of Code Models

06/02/2023
by   Qiang Hu, et al.
0

The costly human effort required to prepare the training data of machine learning (ML) models hinders their practical development and usage in software engineering (ML4Code), especially for those with limited budgets. Therefore, efficiently training models of code with less human effort has become an emergent problem. Active learning is such a technique to address this issue that allows developers to train a model with reduced data while producing models with desired performance, which has been well studied in computer vision and natural language processing domains. Unfortunately, there is no such work that explores the effectiveness of active learning for code models. In this paper, we bridge this gap by building the first benchmark to study this critical problem - active code learning. Specifically, we collect 11 acquisition functions (which are used for data selection in active learning) from existing works and adapt them for code-related tasks. Then, we conduct an empirical study to check whether these acquisition functions maintain performance for code data. The results demonstrate that feature selection highly affects active learning and using output vectors to select data is the best choice. For the code summarization task, active code learning is ineffective which produces models with over a 29.64% gap compared to the expected performance. Furthermore, we explore future directions of active code learning with an exploratory study. We propose to replace distance calculation methods with evaluation metrics and find a correlation between these evaluation-based distance methods and the performance of code models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/05/2021

Robust Active Learning: Sample-Efficient Training of Robust Deep Learning Models

Active learning is an established technique to reduce the labeling cost ...
research
08/16/2018

Deep Bayesian Active Learning for Natural Language Processing: Results of a Large-Scale Empirical Study

Several recent papers investigate Active Learning (AL) for mitigating th...
research
08/13/2022

BenchPress: A Deep Active Benchmark Generator

We develop BenchPress, the first ML benchmark generator for compilers th...
research
04/05/2022

An Exploration of Active Learning for Affective Digital Phenotyping

Some of the most severe bottlenecks preventing widespread development of...
research
11/24/2022

PyTAIL: Interactive and Incremental Learning of NLP Models with Human in the Loop for Online Data

Online data streams make training machine learning models hard because o...
research
05/23/2022

PyRelationAL: A Library for Active Learning Research and Development

In constrained real-world scenarios where it is challenging or costly to...

Please sign up or login with your details

Forgot password? Click here to reset