CCT5: A Code-Change-Oriented Pre-Trained Model

05/18/2023
by   Bo Lin, et al.
0

Software is constantly changing, requiring developers to perform several derived tasks in a timely manner, such as writing a description for the intention of the code change, or identifying the defect-prone code changes. Considering that the cost of dealing with these tasks can account for a large proportion (typically around 70 percent) of the total development expenditure, automating such processes will significantly lighten the burdens of developers. To achieve such a target, existing approaches mainly rely on training deep learning models from scratch or fine-tuning existing pretrained models on such tasks, both of which have weaknesses. Specifically, the former uses comparatively small-scale labelled data for training, making it difficult to learn and exploit the domain knowledge of programming language hidden in the large-amount unlabelled code in the wild; the latter is hard to fully leverage the learned knowledge of the pre-trained model, as existing pre-trained models are designed to encode a single code snippet rather than a code change (i.e., the difference between two code snippets). We propose to pre-train a model specially designed for code changes to better support developers in software maintenance. To this end, we first collect a large-scale dataset containing 1.5M+ pairwise data of code changes and commit messages. Based on these data, we curate five different tasks for pre-training, which equip the model with diverse domain knowledge about code changes. We fine-tune the pre-trained model, CCT5, on three widely-studied tasks incurred by code changes and two tasks specific to the code review process. Results show that CCT5 outperforms both conventional deep learning approaches and existing pre-trained models on these tasks.

READ FULL TEXT
research
10/31/2022

CodeEditor: Learning to Edit Source Code with Pre-trained Models

Developers often perform repetitive code editing activities for various ...
research
01/05/2022

SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations

Recent years have seen the successful application of large pre-trained m...
research
02/19/2020

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

We present CodeBERT, a bimodal pre-trained model for programming languag...
research
03/17/2022

CodeReviewer: Pre-Training for Automating Code Review Activities

Code review is an essential part to software development lifecycle since...
research
08/21/2023

EALink: An Efficient and Accurate Pre-trained Framework for Issue-Commit Link Recovery

Issue-commit links, as a type of software traceability links, play a vit...
research
01/18/2022

Using Pre-Trained Models to Boost Code Review Automation

Code review is a practice widely adopted in open source and industrial p...
research
10/11/2022

Extracting Meaningful Attention on Source Code: An Empirical Study of Developer and Neural Model Code Exploration

The high effectiveness of neural models of code, such as OpenAI Codex an...

Please sign up or login with your details

Forgot password? Click here to reset