Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding

12/04/2021
by   Deze Wang, et al.
0

With the great success of pre-trained models, the pretrain-then-finetune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to effectively adapt pre-trained models to a new task has not been fully explored. In this paper, we propose an approach to bridge pre-trained models and code-related tasks. We exploit semantic-preserving transformation to enrich downstream data diversity, and help pre-trained models learn semantic features invariant to these semantically equivalent transformations. Further, we introduce curriculum learning to organize the transformed data in an easy-to-hard manner to fine-tune existing pre-trained models. We apply our approach to a range of pre-trained models, and they significantly outperform the state-of-the-art models on tasks for source code understanding, such as algorithm classification, code clone detection, and code search. Our experiments even show that without heavy pre-training on code data, natural language pre-trained model RoBERTa fine-tuned with our lightweight approach could outperform or rival existing code pre-trained models fine-tuned on the above tasks, such as CodeBERT and GraphCodeBERT. This finding suggests that there is still much room for improvement in code pre-trained models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/05/2022

SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations

Recent years have seen the successful application of large pre-trained m...
research
10/07/2022

Pre-trained Adversarial Perturbations

Self-supervised pre-training has drawn increasing attention in recent ye...
research
06/15/2022

NatGen: Generative pre-training by "Naturalizing" source code

Pre-trained Generative Language models (e.g. PLBART, CodeT5, SPT-Code) f...
research
06/13/2023

TRACED: Execution-aware Pre-training for Source Code

Most existing pre-trained language models for source code focus on learn...
research
10/31/2022

CodeEditor: Learning to Edit Source Code with Pre-trained Models

Developers often perform repetitive code editing activities for various ...
research
03/09/2022

Inadequately Pre-trained Models are Better Feature Extractors

Pre-training has been a popular learning paradigm in deep learning era, ...
research
05/02/2021

MathBERT: A Pre-Trained Model for Mathematical Formula Understanding

Large-scale pre-trained models like BERT, have obtained a great success ...

Please sign up or login with your details

Forgot password? Click here to reset