TRACED: Execution-aware Pre-training for Source Code

06/13/2023
by   Yangruibo Ding, et al.
0

Most existing pre-trained language models for source code focus on learning the static code text, typically augmented with static code structures (abstract syntax tree, dependency graphs, etc.). However, program semantics will not be fully exposed before the real execution. Without an understanding of the program execution, statically pre-trained models fail to comprehensively capture the dynamic code properties, such as the branch coverage and the runtime variable values, and they are consequently less effective at code understanding tasks, such as retrieving semantic clones and detecting software vulnerabilities. To close the gap between the static nature of language models and the dynamic characteristics of programs, we introduce TRACED, an execution-aware pre-training strategy for source code. Specifically, we pre-train code language models with a combination of source code, executable inputs, and corresponding execution traces. Our goal is to teach code models the complicated execution logic during the pre-training, enabling the model to statically estimate the dynamic code properties without repeatedly executing code during task-specific fine-tuning. To illustrate the effectiveness of our proposed approach, we fine-tune and evaluate TRACED on three downstream tasks: static execution estimation, clone retrieval, and vulnerability detection. The empirical results show that TRACED relatively improves the statically pre-trained code models by 12.4 complete execution path prediction and by 25.2 predictions. TRACED also significantly outperforms statically pre-trained models in clone retrieval and vulnerability detection across four public benchmarks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/04/2021

Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding

With the great success of pre-trained models, the pretrain-then-finetune...
research
05/08/2023

Code Execution with Pre-trained Language Models

Code execution is a fundamental aspect of programming language semantics...
research
07/25/2023

Predicting Code Coverage without Execution

Code coverage is a widely used metric for quantifying the extent to whic...
research
10/08/2021

Towards Learning (Dis)-Similarity of Source Code from Program Contrasts

Understanding the functional (dis)-similarity of source code is signific...
research
02/05/2023

LExecutor: Learning-Guided Execution

Executing code is essential for various program analysis tasks, e.g., to...
research
06/17/2022

Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection

Code clones are pairs of code snippets that implement similar functional...
research
09/09/2023

FAIR: Flow Type-Aware Pre-Training of Compiler Intermediate Representations

While the majority of existing pre-trained models from code learn source...

Please sign up or login with your details

Forgot password? Click here to reset