What do pre-trained code models know about code?

08/25/2021
by   Anjan Karmakar, et al.
0

Pre-trained models of code built on the transformer architecture have performed well on software engineering (SE) tasks such as predictive code generation, code summarization, among others. However, whether the vector representations from these pre-trained models comprehensively encode characteristics of source code well enough to be applicable to a broad spectrum of downstream tasks remains an open question. One way to investigate this is with diagnostic tasks called probes. In this paper, we construct four probing tasks (probing for surface-level, syntactic, structural, and semantic information) for pre-trained code models. We show how probes can be used to identify whether models are deficient in (understanding) certain code properties, characterize different model layers, and get insight into the model sample-efficiency. We probe four models that vary in their expected knowledge of code properties: BERT (pre-trained on English), CodeBERT and CodeBERTa (pre-trained on source code, and natural language documentation), and GraphCodeBERT (pre-trained on source code with dataflow). While GraphCodeBERT performs more consistently overall, we find that BERT performs surprisingly well on some code tasks, which calls for further investigation.

READ FULL TEXT
research
02/08/2023

An Empirical Comparison of Pre-Trained Models of Source Code

While a large number of pre-trained models of source code have been succ...
research
12/12/2022

A Pre-Trained BERT Model for Android Applications

The automation of an increasingly large number of software engineering t...
research
04/05/2022

An Exploratory Study on Code Attention in BERT

Many recent models in software engineering introduced deep neural models...
research
08/10/2022

Multi-View Pre-Trained Model for Code Vulnerability Identification

Vulnerability identification is crucial for cyber security in the softwa...
research
11/23/2022

Explainable AI for Pre-Trained Code Models: What Do They Learn? When They Do Not Work?

In recent years, there has been a wide interest in designing deep neural...
research
10/11/2022

Extracting Meaningful Attention on Source Code: An Empirical Study of Developer and Neural Model Code Exploration

The high effectiveness of neural models of code, such as OpenAI Codex an...
research
12/21/2019

Pre-trained Contextual Embedding of Source Code

The source code of a program not only serves as a formal description of ...

Please sign up or login with your details

Forgot password? Click here to reset