CodeBERT: A Pre-Trained Model for Programming and Natural Languages

02/19/2020
by   Zhangyin Feng, et al.
0

We present CodeBERT, a bimodal pre-trained model for programming language (PL) and nat-ural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language codesearch, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation tasks. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NL-PL probing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/17/2020

GraphCodeBERT: Pre-training Code Representations with Data Flow

Pre-trained models for programming language have achieved dramatic empir...
research
04/18/2022

Zero-Shot Program Representation Learning

Learning program representations has been the core prerequisite of code ...
research
04/16/2023

Automated Program Repair Based on Code Review: How do Pre-trained Transformer Models Perform?

Sequence-to-sequence models have been used to transform erroneous progra...
research
09/05/2023

A study on the impact of pre-trained model on Just-In-Time defect prediction

Previous researchers conducting Just-In-Time (JIT) defect prediction tas...
research
05/18/2023

CCT5: A Code-Change-Oriented Pre-Trained Model

Software is constantly changing, requiring developers to perform several...
research
02/01/2023

CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models

Code generation models based on the pre-training and fine-tuning paradig...
research
03/10/2023

Model-Agnostic Syntactical Information for Pre-Trained Programming Language Models

Pre-trained Programming Language Models (PPLMs) achieved many recent sta...

Please sign up or login with your details

Forgot password? Click here to reset