SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation

08/10/2021
by   Xin Wang, et al.
0

Code representation learning, which aims to encode the semantics of source code into distributed vectors, plays an important role in recent deep-learning-based models for code intelligence. Recently, many pre-trained language models for source code (e.g., CuBERT and CodeBERT) have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code search, code clone detection, and program translation. Current approaches typically consider the source code as a plain sequence of tokens, or inject the structure information (e.g., AST and data-flow) into the sequential model pre-training. To further explore the properties of programming languages, this paper proposes SynCoBERT, a syntax-guided multi-modal contrastive pre-training approach for better code representations. Specially, we design two novel pre-training objectives originating from the symbolic and syntactic properties of source code, i.e., Identifier Prediction (IP) and AST Edge Prediction (TEP), which are designed to predict identifiers, and edges between two nodes of AST, respectively. Meanwhile, to exploit the complementary information in semantically equivalent modalities (i.e., code, comment, AST) of the code, we propose a multi-modal contrastive learning strategy to maximize the mutual information among different modalities. Extensive experiments on four downstream tasks related to code intelligence show that SynCoBERT advances the state-of-the-art with the same pre-training corpus and model size.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/04/2022

CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training

Recent years have witnessed increasing interest in code representation l...
research
03/27/2022

HELoC: Hierarchical Contrastive Learning of Source Code Representation

Abstract syntax trees (ASTs) play a crucial role in source code represen...
research
07/07/2023

CLIPMasterPrints: Fooling Contrastive Language-Image Pre-training Using Latent Variable Evolution

Models leveraging both visual and textual data such as Contrastive Langu...
research
02/15/2021

DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Recent advances in self-supervised learning have dramatically improved t...
research
06/05/2023

CONCORD: Clone-aware Contrastive Learning for Source Code

Deep Learning (DL) models to analyze source code have shown immense prom...
research
09/30/2019

Multi-Modal Attention Network Learning for Semantic Source Code Retrieval

Code retrieval techniques and tools have been playing a key role in faci...
research
11/18/2019

patch2vec: Distributed Representation of Code Changes

Deep learning methods, which have found successful applications in field...

Please sign up or login with your details

Forgot password? Click here to reset