DeepAI AI Chat
Log In Sign Up

Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation

by   Ensheng Shi, et al.

Code search aims to retrieve the most semantically relevant code snippet for a given natural language query. Recently, large-scale code pre-trained models such as CodeBERT and GraphCodeBERT learn generic representations of source code and have achieved substantial improvement on code search task. However, the high-quality sequence-level representations of code snippets have not been sufficiently explored. In this paper, we propose a new approach with multimodal contrastive learning and soft data augmentation for code search. Multimodal contrastive learning is used to pull together the representations of code-query pairs and push apart the unpaired code snippets and queries. Moreover, data augmentation is critical in contrastive learning for learning high-quality representations. However, only semantic-preserving augmentations for source code are considered in existing work. In this work, we propose to do soft data augmentation by dynamically masking and replacing some tokens in code sequences to generate code snippets that are similar but not necessarily semantic-preserving as positive samples for paired queries. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages. The experimental results show that our approach significantly outperforms the state-of-the-art methods. We also adapt our techniques to several pre-trained models such as RoBERTa and CodeBERT, and significantly boost their performance on the code search task.


page 1

page 2

page 3

page 4


Exploring Representation-Level Augmentation for Code Search

Code search, which aims at retrieving the most relevant code fragment fo...

Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection

Code clones are pairs of code snippets that implement similar functional...

CrossAug: A Contrastive Data Augmentation Method for Debiasing Fact Verification Models

Fact verification datasets are typically constructed using crowdsourcing...

On the Importance of Building High-quality Training Datasets for Neural Code Search

The performance of neural code search is significantly influenced by the...

CodeRetriever: Unimodal and Bimodal Contrastive Learning

In this paper, we propose the CodeRetriever model, which combines the un...

ContraGen: Effective Contrastive Learning For Causal Language Model

Despite exciting progress in large-scale language generation, the expres...

COSEA: Convolutional Code Search with Layer-wise Attention

Semantic code search, which aims to retrieve code snippets relevant to a...