CodeRetriever: Unimodal and Bimodal Contrastive Learning

01/26/2022
by   Xiaonan Li, et al.
3

In this paper, we propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations, specifically for the code search task. For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name. For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs. Both contrastive objectives can fully leverage the large-scale code corpus for pre-training. Experimental results on several public benchmarks, (i.e., CodeSearch, CoSQA, etc.) demonstrate the effectiveness of CodeRetriever in the zero-shot setting. By fine-tuning with domain/language specified downstream data, CodeRetriever achieves the new state-of-the-art performance with significant improvement over existing code pre-trained models. We will make the code, model checkpoint, and constructed datasets publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/18/2022

Soft-Labeled Contrastive Pre-training for Function-level Code Representation

Code contrastive pre-training has recently achieved significant progress...
research
01/24/2022

Text and Code Embeddings by Contrastive Pre-Training

Text embeddings are useful features in many applications such as semanti...
research
06/05/2023

Unsupervised Dense Retrieval with Relevance-Aware Contrastive Pre-Training

Dense retrievers have achieved impressive performance, but their demand ...
research
11/11/2022

Cross-Platform and Cross-Domain Abusive Language Detection with Supervised Contrastive Learning

The prevalence of abusive language on different online platforms has bee...
research
08/07/2023

Towards General Text Embeddings with Multi-stage Contrastive Learning

We present GTE, a general-purpose text embedding model trained with mult...
research
10/25/2021

CLLD: Contrastive Learning with Label Distance for Text Classificatioin

Existed pre-trained models have achieved state-of-the-art performance on...
research
01/23/2023

Semantic-aware Contrastive Learning for Electroencephalography-to-Text Generation with Curriculum Learning

Electroencephalography-to-Text generation (EEG-to-Text), which aims to d...

Please sign up or login with your details

Forgot password? Click here to reset