Retriever and Ranker Framework with Probabilistic Hard Negative Sampling for Code Search

05/08/2023
by   Hande Dong, et al.
0

Pretrained Language Models (PLMs) have emerged as the state-of-the-art paradigm for code search tasks. The paradigm involves pretraining the model on search-irrelevant tasks such as masked language modeling, followed by the finetuning stage, which focuses on the search-relevant task. The typical finetuning method is to employ a dual-encoder architecture to encode semantic embeddings of query and code separately, and then calculate their similarity based on the embeddings. However, the typical dual-encoder architecture falls short in modeling token-level interactions between query and code, which limits the model's capabilities. In this paper, we propose a novel approach to address this limitation, introducing a cross-encoder architecture for code search that jointly encodes the semantic matching of query and code. We further introduce a Retriever-Ranker (RR) framework that cascades the dual-encoder and cross-encoder to promote the efficiency of evaluation and online serving. Moreover, we present a probabilistic hard negative sampling method to improve the cross-encoder's ability to distinguish hard negative codes, which further enhances the cascade RR framework. Experiments on four datasets using three code PLMs demonstrate the superiority of our proposed method.

READ FULL TEXT

page 1

page 11

research
08/23/2022

Query-Response Interactions by Multi-tasks in Semantic Search for Chatbot Candidate Retrieval

Semantic search for candidate retrieval is an important yet neglected pr...
research
03/10/2022

LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval

Dual encoders and cross encoders have been widely used for image-text re...
research
03/28/2018

Deeply Supervised Semantic Model for Click-Through Rate Prediction in Sponsored Search

In sponsored search it is critical to match ads that are relevant to a q...
research
03/27/2023

Empowering Dual-Encoder with Query Generator for Cross-Lingual Dense Retrieval

In monolingual dense retrieval, lots of works focus on how to distill kn...
research
05/21/2022

NS3: Neuro-Symbolic Semantic Code Search

Semantic code search is the task of retrieving a code snippet given a te...
research
05/05/2023

On Contrastive Learning of Semantic Similarity forCode to Code Search

This paper introduces a novel code-to-code search technique that enhance...
research
08/08/2022

CSSAM:Code Search via Attention Matching of Code Semantics and Structures

Despite the continuous efforts in improving both the effectiveness and e...

Please sign up or login with your details

Forgot password? Click here to reset