On Contrastive Learning of Semantic Similarity forCode to Code Search

05/05/2023
by   Anthony Saieva, et al.
0

This paper introduces a novel code-to-code search technique that enhances the performance of Large Language Models (LLMs) by including both static and dynamic features as well as utilizing both similar and dissimilar examples during training. We present the first-ever code search method that encodes dynamic runtime information during training without the need to execute either the corpus under search or the search query at inference time and the first code search technique that trains on both positive and negative reference samples. To validate the efficacy of our approach, we perform a set of studies demonstrating the capability of enhanced LLMs to perform cross-language code-to-code search. Our evaluation demonstrates that the effectiveness of our approach is consistent across various model architectures and programming languages. We outperform the state-of-the-art cross-language search tool by up to 44.7%. Moreover, our ablation studies reveal that even a single positive and negative reference sample in the training process results in substantial performance improvements demonstrating both similar and dissimilar references are important parts of code search. Importantly, we show that enhanced well-crafted, fine-tuned models consistently outperform enhanced larger modern LLMs without fine tuning, even when enhancing the largest available LLMs highlighting the importance for open-sourced models. To ensure the reproducibility and extensibility of our research, we present an open-sourced implementation of our tool and training procedures called Cosco.

READ FULL TEXT

page 1

page 6

page 7

page 8

research
04/05/2022

On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages

A recent study by Ahmed and Devanbu reported that using a corpus of code...
research
05/09/2023

StarCoder: may the source be with you!

The BigCode community, an open-scientific collaboration working on the r...
research
05/19/2023

CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search

We consider the clone detection and information retrieval problems for s...
research
06/16/2021

Cross-Language Code Search using Static and Dynamic Analyses

As code search permeates most activities in software development,code-to...
research
05/08/2023

Retriever and Ranker Framework with Probabilistic Hard Negative Sampling for Code Search

Pretrained Language Models (PLMs) have emerged as the state-of-the-art p...
research
02/27/2023

The ROOTS Search Tool: Data Transparency for LLMs

ROOTS is a 1.6TB multilingual text corpus developed for the training of ...
research
09/05/2023

Making Large Language Models Better Reasoners with Alignment

Reasoning is a cognitive process of using evidence to reach a sound conc...

Please sign up or login with your details

Forgot password? Click here to reset