CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search

05/19/2023
by   Nikita Sorokin, et al.
0

We consider the clone detection and information retrieval problems for source code, well-known tasks important for any programming language. Although it is also an important and interesting problem to find code snippets that operate identically but are written in different programming languages, to the best of our knowledge multilingual clone detection has not been studied in literature. In this work, we formulate the multilingual clone detection problem and present XCD, a new benchmark dataset produced from the CodeForces submissions dataset. Moreover, we present a novel training procedure, called cross-consistency training (CCT), that we apply to train language models on source code in different programming languages. The resulting CCT-LM model, initialized with GraphCodeBERT and fine-tuned with CCT, achieves new state of the art, outperforming existing approaches on the POJ-104 clone detection benchmark with 95.67% MAP and AdvTest code search benchmark with 47.18% MRR; it also shows the best results on the newly created multilingual clone detection benchmark XCD across all programming languages.

READ FULL TEXT

page 8

page 14

research
04/05/2022

On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages

A recent study by Ahmed and Devanbu reported that using a corpus of code...
research
04/03/2022

MSCCD: Grammar Pluggable Clone Detection Based on ANTLR Parser Generation

For various reasons, programming languages continue to multiply and evol...
research
06/13/2022

MetaTPTrans: A Meta Learning Approach for Multilingual Code Representation Learning

Representation learning of source code is essential for applying machine...
research
02/28/2023

Benchmarking Deepart Detection

Deepfake technologies have been blurring the boundaries between the real...
research
05/05/2023

On Contrastive Learning of Semantic Similarity forCode to Code Search

This paper introduces a novel code-to-code search technique that enhance...
research
03/10/2023

Software Vulnerability Prediction Knowledge Transferring Between Programming Languages

Developing automated and smart software vulnerability detection models h...
research
06/11/2022

CodeS: A Distribution Shift Benchmark Dataset for Source Code Learning

Over the past few years, deep learning (DL) has been continuously expand...

Please sign up or login with your details

Forgot password? Click here to reset