Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection

06/17/2022
by   Maksim Zubkov, et al.
0

Code clones are pairs of code snippets that implement similar functionality. Clone detection is a fundamental branch of automatic source code comprehension, having many applications in refactoring recommendation, plagiarism detection, and code summarization. A particularly interesting case of clone detection is the detection of semantic clones, i.e., code snippets that have the same functionality but significantly differ in implementation. A promising approach to detecting semantic clones is contrastive learning (CL), a machine learning paradigm popular in computer vision but not yet commonly adopted for code processing. Our work aims to evaluate the most popular CL algorithms combined with three source code representations on two tasks. The first task is code clone detection, which we evaluate on the POJ-104 dataset containing implementations of 104 algorithms. The second task is plagiarism detection. To evaluate the models on this task, we introduce CodeTransformator, a tool for transforming source code. We use it to create a dataset that mimics plagiarised code based on competitive programming solutions. We trained nine models for both tasks and compared them with six existing approaches, including traditional tools and modern pre-trained neural models. The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others. Among CL algorithms, SimCLR and SwAV lead to better results, while Moco is the most robust approach. Our code and trained models are available at https://doi.org/10.5281/zenodo.6360627, https://doi.org/10.5281/zenodo.5596345.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/07/2022

Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation

Code search aims to retrieve the most semantically relevant code snippet...
research
05/08/2023

Code Execution with Pre-trained Language Models

Code execution is a fundamental aspect of programming language semantics...
research
06/13/2023

TRACED: Execution-aware Pre-training for Source Code

Most existing pre-trained language models for source code focus on learn...
research
10/12/2019

Deep Transfer Learning for Source Code Modeling

In recent years, deep learning models have shown great potential in sour...
research
10/11/2022

Extracting Meaningful Attention on Source Code: An Empirical Study of Developer and Neural Model Code Exploration

The high effectiveness of neural models of code, such as OpenAI Codex an...
research
01/11/2023

Predicting Tags For Programming Tasks by Combining Textual And Source Code Data

Competitive programming remains a very popular activity that combines bo...
research
09/06/2020

Self-Supervised Learning for Code Retrieval and Summarization through Semantic-Preserving Program Transformations

Code retrieval and summarization are useful tasks for developers, but it...

Please sign up or login with your details

Forgot password? Click here to reset