SEED: Semantic Graph based Deep detection for type-4 clone

09/24/2021
by   Zhipeng Xue, et al.
0

Background: Type-4 clones refer to a pair of code snippets with similar functionality but written in different syntax, which challenges the existing code clone detection techniques. Previous studies, however, highly rely on syntactic structures and textual tokens, which cannot precisely represent the semantic information of code and might introduce nonnegligible noise into the detection models. Aims: To overcome these limitations, we explore an effective semantic-based solution for Type-4 clone detection. Additionally, we conducted an empirical study on the characteristics of Type-4 clone pairs. We found that NOT all tokens contain semantics that the Type-4 clone detection required. Operators and API calls emerge as distinct candidates for Type-4 code semantic representation. Method: To bridge this gap, we design a novel semantic graph based deep detection approach, called SEED. For a pair of code snippets, SEED constructs a semantic graph of each code snippet based on intermediate representation to represent the code functionality more precisely compared to the representations based on lexical and syntactic analysis. To accommodate the characteristics of Type-4 clones, a semantic graph is constructed focusing on the operators and API calls instead of all tokens. Then, SEED generates the feature vectors by using the graph deep neural network and performs code clone detection based on the similarity among the vectors. Results: Extensive experiments show that our approach significantly outperforms two baseline approaches over two public datasets and one customized dataset. Specially, SEED outperforms other baseline methods by an average of 25.2 F1-Score. Conclusions: Our experiments demonstrate that SEED can reach state-of-the-art and be useful for Type-4 clone detection in practice.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/23/2020

Modeling Functional Similarity in Source Code with Graph-Based Siamese Networks

Code clones are duplicate code fragments that share (nearly) similar syn...
research
03/09/2023

A Syntax-Guided Multi-Task Learning Approach for Turducken-Style Code Generation

Due to the development of pre-trained language models, automated code ge...
research
02/07/2020

What You See is What it Means! Semantic Representation Learning of Code based on Visualization and Transfer Learning

Recent successes in training word embeddings for NLP tasks have encourag...
research
07/07/2019

Graph based Neural Networks for Event Factuality Prediction using Syntactic and Semantic Structures

Event factuality prediction (EFP) is the task of assessing the degree to...
research
08/07/2020

PSCS: A Path-based Neural Model for Semantic Code Search

To obtain code snippets for reuse, programmers prefer to search for rela...
research
08/20/2020

Investigating the Effect of Intraclass Variability in Temporal Ensembling

Temporal Ensembling is a semi-supervised approach that allows training d...
research
07/16/2023

Planting a SEED of Vision in Large Language Model

We present SEED, an elaborate image tokenizer that empowers Large Langua...

Please sign up or login with your details

Forgot password? Click here to reset