What You See is What it Means! Semantic Representation Learning of Code based on Visualization and Transfer Learning

02/07/2020
by   Patrick Keller, et al.
0

Recent successes in training word embeddings for NLP tasks have encouraged a wave of research on representation learning for source code, which builds on similar NLP methods. The overall objective is then to produce code embeddings that capture the maximum of program semantics. State-of-the-art approaches invariably rely on a syntactic representation (i.e., raw lexical tokens, abstract syntax trees, or intermediate representation tokens) to generate embeddings, which are criticized in the literature as non-robust or non-generalizable. In this work, we investigate a novel embedding approach based on the intuition that source code has visual patterns of semantics. We further use these patterns to address the outstanding challenge of identifying semantic code clones. We propose the WYSIWIM ("What You See Is What It Means") approach where visual representations of source code are fed into powerful pre-trained image classification neural networks from the field of computer vision to benefit from the practical advantages of transfer learning. We evaluate the proposed embedding approach on two variations of the task of semantic code clone identification: code clone detection (a binary classification problem), and code classification (a multi-classification problem). We show with experiments on the BigCloneBench (Java) and Open Judge (C) datasets that although simple, our WYSIWIM approach performs as effectively as state of the art approaches such as ASTNN or TBCNN. We further explore the influence of different steps in our approach, such as the choice of visual representations or the classification algorithm, to eventually discuss the promises and limitations of this research direction.

READ FULL TEXT

page 5

page 6

page 8

research
04/01/2021

Modular Tree Network for Source Code Representation Learning

Learning representation for source code is a foundation of many program ...
research
10/11/2019

Evaluating Semantic Representations of Source Code

Learned representations of source code enable various software developer...
research
04/26/2019

Learning Semantic Vector Representations of Source Code via a Siamese Neural Network

The abundance of open-source code, coupled with the success of recent ad...
research
09/24/2021

SEED: Semantic Graph based Deep detection for type-4 clone

Background: Type-4 clones refer to a pair of code snippets with similar ...
research
04/20/2022

Unleashing the Power of Compiler Intermediate Representation to Enhance Neural Program Embeddings

Neural program embeddings have demonstrated considerable promise in a ra...
research
03/09/2021

Mining Program Properties From Neural Networks Trained on Source Code Embeddings

In this paper, we propose a novel approach for mining different program ...
research
05/01/2023

Interpreting Pretrained Source-code Models using Neuron Redundancy Analyses

Neural code intelligence models continue to be 'black boxes' to the huma...

Please sign up or login with your details

Forgot password? Click here to reset