CV4Code: Sourcecode Understanding via Visual Code Representations

05/11/2022
by   Ruibo Shi, et al.
9

We present CV4Code, a compact and effective computer vision method for sourcecode understanding. Our method leverages the contextual and the structural information available from the code snippet by treating each snippet as a two-dimensional image, which naturally encodes the context and retains the underlying structural information through an explicit spatial representation. To codify snippets as images, we propose an ASCII codepoint-based image representation that facilitates fast generation of sourcecode images and eliminates redundancy in the encoding that would arise from an RGB pixel representation. Furthermore, as sourcecode is treated as images, neither lexical analysis (tokenisation) nor syntax tree parsing is required, which makes the proposed method agnostic to any particular programming language and lightweight from the application pipeline point of view. CV4Code can even featurise syntactically incorrect code which is not possible from methods that depend on the Abstract Syntax Tree (AST). We demonstrate the effectiveness of CV4Code by learning Convolutional and Transformer networks to predict the functional task, i.e. the problem it solves, of the source code directly from its two-dimensional representation, and using an embedding from its latent space to derive a similarity score of two code snippets in a retrieval setup. Experimental results show that our approach achieves state-of-the-art performance in comparison to other methods with the same task and data configurations. For the first time we show the benefits of treating sourcecode understanding as a form of image processing task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/02/2021

AST-Transformer: Encoding Abstract Syntax Trees Efficiently for Code Summarization

Code summarization aims to generate brief natural language descriptions ...
research
05/14/2020

DRTS Parsing with Structure-Aware Encoding and Decoding

Discourse representation tree structure (DRTS) parsing is a novel semant...
research
07/18/2022

What does Transformer learn about source code?

In the field of source code processing, the transformer-based representa...
research
01/20/2022

AstBERT: Enabling Language Model for Code Understanding with Abstract Syntax Tree

Using a pre-trained language model (i.e. BERT) to apprehend source codes...
research
02/14/2022

Source Code Summarization with Structural Relative Position Guided Transformer

Source code summarization aims at generating concise and clear natural l...
research
11/14/2021

Code Representation Learning with Prüfer Sequences

An effective and efficient encoding of the source code of a computer pro...
research
04/06/2023

A Unified Active Learning Framework for Annotating Graph Data with Application to Software Source Code Performance Prediction

Most machine learning and data analytics applications, including perform...

Please sign up or login with your details

Forgot password? Click here to reset