A Mocktail of Source Code Representations

06/21/2021
by   Dheeraj Vagavolu, et al.
0

Efficient representation of source code is essential for various software engineering tasks such as code search and code clone detection. One such technique for representing source code involves extracting paths from the AST and using a learning model to capture program properties. Code2vec is a commonly used path-based approach that uses an attention-based neural network to learn code embeddings which can then be used for various software engineering tasks. However, this approach uses only ASTs and does not leverage other graph structures such as Control Flow Graphs (CFG) and Program Dependency Graphs (PDG). Similarly, most recent approaches for representing source code still use AST and do not leverage semantic graph structures. Even though there exists an integrated graph approach (Code Property Graph) for representing source code, it has only been explored in the domain of software security. Moreover, it does not leverage the paths from the individual graphs. In our work, we extend the path-based approach code2vec to include semantic graphs, CFG, and PDG, along with AST, which is still largely unexplored in the domain of software engineering. We evaluate our approach on the task of MethodNaming using a custom C dataset of 730K methods collected from 16 C projects from GitHub. In comparison to code2vec, our approach improves the F1 Score by 11 the full dataset and up to 100 features from the CFG and PDG paths are indeed helpful. We envision that looking at a mocktail of source code representations for various software engineering tasks can lay the foundation for a new line of research and a re-haul of existing research.

READ FULL TEXT
research
07/30/2022

Adding Context to Source Code Representations for Deep Learning

Deep learning models have been successfully applied to a variety of soft...
research
08/20/2021

Fex: Assisted Identification of Domain Features from C Programs

Modern software typically performs more than one functionality. These fu...
research
12/22/2021

End to End Software Engineering Research

End to end learning is machine learning starting in raw data and predict...
research
03/31/2021

HAConvGNN: Hierarchical Attention Based Convolutional Graph Neural Network for Code Documentation Generation in Jupyter Notebooks

Many data scientists use Jupyter notebook to experiment code, visualize ...
research
11/23/2022

Explainable AI for Pre-Trained Code Models: What Do They Learn? When They Do Not Work?

In recent years, there has been a wide interest in designing deep neural...
research
03/12/2020

Control-flow Flattening Preserves the Constant-Time Policy (Extended Version)

Obfuscating compilers protect a software by obscuring its meaning and im...
research
05/25/2022

Towards Using Data-Influence Methods to Detect Noisy Samples in Source Code Corpora

Despite the recent trend of developing and applying neural source code m...

Please sign up or login with your details

Forgot password? Click here to reset