Code Representation Learning with Prüfer Sequences

11/14/2021
by   Tenzin Jinpa, et al.
0

An effective and efficient encoding of the source code of a computer program is critical to the success of sequence-to-sequence deep neural network models for tasks in computer program comprehension, such as automated code summarization and documentation. A significant challenge is to find a sequential representation that captures the structural/syntactic information in a computer program and facilitates the training of the learning models. In this paper, we propose to use the Prüfer sequence of the Abstract Syntax Tree (AST) of a computer program to design a sequential representation scheme that preserves the structural information in an AST. Our representation makes it possible to develop deep-learning models in which signals carried by lexical tokens in the training examples can be exploited automatically and selectively based on their syntactic role and importance. Unlike other recently-proposed approaches, our representation is concise and lossless in terms of the structural information of the AST. Empirical studies on real-world benchmark datasets, using a sequence-to-sequence learning model we designed for code summarization, show that our Prüfer-sequence-based representation is indeed highly effective and efficient, outperforming significantly all the recently-proposed deep-learning models we used as the baseline models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/01/2021

Modular Tree Network for Source Code Representation Learning

Learning representation for source code is a foundation of many program ...
research
03/14/2023

Implant Global and Local Hierarchy Information to Sequence based Code Representation Models

Source code representation with deep learning techniques is an important...
research
02/09/2021

Demystifying Code Summarization Models

The last decade has witnessed a rapid advance in machine learning models...
research
11/23/2020

Modeling Functional Similarity in Source Code with Graph-Based Siamese Networks

Code clones are duplicate code fragments that share (nearly) similar syn...
research
08/30/2021

CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees

Code summarization aims to generate concise natural language description...
research
08/07/2020

PSCS: A Path-based Neural Model for Semantic Code Search

To obtain code snippets for reuse, programmers prefer to search for rela...
research
05/11/2022

CV4Code: Sourcecode Understanding via Visual Code Representations

We present CV4Code, a compact and effective computer vision method for s...

Please sign up or login with your details

Forgot password? Click here to reset