AstBERT: Enabling Language Model for Code Understanding with Abstract Syntax Tree

01/20/2022
by   Rong Liang, et al.
0

Using a pre-trained language model (i.e. BERT) to apprehend source codes has attracted increasing attention in the natural language processing community. However, there are several challenges when it comes to applying these language models to solve programming language (PL) related problems directly, the significant one of which is the lack of domain knowledge issue that substantially deteriorates the model's performance. To this end, we propose the AstBERT model, a pre-trained language model aiming to better understand the PL using the abstract syntax tree (AST). Specifically, we collect a colossal amount of source codes (both java and python) from GitHub and incorporate the contextual code knowledge into our model through the help of code parsers, in which AST information of the source codes can be interpreted and integrated. We verify the performance of the proposed model on code information extraction and code search tasks, respectively. Experiment results show that our AstBERT model achieves state-of-the-art performance on both downstream tasks (with 96.4 code information extraction task, and 57.12

READ FULL TEXT
research
09/17/2020

GraphCodeBERT: Pre-training Code Representations with Data Flow

Pre-trained models for programming language have achieved dramatic empir...
research
05/26/2021

TreeBERT: A Tree-Based Pre-Trained Model for Programming Language

Source code can be parsed into the abstract syntax tree (AST) based on d...
research
02/14/2022

What Do They Capture? – A Structural Analysis of Pre-Trained Language Models for Source Code

Recently, many pre-trained language models for source code have been pro...
research
03/18/2020

Improving the Robustness to Data Inconsistency between Training and Testing for Code Completion by Hierarchical Language Model

In the field of software engineering, applying language models to the to...
research
06/01/2023

AI Chain on Large Language Model for Unsupervised Control Flow Graph Generation for Statically-Typed Partial Code

Control Flow Graphs (CFGs) are essential for visualizing, understanding ...
research
05/11/2022

CV4Code: Sourcecode Understanding via Visual Code Representations

We present CV4Code, a compact and effective computer vision method for s...
research
03/09/2019

Program Classification Using Gated Graph Attention Neural Network for Online Programming Service

The online programing services, such as Github,TopCoder, and EduCoder, h...

Please sign up or login with your details

Forgot password? Click here to reset