Learning Semantic Program Embeddings with GraphInterval Neural Network

05/18/2020
by   Yu Wang, et al.
0

Learning distributed representations of source code has been a challenging task for machine learning models. Earlier works treated programs as text so that natural language methods can be readily applied. Unfortunately, such approaches do not capitalize on the rich structural information possessed by source code. Of late, Graph Neural Network (GNN) was proposed to learn embeddings of programs from their graph representations. Due to the homogeneous and expensive message-passing procedure, GNN can suffer from precision issues, especially when dealing with programs rendered into large graphs. In this paper, we present a new graph neural architecture, called Graph Interval Neural Network (GINN), to tackle the weaknesses of the existing GNN. Unlike the standard GNN, GINN generalizes from a curated graph representation obtained through an abstraction method designed to aid models to learn. In particular, GINN focuses exclusively on intervals for mining the feature representation of a program, furthermore, GINN operates on a hierarchy of intervals for scaling the learning to large graphs. We evaluate GINN for two popular downstream applications: variable misuse prediction and method name prediction. Results show in both cases GINN outperforms the state-of-the-art models by a comfortable margin. We have also created a neural bug detector based on GINN to catch null pointer deference bugs in Java code. While learning from the same 9,000 methods extracted from 64 projects, GINN-based bug detector significantly outperforms GNN-based bug detector on 13 unseen test projects. Next, we deploy our trained GINN-based bug detector and Facebook Infer to scan the codebase of 20 highly starred projects on GitHub. Through our manual inspection, we confirm 38 bugs out of 102 warnings raised by GINN-based bug detector compared to 34 bugs out of 129 warnings for Facebook Infer.

READ FULL TEXT

page 1

page 2

page 3

page 4

05/18/2020

Learning Semantic Program Embeddings with Graph Interval Neural Network

Learning distributed representations of source code has been a challengi...
06/09/2020

Automatic Code Summarization via Multi-dimensional Semantic Fusing in GNN

Source code summarization aims to generate natural language summaries fr...
07/12/2019

Learning a Static Bug Finder from Data

Static analysis is an effective technique to catch bugs early when they ...
11/01/2017

Learning to Represent Programs with Graphs

Learning tasks on source code (i.e., formal languages) have been conside...
03/09/2019

Program Classification Using Gated Graph Attention Neural Network for Online Programming Service

The online programing services, such as Github,TopCoder, and EduCoder, h...
01/28/2022

HEAT: Hyperedge Attention Networks

Learning from structured data is a core machine learning task. Commonly,...
04/21/2022

On Distribution Shift in Learning-based Bug Detectors

Deep learning has recently achieved initial success in program analysis ...