code2vec: Learning Distributed Representations of Code

03/26/2018
by   Uri Alon, et al.
0

We present a neural model for representing snippets of code as continuous distributed vectors. The main idea is to represent code as a collection of paths in its abstract syntax tree, and aggregate these paths, in a smart and scalable way, into a single fixed-length code vector, which can be used to predict semantic properties of the snippet. We demonstrate the effectiveness of our approach by using it to predict a method's name from the vector representation of its body. We evaluate our approach by training a model on a dataset of 14M methods. We show that code vectors trained on this dataset can predict method names from files that were completely unobserved during training. Furthermore, we show that our model learns useful method name vectors that capture semantic similarities, combinations, and analogies. Comparing previous techniques over the same data set, our approach obtains a relative improvement of over 75%, being the first to successfully predict method names based on a large, cross-project, corpus.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/26/2018

A General Path-Based Representation for Predicting Program Properties

Predicting program properties such as names or expression types has a wi...
research
08/31/2018

Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts

Most of the JavaScript code deployed in the wild has been minified, a pr...
research
02/03/2018

A deep tree-based model for software defect prediction

Defects are common in software systems and can potentially cause various...
research
08/10/2022

Prompt-tuned Code Language Model as a Neural Knowledge Base for Type Inference in Statically-Typed Partial Code

Partial code usually involves non-fully-qualified type names (non-FQNs) ...
research
04/01/2020

OptTyper: Probabilistic Type Inference by Optimising Logical and Natural Constraints

We present a new approach to the type inference problem for dynamic lang...
research
11/29/2022

How Important are Good Method Names in Neural Code Generation? A Model Robustness Perspective

Pre-trained code generation models (PCGMs) have been widely applied in n...
research
09/24/2018

Representing Sets as Summed Semantic Vectors

Representing meaning in the form of high dimensional vectors is a common...

Please sign up or login with your details

Forgot password? Click here to reset