GraphCode2Vec: Generic Code Embedding via Lexical and Program Dependence Analyses

12/02/2021
by   Wei Ma, et al.
0

Code embedding is a keystone in the application of machine learning on several Software Engineering (SE) tasks. To effectively support a plethora of SE tasks, the embedding needs to capture program syntax and semantics in a way that is generic. To this end, we propose the first self-supervised pre-training approach (called GraphCode2Vec) which produces task-agnostic embedding of lexical and program dependence features. GraphCode2Vec achieves this via a synergistic combination of code analysis and Graph Neural Networks. GraphCode2Vec is generic, it allows pre-training, and it is applicable to several SE downstream tasks. We evaluate the effectiveness of GraphCode2Vec on four (4) tasks (method name prediction, solution classification, mutation testing and overfitted patch classification), and compare it with four (4) similarly generic code embedding baselines (Code2Seq, Code2Vec, CodeBERT, GraphCodeBERT) and 7 task-specific, learning-based methods. In particular, GraphCode2Vec is more effective than both generic and task-specific learning-based baselines. It is also complementary and comparable to GraphCodeBERT (a larger and more complex model). We also demonstrate through a probing and ablation study that GraphCode2Vec learns lexical and program dependence features and that self-supervised pre-training improves effectiveness.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/09/2021

Wav2vec-S: Semi-Supervised Pre-Training for Speech Recognition

Self-supervised pre-training has dramatically improved the performance o...
research
02/08/2023

Automating Code-Related Tasks Through Transformers: The Impact of Pre-training

Transformers have gained popularity in the software engineering (SE) lit...
research
02/24/2021

Pre-Training on Dynamic Graph Neural Networks

The pre-training on the graph neural network model can learn the general...
research
10/08/2021

RPT: Toward Transferable Model on Heterogeneous Researcher Data via Pre-Training

With the growth of the academic engines, the mining and analysis acquisi...
research
07/09/2020

Contrastive Code Representation Learning

Machine-aided programming tools such as automated type predictors and au...
research
06/03/2021

Unsupervised Learning of General-Purpose Embeddings for Code Changes

Applying machine learning to tasks that operate with code changes requir...
research
12/04/2019

Learnt dynamics generalizes across tasks, datasets, and populations

Differentiating multivariate dynamic signals is a difficult learning pro...

Please sign up or login with your details

Forgot password? Click here to reset