Evaluating the Impact of Source Code Parsers on ML4SE Models

06/17/2022
by   Ilya Utkin, et al.
0

As researchers and practitioners apply Machine Learning to increasingly more software engineering problems, the approaches they use become more sophisticated. A lot of modern approaches utilize internal code structure in the form of an abstract syntax tree (AST) or its extensions: path-based representation, complex graph combining AST with additional edges. Even though the process of extracting ASTs from code can be done with different parsers, the impact of choosing a parser on the final model quality remains unstudied. Moreover, researchers often omit the exact details of extracting particular code representations. In this work, we evaluate two models, namely Code2Seq and TreeLSTM, in the method name prediction task backed by eight different parsers for the Java language. To unify the process of data preparation with different parsers, we develop SuperParser, a multi-language parser-agnostic library based on PathMiner. SuperParser facilitates the end-to-end creation of datasets suitable for training and evaluation of ML models that work with structural information from source code. Our results demonstrate that trees built by different parsers vary in their structure and content. We then analyze how this diversity affects the models' quality and show that the quality gap between the most and least suitable parsers for both models turns out to be significant. Finally, we discuss other features of the parsers that researchers and practitioners should take into account when selecting a parser along with the impact on the models' quality. The code of SuperParser is publicly available at https://doi.org/10.5281/zenodo.6366591. We also publish Java-norm, the dataset we use to evaluate the models: https://doi.org/10.5281/zenodo.6366599.

READ FULL TEXT
research
07/10/2023

COMEX: A Tool for Generating Customized Source Code Representations

Learning effective representations of source code is critical for any Ma...
research
03/23/2021

PSIMiner: A Tool for Mining Rich Abstract Syntax Trees from Code

The application of machine learning algorithms to source code has grown ...
research
09/15/2021

A Comparison of Code Embeddings and Beyond

Program representation learning is a fundamental task in software engine...
research
07/04/2018

Multi-Stage JavaScript

Multi-stage languages support generative metaprogramming via macros eval...
research
05/21/2020

Java Decompiler Diversity and its Application to Meta-decompilation

During compilation from Java source code to bytecode, some information i...
research
08/19/2019

The Strengths and Behavioral Quirks of Java Bytecode Decompilers

During compilation from Java source code to bytecode, some information i...
research
04/06/2023

A Unified Active Learning Framework for Annotating Graph Data with Application to Software Source Code Performance Prediction

Most machine learning and data analytics applications, including perform...

Please sign up or login with your details

Forgot password? Click here to reset