COMEX: A Tool for Generating Customized Source Code Representations

07/10/2023
by   Debeshee Das, et al.
0

Learning effective representations of source code is critical for any Machine Learning for Software Engineering (ML4SE) system. Inspired by natural language processing, large language models (LLMs) like Codex and CodeGen treat code as generic sequences of text and are trained on huge corpora of code data, achieving state of the art performance on several software engineering (SE) tasks. However, valid source code, unlike natural language, follows a strict structure and pattern governed by the underlying grammar of the programming language. Current LLMs do not exploit this property of the source code as they treat code like a sequence of tokens and overlook key structural and semantic properties of code that can be extracted from code-views like the Control Flow Graph (CFG), Data Flow Graph (DFG), Abstract Syntax Tree (AST), etc. Unfortunately, the process of generating and integrating code-views for every programming language is cumbersome and time consuming. To overcome this barrier, we propose our tool COMEX - a framework that allows researchers and developers to create and combine multiple code-views which can be used by machine learning (ML) models for various SE tasks. Some salient features of our tool are: (i) it works directly on source code (which need not be compilable), (ii) it currently supports Java and C#, (iii) it can analyze both method-level snippets and program-level snippets by using both intra-procedural and inter-procedural analysis, and (iv) it is easily extendable to other languages as it is built on tree-sitter - a widely used incremental parser that supports over 40 languages. We believe this easy-to-use code-view generation and customization tool will give impetus to research in source code representation learning methods and ML4SE. Tool: https://pypi.org/project/comex - GitHub: https://github.com/IBM/tree-sitter-codeviews - Demo: https://youtu.be/GER6U87FVbU

READ FULL TEXT

page 1

page 2

research
07/27/2023

CodeLens: An Interactive Tool for Visualizing Code Representations

Representing source code in a generic input format is crucial to automat...
research
06/17/2022

Evaluating the Impact of Source Code Parsers on ML4SE Models

As researchers and practitioners apply Machine Learning to increasingly ...
research
04/16/2021

Text2App: A Framework for Creating Android Apps from Text Descriptions

We present Text2App – a framework that allows users to create functional...
research
04/19/2021

DepMiner: A Pipelineable Tool for Mining of Intra-Project Dependencies

Dependency analysis is recognized as an important field of software engi...
research
05/31/2023

CodeTF: One-stop Transformer Library for State-of-the-art Code LLM

Code intelligence plays a key role in transforming modern software engin...
research
12/06/2021

Hyperstyle: A Tool for Assessing the Code Quality of Solutions to Programming Assignments

In software engineering, it is not enough to simply write code that only...
research
03/17/2022

Lupa: A Framework for Large Scale Analysis of the Programming Language Usage

In this paper, we present Lupa - a framework for large-scale analysis of...

Please sign up or login with your details

Forgot password? Click here to reset