JEMMA: An Extensible Java Dataset for ML4Code Applications

12/18/2022
by   Anjan Karmakar, et al.
0

Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.

READ FULL TEXT
research
08/10/2021

Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff Size

This paper presents Megadiff, a dataset of source code diffs. It focuses...
research
01/19/2023

Source Code Metrics for Software Defects Prediction

In current research, there are contrasting results about the applicabili...
research
03/24/2023

PENTACET data – 23 Million Contextual Code Comments and 500,000 SATD comments

Most Self-Admitted Technical Debt (SATD) research utilizes explicit SATD...
research
04/04/2019

Recommendations for Datasets for Source Code Summarization

Source Code Summarization is the task of writing short, natural language...
research
08/19/2019

The Strengths and Behavioral Quirks of Java Bytecode Decompilers

During compilation from Java source code to bytecode, some information i...
research
03/19/2023

Towards a Dataset of Programming Contest Plagiarism in Java

In this paper, we describe and present the first dataset of source code ...
research
05/21/2020

Java Decompiler Diversity and its Application to Meta-decompilation

During compilation from Java source code to bytecode, some information i...

Please sign up or login with your details

Forgot password? Click here to reset