GraphRepo: Fast Exploration in Software Repository Mining

08/11/2020
by   Alex Serban, et al.
0

Mining and storage of data from software repositories is typically done on a per-project basis, where each project uses a unique combination of data schema, extraction tools, and (intermediate) storage infrastructure. We introduce GraphRepo, a tool that enables a unified approach to extract data from Git repositories, store it, and share it across repository mining projects. GraphRepo usesNeo4j, an ACID-compliant graph database management system, and allows modular plug-in of components for repository extraction (drillers), analysis (miners), and export (mappers). The graph enables a natural way to query the data by removing the need for data normalisation. GraphRepo is built in Python and offers multiple ways to interface with the rich Python ecosystem and with big data solutions. The schema of the graph database is generic and extensible. Using GraphRepo for software repository mining offers several advantages versus creating project-specific infrastructure: (i) high performance for short-iteration exploration and scalability to large data sets (ii) easy distribution of extracted data(e.g., for replication) or sharing of extracted data among projects, and (iii) extensibility and interoperability. A set of benchmarks on four open source projects demonstrate that GraphRepo allows very fast querying of repository data, once extracted and indexed. More information can be found in the project's documentation (available at https://tinyurl.com/grepodoc) and in the project's repository (available at https://tinyurl.com/grrepo). A video demonstration isalso available online (https://tinyurl.com/grrepov)

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/03/2022

Tooling for Time- and Space-efficient git Repository Mining

Software projects under version control grow with each commit, accumulat...
research
02/23/2021

The SmartSHARK Repository Mining Data

The SmartSHARK repository mining data is a collection of rich and detail...
research
04/18/2017

HEPData: a repository for high energy physics data

The Durham High Energy Physics Database (HEPData) has been built up over...
research
11/10/2022

Wikidata-lite for Knowledge Extraction and Exploration

Wikidata is the largest collaborative general knowledge graph supported ...
research
08/08/2020

More Effective Software Repository Mining

Background: Data mining and analyzing of public Git software repositorie...
research
10/18/2017

MEDOC: a Python wrapper to load MEDLINE into a local MySQL database

Since the MEDLINE database was released, the number of documents indexed...
research
12/11/2020

DataVault: A Data Storage Infrastructure for the Einstein Toolkit

Data sharing is essential in the numerical simulations research. We intr...

Please sign up or login with your details

Forgot password? Click here to reset