Tooling for Time- and Space-efficient git Repository Mining

05/03/2022
by   Fabian Heseding, et al.
0

Software projects under version control grow with each commit, accumulating up to hundreds of thousands of commits per repository. Especially for such large projects, the traversal of a repository and data extraction for static source code analysis poses a trade-off between granularity and speed. We showcase the command-line tool pyrepositoryminer that combines a set of optimization approaches for efficient traversal and data extraction from git repositories while being adaptable to third-party and custom software metrics and data extractions. The tool is written in Python and combines bare repository access, in-memory storage, parallelization, caching, change-based analysis, and optimized communication between the traversal and custom data extraction components. The tool allows for both metrics written in Python and external programs for data extraction. A single-thread performance evaluation based on a basic mining use case shows a mean speedup of 15.6x to other freely available tools across four mid-sized open source projects. A multi-threaded execution allows for load distribution among cores and, thus, a mean speedup up to 86.9x using 12 threads.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/11/2020

GraphRepo: Fast Exploration in Software Repository Mining

Mining and storage of data from software repositories is typically done ...
research
03/25/2019

git2net - Mining Time-Stamped Co-Editing Networks from Large git Repositories

Data from software repositories have become an important foundation for ...
research
11/30/2020

Toward a Benchmark Repository for Software Maintenance Tool Evaluations with Humans

To evaluate software maintenance techniques and tools in controlled expe...
research
02/23/2021

The SmartSHARK Repository Mining Data

The SmartSHARK repository mining data is a collection of rich and detail...
research
05/05/2022

Applicability of Software Reliability Growth Models to Open Source Software

Software Reliability Growth Models (SRGMs) are based on underlying assum...
research
05/17/2023

Testing GitHub projects on custom resources using unprivileged Kubernetes runners

GitHub is a popular repository for hosting software projects, both due t...
research
06/02/2021

Meta model application for consistency management of models for avionic systems design

This paper presents the application of a meta model and single underlyin...

Please sign up or login with your details

Forgot password? Click here to reset