Towards a Change Taxonomy for Machine Learning Systems

by   Aaditya Bhatia, et al.

Machine Learning (ML) research publications commonly provide open-source implementations on GitHub, allowing their audience to replicate, validate, or even extend machine learning algorithms, data sets, and metadata. However, thus far little is known about the degree of collaboration activity happening on such ML research repositories, in particular regarding (1) the degree to which such repositories receive contributions from forks, (2) the nature of such contributions (i.e., the types of changes), and (3) the nature of changes that are not contributed back to forks, which might represent missed opportunities. In this paper, we empirically study contributions to 1,346 ML research repositories and their 67,369 forks, both quantitatively and qualitatively (by building on Hindle et al.'s seminal taxonomy of code changes). We found that while ML research repositories are heavily forked, only 9 sent changes to the parent repositories, half of which (52 the parent repositories. Our qualitative analysis on 539 contributed and 378 local (fork-only) changes, extends Hindle et al.'s taxonomy with one new top-level change category related to ML (Data), and 15 new sub-categories, including nine ML-specific ones (input data, output data, program data, sharing, change evaluation, parameter tuning, performance, pre-processing, model training). While the changes that are not contributed back by the forks mostly concern domain-specific customizations and local experimentation (e.g., parameter tuning), the origin ML repositories do miss out on a non-negligible 15.4 changes. The findings in this paper will be useful for practitioners, researchers, toolsmiths, and educators.


page 1

page 2

page 3

page 4


Comparative analysis of real bugs in open-source Machine Learning projects – A Registered Report

Background: Machine Learning (ML) systems rely on data to make predictio...

Machine Learning that Matters

Much of current machine learning (ML) research has lost its connection t...

Challenges and Barriers of Using Low Code Software for Machine Learning

As big data grows ubiquitous across many domains, more and more stakehol...

A domain-specific language for describing machine learning datasets

Datasets play a central role in the training and evaluation of machine l...

Perspectives on Incorporating Expert Feedback into Model Updates

Machine learning (ML) practitioners are increasingly tasked with develop...

METL: a modern ETL pipeline with a dynamic mapping matrix

Modern ETL streaming pipelines extract data from various sources and for...

Machine learning assisted exploration for affine Deligne-Lusztig varieties

This paper presents a novel, interdisciplinary study that leverages a Ma...

Please sign up or login with your details

Forgot password? Click here to reset