Towards a Change Taxonomy for Machine Learning Systems

03/21/2022
by   Aaditya Bhatia, et al.
0

Machine Learning (ML) research publications commonly provide open-source implementations on GitHub, allowing their audience to replicate, validate, or even extend machine learning algorithms, data sets, and metadata. However, thus far little is known about the degree of collaboration activity happening on such ML research repositories, in particular regarding (1) the degree to which such repositories receive contributions from forks, (2) the nature of such contributions (i.e., the types of changes), and (3) the nature of changes that are not contributed back to forks, which might represent missed opportunities. In this paper, we empirically study contributions to 1,346 ML research repositories and their 67,369 forks, both quantitatively and qualitatively (by building on Hindle et al.'s seminal taxonomy of code changes). We found that while ML research repositories are heavily forked, only 9 sent changes to the parent repositories, half of which (52 the parent repositories. Our qualitative analysis on 539 contributed and 378 local (fork-only) changes, extends Hindle et al.'s taxonomy with one new top-level change category related to ML (Data), and 15 new sub-categories, including nine ML-specific ones (input data, output data, program data, sharing, change evaluation, parameter tuning, performance, pre-processing, model training). While the changes that are not contributed back by the forks mostly concern domain-specific customizations and local experimentation (e.g., parameter tuning), the origin ML repositories do miss out on a non-negligible 15.4 changes. The findings in this paper will be useful for practitioners, researchers, toolsmiths, and educators.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/20/2022

Comparative analysis of real bugs in open-source Machine Learning projects – A Registered Report

Background: Machine Learning (ML) systems rely on data to make predictio...
research
06/18/2012

Machine Learning that Matters

Much of current machine learning (ML) research has lost its connection t...
research
11/09/2022

Challenges and Barriers of Using Low Code Software for Machine Learning

As big data grows ubiquitous across many domains, more and more stakehol...
research
07/05/2022

A domain-specific language for describing machine learning datasets

Datasets play a central role in the training and evaluation of machine l...
research
05/13/2022

Perspectives on Incorporating Expert Feedback into Model Updates

Machine learning (ML) practitioners are increasingly tasked with develop...
research
03/19/2022

METL: a modern ETL pipeline with a dynamic mapping matrix

Modern ETL streaming pipelines extract data from various sources and for...
research
08/22/2023

Machine learning assisted exploration for affine Deligne-Lusztig varieties

This paper presents a novel, interdisciplinary study that leverages a Ma...

Please sign up or login with your details

Forgot password? Click here to reset