Enabling collaborative data science development with the Ballet framework

12/14/2020
by   Micah J. Smith, et al.
13

While the open-source model for software development has led to successful large-scale collaborations in building software systems, data science projects are frequently developed by individuals or small groups. We describe challenges to scaling data science collaborations and present a novel ML programming model to address them. We instantiate these ideas in Ballet, a lightweight software framework for collaborative open-source data science and a cloud-based development environment, with a plugin for collaborative feature engineering. Using our framework, collaborators incrementally propose feature definitions to a repository which are each subjected to an ML evaluation and can be automatically merged into an executable feature engineering pipeline. We leverage Ballet to conduct an extensive case study analysis of a real-world income prediction problem, and discuss implications for collaborative projects.

READ FULL TEXT

page 2

page 4

page 5

page 12

page 13

page 15

page 16

page 17

research
03/29/2021

Meeting in the notebook: a notebook-based environment for micro-submissions in data science collaborations

Developers in data science and other domains frequently use computationa...
research
07/17/2020

A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

Background: Meeting the growing industry demand for Data Science require...
research
12/15/2022

A Data Source Dependency Analysis Framework for Large Scale Data Science Projects

Dependency hell is a well-known pain point in the development of large s...
research
08/12/2020

The Right Tools for the Job: The Case for Spatial Science Tool-Building

This paper was presented as the 8th annual Transactions in GIS plenary a...
research
06/29/2023

Statistically Enhanced Learning: a feature engineering framework to boost (any) learning algorithms

Feature engineering is of critical importance in the field of Data Scien...
research
10/11/2021

Beyond Desktop Computation: Challenges in Scaling a GPU Infrastructure

Enterprises and labs performing computationally expensive data science a...
research
11/08/2022

Caching and Reproducibility: Making Data Science experiments faster and FAIRer

Small to medium-scale data science experiments often rely on research so...

Please sign up or login with your details

Forgot password? Click here to reset