SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle

09/06/2019
by   Matthias Boehm, et al.
0

Machine learning (ML) applications become increasingly common in many domains. ML systems to execute these workloads include numerical computing frameworks and libraries, ML algorithm libraries, and specialized systems for deep neural networks and distributed ML. These systems focus primarily on efficient model training and scoring. However, the data science process is exploratory, and deals with underspecified objectives and a wide variety of heterogeneous data sources. Therefore, additional tools are employed for data engineering and debugging, which requires boundary crossing, unnecessary manual effort, and lacks optimization across the lifecycle. In this paper, we introduce SystemDS, an open source ML system for the end-to-end data science lifecycle from data integration, cleaning, and preparation, over local, distributed, and federated ML model training, to debugging and serving. To this end, we aim to provide a stack of declarative languages with R-like syntax for the different lifecycle tasks, and users with different expertise. We describe the overall system architecture, explain major design decisions (motivated by lessons learned from Apache SystemML), and discuss key features and research directions. Finally, we provide preliminary results that show the potential of end-to-end lifecycle optimization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/04/2021

Serverless Model Serving for Data Science

Machine learning (ML) is an important part of modern data science applic...
research
01/07/2020

Vamsa: Tracking Provenance in Data Science Scripts

Machine learning (ML) which was initially adopted for search ranking and...
research
09/04/2023

Is Your Learned Query Optimizer Behaving As You Expect? A Machine Learning Perspective

The current boom of learned query optimizers (LQO) can be explained not ...
research
04/20/2022

fairDMS: Rapid Model Training by Data and Model Reuse

Extracting actionable information from data sources such as the Linac Co...
research
08/13/2021

HPTMT Parallel Operators for High Performance Data Science Data Engineering

Data-intensive applications are becoming commonplace in all science disc...
research
05/28/2020

Parallelizing Machine Learning as a Service for the End-User

As ML applications are becoming ever more pervasive, fully-trained syste...
research
02/15/2023

Frameworks for SNNs: a Review of Data Science-oriented Software and an Expansion of SpykeTorch

Developing effective learning systems for Machine Learning (ML) applicat...

Please sign up or login with your details

Forgot password? Click here to reset