PipeSim: Trace-driven Simulation of Large-Scale AI Operations Platforms

06/22/2020
by   Thomas Rausch, et al.
0

Operationalizing AI has become a major endeavor in both research and industry. Automated, operationalized pipelines that manage the AI application lifecycle will form a significant part of tomorrow's infrastructure workloads. To optimize operations of production-grade AI workflow platforms we can leverage existing scheduling approaches, yet it is challenging to fine-tune operational strategies that achieve application-specific cost-benefit tradeoffs while catering to the specific domain characteristics of machine learning (ML) models, such as accuracy, robustness, or fairness. We present a trace-driven simulation-based experimentation and analytics environment that allows researchers and engineers to devise and evaluate such operational strategies for large-scale AI workflow systems. Analytics data from a production-grade AI platform developed at IBM are used to build a comprehensive simulation model. Our simulation model describes the interaction between pipelines and system infrastructure, and how pipeline tasks affect different ML model metrics. We implement the model in a standalone, stochastic, discrete event simulator, and provide a toolkit for running experiments. Synthetic traces are made available for ad-hoc exploration as well as statistical analysis of experiments to test and examine pipeline scheduling, cluster resource allocation, and similar operational mechanisms.

READ FULL TEXT
research
01/10/2022

A Simulation Platform for Multi-tenant Machine Learning Services on Thousands of GPUs

Multi-tenant machine learning services have become emerging data-intensi...
research
10/27/2020

An Experimentation Platform for Explainable Coalition Situational Understanding

We present an experimentation platform for coalition situational underst...
research
03/04/2021

CLAIMED, a visual and scalable component library for Trusted AI

Deep Learning models are getting more and more popular but constraints o...
research
07/21/2020

AI Tax: The Hidden Cost of AI Data Center Applications

Artificial intelligence and machine learning are experiencing widespread...
research
07/01/2022

Shai-am: A Machine Learning Platform for Investment Strategies

The finance industry has adopted machine learning (ML) as a form of quan...
research
05/18/2021

BBE: Simulating the Microstructural Dynamics of an In-Play Betting Exchange via Agent-Based Modelling

I describe the rationale for, and design of, an agent-based simulation m...
research
09/30/2022

Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning

Dynamic adaptation has become an essential technique in accelerating dis...

Please sign up or login with your details

Forgot password? Click here to reset