PipeSim: Trace-driven Simulation of Large-Scale AI Operations Platforms

06/22/2020 ∙ by Thomas Rausch, et al. ∙ ibm 0

Operationalizing AI has become a major endeavor in both research and industry. Automated, operationalized pipelines that manage the AI application lifecycle will form a significant part of tomorrow's infrastructure workloads. To optimize operations of production-grade AI workflow platforms we can leverage existing scheduling approaches, yet it is challenging to fine-tune operational strategies that achieve application-specific cost-benefit tradeoffs while catering to the specific domain characteristics of machine learning (ML) models, such as accuracy, robustness, or fairness. We present a trace-driven simulation-based experimentation and analytics environment that allows researchers and engineers to devise and evaluate such operational strategies for large-scale AI workflow systems. Analytics data from a production-grade AI platform developed at IBM are used to build a comprehensive simulation model. Our simulation model describes the interaction between pipelines and system infrastructure, and how pipeline tasks affect different ML model metrics. We implement the model in a standalone, stochastic, discrete event simulator, and provide a toolkit for running experiments. Synthetic traces are made available for ad-hoc exploration as well as statistical analysis of experiments to test and examine pipeline scheduling, cluster resource allocation, and similar operational mechanisms.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Developing and operating artificial intelligence (AI) applications involves complex workflows that comprise many tasks performed in an increasingly automated fashion. This includes data preprocessing flows, data bias checks, machine learning (ML) model training, vulnerability mitigation algorithms, model compression steps, and potentially many others. Some examples of such workflows are shown in Figure 

1. Systems such as ModelOps [10] are used to compose and automate such high-level AI workflows into pipelines that track both build and runtime metrics of deployed ML models. Pipelines are re-executed either automatically due to events such as arrival of new training data, or manually by developers or data scientists [23, 26]

. Furthermore, these pipelines can be long running. It is not uncommon for the training of a deep neural network, for example, to take days. Consequently, a production system may have hundreds or thousands of pipelines executing at any given time to build and maintain AI models.

Fig. 1:

Prototypical AI pipelines: (1) simple process-train-validate-deploy workflow, (2) extended pipelines with custom steps, (3) hierarchical pipelines with transfer learning

In one of the use cases we have explored as part of this research, a company from the health care domain trains ML models to predict certain health conditions of their patients based on real-time monitoring of bodily functions. The re-training has traditionally been performed manually every four weeks with the newly collected data (which are in the double digit millions). Manual maintenance of these pipelines turned out to be unsustainable, especially as the company is moving to shorter training intervals and training of large numbers of patient-specific models.

AI platforms need to cope with an ever increasing number of AI pipelines used to manage the AI application lifecycle. Capacity planning becomes more difficult given the heterogeneous workloads and infrastructure required, and the variety of automation rules that trigger pipeline executions. Developing operational strategies, such as optimized scheduling of automated pipelines, therefore plays a major role in the development of AI platforms. Ultimately, our goal is to leverage the runtime data of pipeline executions to build predictive models that allow us to automatically derive optimized scheduling decisions for continuous improvement of AI models. To that end, this paper contributes a trace-driven simulation framework, for analyzing and experimenting in large-scale AI operations platforms. This paper is an extended version of the OpML’20 paper [17]. Our approach builds on modeling the structural and qualitative aspects of AI pipelines, and executing these pipelines in a simulated environment, to fine-tune operational strategies in the real system, by systematically mutating parameters in an iterative, exploratory process. The simulation model reflects existing high-level AI ops platform architectures such as ModelOps [10] and IBM Watson OpenScale, and considers concepts from AI application workflow as first-class citizens. We implement the model in a stochastic discrete event simulator using existing techniques [6, 14]. To obtain a realistic representation of different execution parameters (e.g., job arrival patterns, task execution times, training data sizes), we apply clustering and sampling techniques to extract statistical distributions from a database of real-world usage events of a large-scale cloud-based AI platform. The database contains millions of data points recorded from several thousand of pipeline executions over the course of over a year. To the best of our knowledge, this is among the first papers that presents a comprehensive system model, data acquisition approach, and experimentation environment for large-scale AI operations platforms that span across the entire lifecycle of AI models.

Our evaluation shows that the framework provides critical and actionable insights into the performance of AI lifecycle pipelines in terms of system utilization, resource bottlenecks, and the impact of different platform usage patterns. Providing a link that reconciles from the experimentation environment to the real system, the approach enables us to evaluate operational strategies and fine-tune pipeline scheduling algorithms in the platform and do capacity planing. Furthermore, we show that our approach can simulate and analyze years of data on a standalone machine in reasonable time, and that our modeling approach has good simulation accuracy.

Ii Related Work

This section discusses related work in the areas of AI lifecycles and operations platforms, system modeling and simulation, as well as job scheduling for model retraining loops.

KeystoneML [19] optimizes pipelines, but the pipelines capture a different level of abstraction. It focuses on details of what we abstract as a black-box: training. Workflows become more complex when the end-to-end lifecycle of a model, including deployment in production have to be considered. ModelHub [15]

is a platform for lifecycle management of deep learning models, which uses synthetic datasets simulating a model developer performing various tasks to develop a face recognition model. While their approach is centered around enumerating models with different network architectures and hyperparameters, we focus on the pipeline orchestration layer to derive operational strategies to avoid model staleness and manage model performance over time. The idea of optimizing a workload schedule towards “overall user happiness” in the machine learning space has also been explored for problems like multi-tenant model selection 

[11, 25]. This work is focused on maximizing aggregate model accuracy, while our work can accommodate an arbitrary set of model attributes including training time, accuracy, and vulnerability scores.

As ML models are based on data, and “data is expected to evolve over time” [7], it is normal for the predictive power of models to degrade and require retraining [26]. [7] present a taxonomy and survey of approaches to detect and repair concept drift. These patterns once encoded as AI pipelines can be simulated and optimized with the platform presented in this paper. Similarly, recent work on trust in AI emphasizes the need for lineage and traceability of models’ quality attributes across the lifecycle [9]. Asserting model quality requires system support on the infrastructure and job scheduling level, and our work provides the foundation to evolve and optimize these operational strategies.

TensorFlow Extended (TFX) [1] is a large-scale machine learning platform that tackles the lifecycle of models from data prep to training and deployment. TFX uses the notion of model freshness in the context of retraining. TFX doesn’t address the modeling of lifecycle attributes or simulation to understand and optimize pipelines.

There is considerable work on modeling and simulating workflows in different domains. This includes seminal work on process mining and simulation of business processes [13], and simulating the operational characteristics of cloud data centers by platforms like CloudSim [3], iCanCloud [16], or recently more platform-specific simulators like All-Spark [12]. This paper continues in the spirit of this literature but our focus is on modeling AI pipelines and concepts they operate on.

Iii Optimized AI Platform Operations

We have come across an increasing number of teams that automate their model training and deployment pipelines. In the following we discuss the relationship between model metrics and AI pipelines, and the challenges of automating and optimizing the scheduling and execution of pipelines.

Iii-a ML Model Metrics and AI Pipelines

ML model metrics are essential to the AI application lifecycle, as they drive development and deployment decisions. Several types of metrics have been defined for ML models [8]. We distinguish between static (or build-time) metrics of models, which are inherent properties of the model, and dynamic (or run-time) metrics which are attributes that may change over time and depend on the inferencing data. Static metrics include test accuracy or AUC, ML model size (e.g., number of weights, or bytes), or model robustness (e.g., CLEVER score [24]). Dynamic metrics include inferencing time, scoring confidence, or concept drift [7].

A key metric, central to our approach, is model staleness, which refers to the decrease in predictive performance of an ML model over time. A major goal of AI operationalization, and in particular the inclusion of runtime metrics in the decision mechanism for pipeline execution, is to mitigate negative performance effects in an AI application caused by model staleness. Based on model staleness, we can synthesize additional metrics that can help us make pipeline scheduling decisions. For example, we can see staleness as being inversely proportional to the current potential of a retraining pipeline to improve the model. This potential could be given by, a) the current model performance of a model , a composite value that aggregates static and dynamic metrics into a single value, and b) the newly labeled data available since the last retraining.

It is important to notice that this potential improvement captures both, known or measurable quality degradation (e.g., concept drift, a statistically significant deviation of inferencing data from training data), as well as unknown or unmeasurable performance risks (e.g., an attacker may attempt to steal a model by probing it with test requests [21]). That means, a deployed model is usually associated with a certain risk of degrading or becoming stale over time [2], which is illustrated in Figure 2.

Fig. 2: Illustration of model performance over time, indicating an adversarial attack [18] (left), accuracy drop due to a sudden concept drift [20] (right), and abstract concept drift patterns [7] (bottom).

In order to manage the risks and prevent models from becoming stale, it is crucial for AI operations to employ pipeline automation with triggers that monitor runtime indicators of deployed models. An execution trigger is a set of rules that reason about the pipeline inputs (such as the dataset used for training, to detect changes or drift), previous executions of the pipelines (when was the model last built and deployed), and performance of the deployed model. If any of the rules underlying is met, an execution of the pipeline is triggered automatically. Figure 3 illustrates the feedback loop.

Fig. 3: Automated AI pipelines with execution triggers, quality gates, and runtime monitoring

The basic unifying characteristic of AI ops pipelines is that they generate or augment AI models. Each task plays a role in either training a model, or enhancing model metrics such as performance or robustness. This concept allows us to reason about a model and its properties, and the pipeline it is generated by, almost interchangeably. We can think of the model as a latent component of a pipeline, whose properties are assigned or changed once its specific tasks are executed.

Iii-B Optimized Operation of Automated AI Pipelines

A key challenge is to determine the trade-off of costs associated with pipeline execution, and the benefit of (expected) performance improvement [1]. Optimized scheduling and resource sharing for ML pipelines under constraints is an active research problem [11], [1]. Previous research has found that ML platform users often do not use infrastructure resources in a useful way[11]. For example, users would reserve several GPUs for weeks to improve models that already had an accuracy of 0.99. From a AI platform provider’s perspective, the operational challenge is to reconcile user SLAs (encoded as pipeline execution triggers or rules), infrastructure utilization, and fairness. Given the notion of potential improvement we could conceive a platform that has as goal to optimize the potential improvement of all automated AI pipelines, while maintaining SLAs and balancing infrastructure utilization. Figure 4 conceptualizes a scheduler that deploys pipelines on to limited infrastructure, based on probabilistic parameters (e.g., model staleness), user preferences (e.g., model prioritization), and resource availability. However, developing and testing such schedulers is difficult given the lack of frameworks and simulation tools, which is what we address in this paper.

Fig. 4: An AI pipeline scheduler optimizes overall user satisfaction and resource balancing

Iii-C Challenges in Operational Research for AI Ops

Recent work on scheduling and multi-tenant resource sharing allows users to utilize training infrastructure in an efficient way [11]. Yet, devising optimal operational strategies is still challenging due to a number of factors:

Cutting Edge: As large-scale AI pipelines are an emerging field, industry and research have not yet converged to a common configuration format for pipelines, hence pipeline definitions are often custom-built and not readily available for analysis or optimization.

Limited Testability: The execution of AI pipelines - including tasks like model training on GPU clusters - is often long-running, resource intensive, and hence costly. Operational research relies on experimentation with quick feedback cycles, yet executing test workloads to evaluate large-scale strategies in the real system may be prohibitively expensive.

Lack of Data: As the field is still evolving, there is currently a lack of datasets about AI pipeline executions, capturing the variety of specialized ML tasks (e.g., data preprocessing, model training, bias and robustness checks, model compression, etc), including the specific effects of each task on the code, data, and model assets, across the entire lifecycle.

Iv Experimentation and Analytics for AI Operations Platforms

In this section we introduce PipeSim, an experimentation and analytics framework for AI operations platforms. The system architecture is illustrated in Figure 5. The real system, which consists of compute clusters, pipeline executions, and various lifecycle services (e.g., model compression, model robustness, and data bias checkers), is represented in a modeled system, which feeds into a simulator component. The system model is parameterized with simulation parameters whose distributions are sampled from empirical data, extracted from real-world usage logs of the system.

The operational strategies in the modeled system emulate the strategies that control the real system. For example, a pipeline scheduler would make API calls to the pipeline executor in the real system. In the modeled system, the scheduler operates on the system model, i.e., creates a pipeline entity with current timestamp and feeds it into the simulator.

Fig. 5: System Architecture of PipeSim

The main entry point for users is to define an experiment and its parameters; once it gets executed by the simulator, the user can drill into the results using the exploratory analysis dashboard. The statistical analysis tool analyzes the results of the experiment database and is used to create predictive models, which themselves feed into the operational strategies, to close the feedback loop into the real system.

In the following subsections we discuss the overall approach and the individual components of the system in more detail.

Iv-a AI Ops Platforms: Conceptual System Model

As discussed in Section III-A, automated AI ops pipelines integrate the entire lifecycle of an AI model, from training, to deployment, to runtime monitoring. In our definition of pipelines, we distinguish between the build-time and run-time view of the platform – analogous to the two steps in the conceptual ML workflow described by [1]: the training phase and the inference phase.

Iv-A1 Build-Time View

At build time, an AI pipeline generates or augments a model (classifier) by operating on data assets and using underlying infrastructure resources (e.g., GPUs). The build-time view of our system model is illustrated in

Figure 6.

Fig. 6: Conceptual system model: Build-time view
Pipeline and Tasks

AI pipelines are compositions of tasks that create or operate on machine learning models [19, 1]. Formally, a pipeline is a directed graph (digraph) , where vertices are tasks (i.e., workflow operations), and directed edges are task transitions labeled with the input that triggers a transition. To better reason about the control flow it is useful to also explicitly model decisions and joins (task that are only transitioned to if all previous task have been executed). However, as this is not the focus of the paper, we omit these definitions. We model different types of tasks denoted , where . We shorten the notation to the first letter of the type. Instances of tasks of a specific type in a pipeline are denoted by . Each task instance holds type specific variables, for example we associate the original and transformed data asset and with a data preprocessing task .


A resource represents an infrastructure component required for executing pipeline tasks. We model a generic system comprising (i) a generic data storage, (ii) a training, and (iii) a compute infrastructure, but we allow for custom configuration fo resource types. Data stores are abstracted in terms of read and write operations, and can therefore be anything from a cloud-based object store (such as S3), to a relational or NoSQL database. Cloud-based AI platforms, such as Watson OpenScale, perform training of models on a dedicated infrastructure with specialized hardware (GPUs), whereas generic compute tasks, such as data preprocessing, may be performed on general purpose compute hardware (e.g., running Spark or Hadoop). Each compute resource is assumed to have a specified job capacity and a work queue, but we do not make assumptions about individual scheduling mechanisms.


An asset denotes any data artifact that is transferred, processed, and generated by data stores and compute resources. Modeling asset characteristics is critical, as the execution time of pipeline tasks often significantly depends on the size and dimensions of processed data sets, and is necessary to simulate traffic between tasks and data stores. We distinguish between Data Assets and Trained ML Models

. This allows us to store ML model specific characteristics, like the number of neurons or layers of neural networks.

Task Executors

Task executors encapsulate the actual system operations performed by a task. For example, a task executor of a training task on a distributed compute cluster is an iterative process of fetching training data from a remote storage (like S3) and running an optimization algorithm (e.g., gradient descent) on the data, and finally persisting the model to the storage. We define the following system operations Where the first to are read and write operations on the data store for an Asset , and are request and release operations for compute resources, and is the task type specific execution of on which describes how that task interacts with the system resources. A task executor is therefore a sequence of operations . Typically the first and last operations in a sequence are read and write operations, respectively.

Iv-A2 Run-Time View

The outcome of a successful pipeline execution is usually a deployed model that is being served by the platform and used by applications for scoring. At runtime, the deployed model has associated performance indicators that change over time. Some indicators can be measured directly, by inspecting the scoring inputs and outputs (e.g., confidence), whereas other metrics (e.g., bias, or drift) require continuous evaluation of the runtime scoring data against the statistical properties of the historical scorings and training data.

Fig. 7: Run-time view: Drift detection and continuous retraining

The ability to react to scoring events, introspect the historical payloads, and continuously compute performance metrics is a critical piece of functionality that itself requires a significant amount of computational resources. In fact, model performance monitoring tools like drift detectors are themselves ML models, for example based on model explanation classifiers like IME [4], that need to be trained, deployed, and continuously maintained.

Figure 7 illustrates the interaction between model assets and performance monitoring in PipeSim. A detector component monitors a trained classifier, and computes drift and staleness metrics over time. At time point , a trigger rule detects that the drift metric exceeds a threshold, and triggers a retraining pipeline which creates version

of the classifier. Note that the pipeline may employ active learning with humans in the loop to perform data selection and labeling.

Iv-B Pipeline and Data Synthesizer

To run an experiment we need to artificially generate data that follow some underlying process we can control. We describe our approach to generate synthetic pipelines and assets that are used as input for a simulation.

Iv-B1 Synthetic Pipelines

Because the goal is to simulate the execution of a large number of pipelines, a key element of the experimentation environment is a Pipeline Synthesizer that stochastically generates plausible pipelines. That is, although there should be some randomness involved, the sequence of steps in a synthetic pipeline should be sensible (e.g., a model validation task cannot precede a training task). This process is challenging because the structure of complex AI ops pipelines, i.e., beyond simple train–deploy workflows, is still poorly understood. However, we can make some basic assumptions about pipeline structures based on the prototypical pipelines we have identified by analyzing both commercial and research use cases as presented in Figure 1.

We also recognize that some steps within these pipeline structures may be optional. While a pipeline that generates a model unconditionally requires a training step, not all require a data preprocessing step if they make use of already structured and curated datasets. When generating pipelines, some tasks therefore have a certain (possibly conditional) probability associated with them, that may depend on the state of the pipeline currently being generated.

Task characteristics, such as the framework and algorithm used for training, or the number of operations in a preprocessing step, are also synthesized. Such characteristics can be based on frequencies observed in production systems, or configured as experiment parameters. For example, while examining the AI pipeline executions of our platform we found that 63% of training jobs use SparkML, 32% use TensorFlow, 3% PyTorch, 1% Caffe, and 1% use a variety of other frameworks. Given that different frameworks map to different infrastructures (e.g., a Spark cluster) and often correlate with significantly different execution times (see later in Figure

9(b)), we want to easily adapt these percentages to observe the effect on the system.

Iv-B2 Synthetic Assets

A key ingredient of AI pipelines are assets, i.e., training data assets, and trained models. We describe properties of assets as multivariate random variables, as this allows us to use common sampling methods for synthesizing assets.

For example, we model a data asset as an observation of a multivariate random variable where constitutes the number of dimensions of the dataset (e.g., number of columns in tabular data), the number of rows or instances, and the disk space in bytes required to store the asset (uncompressed). In Section V-A we describe how we can obtain synthetic data assets by sampling from a multivariate Gaussian mixture fitted on empirical data.

Describing a trained model in this way is not as straight forward, as many of its properties are the result of stochastic processes in the pipeline execution. In general, we say that a model has a set of static and dynamic properties. Static properties include those that are assigned by the pipeline at build-time, such as the prediction type (e.g., binary or multiclass classification, or regression) or the model type

(e.g., Linear Regression, Random Forest, or Neural Network). Dynamic properties include metrics we have described in Section 

III-A, such as model performance , and CLEVER score [24].

Iv-C Process Simulator

Iv-C1 Task Execution

Besides the side effects a task execution has on the properties of an Asset, we are mostly interested in simulating the execution duration of a task. Because a task is a sequence of system operations , we can define the execution duration of a task as the sum over the execution duration of the task’s system operations, i.e., . Similarly, because we define a pipeline as a sequence of steps, we can define the execution duration of an entire pipeline as the sum over task execution durations (the current system model assumes that tasks are not running in parallel). This allows granular modeling of system behavior. For example, equates to the time a task as to wait for a resource to become available, and are functions of and the up/download bandwidth and latency associated with the data store, and so on. These functions can be expressed as analytical functions of system properties. However, for functions that are subject to complex processes, such as data preprocessing , or model training task , we rely on empirical data and statistical modeling of these function.

Iv-C2 Pipeline Arrivals

A fundamental parameter of a platform simulation is the rate at which pipeline executions are triggered. We say the average arrival rate is . A typical way for simulating arrivals is to model the time between arrivals (interarrivals) as a random variable [14] and sampling from the underlying distribution after every event. It is well known that interrarivals typically follow some exponential or related process. Researchers have found that, for example, TCP traffic is well described by lognormal, Pareto, or Weibull distributions [5]. Our data suggests that pipeline-interarrival-times follows a exponentiated Weibull process (see V-A). However, pipelines are executed at a given time either because they were triggered manually by a user (or other application), or they were triggered automatically due to a rule. To us as an operator, the former is a random process, whereas the latter we have control over. It is therefore useful to simulate the processes that underly the user-defined rules (e.g., the arrivals of new data, the expected model drift, etc.). Describing these processes is part of the run-time view of the system and is ongoing work. Furthermore, it is important to preserve, to some degree, the complexity of workload arrival patterns, and there are many ways to achieve this. We discuss our approach in more detail in Section V-A.

V AI Platform Simulator

This section presents our effort towards a full implementation of the experimentation and analytics environment as described in the previous section. The current implementation comprises a simulator backed by empirical data from (product name blinded), and an analytics frontend connected to a time series database that contains the simulated system data. In this section we first describe how we acquired the data for statistical modeling of simulation processes. We then describe the implementation of individual components.

V-a Statistical Modeling and Pipeline Simulation

As discussed in Section IV-B, a fundamental requirement for an experimentation environment is to allow meaningful reasoning over the original system, and therefore requires generated data and underlying process simulation to reflect properties of that system. While we can make some assumption about, e.g., the distribution of number of pipelines per user (where the Pareto principle will likely apply), most data should be based on empirical observations. The analytics database of (product name blinded) with several million rows of user and system events is our primary source of empirical data. We run queries on this database an fit different statistical distributions on the extracted data, which include several thousand pipeline execution traces.

In general, we generate random entities in the simulator by first using scikit-learn or SciPy to fit statistical models on the respective observed data. The generated models or distribution parameters are exported using Python’s serialization to the simulator. During simulation time, we can then sample individuals from the statistical model. In principle, fitting all necessary models that we currently use could also be done on the fly when starting the simulation, but would take in the order of a few minutes. This allows us to plug in the live, updated data sources for grounding simulation parameters.

V-A1 Synthesizing Data Assets

The data processing component of (product name blinded) stores metadata about data assets, i.e., the number of dimensions and instances, as well as the amount of bytes it transfers between object storage and execution platform. To generate synthetic data assets we sample from the distribution of rows, columns and bytes processed. Note that, while the usage data we sample from is mostly about processing of tabular/structured data, our approach generalizes to arbitrary types of data (e.g., images, text, speech).

We filter data assets with less than 50 rows and 2 columns, as they are unlikely to be used for training models. Figure 8 shows two log-transformed density scatter plots of a de-duplicated subsample of observations. The first shows the distribution of columns and rows of data assets. We observe that there are clusters of assets with similar structure. The second shows the dimension of the data (rows columns) and corresponding bytes, which reveals an (expected) linear relationship, but also a large variability in values.

Fig. 8: Observations () of asset dimensions and size

The model we use to sample a random data asset is a multivariate Gaussian Mixture Model. We use the scikit-learn implementation to fit a mixture with 50 components and a full covariance matrix on the three-column data set. The model is then exported into the simulator as described previously. Because the original data contain many extreme and spread out values which cause too many singletons to form in the mixture, we fit instead on the log-transformed data. During simulation time, we transform the data back and reject out-of-bound values.

V-A2 Simulating Task Execution

As discussed in Section IV-A, the execution duration of a task typically depends on several factors. In particular, tasks like data preprocessing and model training are not straight forward to express analytically, and we instead define statistical models for them. Figure 9 shows examples of different ways to simulate the compute time of data preprocessing and training tasks. Individual plots are explained in more detail below.

(a) Data preprocessing task
(b) Training task
Fig. 9: Observations of compute time for data preprocessing and training tasks based on other known aspects of the task.
Data Preprocessing

For a data preprocessing task, we can correlate the data asset’s size with the computation time. Figure 9(a) illustrates this relation. The red line is an exponential function with parameters , , , fitted on the

-transformed data using SciPy’s implementation of non-linear least squares. During simulation time, we use the size of the synthesized data asset to estimate the compute time from the fitted function and add noise from a log-normal distribution

, to simulate the long tail around the function. By associating the data asset with the task instance , and given the compute resource , we can express , and estimate with being a random sample from our noise function.

Currently we do not model the transformation of to , so we simply substitute for . There are other ways to simulate the compute time based on the available data. For example, sampling from the conditional distribution of compute time given the data set size from a fitted Gaussian mixture, or training a different nonlinear regression model and adding noise to the estimate. We chose the previously described method as we found it produced good results and was straight-forward to implement. Internals of a preprocessing task, e.g., number and types of operations, will also affect the processing time. Although we do not have data on this at present, we plan to include it in a future iteration.


For a training task we know from the assignment we made during pipeline synthesis which framework is used, and we know that frameworks have vastly different execution duration distributions. For example, 50% of TensorFlow jobs run in under 180 seconds, whereas 50% of Spark ML jobs run in under 10 seconds. Figure 9(b) shows two histograms over the compute time of a subsample () of these jobs. For a better illustration we only show values below the 99th percentile. We model in a similar fashion as . To estimate for a given framework , we stratify the observed execution duration by and then fit a Gaussian mixture model on each stratum. During simulation time we then simply pick a random sample from . This gives us a relatively good fidelity when testing, e.g., the effect of specific frameworks trending (which is something we have observed in the production system, specifically that the number of Tensorflow builds are increasing over time). In our implementation we also model PyTorch and Caffe training jobs.

Model Evaluation

For a model evaluation (validation) task we have no data to correlate, only the raw compute time data where we again fit a Gaussian mixture to sample at simulation time. However, we plan to investigate this further in the next iteration of our simulator. It is reasonable to assume that the time it takes to validate a model can be described by the dataset size used for validation, and the model size (which will generally be a good indicator for the inferencing time).

Model Compression

Model compression can be used to reduce the size and inferencing time of deep neural networks [22]. We know from the way state-of-the art model algorithms operate, that model compression requires roughly as much time as training. We can therefore re-use the execution duration we have simulated for training and add Gaussian noise to it to simulate . Compression affects several model metrics, specifically performance, size, and inferencing time. Table I shows these values from experiments we have performed with GoogleNet and ResNet50 networks on the Food101 training set using Caffe.

Prune Accuracy (%) Size (MB) Inference (ms)
0 % 80.7 81.3 42.5 91.1 128 223
20 % 80.9 80.9 28.7 83.5 117 200
40 % 80.0 80.8 20.9 65.2 100 169
60 % 77.7 79.5 14.6 41.9 84 141
80 % 69.8 69.8 8.5 8.5 71 72
TABLE I: Effect of model compression on model parameters

Our simulator currently does not change model metrics for compression tasks in a systematic way, however we can see that the relative changes in model metrics could be described by a regression model. Together with a simulation of the run-time view as described in Section IV-A, we could experiment how varying compression levels affect build-time pipeline execution duration compared and run-time inferencing.

V-A3 Simulating Pipeline Arrivals

For simulating pipeline arrivals we model the interarrival in seconds as a random variable and sequentially draw from a fitted distribution. We collect the timestamps of training job arrivals (which we use as proxy for pipeline arrivals) and calculate the interarrivals. It is well known that interrarivals typically follow some exponential or related process. Researchers have found that, for example, TCP traffic is well described by lognormal, Pareto, or Weibull distributions [5]. On the collected data we found that the exponentiated Weibull distribution produces a good fit.

However, given that humans interact with the system, arrival will differ across different weekdays and hours of day. Figure 10 shows a histogram of average arrivals per hour, grouped by hour of day and weekday111In this paper we report relative numbers for the arrival rates, as we cannot disclose the absolute numbers by company policy at the present time. Note that these numbers correspond to the real-world metrics we have collected, and are useful here for illustration purposes.. We can leverage these arrival patterns to predict periods of low infrastructure load for scheduling of automated pipelines. To provide a realistic arrival profile for our simulation, we first cluster the calculated interarrivals by hour of day and weekday (resulting in 168 clusters). On each cluster we fit a log-normal, exponentiated Weibull, and Pareto distribution, select the best fit based on the sum of square errors (SSE), and store the parameters together with the hour of day and weekday. During simulation, we map real timestamps to simulation time, and use that to sample from the respective cluster.

Fig. 10: Average arrivals per hour stratified by hour of day and weekday .

shows the average arrivals per hour aggregated over all values. Error bars show one standard deviation.

V-B Simulator

At the core of the experimentation environment lies the simulator. It implements the conceptual system model and data synthesizers as described in Section IV as a stochastic, standalone, discrete-event simulator using the simulation framework SimPy222https://simpy.readthedocs.io. We also make use of SciPy and scikit-learn for implementing statistical operations such as distribution fitting and sampling. Synthetic traces are persisted into InfluxDB333https://www.influxdata.com/time-series-platform/influxdb/. We also developed a Python toolkit for analyzing the experiment data (which we also use for the evaluation), and a analytics dashboard using Grafana444https://grafana.com/. We briefly describe how the core concepts are implemented using SimPy.


SimPy provides the concept of shared resources555https://simpy.readthedocs.io/en/latest/topical_guides/resources.html to model process interactions. A shared resource is a congestion point where processes queue up to use them. We leverage this concept to model our infrastructure. Each compute resource (e.g., a training cluster) has an associated job capacity. When a pipeline task is executed, depending on the task executor implementation, one or more resources for a job will be requested. If the capacity is reached, the job queues up and waits until a resource is available. This abstraction is useful because AI ops platforms plug together many different types of infrastructure in a replaceable way, and therefore generally cannot make assumptions about how system resources behave internally. For example, details of how a specific training cluster technology provisions workers internally should not leak out to the AI ops platform. Instead, resources should provide a common interface that allows the platform to reason about a resource’s capacity in a high-level way. SimPy resources provide exactly that. However, our framework also enables customization of resources, s.t. resource queuing and scheduling mechanisms can be implemented in more detail.

Pipeline Execution

For simulating pipeline executions as described in Section IV-A, tasks are implemented in plain Python code, and each system operation is implemented as a SimPy event. We give an example of how a training task is simulated in this manner. The training task executor will first attempt to request the shared resource that models the training cluster . The queuing time, if any, is used to simulate . It then simulates the task execution by generating a timeout event with the value sampled as described in Section V-A

. Afterwards, a Trained Model Asset is created, and properties associated with the model (size, performance, CLEVER score, etc.) are materialized. For example, to materialize the performance of the model (e.g., the AUC or F-score), we could sample from the distribution of performance values historically observed for the estimator type. While this is not an accurate estimate of the performance an individual model will have, it will give us an idea of the overall distribution of, e.g., pipelines that may not meet certain quality gates. The created Trained Model Asset is then persisted to the data store and the execution trace is recorded via the logging interface.

Vi Evaluation

To assess the feasibility of our prototype, we evaluate three key aspects. First, we show how the simulator results can be used to analyze the effect of different parameters on the system. Second, we examine if the data produced by the simulator reflect the original system under simulation. Third, we test how well the simulator scales even when simulating years of pipeline executions.

Fig. 11: Experiment analytics dashboard showing infrastructure and pipeline execution metrics.

Vi-a Exploratory Analysis using the Dashboard

The analytics frontend allows exploratory analysis of experiment results. Figure 11 shows the dashboard populated with data from a sample experiment run. It shows the experiment parameters, general statistics about individual task executions and wait time. The graphs shows the resource utilization of compute resources, individual tasks arrivals, and overall wait time of pipelines, which allows us to quickly observe the impact of resource utilization on pipeline wait time. The dashboard also shows the network traffic caused by the execution platform and includes TCP overhead.

This example illustrates how we can analyze the impact of, e.g., arrival peaks on the infrastructure. Around 16:00, a typical peak in pipeline arrivals occurs. The usage of the compute cluster spikes slightly because of preprocessing tasks being scheduled to it. However, because the learning cluster is quickly saturated by long-running training jobs, subsequent jobs have to queue, and post-processing task (like model validation) arrivals are delayed to a later time. This scenario illustrates how the simulator can be used to examine load balancing of heterogeneous compute infrastructure given different system conditions.

Although the simulation synthesizes and logs model metrics for each pipeline, we currently do not have a good way of visualizing these data. We are working on a way to visualize aggregated model metrics (such as overall potential improvement, as described in Section III) in a meaningful way. Queries can however be executed, for example, via the InfluxDB web UI, and metrics can be aggregated over pipelines which are identified by unique IDs generated by the system. This way the lineage of a pipeline can be tracked, and the accuracy over time (which are currently synthesized values and added Gaussian noise) observed.

Vi-B Simulation Accuracy

We use the statistical analysis component of PipeSim to evaluate the accuracy of our simulation data. Figure 12 shows the results of several pipeline execution experiments to compare how well the empirical and simulated data agree.

(a) Q–Q plots of task execution duration in seconds
(b) Q–Q plots of interarrivals in seconds
(c) Average arrivals per hour with realistic arrival profile
Fig. 12: Statistical evaluation of simulated processes.

The Q–Q plots in Figure 12(a)

plot the quantiles of different task execution durations against each other. Our preprocessing task simulation slightly overestimates execution duration for short running tasks, but overall performs well considering the relatively simple statistical model for this complex process. Training tasks, which we simulated separately for each framework exhibit a very good fit, which we attribute to the performance of Gaussian mixtures given the large amount of data we have for each framework. The plot for the evaluation task shows how extreme outliers are sometimes difficult to simulate correctly.

The Q–Q plots of interarrivals in Figure 12(b) show that both the random and realistic (clustered by weekday and hour of day) arrival profiles slightly overestimates pipeline interarrivals. This is acceptable because the experiment environment takes an interarrival factor parameter that allows us to increase or decrease the average arrivals of pipelines, and control for such errors. Figure 12(c) shows a detailed view of the realistic arrival profile, where we simulated four weeks of pipeline executions The black line shows average arrivals per hour in our simulation, and the red line shows the values from our real system. We can see that our clustered sampling approach generally performs well in capturing different arrival peaks.

Vi-C Simulation Performance

To asses the scalability of the simulator, we run several experiments with an increasing number of pipeline executions and record the (wall clock) execution duration, and memory consumption of the executing python process. The simulation runs in a single thread and is executed on an AMD FX-8350 4.0GHz CPU machine with 8GB RAM running Linux Mint 18.2 and Python 3.6. Our experiment results in Figure 13 reveal a straight-forward linear relationship between the number of executed pipelines and the execution duration, and likely polynomial memory usage due to the internal storage of execution traces.

Fig. 13: Simulator performance depending on number of pipeline executions

We simulated the system for up to a year (365 days), with an average interarrival of 44 seconds, which corresponds to about 720 000 pipeline executions. This simulation took on average 517 seconds, or 8.6 minutes, meaning that simulating a pipeline execution takes, on average, around 1.4 ms on the evaluation machine. We ran these experiments several times and observed negligible variance. The maximum memory consumption for this run was roughly 850 MB. While we found that the simulator is overall capable of running extensive simulations on a single machine, we quickly ran into memory issues with InfluxDB when storing more than a few hundred thousand pipelines due to the way it manages indexes for group-by queries internally. In fact we were unable to complete the last experiments where we simulated over 100 000 executions. While InfluxDB provided an easy-to-use database for rapid prototyping, we conclude that it was overall a poor choice for the experimentation environment moving forward, and we will investigate alternatives to maintain better scalability.

Vii Conclusion

Efforts of both research and industry to build platforms for democratizing and operationalizing AI have revealed exciting new opportunities for operations research. The development of such platforms and their operational strategies is challenging due to the complex nature of the AI application lifecycle, as well as the growing need for reconciling build-time and run-time aspects of ML models.

Being able to examine and experiment with the behavior of these systems is a critical requirement for AI platform operators as our understanding of the AI lifecycle continues to grow. To that end, we have presented an experimentation and analytics framework for large-scale AI platforms. Based on our current knowledge of production-grade AI systems, we developed a system and process model to simulate, monitor, and analyze the operation of such platforms, enabling continuous improvement with humans in the loop.

We have shown how to build statistical models from empirical data that accurately reflect the effect of AI pipeline executions and model scorings on the infrastructure as well as metrics of the model itself. Simulating a year’s worth of pipeline executions takes only a few minutes on a single machine, enabling a quick feedback cycle to run experiments. Our evaluation demonstrates that the analytics components can be used to examine the impact of different system parameters and support advanced techniques like capacity planning and resource optimization. For our future work we plan to further extend the current simulator and establish a stronger link with the run-time of the real system. We envision a mode of operation where the simulation automatically reconciles its predictions with the real system, and dynamically adjusts the underlying distributions accordingly, resulting in an increased fidelity of the simulation. Furthermore, we are working on large-scale scheduling optimizations for automated retraining pipelines that we plan to develop and evaluate using our simulator.


We thank the anonymous reviewers who have contributed to the improvement of this paper. We also thank Youssef Mroueh, who provided valuable input on the statistical modeling techniques used in our approach.


  • [1] D. Baylor, E. Breck, H. Cheng, N. Fiedel, C. Y. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc, et al. (2017) Tfx: a tensorflow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1387–1395. Cited by: §II, §III-B, §IV-A1, §IV-A.
  • [2] A. Bifet, G. de Francisci Morales, J. Read, G. Holmes, and B. Pfahringer (2015) Efficient online evaluation of big data stream classifiers. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 59–68. Cited by: §III-A.
  • [3] R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. De Rose, and R. Buyya (2011) CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Software: Practice and experience 41 (1), pp. 23–50. Cited by: §II.
  • [4] J. Demšar and Z. Bosnić (2018) Detecting concept drift in data streams using model explanation. Expert Systems with Applications 92, pp. 546–559. Cited by: §IV-A2.
  • [5] A. Feldmann (2000) Characteristics of tcp connection arrivals. Self-Similar Network Traffic and Performance Evaluation, pp. 367–399. Cited by: §IV-C2, §V-A3.
  • [6] G. S. Fishman (2013) Discrete-event simulation: modeling, programming, and analysis. Springer Science & Business Media. Cited by: §I.
  • [7] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia (2014) A survey on concept drift adaptation. ACM computing surveys (CSUR) 46 (4), pp. 44. Cited by: §II, Fig. 2, §III-A.
  • [8] J. Hernández-Orallo, P. Flach, and C. Ferri (2012) A unified view of performance metrics: translating threshold choice into expected classification loss. Journal of Machine Learning Research 13 (Oct), pp. 2813–2869. Cited by: §III-A.
  • [9] M. Hind, S. Mehta, A. Mojsilovic, R. Nair, K. N. Ramamurthy, A. Olteanu, and K. R. Varshney (2018) Increasing trust in ai services through supplier’s declarations of conformity. arXiv:1808.07261. Cited by: §II.
  • [10] W. Hummer, V. Muthusamy, T. Rausch, P. Dube, and K. El Maghraoui (2019-06) ModelOps: cloud-based lifecycle management for reliable and trusted ai. In 2019 IEEE International Conference on Cloud Engineering (IC2E’19), Cited by: §I, §I.
  • [11] T. Li, J. Zhong, J. Liu, W. Wu, and C. Zhang (2018-01) Ease.ml: towards multi-tenant resource sharing for machine learning workloads. Proc. VLDB Endow. 11 (5), pp. 607–620. Cited by: §II, §III-B, §III-C.
  • [12] J. Lin, J. Zhang, Y. Ding, L. Zhang, and Y. Han (2018) All-spark: using simulation tests directly in production environments to detect system bottlenecks in large-scale systems. In Proceedings of the 19th International Middleware Conference, Middleware ’18, New York, NY, USA, pp. 1–12. External Links: ISBN 978-1-4503-5702-9, Link, Document Cited by: §II.
  • [13] Y. Liu, H. Zhang, C. Li, and R. J. Jiao (2012-02) Workflow simulation for operational decision support using event graph through process mining. Decision Support Systems 52 (3), pp. 685–697. Cited by: §II.
  • [14] N. Matloff (2008) Introduction to discrete-event simulation and the simpy language. Davis, CA. Dept of Computer Science. University of California at Davis. Retrieved on August 2 (2009). Cited by: §I, §IV-C2.
  • [15] H. Miao, A. Li, L. S. Davis, and A. Deshpande (2017) Towards unified data and lifecycle management for deep learning. In IEEE 33rd International Conference on Data Engineering (ICDE), pp. 571–582. Cited by: §II.
  • [16] A. Núñez, J. L. Vázquez-Poletti, A. C. Caminero, G. G. Castañé, J. Carretero, and I. M. Llorente (2012) ICanCloud: a flexible and scalable cloud infrastructure simulator. Journal of Grid Computing 10 (1), pp. 185–209. Cited by: §II.
  • [17] T. Rausch, W. Hummer, and V. Muthusmay (2020) An experimentation and analytics framework for large-scale ai operations platforms. In 2020 USENIX Conference on Operational Machine Learning (OpML ’20), Cited by: §I.
  • [18] T. S. Sethi and M. Kantardzic (2018) Handling adversarial concept drift in streaming data. Expert Systems with Applications 97, pp. 18–40. Cited by: Fig. 2.
  • [19] E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, and B. Recht (2017) KeystoneML: optimizing pipelines for large-scale advanced analytics. In Data Engineering (ICDE), 2017 IEEE 33rd International Conference on, pp. 535–546. Cited by: §II, §IV-A1.
  • [20] Y. Sun, Z. Wang, H. Liu, C. Du, and J. Yuan (2016) Online ensemble using adaptive windowing for data streams with concept drift. International Journal of Distributed Sensor Networks 12 (5), pp. 4218973. Cited by: Fig. 2.
  • [21] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart (2016) Stealing machine learning models via prediction apis.. In USENIX Security Symposium, pp. 601–618. Cited by: §III-A.
  • [22] D. T. Vooturi, S. Goyal, A. R. Choudhury, Y. Sabharwal, and A. Verma (2017) Efficient inferencing of compressed deep neural networks. CoRR abs/1711.00244. Cited by: §V-A2.
  • [23] K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin (2017) Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology 27 (12), pp. 2591–2600. Cited by: §I.
  • [24] T. Weng, H. Zhang, P. Chen, J. Yi, D. Su, Y. Gao, C. Hsieh, and L. Daniel (2018) Evaluating the robustness of neural networks: an extreme value theory approach. arXiv preprint arXiv:1801.10578. Cited by: §III-A, §IV-B2.
  • [25] C. Yu, B. Karlas, J. Zhong, C. Zhang, and J. Liu (2018) Multi-device, multi-tenant model selection with GP-EI. CoRR abs/1803.06561. External Links: Link, 1803.06561 Cited by: §II.
  • [26] M. F. Zeager, A. Sridhar, N. Fogal, S. Adams, D. E. Brown, and P. A. Beling (2017) Adversarial learning in credit card fraud detection. In Systems and Information Engineering Design Symposium (SIEDS), 2017, pp. 112–116. Cited by: §I, §II.