Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-based Platforms

by   Liqun Shao, et al.

Microsoft's internal big data analytics platform is comprised of hundreds of thousands of machines, serving over half a million jobs daily, from thousands of users. The majority of these jobs are recurring and are crucial for the company's operation. Although administrators spend significant effort tuning system performance, some jobs inevitably experience slowdowns, i.e., their execution time degrades over previous runs. Currently, the investigation of such slowdowns is a labor-intensive and error-prone process, which costs Microsoft significant human and machine resources, and negatively impacts several lines of businesses. In this work, we present Griffin, a system we built and have deployed in production last year to automatically discover the root cause of job slowdowns. Existing solutions either rely on labeled data (i.e., resolved incidents with labeled reasons for job slowdowns), which is in most cases non-existent or non-trivial to acquire, or on time-series analysis of individual metrics that do not target specific jobs holistically. In contrast, in Griffin we cast the problem to a corresponding regression one that predicts the runtime of a job, and show how the relative contributions of the features used to train our interpretable model can be exploited to rank the potential causes of job slowdowns. Evaluated over historical incidents, we show that Griffin discovers slowdown causes that are consistent with the ones validated by domain-expert engineers, in a fraction of the time required by them.


Best of Both Worlds: High Performance Interactive and Batch Launching

Rapid launch of thousands of jobs is essential for effective interactive...

Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing

Distributed dataflow systems like Apache Spark and Apache Hadoop enable ...

Towards Reliable (and Efficient) Job Executions in a Practical Geo-distributed Data Analytics System

Geo-distributed data analytics are increasingly common to derive useful ...

PingAn: An Insurance Scheme for Job Acceleration in Geo-distributed Big Data Analytics System

Geo-distributed data analysis in a cloud-edge system is emerging as a da...

Bellamy: Reusing Performance Models for Distributed Dataflow Jobs Across Contexts

Distributed dataflow systems enable the use of clusters for scalable dat...

Data-driven Job Search Engine Using Skills and Company Attribute Filters

According to a report online, more than 200 million unique users search ...

DRESS: Dynamic RESource-reservation Scheme for Congested Data-intensive Computing Platforms

In the past few years, we have envisioned an increasing number of busine...

1 Introduction

Microsoft operates one of the biggest data lakes worldwide for its big data analytics needs [hydra]. It is comprised of several clusters for a total of over 250k machines and receives approximately half a million jobs daily that process exabytes of data on behalf of thousands of users across the organization. The majority of these jobs are recurring and several of them are critical services for the company. Hence, administrators and users put significant effort in tuning the system and the jobs to optimize their performance. Nevertheless, some jobs inevitably experience slowdowns in their execution time (i.e., they take longer to complete than their previous occurrences) due to either system-induced (e.g., upgrades in the execution environment, network issues, hotspots in the cluster) or user-induced reasons (e.g., changes in job scripts, increase in data consumed).

Such job slowdowns can have a catastrophic impact to the company. In fact, runtime predictability is often considered more important than pure job performance in recurring production jobs [morpheus]. First, several jobs are interdependent, that is, the output of a job might be consumed by multiple other jobs [owl]. Thus, the slowdown of the first job can have a cascading effect on all other dependent jobs, impacting vital services across the company. Second, some business-critical jobs are associated with deadlines in the form of service-level objectives (SLOs). Missing those SLOs can result in substantial financial penalties in the order of millions of dollars.

Despite the importance of promptly resolving such incidents, the current approach remains largely manual. Job slowdowns are signaled either through tickets raised by customers or by missed deadlines (for jobs with SLOs). In either case, a slow, labor-intensive process of error triaging and root-cause analysis must be initiated. In particular, on-call engineers manually investigate causes of job slowdowns by analyzing hundreds of logs and system traces through a complex monitoring dashboard. Despite the existence of detailed metrics, it can sometimes take several hours to resolve an incident. This bottleneck costs millions of dollars in engineering time wasted on investigation and in job SLO violations, and results in degraded user experience.

In this work, we present Griffin, the system we built and have deployed in our production big data analytics infrastructure to automatically discover the main factors causing a job’s runtime deviation through the use of machine learning. Griffin greatly improves the situation described above. First, it helps users find user-induced causes of their job slowdowns and prevents them from raising tickets that are “false alarms” to system administrators. Second, in case of actual infrastructure issues, it directs administrators towards the most probable causes for a job slowdown and allows early elimination of factors unrelated to the slowdown. Third, by observing slowdowns in jobs submitted for testing purposes, administrators can resolve system issues before they affect user jobs.

Existing related works have used detection methods such as classification and clustering to perform analysis of anomalies in cloud computing [agrawal2015survey, modi2013survey]. However, to analyze anomalies, these methods rely on labeled data, e.g., data from existing incidents that associate jobs with their slowdown causes. Such labeled training data in production cloud systems are extremely hard to obtain and can also be erroneous. A few approaches do consider unlabeled data, but rely either on time-series analysis or restrict their focus to machine or VM behavior [cherkasova2009automated, cohen2004correlating, dean2012ubl, gu2009online, tan2010adaptive, tan2012prepare, zhang2013intelligent]. In contrast, we focus on job instances that span several hundreds of machines but only during the lifetime of the job. As a result, existing techniques are frequently not applicable to identify root causes of job slowdowns.

Unlike existing works, Griffin employs an interpretable regression model to predict job runtime and then suggest reasons for runtime deviations. Griffin exploits two characteristics of available data. First, in our clusters we collect valuable telemetry at various levels of abstraction (at the job, machine, and cluster level). Second, the majority of our jobs are recurrent, meaning similar jobs which run regularly, for example every day or several times a day, allowing Griffin to leverage this historic data. Then, based on the relative contribution of each metric/feature to the runtime of a job that experienced a slowdown, we emit a list of possible causes for the slowdown, ranked by their importance.

Our contributions are the following:

  1. We present an end-to-end ranking system to identify the root causes of job slowdowns without human-labeled data.

  2. We show how an interpretable regression model can be used to reason about job slowdowns.

  3. We experimentally compare various models in terms of accuracy, scalability in the model size and number of jobs, and generalizability to jobs not seen before by the system.

  4. Griffin is deployed in our clusters and is used by our engineers. Early indications show that slowdown causes generated by Griffin are closely correlated to causes validated by domain experts. At the same time, Griffin drops the time of investigation by orders of magnitude compared to the existing manual process.

The rest of the paper is organized as follows. Section 2 provides details on our production environment and the relevance of the problem we focus on. Section 3 gives an overview of Griffin. Section 4 describes our anomaly reasoning algorithm. Section 5 discusses feature engineering and data collection, whereas Section 6 provides details on Griffin’s deployment. Section 7 presents the results of our experimental evaluation. Section 8 discusses related work, and Section 9 provides our concluding remarks.

2 Background on our Environment

In this section, we provide background on the characteristics of our analytics clusters to give a sense of the scale of the problem we target (Section 2.1). Then, we describe the current state of affairs in finding the reasons for a job’s slowdown (Section 2.2).

2.1 Cluster Characteristics

At Microsoft we operate a massive data infrastructure, powering our internal analytics processing. This infrastructure consists of several clusters, each comprised of tens of thousands of machines—Table 1 highlights some details of our clusters’ scale. To make the situation even more challenging, our cluster environments are also heterogeneous, including several generations of machines.

Figure 1: Running tasks in one of Microsoft’s production analytics clusters, comprised of tens of thousands of machines.
Dimension Description Size
Daily Data I/O Total bytes processed daily 1EB
Fleet Size Number of servers in the fleet 250k
Cluster Size Number of servers per cluster 50k
Table 1: Microsoft cluster environments.
Figure 2: Two occurrences of the same job, broken down . The bottom one takes 90% longer than the top one.

Tens of thousands of users submit hundreds of thousands of jobs to these clusters daily. Each job is a directed-acyclic graph (DAG) of operators (which we term stages), and each stage consists of several tasks [scope]. Each task gets executed in a cluster’s machine (and each machine runs several tasks in parallel). Figure 1

depicts the number of running tasks in one of our clusters over the course of a week. At each moment in time there are between 200k–300k tasks running.

Given this extreme scale and complexity, job slowdowns are quite common. Manually investigating such slowdowns, as we explain in the following section is a painful and time-consuming effort.

2.2 Manual Job Slow-down Investigation

We now describe how Microsoft engineers used to approach job slowdowns before Griffin got deployed in our clusters. Figure 2 shows two occurrences of the same job. The top one corresponds to its regular execution, taking 44 mins to complete. The bottom one experiences a slow down with a completion time of 88 mins. The figure visualizes how long the various stages of the job take to execute (although several stages might run in parallel, this tool shows the ones in the critical path, as those determine the job’s execution time).

To investigate this slowdown, an engineer will typically start by looking at the visualization tool of Figure 2, trying to detect the stages that seem abnormal. Note, however, that in this example, the “regular” top occurrence is the one that seems to have longer stages. Therefore, this tool is of limited use. Next, the engineer will have to manually combine several other tools and system files to get more information about the job and the system during the time this job was executed. Given the scale of the system and the amount of metrics collected, this process can take a considerable amount of time to complete. Multiply this by the number of slowdowns and one can easily see the significant opportunity in saving engineering time and improving user experience by speeding up this process. Note also that only a few engineers have the knowledge to perform this manual analysis.

3 System Overview

Figure 3: System architecture.

Griffin’s goal is to find the causes for job runtime degradations in our big data analytics clusters. A central requirement is to not rely on labeled data, i.e., there should be no need for existing slowdown instances associated with their causes. Each job in our clusters is associated with a set of telemetry data that we already collect for monitoring and debugging purposes (e.g., number of tasks, size of input data, load of machines the job was executed on—see Section 5 for details), some of which contributed to the job’s slowdown. Instead of finding a subset of the slowdown causes (as a system that relies on labeled data would do), Griffin ranks the causes (i.e., the features) in the order they affected the deviation of the job’s runtime from its expected runtime, and then suggests the top causes to the users. A formal description of the problem and our approach for solving it is presented in Section 4.

Griffin’s architecture is depicted in Figure 3. It consists of two pipelines: the offline training and the online prediction, which we detail below.

Training The offline training process uses various metrics that we collect at the job, machine, and cluster level to generate a model that will be able to predict the runtime of a job given these features. Our anomaly reasoning algorithm will use this model to rank the features that contributed to the job’s slowdown. The training process involves the following steps: (1) data preparation, i.e., collect and clean the data from different sources in the cluster at regular intervals; (2) feature engineering, i.e., extract raw features from the collected data, create new ones, and choose the ones that we will be using to train the model; and (3) model generation, i.e., the creation of the model, including hyper-parameter tuning, training, and model evaluation. The data preparation is detailed in Section 5, whereas the model details are in Section 4.

Figure 4: Griffin’s output for the slowdown of the job depicted in Figure 2.

The generated model is stored in the Tracking Server, which tracks model runs and stores performance metrics and hyper-parameter values (see Section 6).

Prediction The online prediction pipeline provides an API that takes as input the ID of a job that was executed and experienced a slowdown. This API is exposed to the users through a web application. Then the Online Feature Building component gathers the metrics associated with that job and provides the data to the Model Server where the prediction model is deployed. The output is a report with ranked reasons for job slowdown. In addition, the system provides a confidence level in the results (see Section 4.4).

High confidence means that users can rely on the output of the system. On the other hand, low confidence means that the metrics used are not sufficient to explain the job slowdown. The output of the system is useful even in the latter case, because users can rule out these metrics and focus their investigation in other areas.

Figure 4 shows Griffin’s output for the job slowdown of Figure 2. In this case, Griffin suggests with high confidence that the increase in data written by the job is the main reason for its slowdown. As this is a user-induced reason and not a problem with the system, the corresponding ticket can be closed without further investigation.

4 Anomaly Reasoning Algorithm

In this section, we describe our algorithm for reasoning about anomalies. First, we formally define the problem (Section 4.1) and discuss model interpretability (Section 4.2). Then we describe the interpretable tree-based model that Griffin uses for determining the reasons for a job’s slowdown (Section 4.3) and its associated confidence level (Section 4.4).

4.1 Problem Statement

We consider a set of jobs that have already been executed. Hence, we know the runtime of each job. Through our collected metrics, we also know the values of the features that we are interested in (see Section 5.2

for feature selection).

The majority of the jobs submitted in our clusters are analytics jobs111Analytics jobs are executed using Scope, an internal SQL-like distributed query engine that enables processing of petabytes of data per job [scope]. that are recurring, i.e., they are submitted at regular intervals (typically hourly, daily, or weekly) [hydra]. We use the notion of job template to refer to each of these recurring jobs. Jobs belonging to the same template have very similar scripts with minor differences, e.g., to access the latest data.

Figure 5: Example of baseline selection for a job template.

We also define the baseline of a job to be its “expected” runtime, given the runtime of the other jobs that belong to the same template. In practice, we use the mean runtime of the jobs whose runtime falls between the 45 and 55

percentile for that template. A benefit of using a percentile measure is that we avoid outliers. Therefore “slow jobs” in our training data will not affect the baseline set. Figure 

5 shows the runtime distribution for one job template. The data we use for baseline selection falls between the two orange lines.

For jobs that belong to templates with no previous occurrences, we use the baseline of jobs with similar characteristics (in data size and performed operations). Similarly, we define the baseline of various features of a job to be their expected value, given the jobs of the corresponding template.

Let be the -dimensional features, and be the job runtime. Let and be the baseline of the runtime and the features, respectively. We define the problem as follows: for each job , lacking human labeled reasons, with features and runtime , predict the rank of different features based on their influence on the deviation of from .

4.2 Interpretable Model

Consider a machine learning model that is trained to predict the runtime of a job using a set of features . That alone would be a standard regression problem. However, in our setting we want to use such a runtime prediction to find the features that contribute the most to a job’s slowdown, that is, to the runtime’s deviation from the job’s baseline . To this end, we need an interpretable regression model for the job’s runtime.

We define a regression model to be interpretable, if the output of the model can be expressed as the sum of contributions of each of the model’s features:


where is a constant and is the contribution of feature to the prediction. Similarly, for a baseline job, let be the predicted runtime based on the same model using the baseline features , we can decompose the model prediction as:


where is the contribution of feature to the prediction.

In our setting, we can quantify the delta feature contribution of each feature to the deviation of from :


In our case, for the baseline jobs of all templates, the model prediction is very accurate (with Mean Absolute Ratio Error as 2.2%). Thus the sum of the delta feature contribution approximately equals to the deviation from the baseline job runtime . If the predicted baseline runtime is not accurate, i.e. the difference between and is large, we can raise a flag about our confidence of the model result, as will be discussed in Section 4.4.

Being able to quantify the contributions of each feature to a job’s slowdown allows us to rank the features in order of importance, which is the goal of Griffin.

Model choice

We considered various model categories to predict the runtime of a job, namely Linear Regression (LR), Random Forest (RF), Gradient Boosted Trees (GBT), and Deep Neural Network (DNN). Our main requirements were that the model be interpretable and that it offers good accuracy.

A linear model can be expressed as , where . It is trivial to show that it satisfies the interpretability criterion of Eq. 1. However, as we show in Section 7.2, the accuracy is worse than that of the other models. GBT and DNN have acceptable accuracy but their interpretability is hard to establish. Lastly, the RF model exhibited the best accuracy in our experiments and therefore, is our model of choice in Griffin. In the next section, we describe an appropriate tree interpreter that reformulates a tree-based model to a linear form, so that we can use it to rank feature contributions to job slowdowns.

Note that when training our models, we considered both a global and per-template models. In the former case, we train a single unified model to predict runtime using jobs of all templates together in the training set. In the latter, we train one model per job template utilizing training data drawn exclusively from jobs that belong to the particular template. In Section 7, we compare the two approaches in terms of accuracy, scalability, and generalizability.

4.3 Interpretable Random Forest

In a Random Forest (RF) model, for each tree, in order to make a prediction, we traverse a path from the root of the tree to a leaf. This path consists of a series of decisions based on the model’s features. Assuming there are nodes on the path, each node separates the feature space into two, given a feature and a threshold : the one child node corresponds to , the other to . In other words, from the root node where all the samples reside, a partition based on feature and threshold thus separates the data samples to the two children that correspond to smaller feature spaces.

Consider a tree of the model and a node that is partitioned from its sibling based on feature . Let be the mean target value for all samples that reside on node . Then the contribution of feature to the final prediction due to this partitioning is calculated as:


for , where node is ’s parent. equals to 1 if the partitioning at node involves feature for tree or 0 otherwise. The number of samples that reside on each node becomes smaller and smaller by traversing the path, as the feature space gets smaller. The contribution of to the final prediction can be calculated as the sum of all :


The prediction of the target value from this tree is and can be expressed using the sum of all features’ contributions along the path:


where is the full sample mean. TreeInterpreter [TreeInterpreter] combines the results of all trees in our Random Forest by taking the sum of the contribution from each tree. Thus, each prediction is decomposed into a sum of contributions from the features, as follows:


where is the number of trees, is the full sample mean for each tree, and is the number of features involved.

Using for the average runtime across the whole training set and for the contribution of feature to the predicted runtime, we get to Eq. 3, which shows that our Random Forest model meets the interpretability criterion. Therefore, it can be used to detect reasons for job slowdowns in Griffin.

4.4 Confidence Level

The confidence level shows how reliable is the prediction made by our model for the contribution of each feature to a job’s slowdown. We consider two factors that affect our model confidence: (1) the relative error in predicting the runtime of the job (by comparing the model prediction with the actual runtime of the job); (2) the confidence intervals estimated by the random forest 


The relative error is defined as following:


We use two thresholds, and , for the relative error, as explained below.

The confidence interval of the random forest method is estimated based on the prediction of each decision tree,

. We take the and percentile of the distribution of . If the final prediction

is within this range, we consider the prediction to have low variance, since the predictions from all trees are consistent.

We define three confidence levels as follows:


The prediction is within the range of and percentile of , and the relative error is lower than threshold ;


The relative error is between and ;


Other scenarios.

Parameters , and are tuned as hyper parameters using validation data. High confidence means our model can reliably predict the slowdown reasons. Low confidence means the reasons are likely to fall outside the metrics we used. In the API, the user will be presented the level of confidence. A low confidence indicates more investigation will be needed. However, even in this case, as the model has examined many metrics, the DRI can focus their investigation in other areas.

5 Data Preparation and Feature Engineering

5.1 Data Preparation

In our clusters, we keep hundreds of metrics for monitoring, reporting, and troubleshooting purposes, which result in petabytes of logs and metrics per day. Moreover, the features we are interested in are scattered both physically (in different files across our hundreds of thousands of machines) and logically (we need to process and combine different files to generate features). To perform the required data preparation and extract features out of the data, we use Scope [scope], which provides a SQL-like language and can support our scale.

Feature extraction occurs both during the offline training phase and the online prediction (see Figure 3). Given that data freshness is not an issue for training (we do not need data of the current day), we use data that becomes available daily in our clusters and includes years-worth of historical data. For prediction, we need to collect features only for a single job (or a group of jobs), but latest data is required, as a user might want to debug their job that just finished. Thus, we use different data sources that allow us to access data within minutes from when they are produced (but that do not allow access to historical data, so cannot be used for training).

5.2 Feature Engineering and Selection

In collaboration with domain experts, we selected a subset of the features we collect to train our models in Griffin, based on what could potentially impact the runtime of a job. As already discussed, a job can experience a slowdown compared to its previous occurrences, due to either user-induced or system-induced reasons. User-induced reasons can be captured by metrics collected at the job-level, whereas metrics related to system-induced reasons can be split to either machine-level or cluster-level, as detailed below.


These are metrics collected for each job. Griffin currently uses approximately 15 such features, including data read within and across racks, data written, data skewness metrics, job priority, execution DAG features (e.g., number of stages and tasks), and user information.


These are metrics collected at the machines used during the execution of a job, such as CPU load, allocation delays, I/O reads/writes. Due to the challenges of collecting such features and correlating them with each job, in production Griffin currently uses only a few of them, but we are working on adding more.


These relate to the cluster environment when a job was executed. Examples of the ones we use are job queuing times, number of failed and revoked vertices, and execution environment version.

Challenges One of the biggest challenges with feature engineering is the correlation between features, which often results in high variance in the model prediction. While certain highly correlated features () were removed, feature data was preserved to the greatest extent, because correlated features may indicate different problems with a job. For instance, input size and input size per task have a correlation of . However, the former might indicate that the slowness reason was more data; the latter might indicate data skew. Fortunately, Griffin’s tree-based models have an innate feature of being robust to correlated features.

6 Tracking and Deployment

Model Tracking Machine learning is an iterative process, and reproducibility and versioning are crucial to productionalize machine learning models. To track model history in Griffin, we use MLflow [MLflow] in our Tracking Server (see Figure 3), which is deployed as an Azure Linux VM. For each model, we track logging parameters, code versions, metrics, and model artifacts.

Model Serving In order to make available to end users a model that we trained and stored in the Tracking Server, we use one of MLflow’s “flavors”222An MLflow flavor is a convention that deployment tools use to understand the model. to build an Azure Machine Learning image out of that model. Then we deploy this image as an Azure Container Instance, using the Azure ML Service’s [AML] in the Model Server (see Figure 3).

The models that are deployed in the Model Server are made available to end users through a web application, which exposes a scoring API. The web application runs on a flask web server [Flask].

Model Monitoring It is critical to monitor the model performance and retrain models if they go stale. To ensure this, we have provisioned our pipeline to allow single click retraining. Decoupling data sources from the training pipeline helps to easily refresh our data, retrain, and deploy the model with minimal impact to end users.

7 Evaluation

We now present our experimental evaluation for Griffin. In Section 7.1 we discuss Griffin’s effectiveness in finding the actual causes of job slowdowns. In Section 7.2 we compare different machine learning models for training, whereas Section 7.3 studies the scalability of different models as the number of job templates increases. In Section 7.4 we compare the model performance when training our model with an increasing number of jobs per template. In Section 7.5 we show early evidence that Griffin can be applicable in different domains.

We carried out our experiments on Windows using Python 3.7. We used a machine with eight 2.90 GHz processors and 64 GB RAM for the experiments in Sections 7.1 and 7.2, and a high-memory Virtual Machine (VM) for the scalability experiments of Section 7.3.

The job and feature data is obtained from the Microsoft production clusters, as described in Section 5.1. We use historical job data with different job templates, as described in Section 5.2, over a period of three months.

Figure 6: Runtime distribution for the jobs of different templates.

7.1 Validation Results

Working with domain experts at Microsoft, we picked a set of job templates that are considered important for our production clusters (SLO critical), and trained Griffin based on those. Note that the runtime distribution of the jobs of different templates varies significantly, which poses extra challenges for the runtime prediction based on machine learning models. Figure 6 shows the runtime distributions for five of the templates that we used.

From these templates, we then randomly picked seven jobs that experienced slowdowns (five from these templates and two from different templates), and compared the causes for slowdowns that were identified by the experts with those suggested by Griffin. For these jobs, Table 2 shows the reasons identified by the experts and Griffin (with their ranking), Griffin’s confidence level, and if the job belonged to one of the templates used for training the model (in-t). We use to denote the reason that Griffin predicted for a job’s slowdown with rank . For readability, we show only the reasons that were common between Griffin and experts and use variables for the rest.

Job Griffin’s Predicted List of
Ranked Reasons
Engineer Validated
Confidence Level in-t
1 [Input Size, , ] Input size High Yes
2 [, , , Revocation, ] Revocation Medium Yes
3 [, , , , , ] Framework issue Low Yes
4 [, , , , High compute hours] High compute hours Medium Yes
5 [Time skew, , , ] Time skew High Yes
6 [High compute hours, , ] High compute hours High No
7 [, Usable machine count, , ] Usable machine count High No
8 [High compute hours, ] High compute hours High No

Table 2: Result subset validated by engineers.

The results in Table 2 show that the reasons generated by Griffin are highly correlated with the reasons manually validated by our domain expert engineers. For job 1, the top predicted reason is the same as the manually validated reason with high confidence. For jobs 2 and 4, our system predicted the validated reason in the top 5 slowdown reasons, which is consistent with the confidence level medium. For job 3, our model does not identify the same reasons as the experts, also consistent with low confidence. Adding more features as planned (e.g., additional machine-level features) will allow us to improve the model’s prediction capability and minimize such low-confidence cases. Jobs 6, 7 and 8 show the robustness of our model: although Griffin was not trained using these job templates, it can still find the correct reasons with high confidence by using knowledge gathered by other similar job templates. Importantly, we observed no misleading predictions, i.e., there were no cases where Griffin predicted wrong slowdown reasons with high confidence. This means that even predictions with low confidence can be useful in ruling out the currently used features from the investigation.

7.2 Picking the Right Model

As discussed in Section 4, we experimented with various categories of models for the job runtime prediction, including Linear Regression (LR), Random Forest (RF), Gradient Boosted Trees (GBT) and Deep Neural Networks (DNN) with two hidden layers without hyper parameter tuning. For each of these categories, we consider both a global and per-template models.

We use Mean Absolute Ratio Error (MARE) as a metric to evaluate each model’s accuracy. As the runtime distribution varies significantly across different job templates, we normalize the estimation error by the baseline runtime of each job template (see Section 4.1), calculating the average runtime per job template in the training data:


where is the number of jobs in the testing data, , , and are the predicted, actual, and baseline runtime from testing data, respectively.

(a) MS and TT
(b) MARE and SIT
Figure 7: Scalability of global vs. per-template models

Table 3 shows the results of MARE scores for the four model categories. Random forest performs best in terms of accuracy, both for the global and per-template model. Given its high accuracy and interpretability (as discussed in Section 4.2), RF is the approach we use in our production Griffin deployments. Moreover, we observe that the global model tracks closely the performance of the per-template model, while allowing to reason about jobs that we have not sufficiently encountered previously. In the next section, we demonstrate that the global model scales much better than the per-template models with an increasing number of templates. Hence, we use the global model in production.

 Per-Template Model 0.186 0.116 0.124 0.146
 Global Model 0.235 0.121 0.277 0.353
Table 3: MARE scores on runtime prediction by LR, RF, GBT, DNN with global and per-template models. Lower MARE is better.
Figure 8: Model performance with different training sample sizes

7.3 Scalability of Global vs. Per-template Models

We assess the scalability of the global and per-template models by training them with an increasing number of job templates, as shown in Figure 7. We added job templates incrementally to the training set and repeated each experiment 10 times. We use 5-fold cross validation for hyper parameter tuning with random grid search. With more job templates, the training time (TT) and the model size (MS) for both global and per-template models increased. We observe that the MARE for global model is better than per-template models when training with a larger number of job templates. This can be attributed for the most part to the larger sample size for the global model, as a result of the unification of the training set. In contrast, when training the per-template models with 240 job templates, many templates only have a small number of samples and the MARE is high. We also report the prediction time on a single job, i.e. the single inference time (SIT). The SIT didn’t increase with the model size, which is important to deliver a real-time experience to Griffin’s users.

Overall, the above experiments demonstrate the scalability of the global model for cloud-scale training. At the other extreme, a template-specific model suffers from lack of training data and the ability to generalize to new (unseen) templates. As part of our future work, we plan to cluster job templates using unsupervised machine learning methods and train “semi-global” models that take into account multiple job templates that share similar characteristics to strike a balance between the two approaches.

7.4 Varying Size of Training Data

In order to determine the impact of the number of jobs per template on our model performance, we retrain our global model assuming that we only have a limited number of observations for each job template. In particular, we train our model with observations per job template, where . The MARE and the training time are reported in Figure 8. We used 22 job templates and, for each run of the experiment, a random sample of observations were selected as the training data. When , the MARE dropped below 20%, and the prediction accuracy continues to improve for larger sample sizes, although less significantly. Note that the training time increased exponentially to the sample size.

7.5 Model Generalization

The baseline approach for interpretation described in this paper allows job runtime prediction results to be interpreted and compared to a set of similar jobs. Data in the real world frequently looks similar to our dataset: Gaussian mixture distributions of a target variable are commonly encountered.

This section presents an example of employing Griffin on another dataset in another field. A classic dataset from statistics is the “Auto” dataset of gas mileage, which is represented well by a linear superposition of three Gaussians by region of origin: American, European and Japanese. Manufacturers might be interested in understanding what factors drive higher gas mileage in American cars relative to other American cars. Here gas mileage is the equivalent to job runtime.

Feature Ford Granada Mean Baseline Delta FC (in mpg)
 Year 81 76 75 2.07
 Horsepower 88 104 106 1.07
 Weight 3060 2978 3224 0.7
 Displacement 200 194 221 0.04
 Acceleration 17.1 15.5 16.1 -0.19
 Cylinders 6 5.5 5.8 -1.4
Table 4: Contribution of each feature to the high gas mileage of American-made Ford Granada compared to other American cars.

Table 4 summarizes the delta contribution of each feature (FC) to the high gas mileage of American-made Ford Granada compared to other American cars based on our anomaly reasoning algorithm, described in Section 4. We observe that “Year” and “Horsepower” contribute the most to high gas mileage, while “Weight” and “Displacement” make marginal contributions. “Acceleration” and “Cylinders” contribute to low gas mileage.

8 Related Work

Anomaly detection [Chandola2009AnomalyDA] refers to the problem of finding patterns in data that do not conform to expected behavior. In contrast, anomaly reasoning, which is the purpose of this work, encompasses recognizing, interpreting, and reacting to unfamiliar objects or familiar objects appearing in unexpected contexts.

Anomaly reasoning is particularly important for large-scale systems, as it is not possible to manually track all machines and applications at scale. Below we discuss the main categories of works that are related to Griffin.

Interpretable models Building an effective anomaly system requires both interpretable models and reasoning algorithms. Several efforts have focused on interpreting the results of machine learning models. Their goal is to provide proper explanation about how or why the algorithm produces a specific prediction and to identify interactions between features and estimation results [dovsilovic2018explainable, gunning2017explainable, samek2017explainable]. In Griffin, we use an interpretable random forest model to rank a job’s slowdown reasons given the contribution of various features. Similar methods to interpret a tree model can be generalized to boosting algorithms [welling2016forest].

Anomaly reasoning with labeled data Existing related works have used detection methods such as classification and clustering to perform analysis of anomalies in cloud computing. A detailed survey of those work can be seen in [agrawal2015survey, modi2013survey]. For example, a fault detection and isolation system based on k-nearest neighbor has been proposed to rank machines in order of their anomalous behavior [bhaduri2011detecting]. Other works have used a hybrid of SVM classification and k-medroids clustering to detect intrusions of the network [chitrakar2012anomaly]. An anomaly-based clustering method has also been suggested to detect failures in general production systems [duan2009fa].

The downside of these approaches is that they require labeled data. Such data is hard to acquire in many settings, including ours. In the context of an infrastructure that has been operating for many years, labeling data requires infrastructure support and, most importantly, training a large number of engineers to add labels when resolving anomalies. Moreover, it is almost impossible to perform labeling for the years-worth of historical data.

Anomaly reasoning with unlabeled data A few approaches have considered unlabeled data, but focus either on time-series analysis or are restricted to machine or VM metrics [cherkasova2009automated, cohen2004correlating, dean2012ubl, gu2009online, tan2010adaptive, tan2012prepare, zhang2013intelligent]. For instance, an anomaly detection and reasoning system has been proposed to detect security problems during VM Live Migration [zhang2013intelligent]. This work is based on time series data related to resource utilization statistics, e.g., file read/write, system call, CPU usage. PREdictive Performance Anomaly pREvention (PREPARE) [tan2012prepare]

predicts performance anomalies using a 2-dependent Markov model and a classifier based on system-level metrics, such as CPU, memory, network traffic. Another related work developed an Unsupervised Behavior Learning (UBL) system to capture the anomalies and infer their causes 


. To circumvent the need for labeling data, Self Organizing Map (SOM) has been suggested to model the system behavior, and deviations are used for the anomaly detection 


Similar to Griffin, those methods estimate the contribution of each attribute to the anomaly and provides information about which system-level metrics to look into. However, they require time-dependent series of data to capture the anomalies. In contrast, we focus on job instances that span several hundreds of machines but only during the lifetime of the job. Thus, our features are neither time series nor machine-centric (although we do employ some system-level data to examine the system’s impact on a particular job’s execution).

Other works in anomaly reasoning aim to pinpoint the faulty components of a system by tracing the system’s activities [aguilera2003performance, mi2012performance, mi2011magnifier, nguyen2011pal]. The methods rely more on the estimation of the time series’ change point and the propagation pattern or the execution graph. However, those methods require significant domain knowledge and are hard to generalize.

To the best of our knowledge, Griffin is the first anomaly reasoning system to be deployed at this scale in production to identify the causes of job slowdowns in analytics clusters. Unlike existing approaches, it follows a job-centric approach and does not rely neither on labeled data nor on time series analysis.

9 Conclusion & Future Work

We presented Griffin, a system that we built and have deployed in production to detect the causes of job slowdowns in Microsoft’s big data analytics clusters, consisting of hundreds of thousands of machines. Griffin does not require labeled data to perform anomaly reasoning. Instead, it uses an interpretable machine learning model to predict the runtime of a job that has experienced a slowdown. Using this model, we can determine the contribution of each feature in the deviation of the job’s runtime compared to previous normal executions of the job (or of jobs with similar characteristics).

Our evaluation results using historical incidents showed that Griffin discovers the same slowdown reasons that were detected by domain expert engineers. We also compared various categories of models and showed that a global (i.e., trained over all jobs) random forest model strikes a good balance between accuracy, training time, model size, and generalization capabilities.

Towards data-driven decisions Griffin is part of our bigger vision towards employing data-driven decisions to optimize various aspects of our systems. Taking Griffin’s capabilities a step further, knowing the job slowdown reasons allows us to automatically tune the system to avoid such slowdowns in the future. This may include both system parameters, such as dynamically setting the number of running tasks per machine, and application parameters, such as the degree of parallelism for each stage of a job. Moreover, such parameter autotuning does not have to be constrained to job slowdowns—we can use it to automatically and dynamically set various parameters in our systems to improve their performance.

Furthermore, although Griffin currently targets our internal analytics clusters, the above techniques can be applied to other environments, such as various public Azure services, including the Azure SQL and HDInsight offerings. Similar data-driven decisions are increasingly applied in various companies [uberds, linkedinds].