Large cloud providers offer a wide range of virtual resources and managed services, and thus serve various use cases and needs. From a user perspective, this has various advantages, such as dynamic scaling of resources based on demand, or simplified management and deployment. Especially for infrequent needs, making use of cloud resources is often favored over running and managing own resources. Hence, an increasing number of computing workloads are run in the cloud.
Yet, in the majority of cases, users still need to select appropriate resources for their workloads. Inexperienced users are often overwhelmed by the range of different resources to choose from, while even expert users find this to be difficult. With an increasing number of scientists and data analysts from other domains that require data-processing workloads to be run in the cloud [4, 6, 2], supporting them becomes important.
Multiple solutions have been proposed over the years to support users with fixed constraints, e.g. in terms of runtime or cost. They can be roughly clustered into search-based methods [1, 13, 3] and model-based methods [26, 23, 21, 30]. In general, most methods require historical data about workload executions or benefit from them, yet their collection is costly or often restricted to certain use cases. However, since cloud users usually operate on the same infrastructure and possibly execute similar workloads, there lies potential in sharing and exploiting information about their workload executions, most likely in the form of confidentiality-preserving traces of data.
In this paper, we envision a system for collaborative exploitation of workload execution traces as well as data enrichment in order to optimize future workloads. We further argue for an explicit consideration of the attributed graph of task-dependencies of workloads, and strive for computing compact representations of these graphs for downstream clustering tasks. Ideally, computed clusters consist of similar or reoccurring workloads. Knowledge present in individual clusters, e.g. average runtime of workloads, can then be used to make better decisions for new workloads. Our approach to encoding and clustering of workload execution graphs is prototypically evaluated on the Alibaba dataset111https://github.com/alibaba/clusterdata of cluster traces. We are able to demonstrate that even anonymized workload execution graphs bear a predictive value for identifying groups of similar workloads. This underlines the potential of exploiting and enriching shared workload execution traces for predicting performance indicators of workloads, and addresses the problem of limited data via data sharing.
Contributions. The contributions of this paper are:
An idea for a system that enables users to share traces of their workload executions. Exchanged data can be evaluated for general patterns and workload similarities, which can be exploited for optimization of future workloads.
A prototypical implementation for mining of traces from workload execution graphs via graph encoding and graph clustering. These steps form an important part of the optimization process.
An evaluation of our implementation on a publicly available trace dataset. We demonstrate the predictive value of workload execution graph traces, e.g. for finding clusters of presumably similar workloads, and discuss the implications of our findings.
Outline. The remainder of the paper is structured as follows. Section II elaborates on the idea and proposes a system for exploiting workload execution traces for workload optimization, whereas Section III concretizes on the encoding and clustering of workload execution graphs. Section IV presents the preliminary results of our trace analysis. Section V discusses requirements for the implementation of our approach in real cluster environments. Section VI discusses the related work. Section VII concludes the paper.
Ii System Idea
This section elaborates on our envisioned system for workload optimization through data sharing and exploitation of workload traces. It is further illustrated in Figure 1.
Ii-a Sharing of Confidential Execution Data
Workload optimization through selection of more suitable cloud configurations is often realized with performance models. However, such models usually require a certain amount of historical data or profiling, which is costly to collect. Yet, many cloud users operate on the same infrastructure and potentially even run similar workloads, which opens the possibility of data sharing. As execution data of workloads encompasses also confidential and personal information, it is necessary to share traces of this data in a confidentiality-preserving manner. In the context of recent publications of cluster trace datasets from cloud providers, research was also conducted on this aspect . If cloud providers were to implement interfaces for conveniently fetching execution traces, or if enough users organize themselves and gather their respective traces in a centralized place, methods can be employed for exploiting data and optimizing future workloads.
Ii-B Envisioned System
Assuming that for each planned or executed workload, its execution graph traces are accessible, cloud users can collect their traces, optionally enrich them with further information describing the respective workloads, and save them to community-managed execution data repositories. With growing data, machine learning (ML) methods can be employed to encode and cluster the various workloads according to certain characteristics and performance indicators. As the traces are confidential by design, and users add further data on a voluntarily basis, users retain full control over their data.
Whenever a user attempts to submit a new workload, traces of the prospective execution plan are retrieved from the respective cloud provider at first. The information can be used to consult the various trained ML methods in order to exploit insights of similar historical workloads. With this knowledge gain, the desired workload can be optimized according to user-specific objectives and constraints, and eventually submitted.
Iii Mining of Execution Graph Traces
This section presents our approach to the mining of execution graph traces of workloads for the optimization of future workloads. It encompasses both encoding and clustering steps. The complete approach is generally sketched in Figure 2.
Data-parallel processing jobs can be modelled as directed acyclic graphs (DAG) to represent the dependencies between job tasks. A directed, acyclic, and attributed graph consists of a set of nodes and a set of edges . Each node
has a node feature vectorof dimension . An edge describes a directed connection between node and . Thus, the node is then called a neighbor of node , formally written as . The adjacency matrix of a graph is an matrix with entries such that if an edge exists, else 0. No cycles exist in the graph.
Iii-B Graph Encoding
In order to leverage the DAG of a workload in various prediction models, we need to compute a reasonable representation in the first place. Thus, we compose a graph neural network (GNN) architecture and derive a vector representation for each graph through a final global pooling operation. In a first step, multiple graph convolutional layers are employed. The fundamental idea of our GNN is the exploitation of structural information in the graph, which is realized by passing node features to neighboring nodes and thereby constantly computing new node features. This operation is referred to asneighborhood aggregation and commonly defined as
where denotes a differentiable and permutation invariant function, denotes the number of hops, and both and
denote differentiable functions, e.g. feed-forward neural networks, which optionally alter the vector dimensionality. When conducting multiple such steps, structural information are effectively exploited and propagated through the graph.
Eventually, we seek to compute a fixed-size vector representation for a graph. This can be formally expressed as
where is the final vector representation for a graph with hidden dimension , and is used to access individual elements in the feature vector of the -th node. Since shall be a good representation of the corresponding original graph
, it is necessary to design the training of the graph neural network appropriately. Given a set of graph features, e.g. number of nodes or the length of the longest path, chosen via a feature selection process based on their predictive value for some target objective, we argue for employing an additive loss. With each loss term measuring how good computed graph representations can be separated given a concrete graph feature, the optimization process eventually leads to the generation of graph representations which capture the most relevant information of the original graph. As all representation vectors maintain a suitable dimensionality, they can be used for downstream prediction and planning tasks.
Iii-C Graph Clustering
With the computed graph representations containing as much original information as possible, they can be further clustered into groups of vectors which are presumably similar in terms of their structure and node annotations. This is also a meaningful indicator for identifying recurring or mostly similar batch workloads. In order to make use of clustering techniques, the graph representations first need to be preprocessed. We standardize features by removing the mean and scaling to unit variance, and then scale the graph representations individually to unit norm. The preprocessed vector of a graphis from here on denoted as .
To determine the distance between two graph representations, we compute their Euclidean distance
denote representations of two different graphs respectively. Computed vector distances are then utilized during clustering. As the number of clusters evolves over time and thus can not be known a priori, we utilize a density-based clustering technique. Suitable hyperparameters for such a technique can be determined empirically during training.
Iii-D Information Enrichment & Optimization
The identified clusters of data points are a first indication of executions of either identical or similar batch workloads. Yet, as more concrete information about the nature of a workload are not present in most cluster traces, either due to cluster providers not being aware of e.g. input dataset characteristics or not being allowed to publish confidential and personal information, an even more precise clustering solely based on cluster traces can not be achieved. Following our envisioned system, users are ideally able to obtain traces of their own workload executions and voluntarily enrich them with exclusive information about utilized datasets, models, or workload results. In such a scenario, the interpretation of data point clusters would be simplified.
The clustering results can be used to exploit properties common to the majority of data points belonging to a concrete cluster. Depending on the overall objective to optimize, information can be leveraged for future workloads. For example, assuming that the graph representation learning has been optimized for the task of runtime prediction and that clusters of data points have been determined based on the resulting graph representations, the execution graph of a new workload, prior to its actual execution, could be used to identify the most similar cluster and compute the median runtime of its corresponding data points. This runtime estimate can then be used for workload optimization, e.g. by tweaking the associated resource configuration of the workload.
Iv Preliminary Trace Analysis
In this section, we examine the predictive value of workload DAGs using a prototype implementation of our presented approach called Trace-EC. We make use of a comprehensive and publicly available cluster trace dataset and obtain preliminary results. Since, to the best of our knowledge, current cluster providers do not offer the possibility of independently exporting cluster traces of own workloads, and thus no investigation in conjunction with additional properties is possible, we solely assess the value of our approach based on the DAGs.
The Alibaba cluster trace dataset222https://github.com/alibaba/clusterdata consists of trace data of about 4000 machines over a period of 8 days. The required DAG information of workloads, also referred to as jobs, are retrievable from the batch_task table. This table includes more than 14 million tasks of more than 4 million jobs and has eight features, namely task name, job name, start time and end time, status, planned CPU (100 means one core, 200 means two cores, etc.) and planned memory (normalized to range ), as well as instance number (the number of instances required for each task). Since task dependencies are encoded into the task name, these features can be used to construct valid and attributed workload DAGs.
As the dataset suffers from minor inconsistencies and incompleteness, we employ multiple data cleaning operations. For instance, we exclude all unfinished jobs or jobs with undefined values. We furthermore make sure that the graph of each job complies with the formal definition of DAGs. For the sake of our evaluation, we eventually consider only the jobs that have at least 10 individual tasks to avoid coincidentally similarity in job dependencies structure. We also eliminate the jobs with runtime of longer than one hour. The total number of remaining jobs is after data cleaning. For each job, we further extract 20 features from its DAG (e.g. number of nodes) and consider them as candidates for target variables during model training. Each node in a DAG has three node features, i.e. the aforementioned features planned CPU, planned memory, and instance number. Using continuous data stratification based on the runtime variable, we split the dataset into 80% training data and 20% test data.
Iv-B Prototype Pipeline
We implement our envisioned pipeline in a prototypical manner using various established methods. In a first step, we employ a voting mechanism of multiple machine learning models and techniques to select the most runtime predictive features. Precisely, we use Principal Component Analysis (PCA), Extra-trees, and Recursive Feature Elimination (RFE) with linear regression, and choose the union of the five top scored features of each technique to extract the target variables. They encompass for instance theaverage node degree and total instance number. For supervised techniques, runtime is defined as the target variable and excluded from the features.
The graph neural network is implemented using four stacked MFConv 
layers, where the first one maps from the node feature dimension of 3 to the hidden dimension of 64. Each graph convolution layer is followed by a non-saturating ELU activation. The final node representations are averaged graph-wise to obtain graph representations. For regularization during training, we apply dropout with 50% probability on the model output, as well as an L2 penalty term offor weight decay.
Using continuous data stratification based on the runtime variable, the provided graph training dataset is split into 75% training data and 25% validation data. The latter is used for early stopping. We use the Adam optimizer and a batch size of 128 graphs during training. We decide for an additive loss to be minimized, where each loss term is a triplet loss  term corresponding to an individual target variable. Triplet loss reduces the distance between similar objects and increases the distance between dissimilar ones, based on a target variable. With a MultiSimilarityMiner , we make sure to evaluate challenging triplets in the respective triplet loss terms.
For the clustering of graph representations, we employ DBSCAN , with radius and a minimum of two samples in order to form a cluster.
In , the authors discuss a methodology for identifying recurrent jobs. It is applied on the Alibaba trace dataset for gaining potential insights. They argue that two or multiple jobs could be assumed as recurrent when they fulfill two conditions:
The jobs have isomorphic DAGs.
The start time of jobs should happen within periodical time intervals, e.g. 15 minutes, 1 hour, or 1 day intervals.
Both conditions were individually shown to be relevant in related works, for instance in scheduler design or for planning of production jobs. We thus utilize this methodology to compute clusters and compare them against our envisioned approach. Bliss  is used for extracting isomorphic graphs, and the aforementioned time intervals (relatively expanded for tolerance by adding 3% of intervals before and after) are investigated for deriving clusters of jobs.
Since the utilized test dataset of 46.562 workload DAGs has no ground truth, i.e. no information about group affiliation of workloads exist, a precise evaluation is hindered. Thus, we make the following approximation to interpret our results. Given a batch workload that is executed recurrently, it can be assumed that most executions exhibit similar characteristics and are thus strongly related. The clusters identified by the baseline and our approach should reflect this at best, i.e. clustered executions should either originate from the same workload or be very similar. We thus assess the goodness of clusters in two ways. In a first step, we test each clustering individually by predicting the runtime of the most recent execution in each cluster as the average runtime of all remaining cluster members. Consequently, this yields as many predicted values as clusters were formed. The prediction errors are measured and reported in terms of Mean Absolute Error (MAE) and Mean Squared Error (MSE). In a second step, we attempt to compare both clusterings. Since the number of identified clusters differs and no groundtruth is given, we do so by extracting the subset of workloads predicted in both clusterings, and comparing the associated prediction results.
Proportion of outliers
|Avg. cluster size||18,16||42,62|
Table I summarizes the most important findings of our first evaluation step. It can be seen that although both methods are able to cluster roughly the same amount of workloads, Trace-EC tends to form fewer clusters with more members each, whereas the utilized baseline suspects more clusters with on average less members. Considering the composition of individual clusters, it is worth mentioning that Trace-EC apparently produces well-formed clusters, as the reported MAE is substantially smaller compared to the baseline. This can be also found for the reported MSE and variance. Furthermore, the histograms in Figure 3 underline that clusters generated by Trace-EC allow for more accurate predictions and in turn smaller prediction errors. This results are consequently a first indication of superior performance.
For our second evaluation step, we extract the subset of workloads predicted in both methods, which leaves us with 1715 clusters from both clusterings. Again, we compute the relevant evaluation metrics on the subsets and illustrate them inFigure 4. It is confirmed once more that Trace-EC yields better capabilities for downstream optimization tasks, which becomes evident by its MAE of 72,62 outperforming the MAE of 94,71 from the baseline. In conclusion, clusters identified by Trace-EC tend to encompass more similar workloads, with respect to desired performance objectives.
Our findings demonstrate that the explicit consideration of workload DAGs through a flexible data-driven approach yields improved prediction performance. Even with normalized attributes, workload DAGs obtained from cluster traces appear to have predictive value. Although the lack of ground truth hinders a complete evaluation, generated clusters tend to encompass similar workloads and can thus be exploited for future workload optimization.
For application in real-world cloud environments and for optimization of actual workloads, we deem two things important. On the one hand, users are still required to enrich their collected traces with further information, e.g. dataset size or chosen algorithm implementation. Without such information, a more fine-grained distinction and thus clustering of workloads is hardly possible. On the other hand, cloud providers need to provide simple ways of retrieving confidential traces for own workloads. This can for instance be achieved by configurable data sinks, such that each workload execution is followed by an automated transfer of anonymized traces to a desired storage, e.g. a community-managed execution data repository.
Vi Related Work
Our work motivates the usage of cluster traces in order to collaboratively optimize workloads. This section consequently discusses attempts to collaborative workload optimization, as well as other trace analysis papers and trace-based methods.
Vi-a Collaborative Workload Optimization
Multiple solutions have been proposed that either enable the usage of data originating from various contexts, strive to find a configuration that optimizes numerous workloads at the same time, or foster a collaborative approach in a different way.
Micky  is a collective optimizer that determines a cloud configuration optimizing as many of the given workloads as possible. In this specific context, operational costs and execution times are regarded as performance objectives, and an optimized solution balances both of them for most workloads.
The Peregrine  workload optimization platform follows a collaborative approach by searching for patterns in historical query workloads and employing suitable optimization strategies. Exemplary patterns are periodicity and similarity.
The authors of  attempt to optimize workloads in environments with multiple actors by utilization of an experiment graph that allows for reuse of historical operations and their artifacts. Their work is especially suited for machine learning workloads while incurring only negligible overhead.
In our own previous work, we investigate collaborative solutions for distributed dataflows. Bellamy  proposes an end-to-end trainable neural network architecture that allows for incorporation of workload execution data originating from different contexts. It can thus be pre-trained and later fine-tuned on new contexts with only little available data. Another work is the C3O system [29, 30, 28], which enables the sharing of runtime data and artifacts. This allows for collaborative exploitation of shared information, which is demonstrated to be effective in case of context-aware predictors.
Vi-B Cluster Traces from Cloud Providers
In recent years, various cloud providers published traces of their clusters for further analysis and application. A considerable proportion of works mainly focuses on analyzing workload characteristics and distributions [19, 18, 5, 12, 25], where it is for instance found that the resource and memory consumption of most workloads entails a heavy-tailed distribution, i.e. few workloads consume the majority of resources [19, 25]. At the same time, especially long-running workloads tend towards over-provisioning of resources .
With regards to application, publicly available cluster traces are furthermore used for proposing novel scheduling approaches [16, 31], or utilizing trace information for implementing or evaluating various prediction methods [8, 10].
The primary goal of this work is to show how even anonymized execution traces shared with other users can be used to optimize future workloads. To this end, we envision a system that fosters trace data sharing among users, such that users can design and employ methods for detection of patterns, which in turn can be exploited for planning and optimizing workloads in the future, e.g. with regards to runtime or cost. Towards this goal, we implemented an approach for encoding and clustering traces of workload execution graphs, and evaluated it on a publicly available trace dataset. We find that our data-driven solution is able to make use of the predictive value of workload DAGs for performance indicators of interest, which in turn is a prerequisite for workload optimization.
In the future, we plan to leverage our findings and use graph information for optimized resource management for data processing workloads, including for resource allocation and scheduling. Moreover, we want to investigate the potential of traces other than execution graphs.
This work has been supported through grants by the German Federal Ministry of Education and Research (BMBF) as BIFOLD (funding mark 01IS18025A) and the German Research Foundation (DFG) as FONDA (DFG Collaborative Research Center 1404).
-  (2017) CherryPick: adaptively unearthing the best cloud configurations for big data analytics. In NSDI, Cited by: §I.
-  (2021) Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters. In BigData, Cited by: §I.
-  (2020) Finding the right cloud configuration for analytics clusters. In SoCC, Cited by: §I.
-  (2013) Parallelization in scientific workflow management systems. CoRR abs/1303.7195. Cited by: §I.
-  (2018) Characterizing co-located datacenter workloads: an alibaba case study. In APSys, Cited by: §VI-B.
-  (2019) The evolution of the pegasus workflow management software. Comput. Sci. Eng. 21 (4). Cited by: §I.
-  (2020) Optimizing machine learning workloads in collaborative environments. In SIGMOD, Cited by: §VI-A.
-  (2012) Host load prediction in a google compute cloud with a bayesian model. In SC, Cited by: §VI-B.
-  (2015) Convolutional networks on graphs for learning molecular fingerprints. In NIPS, Cited by: §IV-B.
-  (2017) Learning from failure across multiple clusters: A trace-driven approach to understanding, predicting, and mitigating job terminations. In ICDCS, Cited by: §VI-B.
-  (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, Cited by: §IV-B.
-  (2019) Who limits the resource efficiency of my datacenter: an analysis of alibaba datacenter traces. In IWQoS, Cited by: §VI-B.
-  (2018) Micky: A cheaper alternative for selecting cloud instances. In CLOUD, Cited by: §I, §VI-A.
-  (2019) Peregrine: workload optimization for cloud query engines. In SoCC, Cited by: §VI-A.
-  (2007) Engineering an efficient canonical labeling tool for large and sparse graphs. In ALENEX, Cited by: §IV-C.
-  (2018) Leveraging dependency in scheduling and preemption for high throughput in data-parallel clusters. In CLUSTER, Cited by: §VI-B.
-  (2020) Understanding the workload characteristics in alibaba: a view from directed acyclic graph analysis. In HPBD&IS, Cited by: §VI-B.
-  (2017) Imbalance in the cloud: an analysis on alibaba cluster trace. In BigData, Cited by: §VI-B.
-  (2012) Heterogeneity and dynamicity of clouds at scale: google trace analysis. In SoCC, Cited by: §VI-B.
-  (2012) Obfuscatory obscanturism: making workload traces of commercially-sensitive systems safe to release. In NOMS, Cited by: §II-A.
-  (2021) Bellamy: reusing performance models for distributed dataflow jobs across contexts. In CLUSTER, Cited by: §I, §VI-A.
FaceNet: A unified embedding for face recognition and clustering. In CVPR, Cited by: §IV-B.
-  (2019) Quick execution time predictions for spark applications. In CNSM, Cited by: §I.
-  (2019) Characterizing and synthesizing task dependencies of data-parallel jobs in alibaba cloud. In SoCC, Cited by: §IV-C, §VI-B.
-  (2020) Borg: the next generation. In EuroSys, Cited by: §VI-B.
-  (2016) Ernest: efficient performance prediction for large-scale advanced analytics. In NSDI, Cited by: §I.
-  (2019) Multi-similarity loss with general pair weighting for deep metric learning. In CVPR, Cited by: §IV-B.
-  (2021) Training data reduction for performance models of data analytics jobs in the cloud. In BigData, Cited by: §VI-A.
-  (2020) Towards collaborative optimization of cluster configurations for distributed dataflow jobs. In BigData, Cited by: §VI-A.
-  (2021) C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds. In IC2E, Cited by: §I, §VI-A.
-  (2019) Aladdin: optimized maximum flow management for shared production clusters. In IPDPS, Cited by: §VI-B.