Log In Sign Up

DeepPlace: Learning to Place Applications in Multi-Tenant Clusters

by   Subrata Mitra, et al.

Large multi-tenant production clusters often have to handle a variety of jobs and applications with a variety of complex resource usage characteristics. It is non-trivial and non-optimal to manually create placement rules for scheduling that would decide which applications should co-locate. In this paper, we present DeepPlace, a scheduler that learns to exploits various temporal resource usage patterns of applications using Deep Reinforcement Learning (Deep RL) to reduce resource competition across jobs running in the same machine while at the same time optimizing for overall cluster utilization.


page 1

page 2

page 3

page 4


Hugo: A Cluster Scheduler that Efficiently Learns to Select Complementary Data-Parallel Jobs

Distributed data processing systems like MapReduce, Spark, and Flink are...

Affinity-Aware Resource Provisioning for Long-Running Applications in Shared Clusters

Resource provisioning plays a pivotal role in determining the right amou...

Deep Reinforcement Learning for Multi-Resource Multi-Machine Job Scheduling

Minimizing job scheduling time is a fundamental issue in data center net...

Large-scale Machine Learning Cluster Scheduling via Multi-agent Graph Reinforcement Learning

Efficient scheduling of distributed deep learning (DL) jobs in large GPU...

Energy-aware Scheduling of Jobs in Heterogeneous Cluster Systems Using Deep Reinforcement Learning

Energy consumption is one of the most critical concerns in designing com...

Do the Hard Stuff First: Scheduling Dependent Computations in Data-Analytics Clusters

We present a scheduler that improves cluster utilization and job complet...

A Predictive Autoscaler for Elastic Batch Jobs

Large batch jobs such as Deep Learning, HPC and Spark require far more c...

1. Introduction

Today, large production environments often need to handle a large variety of applications, including but not limited to interactive (user-facing) services, latency sensitive applications, batch analytics jobs, stream processing, iterative computations, maintenance services, etc. The standard practice today is to deploy these applications as containers which are then managed by various container orchestration engines such as Docker-Swarm (docker-swarm), YARN (yarn), Mesos (mesos), or Kubernetes (kubernetes)

. These orchestration engines allocate resources (e.g., CPU and memory) to these jobs according to the estimated

resource limits provided by the developers (kubernetes). In a multi-tenant shared cluster, if multiple applications compete for the same shared resources, they slow each other down due to resource contention (maji2015; maji2014; pythia). Thus to reduce the chances of contention, orchestration engines use developer specified affinity, and anti-affinity(kubernetes; medea_eurosys2018) rules to place applications on different machines. For stateless-services, resource estimates can be a bit aggressive, such that the resources allocated to each of the deployed containers would be enough to make it run smoothly, while the load fluctuations can be handled through an autoscaling mechanism by increasing or decreasing the number of deployed containers on the fly. For stateful-services, autoscaling can be really tricky and reactive migration of containers across machines have high overheads (medea_eurosys2018). Hence, the containers are usually deployed with very conservative estimates by specifying large resource limits so that they can sustain phases with substantial increase in the resource demands. However, periods with such high resource usages are rare and often span only a very short fraction of the life-cycle of application, leading to resource wastage during the comparatively idle times.

In most of the real production systems, not all the applications would require to use the peak resource at the same time, and not all phases of their execution would contend for resources in a similar manner (pythia; atc2018; googlesocc).

Figure 1. Examples of resource usages by three production services with several ”peaks” and ”valleys”.

For example, Figure 1 shows the CPU usage characteristics of three production services in our cluster, across various phases of their execution. Their temporal resource usages show several ”peaks” and ”valleys”, some more regular than the others. For user-facing services, such temporal resource usages can have daily and seasonal patterns due to fluctuations in user-demands (googleworkload) (e.g. some services are mostly used during working hours, while some services are mostly used during major holidays seasons). Applications can have different resource usage patterns across algorithmic phases (opprox; zhang2007; pythia), e.g., between the map and reduce phases in map-reduce jobs. Because of these temporal variations in the resource usage, deploying for peak using developer-provided limits is inefficient from an overall resource utilization perspective. The variety of applications and the complexity of their temporal resource usage patterns makes it infeasible for the developers to express the placement logic in terms of existing placement rules available in current schedulers, e.g., affinity and anti-affinity rules in Kubernetes (kubantiaffinity).

Borg (borg) partially addresses this problem by packing a mix of high and low priority jobs in each machine, so that high priority jobs can expand during load spikes whereas low priority jobs can take advantage during the idle periods of the high priority jobs. However, not all clusters see such a health mix of low priority jobs to effectively fill the valleys of the high priority jobs.

Along with the temporal usage patterns, some jobs might have dependent succeeding jobs that rely on the completion of the first job. These dependencies can be intra or inter services. For example, a customer might have a nightly recommendation model builder, post completion of which a service kicks in to generate a new set of recommendations. A job scheduler that is aware of such dependencies can further utilize this information to efficiently schedule the existing jobs while making room for the upcoming jobs. A central scheduler can even discover serendipitous dependencies between different jobs coming from completely different developer groups, opening up scopes for resource alignment among these jobs leading to improved utilization of the cluster.

In this paper, we introduce an early prototype of DeepPlace, a self-learning scheduler that can opportunistically place containerized applications such that their temporal resource usages are aligned, resource contentions are minimized, quality of service is maintained and overall utilization improved. DeepPlace uses deep reinforcement learning (Deep RL) to learn hidden patterns from historical data over time to improve its scheduling policy. Essentially, DeepPlace treats resource usages of the applications as a multivariate timeseries and learns how these timeseries can be placed across different machines so that their resource usages are better aligned. We show through some example cases, how DeepPlace can take non-trivial decisions by anticipating future placement requests in order to optimize the overall resource usage in the cluster. For stateful-services, DeepPlace helps by minimizing the chances of resource contention, without being overly conservative, leading to operational excellence. For stateless-services, the need for scaling-up can be reduced by having a better placement to begin with. With a better placement, a small number of containers might be able to gracefully handle the load up to a certain extent without a need for scaling-up. However, when scaling-up does happen, where to place those new additional containers is another crucial question, as usually in container scaling new machines are not spawn off frequently, that can be answered by DeepPlace.

2. Background and Related Work

Reinforcement Learning. In RL, at a high-level, an agent interacts with a system and tries to learn an optimized policy. At each timestep , the agent observes the state of the system , and chooses to take an action that changes the state to at timestep , and the agent receives a reward . The agent tries to maximize the received reward

which would help it to learn an optimized policy. It is assumed that the state transitions and rewards are stochastic and the state transition probabilities and rewards depend only on the state of the environment

and the action taken by the agent (i.e., show Markov property (sutton1998reinforcement)).

The objective is to maximize the expected cumulative discounted reward: where determines how much the future rewards contribute to the total reward. More details of theoretical background of RL can be found in (sutton1998reinforcement) and (deeprm)

. Inspired by the recent trends in Deep Reinforcement Learning (DeepRL), in this paper, we use deep-neural-networks (DNNs) as a function approximator for the placement policy that

DeepPlace wants to learn. The RL algorithm can perform gradient-descent on the parameters of this DNN so that it can maximize the expected cumulative discounted reward over the actions the RL-agent takes. The gradients are estimated by observing the trajectories of execution that are obtained by following the policy.

RL has been used in variety of scenarios including learning complex games (mnih2013playing; gibney2016google; lample2017playing), robotics (mnih2015human; kaelbling1996reinforcement; kober2013reinforcement), and very recently for video streaming (neural-abr-2017), routing (boyan1994packet; valadarsky2017learning; mestres2017knowledge), device placement (mirhoseini-icml-2017). But, the application of RL to self-learning schedulers has not been thoroughly explored.

Scheduling. To the best of our knowledge, recently proposed DeepRM (deeprm) is the only other self-learning scheduler that also attempts to learn novel scheduling policy using DeepRL. DeepRM has a very simplified view of the cluster and thus comes with several limitations.

  1. DeepRM assumes that jobs will always take a fixed amount of resources. It does not capture their temporal variations. Users often overestimate resource requirements and there can be a significant difference in resource usage between a peak-load and off-peak loads (Figure 1). Thus, ignoring such temporal variations and using the user-specified resource-limits for placement is wasteful and leads to low utilization.

  2. DeepRM models the resource capacity of the compute cluster as a single monolithic block. It does not have a machine specific view and during its scheduling decisions, it does not try to optimize for the set of job or services to be run together to avoid resource contention.

  3. The single monolithic view of the total resource capacity of the cluster ignores the impacts of resource fragmentation (i.e., the total amount of available resource in the cluster is more than the requirement of a job, but no single machine has that much available resources left.)

Tetris (grandl2015multi)

is another heuristic-based cluster scheduler that takes into account multiple resource dimensions as well as the alignment of the machine’s remaining usage with the job’s requirement for packing jobs to the machines.

A large body of work has focused on scheduling data-driven applications, long-running user-facing services, ML-services, etc. on multi-tenant commodity clusters covering various aspects such as fairness of resource sharing (parkes2015beyond; ghodsi2013choosy; joe2013multiresource; popa2012faircloud; ghodsi2011dominant; grandl2015multi), tail-latency optimization (suresh2015c3; ren2015hopper; ferguson2012jockey; leverich_eurosys2014; mace2015retro; bobtail; haque2015few) and how to protect latency sensitive application while improving cluster utilization (pythia; paragon; quasar; borg; heracles; bubble_flux; q-clouds; omega; apollo). These are distinct from our work as none of these scheduler attempt to self-learn the best scheduling strategy by discovering hidden resource usage characteristics and dependence among applications, along multiple resource dimensions. However, some of the proposed techniques (e.g. cycle-per-instruction (cpi2)) can be used with our technique to further fine-tune reward/penalty design.

3. Design

We now describe the design of DeepPlace explaining how it operates. DeepPlace observes temporal job behavior to optimize its policy, encoded in a DNN-based policy network, using RL. DeepPlace models the scheduling problem as an RL-environment where the compute cluster is composed of machines on which the application services or jobs are to be scheduled. Each such machine has amount of total physical resource capacity for resource dimension (e.g., CPU, Memory, etc.). For a job or service , DeepPlace observes the time-series of the resource usages denoted as (t), where is the resource usage along the resource dimension . DeepPlace also keeps track of the current placement map of which services or jobs are running on which machines as well as what are the incoming services or jobs that need to be scheduled in the cluster, as a queue. The purpose of the queue is to incorporate in the state representation, a view of the upcoming jobs thus allowing the scheduler to learn the arrival patterns and dependencies amongst the jobs. The complete workflow for DeepPlace is shown in Figure 3.

3.1. State Space Representation

Figure 2. Input space representation of DeepPlace

DeepPlace’s state space representation is inspired by (deeprm). Though DeepRM’s representation for scheduling is designed to answer: ”what job to schedule when”, DeepPlace is designed primarily to answer: ”what job to schedule where”. In extreme cases, DeepPlace can delay some scheduling decisions if no suitable placement exists. Thus, DeepPlace makes some key improvements in the input-space representation to capture the degree of competition for resources among the jobs sharing the same underlying resources of a machine and their temporal variations in resource usages.

Figure 3. Workflow of DeepPlace

Figure 2 illustrates the input-space representation.

  1. State of each machine in the cluster is represented as a 2D matrix or an image with x pixels for each of the resource dimension , where is the number of previous logical timesteps.

  2. Within each machine, the vertical direction of the image (i.e., the matrix) represents the time axis and shows the utilization of jobs for up to previous logical timesteps, and the horizontal direction represents the amount of resource used by each job/services (quantized into units of resources). This type of representation helps the DNN-based RL-agent to learn the temporal resource usage characteristics of each job. is a configurable parameter that the user can choose. The value of should be a number reasonably large enough w.r.t. scheduling time-scale so that it helps the agent to capture a significant overlap among applications as well as temporal variations in the resource usage. However, larger results in longer convergence time for the RL-agent.

  3. For each machine, the number of pixels in the horizontal direction () represents the resource capacity of that machine for resource dimension . is another configurable parameter that user can choose depending on the granularity of resource usage that needs to be tracked. Larger results in longer convergence time for the RL-agent.

  4. After each machine, there is a column representing the applications or jobs scheduled and waiting to be run in the machine. This representation is important for DeepPlace to take multiple decisions in the same logical timestep. Even if the machine representation is not showing the resource usage of the scheduled application (as time has not proceeded), the column will give an insight to the agent that in the next timestep the application will be running in that machine and hence helps in taking the next action in the same logical timestep.

  5. The pixels of the image representing machine-states (i.e., the values in the matrix) are colored differently to denote how much of the available capacity of the machine is being used at what time by which job. To make DeepPlace scalable, we consider that DeepPlace will attempt to learn the characteristics of up to types or equivalence-classes of applications, and each type of application is represented by a unique number between 0 and 1, both exclusive (this is analogous to a different color of the corresponding pixels in the image). The unused resources are marked with white color (or a value of 0 in the matrix). DeepPlace uses these colors (i.e. the numbers) to learn which type of applications when run together can potentially suffer from resource competition and for how long such competition might last.

  6. There can be multiple instances of the same application type running in the same machine with an overlap in their duration (e.g. two instances of a face-detection service triggered by two different products). These different instances can potentially also create resource contention among themselves (e.g. when the application is highly CPU intensive) and therefore needs to be distinguished and captured by the RL-agent. We again assign different colors (i.e., floating point numbers) to each instance of the application that are unique but close-by within a small range to the original assigned color for that job type.

  7. DeepPlace captures the state of individual machines and combines these machine-level state representations into a cluster-level state representation for creating a holistic input for the policy-network. DeepPlace does that by using a trick: (a) To clearly distinguish between the applications running in different machines, DeepPlace adds a different factor for different machines to the number (or color) assigned to the machine usage as well as the column containing the scheduled jobs. For example, if two instances of the same type of application with assigned number representation , are running in two different machines (machine 1 and 2), then in the combined state-space representation, these tasks will be represented as 1.2 and 2.2 respectively.

  8. Along with the combined representation of the machines, DeepPlace also keeps a waiting-queue in its state-space representation. This queue represents the tasks waiting to be scheduled. By observing the changes in the queue over time, the RL-agent learns some key dynamics about the arrival characteristics of the jobs, which type of and how many jobs come together, and the temporal dependency amongst them, as previously discussed.

3.2. Reward/Penalty Design

DeepPlace is driven by negative rewards (penalty) which has the following four components:

Resource contention penalty. To help DeepPlace learn a placement policy that results in better resource alignment (complementary) and avoid resource contention among tasks scheduled in the same machine we use a modified version of cross-correlation to penalize the RL-agent during its learning. Cross-correlation () is calculated between all pairs of tasks and running on the same machine across resource dimension as follows:

where is the length of task and is the instantaneous resource demand across dimension by task at time . Cross-correlation formula amplifies the effect of two peaks being scheduled together. The for a particular state of the cluster is calculated by taking the sum of cross-correlation of each machine, which includes across all the resource dimensions (CPU or memory), the cross-correlation of each task with every other task in that machine.

Resource over-utilization penalty. To prevent scheduling of more tasks than that can be handled by a machine, there is a high penalty if the machine is not able to meet the resource requirement of tasks scheduled in that machine. It is calculated by adding a high negative factor each time a machine is unable to provide appropriate resources to the running tasks.

Wait-time penalty. To prevent the scheduler from holding jobs for a long time in search of a better place, we add a constant penalty proportional to the state of the waiting queue. It is equal to the number of waiting tasks in the queue multiplied by a negative constant at each time.

Under-utilization penalty. Since our goal is to improve overall utilization of the cluster by helping the scheduler learn how to achieve tighter packing and pack on less number of machines, if possible, we add a penalty proportional to the sum of unused resources in the used machines. White pixels in our state-representations denote the number of unused resources at any given time.

4. Implementation

We use the modified version of REINFORCE algorithm as mentioned in (deeprm)

. The policy network consists of a single hidden layer of 20 neurons followed by output neurons equal to the number of actions (number of machines under consideration). We use a 36 core CPU server and python multiprocessing to create multiple workers (equal to

batch size+1

) each operating on distinct examples, taking a fixed number of trajectories and accumulating gradients. The last worker is used to combine the gradients of each worker and send to the policy network for updating the parameters. This gives a major improvement in the training speed. The training time increases significantly as we increase the cluster load. It also depends significantly on the type of applications under consideration (For example, Long running vs Short running jobs). For the hidden layer, we use Relu activation function, while for the output layer we use softmax activation. We use Adam optimizer and a learning rate of 0.001. The number of trajectories taken by each worker is fixed at 20.

Figure 4. Convergence of DeepPlace’s training under 50% average load
Figure 5. Comparison of number of machines used
Figure 6. Comparison of over-utilization in the cluster
(a) CPU utilization (b) Memory utilization
Figure 7. Comparison of average resource utilization in the cluster
(a) CPU fragmentation (b) Memory fragmentation
Figure 8. Comparison of resource fragmentation level in the cluster

5. Evaluations

[Workload.] In our evaluation setup, jobs arrive online as a Poisson process. The average job arrival rate is calibrated to create three average cluster load scenarios: 30%, 50% and 80%. In our setup, 50% of the jobs are long running and the other half are short running. Each job has 2 dimensions of resource requirements: CPU and memory. The capacity of these two resources in each machine is denoted by

For each job, dominant resource usage is randomly chosen to be either CPU or memory. The resource usage of the dominant resource is independently chosen from a uniform distribution between 0.3r and 0.5r. The non-dominant resource usage is also independently and uniformly varied between 0.08r and 0.16r. Thus there is no correlation between the CPU and memory usages. Temporal resource usage for each job varies as a square wave with period uniformly chosen between 0.2t and 0.5t and width as one-fourth of the period, where

denotes the job length. Total 50 such different jobs are used for training and 18 for testing. Our evaluation runs with a cluster of 10 machines.

[Baselines.] We compare DeepPlace with Tetris (grandl2015multi), which schedules jobs on machines based on how well job’s resource requirement aligns with the machine’s available resources balancing preferences for short jobs and packing in a combined score. We also compare it against Best Fit heuristic which allocates the job to the machine having the least units of the dominant resource of the job left.

Note: It is not possible to directly compare DeepPlace with DeepRM (deeprm) because DeepRM only specifies which job to be scheduled next and does not say on which machine it should be scheduled. Thus DeepRM does not have any concepts of competition for resource usage among applications running in the same machine, resource fragmentation and machine-level over-utilization. Thus, a fair comparison with DeepRM with respect to our desired metrics is not possible.

[Learning Progress.] We first show how DeepPlace’s learning converge across multiple iterations in Figure 6. It can be observed that roughly after 1000 iterations, DeepPlace’s policy learning starts to converge and does not see any further significant drop in the normalized penalty.

[Improvement in Cluster Utilization.] We measure average utilization of machines for each resource as:

where is the length of the observation period. Since the number of machines that are actually being actively used varies over time, in the denominator, we used maximum number of machines used at any point in time to normalize.

[Comparing Scheduling Efficiency.] Here in Figure 8, we show how DeepPlace optimizes for cluster utilization for both CPU and memory. We can see that DeepPlace can provide a 68-100% increase in average utilization compared to Tetris across different cluster-load conditions. This is primarily achieved by efficient packing that requires significantly less number of machines to be used compared to Tetris as shown in Figure 6. Further, it can be observed that the gap between DeepPlace and Tetris in terms of the number of machines required to accommodate the jobs increases with the increase of the cluster load. Although it looks like BestFit provides even higher utilization because it just packs the jobs into the machines without any knowledge of peak or future resource usages of the jobs and as a consequence, BestFit suffers from huge over-utilization of the resources as shown in Figure 6. On the other hand, over-utilization due to DeepPlace’s placement decisions are almost negligible. Tetris already includes peak resource usage information in its placement decision thus resulting in no resource over-utilization.

[Improvement in Resource Fragmentation.] Fragmentation score of a cluster at a high-level measure what part of all the available resource in a cluster are concentrated.

The lower the fragmentation score, the higher the ability of the cluster to schedule unanticipated large jobs. Hence, low resource fragmentation in the cluster is a desirable operational property. In Figure 8, we see DeepPlace provides 6-13% reduction in resource fragmentation compared to Tetris. DeepPlace’s intelligent placement which takes both temporal resource usage characteristics and job arrival patterns leaves bigger room in the machines (i.e. less fragmentation score) to accommodate unanticipated large jobs.

6. Discussions

(a) Learned example 1
(b) Learned example 2
Figure 9. Examples of learned placement policies

In this section, we discuss insights and applicability for real deployments.

What DeepPlace learned? Figure 9 illustrates how DeepPlace achieved better packing that ultimately resulted in higher overall utilization. Figure 8(a) shows how Job1 and Job2 were placed in the same machine because resource intensive parts of Job1 would finish before the resources are required by Job2. In Figure 8(b), resource requirements for Job3 and Job4 alternate in such a manner that they do not exactly overlap with each other and thus were placed in the same machine for better packing. All these patterns were learned by DeepPlace on its own without any guiding rules.

Scheduling granularity for effectiveness. DeepPlace looks at where to schedule an incoming application so that it can either improve the resource utilization or reduce the resource contention. However, how often such a placement decision needs to be made depends on the what kind of workload the cluster is handling. For a cluster handling short or medium-duration batch, cron or interactive applications, frequent placement decisions need to be made and DeepPlace can be very useful. On the other hand, for long running services, typically new placement decisions are made less frequently, e.g., when the container for an upgraded service is being deployed, etc. However, if auto-scaling is enabled for these services, taking the decision on where the additional auto-scaled container should be placed in the cluster, can be suggested by DeepPlace.

Cluster size. Our input-space representation as well as action-space of the RL is proportional to the number of machines in the cluster. Hence, larger the size of the cluster, the more iterations and training examples it needs for its policy learning to converge.

Bootstrapping learning in deployments. DeepPlace uses historical time-series pattern of resource usages to learn what job is to be scheduled in which machine so that based on their resource usage characteristics, they either improve the overal utilization or avoid aggravating contention by using the same resource at the same time. If DeepPlace starts to learn from scratch, it can be long before it sees sufficient examples required for its learning to converge. An option to speed up learning by bootstrapping the RL-agent’s policy is by replaying the time-series of historical resource usage through a simulation.

To conclude, in this paper we show an early design prototype of a self-learning scheduler that can exploit the temporal resource usage patterns and arrival dependencies of the jobs to provide a better placement policy and thus achieve better utilization without requiring any manually crafted rules or heuristics.