In a hybrid cloud (supercloud) (Elkhatib, 2016), CSPs might use each other’s resources to offload peak loads or to move processing closer to the end clients. Our principal contribution is an algorithm and a prototype implementation of a distributed broker that fairly balances the load between independent CSPs. ††Author’s version of a paper accepted to Cross-Cloud 2018 (EuroSys Workshop). The final version is available at ACM via https://doi.org/10.1145/3195870.3195873
Our notion of fairness stems from the Shapley value, a fairness concept widely used in game theory and economics. Informally, the Shapley value of an agent is equal to her relative contribution to the common good. The goal of our system is, essentially, to reward CSPs that provide resources when these resources are truly needed, i.e., when others indeed use them. The reward consists of priority treatment of loads of such “accepting” CSPs when they, in turn, get overflowed or prefer non-local processing.
However, our method might be also used for balancing the loads and excess capacities directly between the clients of CSPs. Usually, when the load is low, a client downscales its rented resources. However, in some situations downscaling is hard or sub-optimal: a client might have a long-term rent agreement with a CSP; or it might prefer to keep renting resources to get a rebate on the rent rate; or it might have local, bare-metal resources. In such situations, with our methods clients could form grassroots load-balancing agreements, trading their excess capacities (however, in order to simplify the presentation, even in such scenarios we call the parties trading the capacities as CSPs).
Our method is non-monetary, in contrast to monetary spot markets used now to trade excess, short-term CSP capacity by some providers (Kash et al., 2016). A client could use spot instances to dynamically migrate between CSPs (Jia et al., 2016). However, spot markets require the CSP to set and dynamically manipulate the price, which is a non-trivial problem.
Our broker is decentralized. Individual CSPs run their local brokers which communicate with each other through a coordination layer. When a CSP wants to migrate a task, the broker submits the task to the global queue. When another CSP has free resources, its broker uses our CloudShare algorithm to choose tasks from the global queue. The cloud API is not exposed outside the CSP: it is only accessed by the broker (and perhaps other local submission systems: we don’t require an exclusive access).
The local broker is implemented as a standalone Java application with an embedded HTTP server. The prototype supports executing jobs using local CPU or Kubernetes but it could be extended to support other APIs (e.g., Slurm) by implementation of another driver.
The paper is organized as follows. We start by discussing related work in Section 2. We then formalize our resource management model, show how to apply Shapley value to cross-cloud load balancing and propose the load balancing algorithm in Section 3. In Section 4 we describe our prototype implementation. In Section 5 we show results of simulation experiments.
2. Related work
FairShare (Kay and Lauder, 1988) is arguably the most popular approach to fair scheduling. Fairness is based on predefined shares assigned to each user (or a group). Task’s priority is proportional to this share and inversely proportional to the actual (consumed) share.
. In this paper, we adopt DirectContr, a heuristic proposed in(Skowron and Rzadca, 2014) to a decentralized environment and to cloud computing scenario.
The notion of contribution in DirectContr is similar to reputation used in OurGrid (Andrade et al., 2005). However, our method allows sites to choose tasks (rather than being requested to do some). Moreover, our method is based on and tested against a notion of fairness widely accepted in other fields.
We deliberately focus on a single issue: fairness in cross-site scheduling. To construct a viable prototype, we do not address many orthogonal problems. For instance, we use standard container repositories to instantiate a task, while a complete system should use cross-cloud data storage such as (Rafique et al., 2017); or even consider VM migration between perhaps binary-incompatible CSPs. Similarly, we do not consider the problem of different APIs—we assume that each resource is represented by a driver exposing common functionality (which in a complete system requires a cloud orchestration tool (Baur and Domaschka, 2016)).
3. Theory: Model and Algorithms
This section introduces the theoretical motivation, the algorithm and the architecture of CloudShare.
3.1. Vocabulary and Assumptions
We call a federation a system composed of multiple CSPs that balance the load. We call an individual CSP also an organization (a term common in the theoretical works). Each CSP (organization) has a certain number of machines that correspond to physical machines or VMs rented on the long-term. We assume a machine has a certain number of CPU cores (to simplify our theoretical model, we assume that the CPU is the sole resource — however it is easy to generalize our approach to multiple resources). Each CSP processes jobs that are initially submitted locally to this CSP (e.g. by the end clients). We consider an on-line problem with non-zero
release dates: a job is not known until the moment it is submitted to a CSP. Each job declares the number of CPU cores it requires exclusively (such declaration is equivalent to, e.g., VM capacity or resource requirements in Kubernetes). Each job will be executed on a single machine, but a single machine can execute multiple jobs at the same time (with no overbooking of the available cores). Jobs have finite duration, but the scheduler does not know the job’s duration until the job completes (finishes) (anon-clairvoyant problem). Each organization uses a utility function as a performance measure (e.g., the average flow time).
Cooperation of CSPs requires some level of trust. We assume that a CSP does not try to tamper jobs, i.e., all results are genuine outcomes of job execution. In general, verification of a result may require performing the same computation. Therefore it does not make sense to ask untrusted party to run a job. Similarly all metadata (e.g. job start and completion time) must be true. An organization might simulate long execution of a job by delaying the result announcement and supplying false time stamps. This could artificially increase priority of the organization and it would be hard to detect, as it is difficult to predict duration of a job based solely on its definition. We also assume that CSPs do not alter broker implementations or the data (these problems are orthogonal to the main issue).
3.2. Fairness based on Shapley Value
Following our earlier theoretical works (Skowron and Rzadca, 2013, 2014), we base our notion of fairness on the Shapley value. The main difference from FairShare (see Section 2) is that share entitlements are not predefined. Instead, they depend on CSP’s impact on the federation. The aim is to promote organizations which provide resources when they are needed by assigning a higher priority to their jobs. Calculation of target shares is based on game theoretical concept of Shapley value
3.2.1. Shapley value
A concept from game theory, the Shapley value (Shapley, 1988) can be interpreted as a value that a member brings to the community (a coalition
). The formulation assumes there is a characteristic functionwhich assigns a value to every subset of possible coalition members, . Shapley value of organization is:
Thus, the Shapley value of is essentially its average marginal contribution to coalition value: the difference between the value of the characteristic function for a subset including the member and the same subset excluding the member. It has desirable properties of efficiency, symmetry, linearity and assigns to the members who do not contribute anything to the coalition.
As an illustration, for simplicity assume that the characteristic function is the difference between the number of completed and submitted jobs. Two events change the Shapley value : submitting a job or completing it. An organization which does not submit or complete any tasks will have zero Shapley value. An organization that has completed more tasks than it (locally) submitted will have a positive Shapley value; and an organization only submitting tasks, but not accepting any tasks will have a negative Shapley value.
A scheduling algorithm uses Shapley value as a benchmark. Ideally, the value of the organization’s utility function should be equal to its Shapley value, . However, as the problem is discrete, it might be not possible to achieve such a schedule (e.g.: an organization does not submit any jobs, but accepts jobs from others). Thus, the goal is to construct a schedule with utilities as close to Shapley values as possible (see (Skowron and Rzadca, 2013) for a more formal discussion).
3.2.2. Utility/characteristic functions
The Shapley value relies on the utility function quantifying the quality of the schedule from the coalition’s perspective.
In our previous work (Skowron and Rzadca, 2013), we proposed a non-manipulable utility function that is the sum of utilities for individual jobs computed as:
where denotes the time the job started; —job ended, —the current time and —the number processor cores used (an extension we added here, as (Skowron and Rzadca, 2013) considered only sequential jobs). The utility is proportional to the duration and depends on the start time.
The term expresses the time since the job is run. The motivation was to capture utility of having the same amount of work done faster. However, in a long running system, it might be impossible to “make up” for the sub-optimal decisions taken at the beginning of the schedule. Therefore, we also test a slightly altered utility, where the release time neutralizes this effect:
Finally, we also consider a function that sums the surface of executed jobs (with no reward for executing a job earlier):
While does not adequately express utility, it is reasonable in expressing contribution—the effort of a site that accepts non-local jobs.
3.2.3. DirectContr: Scheduling based on Shapley Value
Calculating the Shapley for an organization is NP-hard and hard to approximate (Skowron and Rzadca, 2013). We proposed a fast heuristic called DirectContr (Skowron and Rzadca, 2014). Instead of computing the Shapley value from the definition (Eq. 1
), the algorithm estimates the contribution of an organizationby summing utilities from jobs executed on ’s resources. The algorithm works as follows. An organization submits its jobs to its queue . Each time a processor becomes available, the algorithm selects the organization that has the highest difference between its contribution and its utility—we will call this difference the priority (if there are multiple free processors, the algorithm selects one randomly). Then, it executes the first job from this organization’s queue, .
By simulation, in (Skowron and Rzadca, 2014) we showed that the “unfairness” of the resulting schedule is relatively close to the exact, exponential algorithm, and significantly lower than the FairShare. This result can be intuitively explained on an example (see Figures 2 and 3). Consider two organizations A and B, each with a single machine. A submits a job at time 0, 2, 4, 6 and 8. B submits two jobs at time 4. Consider the situation at time 4. In DirectShare (Figure 2), priorities of A and B are equal: jobs were executed on local resources, thus both organization have 0 contribution. In contrast, in FairShare, B has higher priority: both organizations have the same predefined share but all completed jobs belong to organization A. FairShare scheduler would decide to start both jobs submitted by B immediately (Figure 3). In the aftermath jobs A3, A4 and A5 are delayed 5 time units compared to the DirectContr schedule. Moreover, we could multiply the number of processors or extend the job duration to arbitrarily large total delay of organization A’s jobs.
3.3. Distributed Scheduling in ClusterShare
ClusterShare adapts DirectContr to the federated cloud infrastructure. An organization (a CSP) is represented in the system by a broker, responsible for tracking local resources, submitting local tasks to the federation, selecting and executing foreign tasks on the local resources.
ClusterShare keeps the state of the system in a coordination layer, a distributed data structure shared across brokers. The coordination layer keeps track of non-local jobs’ life-cycle and execution parameters.
In contrast to DirectContr, scheduling decisions in ClusterShare are distributed: each broker reacts to events independently. This approach has several advantages: (i) the decision to expose local resources might be taken dynamically; (ii) local resource schedulers may pursue custom goals like power efficiency; (iii) resources can be exposed to the federation through existing interfaces.
ClusterShare is event-based. Every broker handles events sequentially in order of their appearance. The following events are handled:
a new task is submitted by a local user;
a new task is submitted to the federation;
a local resource is ready to execute a task;
a local resource completed a task;
a site left the federation.
When a task is submitted by a local user (event 1), the broker first checks whether the site has enough resources to run the job locally. If there is enough capacity and no other tasks are waiting then the task is delegated to a free machine directly. Otherwise the broker publishes the task in the coordination layer.
When a new task is published in the coordination layer (event 2), the broker propagates the event to every configured resource handler. Some handlers will wait until there is enough capacity to run the job, others will submit a pilot job in order to acquire them.
Once a resource handler is ready to execute a job (event 3), it notifies the broker, which picks a federation job (we show the algorithm below). In our implementation, a local machine is exposed to the federation (and thus event 3 is produced) when two conditions are met: (i) the overall reservation ratio is below a certain threshold (e.g.: 30% of the total CPU power is not reserved by currently executing tasks); (ii) there is at least one machine capable of executing a task. The threshold is used to neutralize uncertainty resulting from monitoring delay. This strategy is especially useful for resources which discard jobs that they cannot accept due to insufficient capacity rather than appending them to a queue (like Kubernetes).
Continuing with handling event 3, the broker first loads all waiting tasks from the coordination layer and filters those that match the offer. Then, a task is chosen according to a strategy based on DirectContr, i.e., first, an organization is picked according to the priority (the difference between the contribution and organization’s utility); then, the longest waiting task of this organization is chosen. Once the job is selected the broker registers itself as the provider via coordination layer and initiates job execution.
When a task ends (event 4), a resource handler notifies the broker. The broker updates the job status in the coordination layer and saves scheduling-relevant parameters such as the start time, the end time and the job definition.
When organization looses connection to the federation (event 5), jobs that were computed by the disconnected organization are treated as if they were just submitted to the federation to ensure that they will be rescheduled. Jobs submitted by the lost site are removed, apart from those which are already being executed.
In this section we describe how we map the architecture and the algorithm described above to a hybrid cloud infrastructure. Our implementation is composed of three logical layers: (i) the coordination layer responsible for storing the data shared across CSPs; (ii) the application layer implementing the ClusterShare algorithm; and (iii) the external layer abstracting the resources our algorithm manages.
The coordination layer keeps track of non-local jobs’ life-cycle and execution parameters. To increase resilience, we use Apache ZooKeeper (Hunt et al., 2010), a well-known, distributed open-source coordination service. ZooKeeper servers can run on the same machines that organizations use to expose their brokers or they can be deployed on separate machines. Brokers use ZooKeeper clients to communicate with ZooKeeper servers. It is easy to add new CSPs to the hybrid cloud: all a new member has to do is to connect to the already established ZooKeeper ensemble.
The application layer is composed of symmetrical brokers—each broker represents a CSP. A broker manages only the resources of its CSP. Brokers use the coordination layer to communicate across CSP boundaries. A broker also exposes an HTTP interface for accepting jobs from its local users.
Each of CSP resources (e.g., a Kubernetes-managed cluster, or a Slurm-managed cluster) is represented through a resource handler with a common interface.
To test various scheduling algorithms, we abstract any algorithm through an interface with a single method. The method, given a collection of jobs, selects the one with the highest priority (priority calculation depends on the algorithm).
As ZooKeeper, our coordination layer, is not well-suited for storing large data, a scheduling algorithm periodically replaces the historic scheduling data with a summary that allows to calculate priorities in the future without resorting to the original release/completion times.
The external layer includes resources managed by ClusterShare brokers and peripheral services (such as broker’s client interface or container libraries). To instantiate our system, we focus on sharing Kubernetes-managed clusters. The broker uses Kubernetes to start a task, monitor its progress and also monitor the state of the resources (e.g.: whether there are free resources to start a foreign task). We use one-to-one mapping between ClusterShare tasks and Kubernetes job definition: container image name, command arguments, and resource requirements are copied directly. ClusterShare does not directly manage the containers, nor the eventual results. We envision a setup where CSPs share access to container image repositories and perhaps file repositories. An image should describe both the actual task and result delivery.
5. Simulation Experiments
To quantitatively evaluate ClusterShare, we performed simulation experiments in which we compared the performance of a number of algorithms.
As our goal was to evaluate the system in steady-state and on workloads lasting days rather than minutes, instead of emulation, we decided to implement a simple, event-based simulator. Our simulator replaces the original external and coordination layers of ClusterShare with fast, in-memory implementations; the broker module is the same as in the ClusterShare system. The simulator also allows us to submit tasks at each site according to a workload (log) and then to compute the performance of the schedule.
To compare how “good” a particular schedule is, for each CSP we compute the total wait time of tasks (the time between a task is submitted and it is started). It is an intuitive measure: improved wait time can be the primary reason to federate in a hybrid cloud. However, the wait time might be sensitive to even minor changes in the schedule (for instance, a long task scheduled a unit time earlier might delay a large number of other tasks). Moreover, rather than using the wait time directly, we compute the unfairness, or the distance between the resulting vector of waiting times and the perfectly fair vector (the Shapley value). This requires that the sum of utilities is constant (i.e.: all possible schedules have the same total wait time; only the distribution of the wait time across CSPs changes), because it is interpreted as a characteristic value of a coalition (explained below) which should be fixed for a given log sample. For these reasons, we convert tasks in the logs by, first, replacing a-processor task with tasks requiring a single processor; and then replacing a task lasting hours with tasks each lasting an hour. We stress that this processing is done to emphasize the differences between scheduling algorithm policies: this method only reduces the noise in the observed results.
Given a log, for each tested algorithm we run simulations (where is the number of CSPs). Thus, there are 5 simulations in which each CSP uses only its local resources; 10 simulations for pairs of collaborating CSPs, …, and one simulation for the grand coalition of 5 CSPs. Each simulation yields a vector (of length N) of total wait time. The sum of this vector is the characteristic value of the coalition, since we assumed that total wait time is our characteristic function. We use all vectors to compute the Shapley value for each coalition (Eq. 1). Thus we can evaluate the unfairness of a schedule by comparing total wait times of organizations with their Shapley values.
We aggregate results across logs as follows. We assign a score to each algorithm based on how it performs in comparison to other algorithms. The score is incremented every time an algorithm is more fair than another algorithm. For instance, if we test 3 algorithms A, B and C, and the simulator generated three schedules , and with their respective unfairness of 60, 80 and 60, then we assign 1 point for A and 1 point for C (an alternative, the average deviation from the fair wait time distribution, might vary significantly across different logs).
We use HPC2N, DAS2 fs0, LPC EGEE and MetaCentrum logs from (Feitelson et al., 2014). From each log, we take 20 randomly-chosen periods of 24 hours; we take all jobs submitted during these 24 hours, keeping their relative release dates. Each job is local to some CSP; as in a log each job has an owner user ID. We map these user IDs onto CSPs (and assign number of processors to each CSP) as follows:
- Scenario 1:
CSPs have equal number of processors; users are randomly assigned to CSPs.
- Scenario 2:
distribution of processors across CSP follows the Zipf law; users are randomly assigned to CSPs.
- Scenario 3:
CSPs have equal number of processors; users are divided into two categories: ClusterShare users who submit to the broker (as in previous two scenarios); and users local to each CSPs generating background load.
We compare a few variants of ClusterShare with more classic approaches:
- Original direct contribution:
(ORIG_DIRECT) implements a distributed version of DirectContr with utility function (Section 3.2.2).
- Relative direct contribution:
(REL_DIRECT) uses utility (adjusting utility by release time).
- Simplified direct contribution:
(SIMPL_DIRECT) uses utility; this algorithm tests whether direct contribution algorithm could be simplified without sacrificing fairness.
uses shares proportional to the number of processors a CSP contributes. The algorithm measures the total processing time assigned to each organization (just as ). The CSP with the the smallest ratio of utility to share has the highest priority.
- Round robin:
implementation is based on LRU cache invalidation. The algorithm selects the CSP which has never been chosen yet or (if all were selected at least once) the one whose most recently started job was started least recently.
|Algorithm||Scenario 1||Scenario 2||Scenario 3|
We summarize the scores in Table 1. First, while ROUND_ROBIN scored significantly less points then other algorithms, its result is not 0: on some logs and some scenarios, ROUND_ROBIN performs better than seemingly more fair algorithms.
Second, DirectContr produces more fair schedules than FairShare. The scores of SIMPL_DIRECT and REL_DIRECT were better than scores of FAIRSHARE in every scenario and every analyzed configuration. SIMPL_DIRECT version of DirectContr unexpectedly achieved the highest total score in Scenarios 1 and 2. The advantage over REL_DIRECT is not very significant though. ORIG_DIRECT performed well in some configurations but its total score is below the score of FairShare in the first two scenarios.
Third, when the amount of contributed resources changes dynamically (Scenario 3), the number of processors does not correspond to the contribution. As, DirectContr does not assume fixed shares (unlike FairShare), thus the advantage of DirectContr is more visible.
ClusterShare is a prototype system for fair resource sharing in a hybrid cloud. Our main objective was to adapt DirectContr algorithm to a distributed system and cloud computing scenario. Clustershare was designed to be flexible and resilient. There is no need for common federation servers — deployment on private servers of organizations is possible. The federation can grow or shrink spontaneously without disrupting the system. Each site is responsible for its own resources only but it picks jobs from the common queue based on the global priority of each site. Container engine guarantees job portability between sites and resources.
We verified performance of our method by simulation. Our results show that methods based on the Shapley value lead to more fair outcome distribution than FairShare.
Acknowledgements: This research has been partly supported by the Polish National Science Center grant Sonata (UMO-2012/07/ D/ST6/02440), and project TOTAL that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 677651).
- Andrade et al.  N. Andrade, L. Costa, G. Germóglio, and W. Cirne. Peer-to-peer grid computing with the OurGrid community. In SBRC Proc.,, 2005.
- Baur and Domaschka  D. Baur and J. Domaschka. Experiences from building a cross-cloud orchestration tool. In CrossCloud, Proc. ACM, 2016.
- Elkhatib  Y. Elkhatib. Mapping cross-cloud systems: Challenges and opportunities. In HotCloud, 2016.
- Feitelson et al.  D. G. Feitelson, D. Tsafrir, and D. Krakov. Experience with using the parallel workloads archive. JPDC, 74(10):2967–2982, 2014.
- Hunt et al.  P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: Wait-free coordination for internet-scale systems. In USENIX Proc.,, 2010.
- Jia et al.  Q. Jia, Z. Shen, W. Song, R. van Renesse, and H. Weatherspoon. Smart spot instances for the supercloud. In CrossCloud, Proc. ACM, 2016.
- Kash et al.  I. Kash, Q. Jia, Z. Shen, W. Song, R. van Renesse, and H. Weatherspoon. Economics of a supercloud. In CrossCloud, Proc. ACM, 2016.
- Kay and Lauder  J. Kay and P. Lauder. A fair share scheduler. CACM, 31:44–55, 1988.
- Rafique et al.  A. Rafique, D. Van Landuyt, V. Reniers, and W. Joosen. Towards an adaptive middleware for efficient multi-cloud data storage. In CrossCloud, Proc., pages 4:1–4:6. ACM, 2017.
- Shapley  L. S. Shapley. A value for n-person games, page 31–40. Cambridge University Press, 1988.
- Skowron and Rzadca  P. Skowron and K. Rzadca. Non-monetary fair scheduling — a cooperative game theory approach. In SPAA Proc. ACM, 2013.
- Skowron and Rzadca  P. Skowron and K. Rzadca. Fair share is not enough: measuring fairness in scheduling with cooperative game theory. In PPAM Proc., volume 8385 of LNCS, pages 38–48. Springer, 2014.