Bag-of-Tasks Scheduling on Related Machines

07/13/2021 ∙ by Anupam Gupta, et al. ∙ Indian Institute of Technology Delhi Princeton University Carnegie Mellon University 0

We consider online scheduling to minimize weighted completion time on related machines, where each job consists of several tasks that can be concurrently executed. A job gets completed when all its component tasks finish. We obtain an O(K^3 log^2 K)-competitive algorithm in the non-clairvoyant setting, where K denotes the number of distinct machine speeds. The analysis is based on dual-fitting on a precedence-constrained LP relaxation that may be of independent interest.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scheduling to minimize the weighted completion time is a fundamental problem in scheduling. Many algorithms have been developed in both the online and offline settings, and for the cases where machines are identical, related, or unrelated. Most of the work, however, focuses on the setting where each job is a monolithic entity, and has to be processed in a sequential manner.

In this work, we consider the online setting with multiple related machines, where each job consists of several tasks. These tasks are independent of each other, and can be executed concurrently on different machines. (Tasks can be preempted and migrated.) A job is said to have completed when all its component tasks finish processing. We consider the non-clairvoyant setting where the algorithm does not know the size of a task up-front, but only when the task finishes processing. Such instances arise in operating system schedulers, where a job and its tasks correspond to a process and its threads that can be executed in parallel. This setting is sometimes called a “bag of tasks” (see e.g. [AC08, MK15, BMP10]).

The bag-of-tasks model can be modeled using precedence constraints. Indeed, each job is modeled as a star graph, where the tasks correspond to the leaves (and have zero weight), and the root is an auxiliary task with zero processing requirement but having weight . Hence the root can be processed only after all leaf tasks have completed processing. The goal is to minimize total weighted completion time. Garg et al. [GGKS19] gave a constant-competitive algorithm for this problem for identical machines, in a more general setting where tasks form arbitrary precedence DAGs.

We extend this result to the setting of related machines where machine has speed . By losing a constant factor, we assume that all speeds are powers of some constant . Let denote the number of distinct machine speeds. In §2, we show that this problem is strictly more challenging than in the identical machines setting:

Theorem 1.1 (Lower Bound).

Any online non-clairvoyant algorithm has competitive ratio for bags-of-tasks on related machines.

The lower bound arises because we want to process larger tasks on fast machines, but we have no idea about the sizes of the tasks, so we end up clogging the fast machines with small tasks: this issue did not arise when machines were identical. Given the lower bound, we now look for a non-clairvoyant scheduling algorithm with a competitive ratio that depends on , the number of distinct speeds. This number may be small in many settings, e.g., when we use commodity hardware of a limited number of types (say, CPUs and GPUs). Our main result is a positive answer to this question:

Theorem 1.2 (Upper Bound).

The online non-clairvoyant algorithm for bags-of-tasks on related machines has a competitive ratio of §3 is .

Our algorithm uses a greedy strategy. Instead of explicitly building a schedule, it assigns (processing) rates to tasks at each time . Such a rate assignment is called feasible if for every , the rate assigned to any subset of tasks is at most the total speed of the fastest machines. Using an argument based on Hall’s matching theorem, a schedule exists if and only if such a rate assignment can be found. To assign these rates, each alive task gets a “priority”, which is the ratio of the weight of the job containing it to the number of alive tasks of this job. In other words, a task with low weight or with many tasks gets a low priority. We assign feasible rates to alive tasks in a “fair manner”, i.e., we cannot increase the rate of a high priority task by decreasing the rate of a lower priority task. To efficiently find such feasible rates, we use a water-filling procedure.

The analysis proceeds using the popular dual-fitting approach, but we need new ideas: (i) we adapt the precedence-constrained LP relaxation for completion time in [CS99] to our setting. A naive relaxation would define the completion time of a task as the maximum of the (fractional) completion times of each of the tasks, where the fractional completion time of a task is the sum over times of the fraction of the task remaining at this time Instead, we define , for a job and time as the maximum over all tasks for of the fraction of which remains to be completed at time , the completion time of as . (See §4 for details.) (ii) Although it is natural to divide the machines into classes based on their speeds, we need a finer partitioning, which drives our setting of dual variables. Indeed, the usual idea of dividing up the job’s weight equally among the tasks that are still alive only leads to an -competitiveness (see §5). To do better, we first preprocess the instance so that distinct machine speeds differ by a constant factor, but the total processing capacity of a slower speed class is far more than that of all faster machines. Now, at each time, we divide the machines into blocks. A constant fraction of the blocks have the property that either the average speed of the machines in the block is close to one of the speed classes, or the total processing capacity of a block is close to that of all the machines of a speed class. It turns out that our dual-fitting approach works for accounting the weight of jobs which get processed by such blocks; proving this constitutes the bulk of technical part of the analysis. Finally, we show that most jobs (in terms of weight) get processed by such blocks, and hence we are able to bound the overall weighted completion time. We present the proofs in stages, giving intuition for the new components in each of the sections.

1.1 Related Work

Minimizing weighted completion time on parallel machines with precedence constraints has -approximation in the offline setting: Li [Li17] improves on [HSSW97, MQS98] to give a -approximation. For related machines the precedence constraints make the problem harder: there is an -approximation [Li17] improving on a prior result [CS99], and an hardness under certain complexity assumptions [BN15]. Here denotes the number of machines. These results are for offline and hence clairvoyant settings, and do not apply to our setting of non-clairvoyant scheduling.

In the setting of parallel machines, there has been recent work on minimizing weighted completion time in DAG scheduling, where each job consists of a set of tasks with precedence constraints between them given by a DAG [RS08, ALLM16]. [GGKS19] generalized this to the non-clairvoyant setting and gave an -competitive algorithm. Our algorithm for the related case is based on a similar water-filling rate assignment idea. Since the machines have different speeds, a set of rates assigned to tasks need to satisfy a more involved feasibility condition. Consequently, its analysis becomes much harder; this forms the main technical contribution of the paper. Indeed, even for the special case considered in this paper where every DAG is a star, we can show a lower bound of on the competitive ratio of any non-clairvoyant algorithm. In §A we show that any non-clairvoyant algorithm for related machines DAG scheduling must have -competitive ratio.

Our problem also has similarities to open shop scheduling. In open shop scheduling, each jobs consists of several tasks, where each task (for job ) needs to be processed on a distinct machine for amount of time. However, unlike our setting, two tasks for a job cannot be processed simultaneously on different machines. [QS01] considered open shop scheduling in the offline setting for related machines and gave a -approximation. [CSV09] considered a further generalization of our problem to unrelated machines, where the tasks corresponding to distinct jobs need not be disjoint. They gave a constant-factor approximation algorithm, again offline.

1.2 Paper Organization

In this extended abstract, we first give the algorithm in §3

, and the linear program in §

4. A simpler proof of -competitiveness is in §5. We show -competitiveness for the case of a single job (which corresponds to makespan minimization) in §6, and then give the complete proof for the general case in §7.

2 Problem Statement and the Hardness

Each job has a weight and consists of tasks for some . Each task has an associated processing requirement/size . The job completes when all its associated tasks finish processing. We use letters , etc. to denote jobs, and , etc. to denote tasks .

There are machines with speeds . The goal is to minimize the weighted completion time of the jobs. We allow task preemption and migration, and different tasks of a job can be processed concurrently on different machines. However, a task itself can be processed on at most one machine at any time. In this extended abstract we consider the special case when all release dates are 0, but our results also extend to the more general setting of arbitrary release dates (see §7.5 for details). Let denote the total speed of the fastest machines. Since we care about the number of distinct speeds, we assume there are speed classes, with speeds . There are machines having speed , where .

Assumption 2.1 (Increasing Capacity Assumption).

For parameter :

  • (Falling Speeds.) For each , we have .

  • (Increasing Capacity.) For each , the total processing capacity of speed class is at least twice that of the previous (faster) speed classes. I.e.,

  • (Speed-up.) The algorithm uses machines that are times faster than the adversary’s machines.

Proposition 2.2.

An arbitrary instance can be transformed into one satisfying creftypecap 2.1 by losing a factor in the competitive ratio.


(Sketch) For the first part, we round down the speed of each machine to a power of . This changes the completion time by at most a factor of . The second increasing capacity assumption is not without loss of generality— we greedily find a subset of speed classes by losing factor in competitive ratio (see details in Appendix B). Finally, the -speedup can only change the competitive ratio by factor. ∎

Next we show that any online algorithm has to be -competitive even for a single job with the machines satisfying increasing capacity creftypecap 2.1.

Proposition 2.3.

Any online algorithm is -competitive even for a single job under increasing capacity creftypecap 2.1.


(Sketch) Consider a single job with tasks, where is the number of machines. For every speed class , there are tasks of size —call these tasks Since there is only one job, the objective is to minimize the makespan. The offline (clairvoyant) objective is , since all tasks can be assigned to machines with matching speeds. However, any online algorithm incurs a makespan of . Here is an informal argument, which can be proved even for randomized algorithms against oblivious adversaries: since there is no way to distinguish between the tasks, the algorithm can at best run all the alive tasks at the same speed. The tasks in will be the first to finish by time where the inequality follows from the increasing capacity assumption. At this time, the processing on tasks from for has been very small, and so tasks in will require about more units of time to finish, and so on. ∎

3 The Scheduling Algorithm

The scheduling algorithm assigns, at each time , a rate to each unfinished task . The following lemma (whose proof is deferred to the appendix) characterizes rates that correspond to schedules:

Lemma 3.1.

A schedule is feasible if for every time and every value of :

the total rate assigned to any subset of tasks is at most .

For each time , we now specify the rates assigned to each unfinished task . For job , let be the set of tasks in which are alive at time . Initially all tasks are unfrozen. We raise a parameter , starting at zero, at a uniform speed. The values taken by will be referred to as moments. For each job and each task that is unfrozen, define a tentative rate at to be


Hence the tentative rates of these unfrozen tasks increase linearly, as long as condition is satisfied. However, if becomes tight for some subset of alive tasks, i.e., , pick a maximal set of such tasks and freeze them, fixing their rates at their current tentative values. (Observe the factor of appears on the right side because we assume the machines in the algorithm to have a speedup of .) Now continue the algorithm this way, raising and the values of remaining unfrozen tasks until another subset gets tight, etc., stopping when all jobs are frozen. This defines the rates for each task for time . By construction, these rates satisfy ().

3.1 Properties of the Rate Assignment

The following claim shows that all alive tasks corresponding to a job get frozen simultaneously.

Lemma 3.2 (Uniform Rates).

For any time and any job , all its alive tasks (i.e., those in ) freeze at the same moment , and hence get the same rate.


For the sake of contradiction, consider the first moment where a maximal set of tasks contains but not , for some job with . Both have been treated identically until now, so . Also, by the choice of , Since we maintain feasibility at all moments,

This implies and . Since and , all of these must be equal. In that case, by the maximality of set , the algorithm should have picked instead of . ∎

For a task , define to be task ’s “share” of the weight of job at time . So if task freezes at moment , then its rate is . Let us relate this share for to certain averages of the weight. (Proof in Appendix C)

Corollary 3.3.

Fix a time . Let be the set of tasks frozen by some moment . For a task ,

  • if is any subset of tasks which freeze either at the same moment as , or after it, then

  • if is any subset of tasks which freeze either at the same moment as , or before it, then

3.2 Defining the Blocks

The rates for tasks alive at any time are defined by a sequence of freezing steps, where some group of tasks are frozen: we call these groups blocks. By Lemma 3.2, all tasks in belong to the same block. The weight of block is the total weight of jobs whose tasks belong to . Let be the blocks at time in the order they were frozen, and be the moments at which they froze. Letting , we get that any task satisfies

Each block has an associated set of machines, namely the machines on which the tasks in this block are processed—i.e., the machines indexed . We use to denote the set of machines associated with a block . Since and the jobs in are processed on in a pre-emptive manner at time , the rate assigned to any job is at least the slowest speed (and at most the fastest speed) of the machines in .

4 The Analysis and Intuition

We prove the competitiveness by a dual-fitting analysis: we give a primal-dual pair of LPs, use the algorithm above to give a feasible primal, and then exhibit a feasible dual with value within a small factor of the primal cost.

In the primal LP, we have variables for each task , machine , and time denoting the extent of processing done on task at machine during the interval . Here denotes fraction of job finished at or after time , and denotes the completion time of job .


The constraint (2) is based on precedence-constrained LP relaxations for completion time. Indeed, each job can be thought of as a star graph with a zero size task at the root preceded by all the actual tasks at the leaf. In our LP, for each time , we define to be the maximum over all tasks of the fraction of that remains (the RHS of (2)), and the completion time of is at least the total sum over times of values. Since we do not explicitly enforce that a task cannot be processed simultaneously on many machines, the first term is added to avoid a large integrality gap. We show feasibility of this LP relaxation (up to factor 2) in §D.

Claim 4.1.

For any schedule , there is a feasible solution to the LP of objective value at most

The linear programming dual has variables corresponding to constraints (4),(3),(2) for every job and task , and corresponding to constraints (5) for every machine and time :


We now give some intuition about these dual variables. The quantity should be thought of the contribution (at time ) towards the weighted flow-time of . Similarly, is global contribution of towards the flow-time of . (In the integral case, would be for the task which finishes last. If there are several such tasks, would be non-zero only for such tasks only and would add up to ). The quantity can be thought of as ’s contribution towards the total weighted flow-time, and is roughly the queue size at time on machine . Constraint (6) upper bounds in terms of the other dual variables. More intuition about these variables can be found in §4.2.

4.1 Simplifying the dual LP

Before interpreting the dual variables, we rewrite the dual LP and add some additional constraints. Define additional variables for each job and task and time , such that variable . We add a new constraint:


This condition is not a requirement in the dual LP, but we will set to satisfy it. Assuming this, we set for all jobs , tasks and times ; feasibility of (9) implies that of (8). Moreover, (6) simplifies to

Observe that we can write as the sum of the rates, and hence as . Since this is at least for any , we can substitute above, and infer that it suffices to verify the following condition for all tasks , time , and time :


Henceforth, we ensure that our duals (including ) satisfy (9),(10) and (7).

4.2 Interpreting the Duals and the High-Level Proof Idea

We give some intuition about the dual variables, which will be useful for understanding the subsequent analysis. We set dual variables such that for any job , the sum is (approximately) the weighted completion of job . This ensures that is the total weighted completion of the jobs. One way of achieving this is as follows: for every time and task-job pair we define variables such that they add up to be if job is unfinished at time (i.e., (9) is satisfied with equality). If is set to , then these variables would add up to the weighted completion time of .

The natural way of defining is to evenly distribute the weight of among all the alive tasks at time , i.e., to set . This idea works if we only want to show that the algorithm is -competitive, but does not seem to generalize if we want to show -competitiveness. The reason for this will be clearer shortly, when we discuss the variables.

Now we discuss dual variables. We set these variables so that is a constant (less than ) times the total weighted completion time. This ensures that the objective value of the dual LP is also a constant times the total weighted completion time. A natural idea (ignoring constant factors for now) is to set , where is the set of alive jobs at time and is the speed class of machine . Since we have put an term in the denominator of (and no such term in the definition of ), ensuring the feasibility of (6) would require a speed augmentation of .

Finally, consider the dual variables. As (7) suggests, setting is the same as deciding how to distribute the weight among the tasks in . Notice, however, that this distribution cannot depend on time (unlike where we were distributing among all the alive tasks at time ). In the ideal scenario, tasks finishing later should get high values. Since we are in the non-clairvoyant setting, we may want to set . We now argue this can lead to a problem in satisfying (10).

Consider the setting of a single unit-weight job initially having tasks, and so we set for all . Say that for a large value of : by the increasing capacity assumption, . Now consider a later point in time when only tasks remain, where for some speed class . At this time , each of the surviving tasks have But look at the RHS of (10), with machine of speed class . The rate will be very close to (again, by the increasing capacity assumption), and so both the terms would be about . However, could be much larger than , and so this constraint will not be satisfied. In fact, we can hope to satisfy (10) at some time only if is close to , say at least . When the number of alive tasks drops below , we need to redistribute the weight of among these tasks, i.e., we need to increase the value for these tasks, to about . Since these halving can happen for steps, we see that (3) is violated by a factor of . These ideas can be extended to give an -competitive algorithm for arbitrary inputs; see §5 for details. To get a better bound, we need a more careful setting of the dual variables, which we talk about in §6 and §7.

5 Analysis I: A Weaker Guarantee

We start with a simpler analysis which yields an -competitiveness. This argument will not use the increasing capacity assumption from creftypecap 2.1; however, the result gives a competitiveness of which is logarithmic when is small, whereas our eventual result will be , which can be much smaller when .

Theorem 5.1.

The scheduling algorithm in §3 is -competitive.


For each job , we arrange the tasks in in descending order of their processing requirements. (This is the opposite of the order in which they finish, since all alive tasks of a job are processed at the same rate.) Say the sequence of the tasks for a job is . We partition these tasks into groups with exponentially increasing cardinalities: , , and has tasks. (Assume w.l.o.g. that is a power of by adding zero-sized tasks to ). Now we define the dual variables.

Dual Variables. Define .

  • For a time and machine of speed class , let denote the set of active (unfinished) jobs at time , and define

  • For job and a task in the -th group, define

  • In order to define , we first define quantities for every time , and then set . At time , recall that is the set of alive tasks of job , and define

    This “spreads” the weight of equally among its alive tasks.

Having defined the dual variables, we first argue that they are feasible.

Lemma 5.2 (Dual feasibility).

The dual variables defined above always satisfy the constraints (9), (7) and(10) for a speed-up factor .


To check feasibility of (7), consider a job and observe that

because and there are at most distinct groups. Feasibility of (9) also follows easily. It remains to check (10) for a job , task , machine and times .

If is not alive at time , then is 0, and (10) follows trivially. Else, , and suppose . This means the jobs in are also alive at time , so Furthermore, suppose the tasks in belong to block (defined in §3.1), and let be the speed class with the slowest machines among the associated machines . Let denote the speed class of machine  (considered in (10)). Two cases arise: the first is when , where , so (10) holds because

The second case is : Let be the set of jobs which are frozen by the moment freezes. In other words, contains tasks in block and the blocks before it. Applying the second statement in Corollary 3.3 with ,

where the last inequality uses the fact that all machines of speed class are busy processing jobs in . Therefore,

the last inequality useing the definition of and that

Finally, we show that the dual objective value for this setting of dual variables is close to the primal value. It is easy to check that , which is the total weighted completion time of the jobs. Moreover,

Since we chose speedup , we have and the dual objective value is at least half of the total weighted completion time (primal value). This completes the proof of Theorem 5.1. ∎

6 Analysis II: An Improved Guarantee for a Single Job

We want to show that the competitiveness of our algorithm just depends on , the number of speed classes. To warm up, in this section we consider the special case of a single job; in §7 we consider the general case. As was shown in Proposition 2.3, any algorithm has competitive ratio even in the case of a single job. We give a matching upper bound using dual fitting for an instance with a single job , say of weight , when the machines satisfy creftypecap 2.1.

Theorem 6.1.

If the machines satisfy creftypecap 2.1, the scheduling algorithm in §3 is -competitive for a single job.

6.1 The Intuition Behind the Improvement

The analysis in §5 incurred -competitive ratio because we divided the execution of the tasks of each job into epochs

, where each epoch ended when the number of tasks halved. In each such epoch, we set the

variables by distributing the job’s weight evenly among all tasks alive at the beginning of the epoch. A different way to define epochs would be to let them correspond to the time periods when the number of alive tasks falls in the range . This would give us only epochs. There is a problem with this definition: as the number of tasks vary in the range , the rate assigned to tasks varies from to . Indeed, there is a transition point in such that the rate assigned to the tasks stays close to as long as the number of tasks lie in the range ; but if the number of tasks lie in the range , the assigned rate may not stay close to any fixed value. However, in this range, the total processing rate assigned to all the tasks stays close to .

It turns out that our argument for an epoch (with minor modifications) works as long as one of these two facts hold during an epoch: (i) the total rate assigned to the tasks stays close to for some speed class (even though the number of tasks is much larger than ), or (ii) the actual rate assigned to the tasks stays close to . Thus we can divide the execution of the job into epochs, and get an -competitive algorithm. In this section, we prove this for a single job; we extend to the case of multiple jobs in §7 (with a slightly worse competitiveness).

6.2 Defining the New Epochs

Before defining the dual variables, we begin with a definition. For each speed class , define the threshold to be the following:


The parameter is such that the processing capacity of machines of class equals the combined processing capacity of machines of class at most . The increasing capacity assumption implies , as formalized below:

Claim 6.2.

Define and . Under the increasing capacity creftypecap 2.1 and , for any speed class , we have

  • and so, ,   (b) ,

  • , and   (d) .


Fact (a) follows from the increasing capacity assumption and the definition of the threshold, since . This implies . Proving (b) is equivalent to showing , which follows from the definition of and the fact that for all . The last two statements also follow from the increasing capacity assumption. ∎

Figure 1: Defining breakpoints.

We identify a set of break-points as follows: for each speed class , let denote the first time when alive tasks remain. Similarly, let be the first time when exactly alive tasks remain. Note that . Let be the tasks which finish during , and be those which finish during . Let and denote the cardinality of and respectively. Note that .

Claim 6.3.

For any speed class , we have


The first statement requires that . This is the same as , which follows from creftypecap 6.2 (a). The second statement requires that i.e., But (by creftypecap 6.2 (d)), hence the proof. ∎

Next we set the duals. Although it is possible to directly argue that the delay incurred by the job in each epoch is at most (a constant times) the optimal objective value, the dual fitting proof generalizes to the arbitrary set of jobs.

6.3 Setting the Duals

Define the speed-up . We set the duals as:

  • Define

  • For machine of class , define

  • Finally, as in §5, we define for each task of job , and then set . To define , we consider two cases (we use to denote the number of alive tasks at time ):

    1. for some : Then  

    2. for some : Then  

    Note the asymmetry in the definition. It arises because in the first case, the total speed of machines processing a task is (up to a constant) , whereas in the second case the average speed of such machines is about .

Lemma 6.4 (Dual feasibility).

The dual variables defined above always satisfy the constraints (7) and (9), and satisfy constraint (10) for speed-up .


It is easy to check from the definition of and that the dual constraints (7) and (9) are satisfied. It remains to verify constraint (10) (re-written below) for any task , machine , times and .

((10) repeated)

As in the definition of , there are two cases depending on where lies. First assume that there is class such that . Assume that is alive at time (otherwise is 0), so , where is the number of alive tasks at time . Being alive at this time , we know that will eventually belong to some with , or in some with . So by creftypecap 6.3, . Moreover, let be a machine of some class , so . Hence, it is enough to verify the following in order to satisfy (10):


Two subcases arise, depending on how and relate—in each we show that just one of the terms on the right is larger than the left.

  • : Since at least tasks are alive at this time, the total speed assigned to all the alive tasks at time is at least . Therefore, . Now using , we get

    where the last inequality follows from the increasing capacity assumption.

  • : The quantity is the total speed of the machines which are busy at time , which is at least Again, using , we get

    because and

Thus, (12) is satisfied in both the above subcases.

Next we consider the case when there is a speed class such that We can assume that , otherwise is 0; this means . Since , and , the expression (10) follows from showing


Since , we can drop those terms. Again, two cases arise:

  • : By definition,