Response Time Bounds for Typed DAG Parallel Tasks on Heterogeneous Multi-cores

08/07/2018 ∙ by Meiling Han, et al. ∙ NetEase, Inc 0

Heterogeneous multi-cores utilize the strength of different architectures for executing particular types of workload, and usually offer higher performance and energy efficiency. In this paper, we study the worst-case response time (WCRT) analysis of typed scheduling of parallel DAG tasks on heterogeneous multi-cores, where the workload of each vertex in the DAG is only allowed to execute on a particular type of cores. The only known WCRT bound for this problem is grossly pessimistic and suffers the non-self-sustainability problem. In this paper, we propose two new WCRT bounds. The first new bound has the same time complexity as the existing bound, but is more precise and solves its non-self-sustainability problem. The second new bound explores more detailed task graph structure information to greatly improve the precision, but is computationally more expensive. We prove that the problem of computing the second bound is strongly NP-hard if the number of types in the system is a variable, and develop an efficient algorithm which has polynomial time complexity if the number of types is a constant. Experiments with randomly generated workload show that our proposed new methods are significantly more precise than the existing bound while having good scalability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-cores are more and more widely used in real-time systems, to meet rapidly increasing requirements in performance and energy efficiency. To fully utilize the computation capacity of multi-cores, software should be properly parallelized. A representation that can model a wide range of parallel software is the DAG (directed acyclic graph) task model, where each vertex represents a piece of sequential workload and each edge represents the precedence relation between two vertices. Real-time scheduling and analysis of DAG parallel task models have raised many new challenges over traditional real-time scheduling theory with sequential tasks, and have become an increasingly hot research topic in recent years.

Many modern multi-cores adopt heterogeneous architectures. Examples include Zynq7000 [37] and OMAP1/OMAP2 [33] that integrate CPU and DSP on the same chip, and the Tegra processors [36] that integrate CPU and GPU on the same chip. Heterogenerous multi-cores utilize specialized processing capabilities to handle particular computational tasks, which usually offer higher performance and energy efficiency. For example, [38] showed that a heterogeneous-ISA chip multiprocessor can outperform the best same-ISA homogeneous architecture by as much as with energy savings and a reduction of in energy delay product.

In this paper, we consider real-time scheduling of typed DAG tasks on heterogeneous multi-cores, where each vertex is explicitly bound to execute on a particular type of cores. Binding code segments of the program to a certain type of cores is common practice in software development on heterogeneous multi-cores and is supported by mainstream parallel programming frameworks and operating systems. For example, in OpenMP [34] one can use the procbind clause to specify the mapping of threads to certain processing cores. In OpenCL[35], one can use the clCreateCommandQueue function to create a command queue to certain devices. In CUDA [10], one can use the cudaSetDevice function to set the following executions to the target device.

The target of this paper is to bound the worst-case response time (WCRT) for typed DAG tasks.

To the best of our knowledge, the only known WCRT bound for the considered problem model was presented in an early work [18] (called OLD-B), which is not only grossly pessimistic, but also suffers the non-self-sustainability problem111 By a non-self-sustainable analysis method, a system decided to be schedulable may be decided to be unschedulable when the system parameters become “better”. We will discuss this issue in more details in Section 3. . In this paper we develop two new response time bounds to address these problems:

  • NEW-B-1, which dominates OLD-B in analysis precision with the same time complexity and solves its non-self-sustainability problem.

  • NEW-B-2, which significantly improves the analysis precision by exploring more detailed task graph stucture information. NEW-B-2 is more precise, but also more difficult to compute.

    • We prove the problem of computing NEW-B-2 to be strongly NP-hard if the number of types is a variable.

    • We develop an efficient algorithm to compute NEW-B-2 with polynomial time complexity if the number of types is a constant.

Experiments with randomly generated parallel tasks show that the new WCRT bounds proposed in this paper can greatly improve the analysis precision. This paper focuses on analysis of a single typed DAG task, but our results are also meaningful to general system setting with multiple recurrent typed DAG tasks. On one hand, the results of this paper are directly applicable to multiple tasks under scheduling algorithms where a subset of cores are assigned to each individual parallel task (e.g., federated scheduling [29, 3, 5, 6, 30]). On the other hand, the analysis of intra-task interference addressed in this paper is a necessary step towards the analysis for scheduling algorithms where different tasks interfere with each other (e.g., global scheduling [8, 2, 32]).

2 Preliminary

2.1 Task Model

We consider a typed DAG task to be executed on a heterogeneous multi-core platform with different types of cores. is the set of core types (or types for short), and for each there are cores of this type (). and are the set of vertices and edges in . Each vertex represens a piece of code segment to be sequentially executed. Each edge represents the precedence relation between vertices and . The type function defines the type of each vertex, i.e., , where , represents vertex must be executed on cores of type . The weight function defines the worst-case execution time (WCET) of each vertex, i.e., executes for at most time units (on cores of type ).

If there is an edge , is a predecessor of , and is a successor of . If there is a path in from to , is an ancestor of and is a descendant of . We use , , and to denote the set of predecessors, successors, ancestors and descendants of , respectively. Without loss of generality, we assume has a unique source vertex (which has no predecessor) and a unique sink vertex (which has no successor)222In case has multiple source/sink vertices, one can add a dummy source/sink vertex to make it compliant with our model.. We use to denote is a path in . A path is a complete path iff its first vertex is the source vertex of and last vertex is the sink vertex. We use to denote the total WCET of and the total WCET of vertices of type :

The length of a path is denoted by and represents the length of the longest path in :

      Example 2.1.

Figure 1 illustrates a typed DAG task with two types of vertices (type marked by yellow and type marked by red). The WCET of vertex is annotated by the number next to the vertex. And we can compute that , and . For a path , the length is .

Fig. 1: A typed DAG task with two types.
(a) An execution sequence where each vertex executes for its WCET.
(b) An execution sequence where some vertices executes shorter than their WCET.
Fig. 2: Two possible execution sequences the task in Figure1 executed on a platform with and .

2.2 Runtime Behavior

A vertex is eligible for execution when all of its predecessors have finished. Without loss of generality, we assume the source vertex of is eligible for execution at time . The typed DAG task is scheduled on the heterogeneous multi-core platform by a work-conserving scheduling algorithm:

      Definition 2.1.

Under a work-conserving scheduling algorithm, an eligible vertex of type must be executed if there are available cores of type .

We do not put any other constraints to scheduling algorithms except the work-conserving constraint. There are many possible instances of work-conserving scheduling algorithms, e.g., the list scheduling [13] algorithm. The results of this paper are applicable to any work-conserving scheduling algorithm.

Execution Sequence. At runtime, the vertices of execute at certain time according to the scheduling algorithm. We call a trace describing which vertex executes at which time points an execution sequence of . Given a scheduling algorithm, may generate different execution sequences. This is because, (1) the scheduling algorithm may have nondeterminism (the scheduler may behave differently in the same situation) and (2) each vertex may execute for shorter than its WCET. For example, Figure 2(a) shows an execution sequence where each vertex executes for its WCET, while Figure 2(b) shows another execution sequence where some vertices execute for shorter than their WCET but lead to a larger response time. In an execution sequence , we use to denote the finish time of vertex . For simplicity, we omit the subscript and only use to denote ’s finish time when the execution sequence is clear from the context.

Response Time. The response time of in an execution sequence is the finish time of the sink vertex, and the WCRT of , denoted by , is the maximum among the response times of all possible execution sequences. The target of this paper is to derive safe upper bounds for the WCRT of . Note that the WCRT of is not necessarily achieved by the execution sequence in which each vertex executes for its WCET (even if there is only one type in the system) [13, 14]. Therefore, one can not obtain the WCRT of by simply simulating the execution of using the WCET, but has to (explicitly or implicitly) analyze all the possible execution sequences of .

2.3 Existing WCRT Bound

To our best knowledge, the only known WCRT upper bound for the considered model was developed in an early work [18]:

      Theorem 2.1 (Old-B).

The WCRT of is bounded by:

(1)

This bound can be computed in time [11]. Although OLD-B was originally derived for the list scheduling algorithm [13], it applies to all work-conserving scheduling algorithms. When there is only one type, it degrades to the classical response time bound for untyped DAG tasks [14]:

3 The First New WCRT Bound

OLD-B is not only pessimistic but also suffers the problem of being non-self-sustainable with respect to processing capacity. More specifically, the value of the WCRT bound in (1) may increase when the number of cores (of some type) increases, as witnessed by the following example.

      Example 3.1.

For the task in Figure 1, we can calculate its , and . Suppose and , we obtain a WCRT bound by OLD-B as . However, if we increase to , the bound is increased to .

Note that the actual WCRT of will not increase when more cores are used. The phenomenon shown above is merely the problem of the bound OLD-B itself rather than the system behavior. As pointed out in [1], the self-sustainability property is important in incremental and interactive design process, which is typically used in the design of real-time systems and in the evolutionary development of fielded systems.

In this section we will develop a new WCRT bound, which is not only more precise than OLD-B (with the same time complexity), but also self-sustainable. We start with introducing some useful concepts.

      Definition 3.1.

The scaled graph of has the same topology ( and ) and type function as , but a different weight function :

      Definition 3.2.

A critical path of an execution sequence of is a complete path of satisfying the following condition:

where is the finish time of in this execution sequence.

For example, a complete path is the critical path for the execution sequence shown in Figure 2(a), while a complete path is not a critical path of this execution sequence since the ’s finish time is not the latest among all the predecessors of .

A task may generate (infinitely) many different execution sequences at runtime, and it is in general unknown which complete path in is the critical path that leads to the WCRT. In the following, we assume an arbitrary complete path to be a critical path, and derive upper bounds for the response time of this particular critical path. Then by getting the maximum bound among all possible paths in , we can safely bound the WCRT of .

We divide into segments , , , . For each , we define

and let . We define

  • : the accumulative length of time intervals in during which is executing;

  • : the accumulative length of time intervals in during which is not executing.

Fig. 3: Illustration of , and .

Obviously, . Figure 3 illustrates , and . In general the time intervals counted in or may not be continuous (e.g., in Figure 3). We further define

and we know .

      Lemma 3.1.

Let be a critical path of an arbitrary execution sequence of , then and can be bounded by

(2)
(3)
Proof.

The proof of (2) is trivial. In the following, we focus on the proof of (3). By the definition of critical path, we know all the predecessors of have finished by time . Therefore, when is not executing in , all the cores of type must be occupied by vertices of type not on the critical path. Since the total workload of vertices of type that are not on the critical path is at most

and the number of cores of type is , the accumulated length of time intervals during which the vertices of type on are not executing is bounded by (3). ∎

      Theorem 3.1 (New-B-1).

The WCRT of is bounded by:

(4)

where is the scaled graph of .

Proof.

By (2), (3) and we have

is the response time of the execution sequence with critical path . Since and have the same topology, is also a complete path in , so is bounded by . Therefore, is bounded by (4). ∎

We can compute and construct based on in time, and compute in time [11]. Therefore, the overall time complexity to compute NEW-B-1 is , which is the same as OLD-B-1. By comparing the two bounds we can conclude:

      Corollary 3.1.

NEW-B-1 strictly dominates OLD-B-1.

Finally, we can easily see the bound in (4) is decreasing with respect to each , so we can conclude:

      Corollary 3.2.

NEW-B-1 is self-sustainable with respect to each .

4 The Second New WCRT Bound

Our first new WCRT bound NEW-B-1 is more precise than OLD-B, but still very pessimistic. The source of its pessimism comes from the step of bounding . Intuitively, the bound of in (3) is derived assuming that the workload of vertices not on the critical path are all executed in the shaded areas in Figure 3. However, in reality much workload of may actually be executed outside these shaded areas. Therefore, the length of

is significantly over-estimated in (

3) .

In this section, we introduce the second new WCRT bound NEW-B-2, which eliminates workload of vertices that cannot be executed in the shaded area, and thus reduce the pessimism in bounding .

4.1 WCRT Bound

      Definition 4.1.

For each vertex , denotes the set of vertices that have the same type as but are neither ancestors nor descendants of :

      Definition 4.2.

Let be a critical path, is defined as

(5)
      Example 4.1.

Assume is a critical path of the task in Figure 1. We have , , , , , and .

Intuitively, is the set of vertices of type that are not on the critical path but can actually interfere with vertices of type on the critical path (i.e., can be executed in the shaded area in Figure 3). Therefore, can be bounded more precisely as stated in the following Lemma.

      Lemma 4.1.

Let be a critical path of an arbitrary execution sequence of , then is bounded by

(6)
Proof.

To prove the lemma, it is sufficient to prove that at any time instant in when is not executing, all the cores of type must be executing vertices in . We prove this by contradiction. Assuming that at a time instant when is not executing, there exists a core of type which is not executing vertices in , then one of the following two cases must be true:

  • This core is idle at . Since is a critical path, we know all the predecessors of have finished by time , so is eligible for execution at , and thus this core cannot be idle at . Therefore, this case is impossible.

  • This core is executing a vertex at . First we know , and since , by the definition of ivs we know must be a predecessor or a successor of , so we discuss two cases:

    • is a predecessor of . Since is a critical path, we know all the predecessors of have finished by time , so a predecessor of cannot start execution after , which contradicts that is executing at a time instant after .

    • is a successor of . A successor of cannot start execution before , so this is also a contradiction.

    Therefore, this case is also impossible.

In summary, both cases are impossible, so the assumption must be false and the lemma is proved. ∎

      Theorem 4.1 (New-B-2).

The WCRT of is bounded by

(7)

where

(8)
Proof.

By the same idea as the proof of Theorem 3.1 but using the new bound (6) for instead of (3), we can get

is the response time of the execution sequence with being the critical path. Finally, by getting the maximum bound for all complete paths (assumed to be the critical path), the theorem is proved. ∎

Comparing with NEW-B-1, NEW-B-2 uses a more precise upper bound of , so we have

      Corollary 4.1.

NEW-B-2 strictly dominates NEW-B-1.

The bound in (7) is decreasing with respect to each , so

      Corollary 4.2.

NEW-B-2 is self-sustainable with respect to each .

4.2 Strong NP-Hardness

Fig. 4: The constructed typed DAG task for 3-SAT problem instance , where , , , .

NEW-B-2 requires to compute the maximum of among all paths in the graph . It is computationally intractable to explicitly enumerate all the paths, the number of which is exponential. Can we develop efficient algorithms of (pseudo-)polynomial complexity to compute ?

Unfortunately, this is impossible unless P = NP.

      Theorem 4.2.

The problem of computing is strongly NP-hard.

Proof.

We will prove the theorem by showing that even a simpler problem of verifying whether is larger than a given value is strongly NP-hard, which is proved by a reduction from the 3-SAT problem.

Let be an arbitrary instance of the 3-SAT problem, which has clauses and variables . Each clause , , consists of three literals, and each literal is a variable or the negation of a variable. We construct a typed DAG corresponding to the 3-SAT instance as follows:

  • We first construct vertices of type with . is the source vertex of and is the sink vertex.

  • For each clause , we construct a vertex of type with , as well as two edges and .

  • For each variable , we construct two paths from to :

    • Positive path, which includes a vertex of type if and only if clause includes a literal .

    • Negative path, which includes a vertex if and only if clause includes a literal .

    The WCET of each vertex on these two paths is .

Note that there are in total types in the above constructed DAG. Finally, we set and . The above construction is polynomial as there are no more than vertices in the constructed graph. For illustration, an example of the above construction is given in Figure 4.

In the following we prove that the 3-SAT problem instance is satisfiable if and only if the bound of the above constructed graph is strictly greater than .

First, a complete path that leads to the largest must be one of those traversing . The choice between the positive and negative path between and corresponds to the choice between assigning or to variable in the 3-SAT problem.

Since each vertex is neither an ancestor nor a descendant of any vertex on paths traversing , is included in if and only if the path contains at least one vertex of type . This corresponds to that is satisfied only if it contains at least one literal assigned with value . Therefore, we can conclude that all vertices are included in the corresponding if and only if all clauses contain at least one literal assigned with value , i.e., the 3-SAT problem instance is satisfiable.

Therefore, the second item of RHS of (8) equals if and only if is satisfiable. Moreover, there are at most vertices corresponding to the positive and negative values of the variables along any path traversing , so their total WCET must be in the range . Therefore, the length of any such path must be in the range , and thus in the range . Therefore, is larger than if and only if all vertices are included in the corresponding , i.e., the 3-SAT problem instance is satisfiable. ∎

4.3 Computation Algorithm

The construction in the above strong NP-hardness proof uses different types (where is the number of clauses in 3-SAT). In realistic heterogeneous multi-core platforms, the number of core types is usually not very large. Will the problem of computing remain NP-hard if the number of types is a bounded constant? In the following, we will present an algorithm to compute with complexity , which shows that the problem is actually in P if the number of types is a constant.

We first describe the intuition of our algorithm. Instead of explicitly enumerating all the possible paths, our algorithm will use abstractions to represent paths in the graph searching procedure. More specifically, a path starting from the source vertex of and ending at some vertex is abstractly represented by a tuple , where and are defined in Definition 4.3 and 4.4 in the following. The tuple will be updated when the path is extended from to its successor , and eventually when the path is extended to the sink vertex, is the for this path. The algorithm starts with a single tuple corresponding to the path consisting only the source vertex and repeatedly extends the paths until they all reach the sink vertex, then the maximal among all the kept tuples is the desired bound . The abstraction is compact, so that many different path histories ending with the same vertex can be represented by a single abstraction and the total number of abstractions generated in the computation is polynomially bounded.

      Definition 4.3.

For a path in and a type , we define

and

(9)

Intuitively, is the vertex on path that is the closest to among all the vertices of type .

      Example 4.2.

In the typed task in Figure 1, there are paths from to . For the path , one can derive and . For the path , one can derive and .

      Definition 4.4.

For a path in we define:

(10)

where

(since may be , we let for completeness).

      Lemma 4.2.

For a complete path , it always holds

(11)
Proof.

The proof goes in two steps: (1) Rewrite into a non-recursive form and prove , and (2) Prove .

We first define as follows:

Now we prove by induction:

  • Base case. Both and equal , so the claim holds for the base case with .

  • Inductive step. Suppose , we want to prove . First, we know

    (12)

    For simplicity we let

    so (12) can be rewritten as

By now, we have proved . In the following we prove .

Let be the subsequence of containing all vertices in with type . By the definition of :

(13)

In the following we will prove

(14)

If (14) is true, then by (14) and (13) we have

by which is proved.

In the following, we focus on proving (14). We use LHS and RHS to represent the left-hand side and right-hand side of (14), respectively. In the following, we will prove that both LHS RHS and LHS RHS hold.

  1. LHS RHS. This is proved by combining the following two claims:

    1. Any counted in LHS is also counted in RHS. If is counted in LHS, then must be in some , and by the definition of , we know must be also in , so we can conclude that all the counted in LHS are also counted in RHS.

    2. Each is counted in LHS at most once. Suppose this is not true, then there exists some such that and for some , so it must be the case that

      By , we know is not a descendant of , and since is a predecessor of , is not a descendant of either. On the other hand, by we know is not an ancestor of , and since is a descendant of , is not an ancestor of . In summary, is neither a descendant nor an ancestor of , and has the same type as , which contradicts .

  2. RHS LHS. It is obvious that each is counted at most once in the RHS, so it suffices to prove that each counted in the RHS is also counted in LHS. Since , must be in some . Suppose is the smallest index such that but , then is counted in item (if then is counted in ).

In summary, we have proved both RHS LHS and RHS LHS, so (14) is true. ∎

By Lemma 4.2, we know that by using the abstract tuple to extend the path, eventually, we can precisely compute . Therefore, we can use as the abstraction of paths to perform graph searching. All the paths corresponding to the same tuple can be abstractly represented by a single tuple instead of recording each of them individually. Actually, even paths corresponding to different tuples can be merged together during the graph searching procedure by the domination relation among tuples defined as follows:

      Definition 4.5.

Given two tuples and with the same vertex , dominates , denoted by

if both of the following conditions are satisfied:

  1. : either or