Simultaneous Progressing Switching Protocols for Timing Predictable Real-Time Network-on-Chips

09/19/2019 ∙ by Niklas Ueter, et al. ∙ 0

Many-core systems require inter-core communication, and network-on-chips (NoCs) have been demonstrated to provide good scalability. However, not only the distributed structure but also the link switching on the NoCs have imposed a great challenge in the design and analysis for real-time systems. With scalability and flexibility in mind, the existing link switching protocols usually consider each single link to be scheduled independently, e.g., the worm-hole switching protocol. The flexibility of such link-based arbitrations allows each packet to be distributed over multiple routers but also increases the number of possible link states (the number of flits in a buffer) that have to be considered in the worst-case timing analysis. For achieving timing predictability, we propose less flexible switching protocols, called Progressing Switching Protocols (SP^2), in which the links used by a flow either all simultaneously transmit one flit (if it exists) of this flow or none of them transmits any flit of this flow. Such an all-or-nothing property of the SP^2 relates the scheduling behavior on the network to the uniprocessor self-suspension scheduling problem. We provide rigorous proofs which confirm the equivalence of these two problems. Moreover, our approaches are not limited to any specific underlying routing protocols, which are usually constructed for deadlock avoidance instead of timing predictability. We demonstrate the analytical dominance of the fixed-priority SP^2 over some of the existing sufficient schedulability analysis for fixed-priority wormhole switched network-on-chips.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Power dissipation has constrained the performance scaling of single-core systems in the past decade. Instead of increasing the operational frequency of a processor, the chip manufacturers have shifted their design focus towards chips with multiple or many cores that operate at lower voltages and frequencies than their single-core counterparts. In a multi- or many-core system, communication and synchronization of the applications executed on different cores have to be designed efficiently from both hardware and software perspectives.

The communication fabric of a multi- or many-core platform must scale with the number of cores. Otherwise, the computation capacity of the cores may be wasted if they are waiting for the communication, synchronization, or memory access. One possible approach to achieve good scalability of the communication is the Network-on-Chip (NoC) architecture, in which a switched network with routers is used to provide the interconnection of the physical cores on a chip. The NoC architecture allows parallel inter-core communication with moderate hardware costs. NoCs are the prevalent choice of interconnection due to their overall good performance and scalability potential as reported by Kavaldjiev et. al [Kavaldjiev].

The efficiency of a NoC highly depends on many design factors, including topology, routing protocols, flow control, switching arbitration protocols, etc. The currently available multi-core platforms based on NoCs have employed different topologies, e.g., a ring in the Intel Xeon Phi 3120A, a 2-D torus in MPPA Manycores by Kalray, and a 2-D mesh in Tilera TILE-Gx8036. Moreover, many different communication protocols and communication topologies have been proposed and evaluated in the literature. Specifically, flit-based network-on-chips (NoCs) have been proposed with the goal to decrease production cost and increase energy efficiency due to less complex routers and decreased buffer sizes as compared to other approaches.

Real-Time system design is concerned with the construction of systems that can be formally verified to satisfy timeliness constraints. Such real-time constraints are prevalent in, e.g., timing-sensitive applications in embedded mobile platforms, automotive, and aerospace applications. To construct a hard real-time system on a NoC, each hard real-time message (defined as an instance of a sporadic/periodic flow) in the NoC has to be successfully transmitted from its source to its destination before its deadline.

The approaches for real-time systems on a NoC apply two general strategies. One is to utilize time-division-multiplexing (TDM) to ensure that the timing constraints are satisfied by constructing the transmission schedule statically with a repetitive table, e.g., in [Goosens2005, Paukovits2008ConceptsOS, Stefan2012, DBLP:journals/tvlsi/KasapakiSSMGS16, DBLP:conf/fpl/Schoeberl07, DBLP:conf/date/MillbergNTJ04, DBLP:journals/jsa/SchoeberlAAACGG15, DBLP:conf/rtns/HardeFBC18]. Another is to apply a priority-based dynamic scheduling strategy in the routers (and switches) to arbitrate the flits in the network, e.g., in [365629, 627905, 708526, 1466499, Shi:RCA:2008, Kashif:2014, Kashif:2016, XIONG:2016, DBLP:journals/corr/IndrusiakBN16, Xiong:2017, Indrusiak:DATE:2018]. The difficulty of the TDM strategy is to construct the TDM schedule and the global clock synchronization, whilst the difficulty of the priority-based scheduling strategy is to validate the schedulability, i.e., whether all messages can meet their deadlines in the worst case.

Specifically, for dynamic scheduling strategies, the wormhole-switched fixed-priority NoC with preemptive virtual channels has recently been studied in a series of papers. The first attempts to tackle the schedulability analysis were in 1994 in [365629] and 1997 in [627905]. Both of them were found to be flawed in 1998 by Kim et al. [708526], whose analysis was later found to be erroneous in 2005 by Lu et al. [1466499]. The series of erroneous analyses continued in [365629, 627905, 708526, 1466499] until a seemingly correct result by Shi and Burns [Shi:RCA:2008] published in 2008. Eight years later, Xiong et al. [XIONG:2016] pointed out the analytical flaw in [Shi:RCA:2008] and disproved the safe bounds in [Kashif:2014, Kashif:2016]. The erroneous patch in [XIONG:2016] was later fixed by the authors in their journal revision in [Xiong:2017] in 2017. In the mean time, Indrusiak et al. [DBLP:journals/corr/IndrusiakBN16, Indrusiak:DATE:2018] presented new analyses but admitted to not be able to provide a formal proof of correctness. They supported their analyses by evaluating concrete cases, i.e., whether there was any observed case which was claimed to meet the deadlines but in fact missed the deadlines. However, such case studies cannot validate the correctness of their analyses, as also stated by Indrusiak et al. [DBLP:journals/corr/IndrusiakBN16, Indrusiak:DATE:2018]. Specifically, among the 10 results [365629, 627905, 708526, 1466499, Shi:RCA:2008, Kashif:2014, Kashif:2016, XIONG:2016, DBLP:journals/corr/IndrusiakBN16, Xiong:2017] published up to 2017, Table VII in [DBLP:journals/corr/IndrusiakBN16] shows that eight of them were already disproved by counter examples, and two of them are pr¡obably safe as no counter examples have been given. Nikovíc et al. [Nikolic2019] published the most recent result for the worst-case response timing analysis.

The fact that almost all proposed analyses for the problem discussed in the previous paragraph have been found flawed suggests that the scheduling algorithm and architecture can be potentially too complex to be correctly analyzed. These approaches have adopted the well-known worst-case response time analysis for uniprocessor sporadic real-time tasks under fixed-priority uniprocessor scheduling developed in [lehoczky89, joseph86responsetimes], but they have never shown the connection between worm-hole switching and uniprocessor scheduling.

Another research line to analyze the worst-case response time of wormhole-switched fixed-priority NoC with preemptive virtual channels is to apply Network Calculus and Compositional Performance Analysis (CPA), or their extensions, to analyze the transmission on the links in a compositional manner, e.g., [NC-Wormhole-WRR-2009, MPPA-ERTS-WCTT-2018, NoC-WCTT-RC-NC-WFCS-2016, Wormhole-BP-2015, DBLP:conf/date/TobuschatE17, DBLP:conf/date/RamboE15]. The worst-case response time of a flow is the sum of the worst-case response times on individual links, which can be pessimistic. One exception is the analysis from Giroudot and Mifdaoui [DBLP:conf/rtas/GiroudotM18], which applies the Pay Multiplexing Only Once principle by Schmitt et al. [5755042].

Contributions: In this paper, we revisit the fundamental algorithmic problem of flit-based NoC arbitration protocols with respect to real-time constraints by hinting to the fundamental algorithmic complexity. In addition, we identify the analytic pessimism of wormhole-switched fixed-priority arbitration protocols due to the large degree of uncertainty in system behaviour and thus hard to verify timeliness properties.

This paper intends to answer a few unsolved fundamental questions for real-time networking switching in a NoC and has the following contributions:

  • What is the fundamental difficulty of worst-case timing analysis when flit-based transmissions are handled by switch-based (link-based) scheduling? We will show that the difficulty is mainly due to the explosion of the possible progression states of a message in the transmission path. The existing analyses did not intend to prove the coverage of all possible progression states. Moreover, we will also argue why priority-based scheduling without controlling the number of possible progression states is therefore difficult to be analyzed and optimized.

  • Is there any equivalence between existing uniprocessor scheduling and NoC switching? Yes, but, to the best of our knowledge, such a protocol has never been designed. For achieving timing predictability, we propose less flexible switching protocols, called Simultaneous Progressing Switching Protocols (SP), in which the links used by a message either all simultaneously transmit one flit of this message or none of them transmits any flit of this message. Such an all-or-nothing property of the SP relates the scheduling behavior on the network to the uniprocessor self-suspension scheduling problem. We provide rigorous proofs which confirm the equivalence of these two problems.

  • We demonstrate the analytical dominance of the (work-conserving) fixed-priority version of our approach over the existing sufficient schedulability analyses for fixed-priority wormhole switched NoCs in [Xiong:2017, Indrusiak:DATE:2018]. Moreover, our approaches are not limited to any specific underlying routing protocols, which are usually constructed for deadlock avoidance instead of timing predictability.

Ii System Model and Problem Definition

Network-on-Chips (NoCs) are characterized by the topology, routing protocol, arbitration, buffering, flow control mechanism, and switching protocol.111Our notation of flows is equivalent to tasks and our notation of messages is equivalent to jobs in the classical notation of real-time systems community. In this paper, we define a NoC as a collection of cores A, routers , and links . Each router is connected to at least one other router by two physically separate links, i.e., up-link and down-link. Figure 1 illustrates a meshed network with cores, , routers , and links in between and for some and and links between and for . We assume that all the cores, routers, and links are homogeneous. Therefore, the transmission rate and processing capability are identical.

Fig. 1: Examplary 3x3 mesh NoC. Each application is connected to a source-router where it injects messages.

Ii-a Switching Mechanisms

With respect to switching, three different switching protocols have been established, namely, circuit switching, store-and-forward switching, and wormhole switching.

Ii-A1 Circuit Switching

A packet (transmission unit) is forwarded by the routers through dedicated routes that are reserved/allocated until the transmission is finished. Therefore, each transmission can only be preempted during the establishing of a route. An advantage of this approach is that no buffers are required and subsequent arbitrary deadlock-free routing, which allows for optimized and adaptive routing schemes. However, the overhead to establish the routes may render this approach infeasible when small packets are injected frequently.

Ii-A2 Store-And-Forward Switching

Routers can only forward a packet once it is completely received and stored, which implies that the routers must provide sufficient buffer capacity to store a complete packet. Fortunately, the arbitration protocol is suitable for real-time analysis, since a packet may compete for at most one link at each point in time.

Ii-A3 Wormhole-Switching

Each packet is divided into smaller transmission units, called flits, always including a designated header and a designated tail flit which are used for control and routing. That is, each payload flit follows the output port of the header flit. In fixed-priority wormhole-switched NoCs, each router contains virtual channels, i.e., separated buffers that contain flits of a single packet. Once the tail flit is transmitted and removed from the buffer, the virtual channel can be used for flits of another packet. Furthermore, the highest-priority flit is scheduled to transmit over the link at each router. In this approach, complete packets do not have to be buffered and which allows smaller buffers in the hardware design. On the downside, each packet may be distributed over multiple routers and subsequently compete for multiple links at the same time which makes the timing analysis complex. Additionally, the limited number of virtual channels and full buffers on the receiving router add additional interference which complicate the analysis.

Ii-B Messages and Periodic/Sporadic Flows

A periodic (sporadic) flow generates an infinite sequences of flow instances, called messages, and has the following parameters:

  • is the minimum inter-arrival time or period of the flow , i.e., for a periodic flow one message is released exactly every time units and for a sporadic flow two subsequent messages are separated by at lest .

  • is the static routing path of the flow , i.e., is the sequence of the links that a message of has to be transmitted on. We assume that a physical link cannot be used more than once in the static routing path for any .

  • is the worst-case number of flits of a message of .

  • is the relative deadline of the flow . That is, when a message is injected at time , its absolute deadline is . Our protocols are not restricted to any specific relation of the minimum inter-arrival time and the relative deadline . However, our timing analysis will focus on the constrained-deadline cases, where .

Ii-C Problem Definition

The scheduler design problem studied in this paper is defined as follows: We are given a NoC, defined as a collection of cores A, routers , and links . For a given set F of sporadic or periodic flows on the NoC, the objective is to design a switching mechanism (scheduling algorithm) that can ensure that all messages (instances of the flows) can meet their deadline.

The schedulability test problem studied in this paper is defined as follows: We are given a NoC, defined as a collection of cores A, routers V, and links . For a given set F of sporadic or periodic flows on the NoC and a switching mechanism, the objective is to validate whether the messages (instances of the flows) can meet their deadlines.

We assume that the cores and routers are synchronized perfectly with respect to time. That is, there is no clock drift in the NoC. Otherwise, the clock drift must be considered carefully. One solution is to introduce additional delays and interferences to pessimistically bound the impact due to clock drifts. Moreover, we assume discrete time, i.e., the NoC operates in the granularity of a fixed time unit and the finest granularity is the flit.

Iii Existing Analytical Approaches for Worm-Hole Switching

In this section, we will first summarize the existing analytical approaches for the worm-hole switching mechanism in Section III-A. Then, we will explain the mismatch of the existing analyses and the underlying uniprocessor scheduling in Section III-B.

Iii-a Summary of Existing Analyses

A first analytical approach to determine the worst-case response time of sporadic traffic flows in wormhole-switched fixed-priority network-on-chips was given by Mutka [365629] and Hary and Ozguner [627905]. Both of them are based on the schedulability analysis for uniprocessor sporadic real-time tasks under fixed-priority scheduling developed in [lehoczky89, joseph86responsetimes]. To analyze the worst-case response time of the flow , they considered the complete path as a single shared resource, i.e., a uniprocessor. This shared resource may not always be available for , and they modeled the unavailability by only considering the higher-priority flows that use any link in , called direct interference. They concluded that the problem is equivalent to the fixed-priority uniprocessor scheduling, which was disproved by Kim et al. [708526], who showed that the flow can suffer from the interference due to flow even if and have no intersection, called indirect interference. By extending the notion of interference sets developed by Kim et al. [708526], Lu et al. [1466499] proposed to discriminate between flows that could not interfere with each other to reduce the pessimism of the analysis.

However, both of the approaches in [708526, 1466499] assume that the synchronous release of the first messages of the sporadic real-time flows is the worst-case, i.e., similar to the critical instant theorem in classical uniprocessor fixed-priority scheduling proposed by Liu and Layland [liu73scheduling]. This statement was later disproved in 2008 by Shi and Burns [Shi:RCA:2008], where jitter terms were added to model the asynchronous release of the first messages of the sporadic real-time flows. Based on the results of this work, Kashif and Patel proposed a link-based analysis called stage-level analysis [Kashif:2014, Kashif:2016] to achieve a tighter analysis. Both analyses were proved to be unsafe by Xiong et al. [XIONG:2016] using simulations. It was discovered that a flit of a higher-priority flow may induce interference more than once, i.e., on multiple routers, thus rendering the conjectures made by Shi and Burns [Shi:RCA:2008] and Kashif and Patel [Kashif:2014, Kashif:2016] false. This behavior is referred to as multi-point progressive blocking by Indrusiak et al. [Indrusiak:DATE:2018]. The state of the art with respect to fixed-priority wormhole-switched networks-on-chips with infinite buffers is represented by [DBLP:journals/corr/IndrusiakBN16, Xiong:2017]. Unfortunately, the infinite buffer assumption is infeasible in real systems, thus back-pressure effects that occur due to limited buffer sizes in the routers have to be considered. In the work of Indrusiak et al. [Indrusiak:DATE:2018], the authors incorporate buffer sizes into the worst-case response time analysis. Unfortunately, the authors admit to not be able to provide a formal proof of correctness of their models and analyses, thus further counterexamples may be found. The fact that almost all proposed analyses have been found to be flawed, suggests that the scheduling algorithm and architecture are too complex to be reasonably analyzed. Further evidence for this claim is that in the analyses provided by Indrusiak et al. [Indrusiak:DATE:2018], increased buffer sizes lead to increased worst-case response times. Nikolíc et al. [Nikolic2019] presented an improved analysis over the results in [Indrusiak:DATE:2018, Xiong:2017].

Motivated by this, we revisit the fundamental algorithmic problem of packet-based network-on-chip scheduling and identify the analytic pessimism incurred by the complexity of link-based arbitration as harmful to routing and to verification. All the above results in [365629, 627905, 708526, 1466499, Shi:RCA:2008, Kashif:2014, Kashif:2016, XIONG:2016, DBLP:journals/corr/IndrusiakBN16, Xiong:2017, Indrusiak:DATE:2018, Nikolic2019] made an assumption that the schedulability analysis is somehow related to a corresponding uniprocessor fixed-priority scheduling problem. However, this has never been formally proved. We will explain the mismatch by using the progression model in Section III-B.

Iii-B Progression Model

Fig. 2: Progressions of a message which involves 2 cores and 2 routers, i.e., 3 links, when

. The numbers associated to an edge indicate the change of the buffers in the vector

. The red dashed path illustrates the beginning of a fastest progression and the blue dashed path illustrates the beginning of a slowest progression.

In this subsection, we detail why the link-based arbitration problem does not match the uniprocessor scheduling model and illustrate the subsequent problems in response-time analyses using uniprocessor scheduling theory. To explain the mismatch, we will focus on the possible buffer states of one instance (i.e., message) of a flow under analysis. Let denote the state vector of the number of flits that are buffered in the cores and routers involved in the path . Suppose that there are links involved in . Note that the first element in denotes the number of flits of the whole message of in the source core to be sent and the last element in denotes the number of flits that have been received at the destination core.

Recall our assumption in Section II-C that the NoC is assumed to be fully synchronized in time. Therefore, in each time unit, a buffered flit can be forwarded to the next node (a core or a router). Since the NoC works in discrete time, we can observe the changing of the vector over time when considering only the time units at which the message is sent.

When a flit is sent in a time unit at the -th link in , then the buffer of the -th node is reduced by and the buffer of the -th node in is increased by . For notational brevity, let be a vector of elements in which all the elements are except the -th element that is and the -th element that is . For example, implies that the first and the second links send one flit forward in this time unit. Moreover, implies that the first and the third links send one flit forward in this time unit.

Definition 1 (Progression).

Consider a buffer state at time , in which all elements in are non-negative integers. Suppose element is either or for and at least one of them is . Specifically, when is , the -th link sends one flit forward.

Let be . For a buffer state , for , and a vector defined above, the change of the buffer state is valid if

  • all elements in are non-negative integers, and

  • the -th element in is for .

If the change of the buffer state is valid, we say that the flow makes a progression in this time unit, i.e., one clock cycle. ∎

In a time unit, a link may or may not be utilized to send one flit of in the switching mechanism. Therefore, there are combinations of the vectors of s. Note that progressions do not have to take place in two consecutive time units. If the message is not sent in the next time unit at all, there is no progression of the message. As an illustrative example, consider and that the message is sent from core via and to core in the NoC illustrated in Figure 1. If one flit is sent in a time unit, we get . Now, there are three possibilities for the next time unit when the NoC transmits a flit or multiple flits of the message:

  • : That is, does not send any flit but sends a flit to . The progression is due to .

  • : That is, sends one flit to but does not send a flit to , i.e., .

  • : That is, sends one flit to and sends a flit to , which means that the progression is due to .

The first three levels of the tree illustrated in Figure 2 provides the above example. In each of the above state, the next progression has to be considered. Due to space limitation, we only further illustrate the progressions that are possible when is .

Definition 2 (A Series of Progressions).

A series of progressions is a sequence of progressions defined in Definition 1, one after another, starting from to . ∎

A safe analysis of the worst-case response time or the schedulability for sending the message should consider all possible series of progressions of starting from to . If we only account for the number of time units when the message of is transmitted, it is not difficult to see that the slowest one only sends one flit forward per progression, in which the switching mechanism results in iterations of progressions. Moreover, the fastest one sends one flit (if available) forward for all cores and routers involved in the path per progression, in which the switching mechanism results in iterations of progressions.

All the results in [365629, 627905, 708526, 1466499, Shi:RCA:2008, Kashif:2014, Kashif:2016, XIONG:2016, DBLP:journals/corr/IndrusiakBN16, Xiong:2017, Indrusiak:DATE:2018, Nikolic2019] made an assumption that the corresponding uniprocessor scheduling problem can use as the worst-case execution time of the corresponding sporadic task to represent the flow . This assumption implicitly implies that the flow takes the fastest series of progressions explained above. Such uniprocessor analyses are only valid when the other iterations of progressions are accounted correctly. However, the fastest series of progressions for is not always possible in the worst case. To ensure the correctness of the analysis, some additional time units should be included. Many patches have been provided to account for such additional time units after the series of flaws found in [365629, 627905, 708526, 1466499, Shi:RCA:2008, Kashif:2014, Kashif:2016, XIONG:2016].

Informally speaking, the researchers in [365629, 627905, 708526, 1466499, Shi:RCA:2008, Kashif:2014, Kashif:2016, XIONG:2016, DBLP:journals/corr/IndrusiakBN16, Xiong:2017, Indrusiak:DATE:2018, Nikolic2019] have tried to construct their analyses by linking the problem to a corresponding uniprocessor scheduling problem. Most of them were later found flawed, e.g., [365629, 627905, 708526, 1466499, Shi:RCA:2008, Kashif:2014, Kashif:2016, XIONG:2016], or without a formal proof, e.g., [DBLP:journals/corr/IndrusiakBN16, Indrusiak:DATE:2018, Nikolic2019].222The proofs in [Nikolic2019] did not consider the equivalence of the worst-case response time analysis adopted in uniprocessor systems and the analysis of a NoC. Instead, they emphasized the quantification of different types of interferences. However, in many places in the proofs, e.g., the building blocks from Lemmas 3, 4, and 6 in [Nikolic2019], the derivation is based on examples. However, none of them has seriously considered all the possible progressions for transmitting . Whether the worst-case response time analysis for preemptive worm-hole switching is equivalent to any specific form of uniprocessor scheduling problem remains as an open question.

We strongly believe that the worst-case response time analysis or the schedulability analysis for wormhole-switched fixed-priority network-on-chip is highly complex, as the timing behavior is not uniprocessor equivalent, as reported in the literature. In a uniprocessor system, if a job is executed for time units, the execution time of the job is reduced by time units. However, sending flits can have different series of progressions in the NoC.

If we would like to consider the complete path as a single shared resource, like in [365629, 627905, 708526, 1466499, Shi:RCA:2008, Kashif:2014, Kashif:2016, XIONG:2016, DBLP:journals/corr/IndrusiakBN16, Xiong:2017, Indrusiak:DATE:2018, Nikolic2019], and analyze the worst-case behavior by utilizing the corresponding instance of the uniprocessor scheduling problem, the mapping from the series of the progressions to the uniprocessor problem must be formally proved.

Please note that we do not claim that such a mapping is impossible. We only stated the mismatch. Such mappings are potentially very difficult to achieve precisely due to the large search space. However, safe approximations and upper bounds are also missing in the literature. In both cases, a correct proof should explain how to safely account for the number of iterations in the progressions of the flows and map them to the corresponding execution time in the constructed instance of the uniprocessor scheduling problem.

Iv Simultaneous Progressing Switching Protocols

To the best of our knowledge, there is no formal proof to demonstrate the connection between the progressions of the messages on a NoC and the corresponding instance of the uniprocessor scheduling problem. We note that the analytical difficulty is potentially due to the flexibility introduced in the switching mechanism. In the position papers by Wilhelm et al. [DBLP:journals/tcad/WilhelmGRSPF09] and Axer et. al [Axer:2014], the authors state that system properties that are subject to predictability constraints should already be considered and guaranteed from the design. Since the worm-hole switching protocol was not designed with predictability constraints in mind, designing new protocols that can be safely analyzed without losing too much flexibility or efficiency can be an alternative.

Instead of proving the complex scenarios in the standard worm-hole switching, we propose another protocol which has only one series of progressions. This less flexible switching protocols, called Simultaneous Progressing Switching Protocols (SP), achieves timing predictability by enforcing that a flow is eligible to transmit on its route if and only if it can be allocated for all links in in-parallel. In other words, the links used by a flow either all simultaneously transmit one flit of this flow (if it exists) or none of them transmits any flit of this flow. As a result, for a progression of in a time unit, some links in may be reserved even though there is no flit to be transmitted over this link in this time unit, a behaviour similar to processor spinning.

In order to formally define the schedules and analyze the schedulability, we use the following definition.

Definition 3 (Schedules).

A schedule is a function that maps time in the time-domain to the flow that is scheduled on the link at that time. Further, if is idle at time . ∎

We use the Iverson bracket to indicate whether a flow is scheduled on a link at time .333 is if is scheduled on at time and is otherwise. For convenience, we use to indicate the ordered set of schedules . Moreover, we abbreviate our notation to a single value, i.e., to denote that all links in schedule flow at time .

Definition 4.

A schedule implements the SP if for all , for each static routing path of every flow , the following implication holds:

In general, the SP can be implemented with different strategies. The focus of this paper is not the implementation or design of scheduling strategies to meet the schedule defined in Definition 4. In the upcoming two subsections, we discuss two potential scheduling strategies that can be used to derive such schedules.

Iv-a Links to Gang Scheduling

To meet the deadline of a message of a flow , that arrives at time , the concept of the SP requires to have time units to use all the links in simultaneously before . This is similar to the rigid gang scheduling problem [DBLP:journals/lites/GoossensR16], which can be defined as follows:

  • We are given a set of periodic/sporadic tasks to be executed on the given machines.

  • Each task has to simultaneously use a subset of the given machines as a gang. Whenever the task is executed, all of its required machines must be exclusively allocated to the task.

That is, we can consider that each of the links in is a machine and each flow is a task. The links in form a gang for flow and the execution time is .

Finding the optimal schedule for the rigid gang scheduling problem has been shown NP-hard in the strong sense even when all the tasks have the same period and the same deadline [Kubale:87:The-complexity-of-scheduling]. Moreover, even special cases, like three machines [BazewiczDell-Olmo:94:Corrigendum-to:-Scheduling] or unit execution time per task [HoogeveenVelde:94:Complexity-of-scheduling], are also NP-hard in the strong sense. The rigid gang scheduling problem for implicit-deadline periodic real-time task systems (i.e., for every task ) has been recently studied by Goossens and Richard [DBLP:journals/lites/GoossensR16]

. They presented two algorithms, one is based on linear programming and another is based on a heuristic algorithm. Moreover, Harde et al. 

[DBLP:conf/rtns/HardeFBC18] constructed static schedules by using a constrained-programming or an integer-linear-programming (ILP) approach for harmonic real-time task systems. Another version of gang scheduling is the so-called global gang scheduling problem [DBLP:journals/corr/RichardGK17, DBLP:conf/rtss/Dong017, DBLP:conf/rtss/KatoI09], in which the set of machines used by a gang task is not fixed. A gang task requires a certain amount of machines, and these machines can be dynamically relocated at runtime. We note that the global gang scheduling problem is unrelated to the SP, since the links used by a flow has to be defined from the source node to the destination node.

Iv-B Work-Conserving Priority-Based Schedules

Instead of optimizing the scheduling strategies in the routers in a NoC for the SP, we will focus on the work-conserving priority-based SP. Such strategies can be dynamic-priority algorithms (i.e., two messages of flow may have different priorities) and fixed-priority scheduling algorithms (i.e., all messages of flow have the same priority).

That is, each message has a priority. When two messages intend to use one link at the same time, the higher-priority message is scheduled and the lower-priority message is suspended. Whenever a message is suspended in one of its links, it is suspended in all of its links.

We will focus on fixed-priority scheduling in the remainder of this paper. We will focus on the theoretical benefit of the Simultaneous Progressing Switching Protocols in Section V.

V Sp Scheduling Analysis

For a scheduling protocol , a real-time schedulability analysis of a flow set validates whether no flow in the flow set misses its deadline under any valid schedule generated by . Such a validated flow set is hence called feasibly schedulable by or infeasible otherwise. Furthermore, an analysis of a schedulability test for an algorithm is called sufficient if all flow sets that are deemed to be feasibly schedulable by this test are actually feasibly schedulable. Likewise the test is called necessary, if every flow set that is schedulable by algorithm is verified to be feasibly schedulable by the corresponding schedulability test. Hence, a necessary and sufficient schedulability test is called exact. In the remainder of this paper, we will derive a sufficient schedulability test.

For each flow under analysis, we partition all other flows into a direct contention domain and an indirect contention domain. We define a (direct) contention domain of two flows by identifying the set of higher-priority flows that share at least one link and thus potentially directly interfere with each other. Then, we consider the (indirect) contention domain of the flow under analysis, i.e., flows that do not directly share a link with this flow under analysis but interfere with flows in the (direct) contention domain. In the remainder of this paper, we implicitly assume that the flow set is indexed such that a flow has higher priority than a flow if .

In this section, we explain how the schedulability analysis for preemptive fixed-priority Simultaneous Progressing Switching Protocols can be related to the work-conserving fixed-priority preemptive uniprocessor self-suspension scheduling problem. In real-time scheduling theory, self-suspension denotes the property of an executable entity to be exempted from the scheduling decisions for a specified amount of time , i.e., self-suspension time. In our analysis, we use self-suspension to model a flow that is eligible to be scheduled using any work-conserving algorithm at a given time due to being the highest-priority flow and being active (released and not yet finished), but is not transmitted due to indirect contention. We prove that this (indirect) contention can be related to self-suspension behaviour in uniprocessor real-time scheduling theory and thus show the applicability of existing self-suspension aware schedulability analyses. We formally define and prove the self-suspension equivalent property of fixed-priority scheduling using the SP.

We only briefly introduce the uniprocessor self-suspension problem and refer the reader to the existing literature [Chen2018-suspension-review, ChenECRTS2016-suspension] for further information. In short, self-suspension refers to the exemption of a ready schedulable entity from the scheduling decision for a certain amount of time. This exemption behavior is modeled as dynamic self-suspension and multi-segment self-suspension in the literature. In the former, the suspension pattern can be arbitrary and is only parametrically upper bounded by the total self-suspension time. This flexibility comes at the cost of more pessimism in the timing analyses. In the latter model, an upper bound of the duration and the number of a task’s suspension intervals is known and can thus be used in the timing analyses. In this paper, we consider the dynamic self-suspension model and the corresponding timing analyses.

Fig. 3: An exemplary self-suspension instance for three flows using the SP and fixed-priority scheduling. The empty rectangles are the time allocated for and but not used for transmission since there is no available flit yet.
Example 1 (Self-Suspension Behaviour).

In Figure 3, three priority ordered flows that transmit 20, 19, and 29 flits respectively through the subset of routers , and as illustrated Figure. 1. Flow has the highest priority and has the lowest priority. Flow transmits from to the router , flow transmits from router through to . Moreover, flow transmits from through the router to . In this example, all flows are released synchronously and are scheduled according to Simultaneous Progressing Switching Protocols using fixed-priority scheduling. Note that an actual transmission of a flit is denoted by a darkened box whereas the unfilled areas indicate the flows that are granted that link. On the link from to , flow precedes flow due to its higher priority. Since does not share any link with , flow can transmit on its links. After the finishing of , flow attempts to transmit on its links and preempts flow on the link from to . Due to the SP, is not eligible to transmit on the link from to despite being the highest-priority flow on that link. Since the preemption by is transparent on that link, this behaviour is similar to the self-suspension property of real-time tasks. ∎

To formalize the direct contention domain of a flow under analysis, we denote the set of higher-priority flows of that share at least one link as . Under the fixed-priority SP, each flow has a priority, assumed to be unique here. In our analysis in this section, we assume work-conserving fixed-priority arbitration:

  • a flow transmits further if none of its link is used by any higher-priority flow, and

  • a flow does not transmit further if one of its link is allocated by another higher-priority flow.

Note that we analyze the schedulability of each flow individually under the assumption that the schedulability of all higher-priority flows has already been validated. In order to formally quantify the times such that a flow is eligible to transmit on all links in in-parallel using Simultaneous Progressing Switching Protocols but is not transmitting, i.e., self-suspension behaviour, we give the following definition.

Definition 5.

A flow is said to be -self-suspended at a time , if the following conditions are satisfied:

  • is active, i.e., released and not yet finished at time

  • is the highest-priority flow on all the links in , i.e.,

  • and , i.e., is a true non-empty subset of

  • for all at time , i.e., is not scheduled on all the links ∎

Note that we require , since the self-suspension like behaviour can only occur due to contention that is transparent to the flow under analysis. In the following theorem, we formally bound the set of flows that exhibit self-suspension behaviour in order to safely account for the additional interference.

Lemma 1.

Let a flow satisfy the properties and for all then it follows that .

Proof.

This simply comes from the definitions. By the first property, it follows that and . The second property implies that such that , the condition holds, which implies that . ∎

Theorem 1.

In a schedule that is generated by any fixed-priority algorithm using Simultaneous Progressing Switching Protocols, the set of flows that are -self-suspending is not larger than
.

Proof.

We prove this theorem by contrapositive, i.e., we show if then is not -self-suspending. Therefore, we must analyze the following two cases:

  1. ,

  2. .

In the first case, let and thus by definition . Therefore, is not -self-suspending at any time by definition.

In the second case, assume the existence of a time instant such that is -self-suspended at time and satisfies the properties stated in the second case. Then, is active at time and for all either or . Further, by the properties stated for the second case and the results from Lemma. 1, we know that flow . This is a contradiction, because no flow could have been active at time since otherwise the schedule would have been for all and . ∎

For further analysis, it is required to bound the maximal amount of time a flow may be -self-suspending.

Theorem 2.

Let each higher-priority flow of the flow under analysis be feasibly schedulable using Simultaneous Progressing Switching Protocols. Then, the cumulative amount of time that is -self-suspending is at most , where denotes the worst-case response-time of flow .

Proof.

We consider the following two cases:

  1. Flow is not -self-suspending and thus the self-trivially upper bounded by .

  2. There exists at least one point in time such that is -self-suspending.

Let where denotes the release of a packet of . By Definition. 5, we know that satiesfies Prop. 1 - Prop. 4 at time for all . Furthermore, since (Prop. 3) and by the all-or-nothing property of Simultaneous Progressing Switching Protocols, it must be that .

Since by assumption, the schedulability of each higher-priority flow has already been validated, is feasibly schedulable, i.e., . Due to the SP property, we know that all that satisfy also satisfy . In conclusion, we have that which concludes the proof. ∎

Corollary 1.

A sporadic constrained-deadline flow set is fixed-priority schedulable using Simultaneous Progressing Switching Protocols, if for each flow the transformed higher-priority flow set:

(1)

is schedulable, where .

The worst-case response time and schedulability of each flow has to be verified under the assumption that the higher-priority flows are already verified to be schedulable. Based on Corollary 1, any schedulability test that verifies the schedulability of sporadic constrained-deadline self-suspending task sets on uniprocessor systems with preemptive fixed-priority scheduling can be used, e.g., the state-of-the-art tests by Chen et al. [ChenECRTS2016-suspension].

Vi Analytical Advantages of SP

In this section we shortly compare our fixed-priority SP and schedulability analysis with some of the state-of-the-art fixed-priority schedulability analyses for wormhole-switching NoC with virtual channels proposed by Xiong et al. [Xiong:2017] and Indrusiak et al. [Indrusiak:DATE:2018]. Unfortunately, we have to admit to not be able to comprehend the analyses presented by Nikolíc et al. [Nikolic2019] and are thus incapable to compare with their methods analytically. We note that we do not intend to directly compare the state-of-the-art analyses due the different protocols and models but to only examplify the analytical gains and benefits of the SP compared to one-hop scheduling protocols.

Let be the set of higher-priority flows whose routing path intersects with the routing path of . Let be . This notation follows [Nikolic2019]. In the plaintext, consists of the flows in by excluding those flows in which higher-priority flows that intersect with , i.e., are also in . This means that if is in , then is in and there exists one flow that is not in but in . Therefore, this notation is exactly defined in Theorem 1, i.e.,

According to Eq. (4) and Eq. (5) in [Nikolic2019], the analyses of the worst-case response time for preemptive worm-hole switching from [Xiong:2017, Indrusiak:DATE:2018] can be computed by solving the minimum value of the following function:444The term in Eq. (4) in [Nikolic2019] is removed here since we consider sporadic flows without release jitter.

(2)

where is the interference due to buffering, i.e., backpressure, and

(3)

Now, we consider the following response time analysis from Chen et al. in [ChenECRTS2016-suspension]:555For notational consistency with [ChenECRTS2016-suspension], we here use the notation from [ChenECRTS2016-suspension] for self-suspending task systems and assume that there are higher-priority flows in .

(4)

where and for a certain binary assignment of for .

According to Corollary 1 and , we have if and if . Moreover, since is not in , we have . If is in , we set to ; otherwise, we set to . In such a setting of , the value in Eq. (4) is always . Together with by definition and the definition in Eq. (3), the analysis in Eq. (4) becomes:

Therefore, the worst-case response time analysis from [Xiong:2017, Indrusiak:DATE:2018] is dominated by our analysis from Corollary 1 by applying suspension-aware response time analysis from Chen et al. [ChenECRTS2016-suspension].

Vii Implementation Considerations

The architectural implementation of the priority-based Simultaneous Progressing Switching Protocols(SP) providing the all-or-nothing property requires to rethink previous router designs. In the state-of-the-art wormhole switching protocols, the decision at each router is local, i.e., each router simply chooses the highest-priority flow on any outgoing link. In contrast, the all-or-nothing property requires global decision making.

The SP is a general concept, and there could be different possible realizations. One possibility is to use centralized arbitration which decides and dispatches the priority information to the routers. However, this may incur high hardware cost. Possible implementations are subject of future research efforts and beyond the conceptual scope of this paper.

Viii Conclusion

In this paper, we discuss the fundamental difficulty of worst-case timing analysis of flit-based pipelined transmissions over multiple links in-parallel in network-on-chips. The space of possible progression states that need to be covered by an analysis hints to the mismatch with uniprocessor scheduling theory and their assumptions, thus making analyses complex and prone to being optimistic. To that end, we propose Simultaneous Progressing Switching Protocols (SP), in which the links used by a message either all simultaneously transmit one flit of this message or none of them transmits any flit of this message. For this family of protocols, we formally prove the matching with uniprocessor scheduling assumptions and theory. Furthermore, we show the relation between the uniprocessor self-suspension scheduling problem and the SP scheduling and provide formal proofs to confirm this relation. In addition, we provide a sufficient schedulability analysis for fixed-priority SP scheduling.

We note that the existing link-based switching and the proposed SP are in fact two extreme scenarios with respect to the series of progressions. The SP approach always results in the fastest series of progressions in any case by sacrificing the possibility to send part of the messages even when only one link is blocked by another higher-priority message/flow. The link-based switching mechanism (worm-hole protocol) allows the flexibility to send only one flit of a flow forward at a time unit in the NoC, but it has to potentially consider the slowest series of progressions of the flow in the worst case. It may be possible to design timing predictable systems with good average-case performance by pruning unnecessary progressions that have to be considered in the protocol. However, this alternative was not considered in this paper.

We strongly believe that a timing-predictable switching protocol in NoCs should be carefully designed so that NoC-based many-core systems can yield predictable performance. This paper provides protocols that can be implemented with different strategies. We believe that our proposals can be a first step towards predictable switching protocols of NoCs. In our future work, we will explore possible design options and their tradeoffs in the schedulability analyses and design complexity/cost.

References