Theoretical Analysis and Evaluation of NoCs with Weighted Round-Robin Arbitration

08/21/2021
by   Sumit K. Mandal, et al.
0

Fast and accurate performance analysis techniques are essential in early design space exploration and pre-silicon evaluations, including software eco-system development. In particular, on-chip communication continues to play an increasingly important role as the many-core processors scale up. This paper presents the first performance analysis technique that targets networks-on-chip (NoCs) that employ weighted round-robin (WRR) arbitration. Besides fairness, WRR arbitration provides flexibility in allocating bandwidth proportionally to the importance of the traffic classes, unlike basic round-robin and priority-based arbitration. The proposed approach first estimates the effective service time of the packets in the queue due to WRR arbitration. Then, it uses the effective service time to compute the average waiting time of the packets. Next, we incorporate a decomposition technique to extend the analytical model to handle NoC of any size. The proposed approach achieves less than 5 while executing real applications and 10 traffic with different burstiness levels.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

07/28/2020

Analytical Performance Modeling of NoCs under Priority Arbitration and Bursty Traffic

Networks-on-Chip (NoCs) used in commercial many-core processors typicall...
08/10/2020

Performance Analysis of Priority-Aware NoCs with Deflection Routing under Traffic Congestion

Priority-aware networks-on-chip (NoCs) are used in industry to achieve p...
08/05/2020

Interprocess Communication in FreeBSD 11: Performance Analysis

Interprocess communication, IPC, is one of the most fundamental function...
08/07/2019

Analytical Performance Models for NoCs with Multiple Priority Traffic Classes

Networks-on-chip (NoCs) have become the standard for interconnect soluti...
02/17/2022

Improving Performance Bounds for Weighted Round-Robin Schedulers under Constrained Cross-Traffic

Weighted round robin (WRR) is a simple, efficient packet scheduler provi...
06/07/2020

Stochastic Automata Network for Performance Evaluation of Heterogeneous SoC Communication

To meet ever increasing demand for performance of emerging System-on-Chi...
03/19/2019

Multi timescale bandwidth profile and its application for burst-aware fairness

We propose a resource sharing scheme that takes into account the traffic...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Networks-on-chip continue playing a central role as many-core processors start dominating the server [1, 2]

and deep learning 

[3, 4, 5] market. As the commercial solutions scale up, the latency, area, and power consumption overheads of NoCs become increasingly crucial. Designers need analytical power-performance models to guide complex design decisions during the architecture development and implementation phases. After that, the same models are required by virtual platforms, commonly used to develop and evaluate the software ecosystem and applications [6]. Hence, there is a strong demand for high-fidelity analytical techniques that accurately model fundamental aspects of industrial designs across all segments ranging from systems-on-chip to client and server systems.

NoCs can be broadly classified in terms of buffer usage as buffered and bufferless architectures 

[7, 8, 9, 10]. Most early solutions adapted buffered techniques, such as wormhole and virtual-channel switching, where the packets (or their flits) are stored in intermediate routers. Area, latency, and energy consumption of buffers have later led to bufferless architectures, where the intermediate routers forward the incoming flits if they can and deflect otherwise. Bufferless NoCs save significant buffer area and enable ultra-fast, as low as single-cycle routing decisions [7, 8]. Therefore, many industrial NoCs used in server and client architectures employ bufferless solutions to minimize the communication latency between the cores, last-level caches (LLC), and main memory [11, 2]. These solutions give priority to the packets already in the network to enable predictable and fast communication while stalling the newly generated packets from the processing and storage nodes. However, buffer area savings and low communication latency come at the cost of the early onset of congestion. Indeed, the packets wait longer at the end nodes, and the throughput saturates faster when the NoC load increases. Moreover, all routers in the NoCs remain powered on, increasing the NoC power consumption. Therefore, there is a strong need to address these shortcomings.

Buffered NoCs with virtual channel routers have been used more commonly in academic work and most recent industry standards [12, 13]. Shared buffering resources, such as input and output channels, require arbitrating among different requesters. For example, suppose that packets in different input channels request the same output channel. An arbiter needs to resolve the conflicts and grant access to one of the requesters to meet performance target. The architectures proposed to date predominantly employ basic round-robin (RR) arbiter to provide fairness to all requesters [14, 15, 16]. Although the decisions are locally fair, the number of arbitrations a packet goes through grows with its path length. Hence, RR arbitration is globally unfair. More importantly, basic RR cannot provide preference to a particular input, which is typically desired since not all requests are equal. For example, data and acknowledgment packets can have higher priority than new requests to complete outstanding transactions, especially when the network is congested.

WRR arbitration provides flexibility in allocating bandwidth proportionally to the importance of the traffic classes, unlike basic round-robin and priority-based arbitration. Each requester has an assigned weight, which is a measure of its importance. A larger weight indicates that the requester is given more preference in arbitration. Due to its generality, WRR arbitration has been employed in several NoC proposals in the literature [17, 18, 19]. Indeed, WRR arbitration enables higher throughput than RR arbitration [18]. Despite its potential, WRR arbitration has not been analyzed theoretically, especially for large-scale NoCs. A large body of literature has proposed performance analysis techniques for buffered and bufferless NoCs since analytical models play a crucial role in fast design space exploration and pre-silicon evaluation [20, 21, 22, 23, 24]. In contrast, no analytical modeling technique has been proposed to date for NoCs with WRR arbitration. A formal analysis is required to understand the behavior of NoCs with WRR arbitration. At the same time, executable performance models are needed to guide many-core processor design and enable virtual platforms for pre-silicon evaluation.

This paper presents a fast, accurate, and scalable performance analysis technique for NoCs with WRR arbitration. To the best of our knowledge, it is the first performance analysis technique for NoCs with weighted round-robin arbitration. Furthermore, the proposed technique supports bursty core traffic observed in real applications, which is typically ignored due to its complexity. It first estimates the effective service time of the packets in the queue due to WRR arbitration. Then, it utilizes the effective service time to obtain the average waiting time of the packets. We also propose a decomposition technique to extend the analytical model for any size of NoC. Extensive experimental evaluations show that the proposed analytical model has less than 5% with real applications and 10% error with synthetic traffic having different burstiness levels congesting the NoC.

The major contributions of the work are listed below:

  • A novel performance analysis technique for NoCs that employ WRR arbitration,

  • A decomposition technique to obtain a scalable analytical model for NoC of any size.

  • Experimental evaluations with multiple NoC configurations with different traffic scenarios showing less than 5% error for real applications.

Ii Related Work and Novel Contributions

Analytical models are required to estimate the NoC performance for fast design space exploration and pre-silicon evaluation. Multiple prior studies have proposed NoC performance analysis techniques with basic round-robin arbitration. [25, 26, 21]. Authors in [25]

first construct a contention matrix between multiple flows in the NoC. Then, the average waiting time of the packets corresponding to each flow is computed. Support vector regression-based analytical model for NoCs is proposed in 

[21]. The analytical model proposed in [26] estimates the mean service time of the flows with RR arbitration. The estimated mean service time is used to find the average waiting time of the flows. However, none of these techniques are applicable in the presence of both bursty traffic and WRR arbitration.

Analytical modeling of round-robin arbitration has also been studied outside NoC domain [27, 28, 20]. The techniques presented in [27, 28]

incorporate a polling model to approximate the effective service time of a queue in the presence of RR arbitration. However, none of these approaches are applicable when the input distribution to the queue is not geometric. A Markov chain-based analytical model is proposed in 

[20] to account for bursty input traffic. However, the technique is not scalable for a network of queues. Moreover, none of these techniques are applicable for discrete-time queuing systems. Since each transaction in NoC happens at discrete clock cycles, the analytical models need to incorporate discrete-time queuing systems. The major drawbacks of the prior approaches are summarized in Table I.

The basic round-robin arbitration cannot provide fairness when requesters have widely varying data rate requirements and priorities. Therefore, weighted round-robin arbitration, i.e., WRR, has been used in on-chip communication architectures [17, 18]. Qian et al. compute delay bounds for different channels with different weights to assign appropriate weight to each input channel of the NoC [17]. They show that WRR delivers better quality of service than NoCs with strict priority-based arbitration. Authors in [18] propose a WRR-based scheduling policy. The proposed technique assigns larger bandwidth to input channels with higher weights. It achieves higher throughput compared to round-robin arbitration. Although WRR has shown promise, no analytical modeling approach exists for NoCs with WRR to date.

Research Approach WRR
Bursty
Traffic
Scalable
Discrete
Time
Boxma et al. [27] Polling model No No Yes No
Wim et al. [28]
Extended
polling model
No No Yes No
Wang et al. [20] Markov chain No Yes No No
Fischer et al. [26] Heuristic No No Yes No
Vanlerbergee
et al. [29]
Moment
generating function
No No No Yes
This work
Queue
decomposition
Yes Yes Yes Yes
TABLE I: Comparison of prior research and our novel contribution.

This paper presents the first performance analysis technique for NoCs with WRR arbitration. It fills an essential gap since WRR can address the shortcomings of priority-based bufferless NoC architectures and the basic round-robin arbitration. Furthermore, the proposed technique supports bursty traffic observed in real applications, which is typically ignored due to its complexity. Hence, it is a vital step towards comprehending the theoretical underpinnings of NoCs with WRR arbitration and enabling their deployment in industrial designs.

Iii Background and Overview

Fig. 1: Illustration of weighted round-robin arbitration. ‘A’ is served first, ‘I’ last. In this example it is assumed that no new packets arrived until all prior packets (A-I) have been served.

Iii-a Weighted Round-Robin Arbitration

This work uses weighted round-robin arbitration in the NoC routers. The basic operating principle of WRR arbitration is illustrated in Figure 1 for three traffic classes. Packets from each class are first written to a dedicated input queue (also known as a channel). Suppose there are input queues . WRR technique assigns a positive integer weight denoted as for . The WRR arbiter serves up to consecutive packets from before moving to the next queue. If has less than packets, then WRR serves until it becomes empty. Then, the WRR arbiter serves the subsequent queues following the same principle. After a cycle is completed, the weights are reset to their initial values, as illustrated in Figure 1, and the same arbitration cycle is repeated. In NoCs with WRR arbitration, whenever two or more requesters compete for the same resource, they are arbitrated following WRR. WRR arbitration can be used both for arbitrating different virtual channels and different ports in the network.

Iii-B Usage of the Proposed Performance Analysis Technique

WRR arbitration is promising for NoCs since it can tailor the communication bandwidth to different traffic classes. Furthermore, it provides end-to-end latency-fairness to different source-destination pairs, unlike basic round-robin and priority arbitration techniques. However, these capabilities come at the expense of a vast design parameter space. An mesh with -port routers has tunable weights, e.g., an 88 2D mesh with 5-port routers would have 320 WRR weights. Due to this ample design space, the current practice is limited to assigning two weights to each router (e.g., one weight to local ports and another weight to packets already in the NoC).

The benefits of the proposed theoretical analysis are two-fold. First, it can enable accurate pre-silicon evaluations and design space exploration without time-consuming cycle-accurate simulations. Second, it can be used to find the combination of weights that optimizes the performance, i.e., to solve the optimization problem in the vast design space described in the previous paragraph. This paper focuses on constructing the proposed analysis technique and its evaluation against cycle-accurate simulation for two reasons. First, the theoretical analysis is complex, and its evaluation deserves a dedicated treatment on its own. Moreover, its application for solving optimization problems requires demonstration of its fidelity first. Multi-objective optimization of the WRR weights using the proposed model is one of our future research directions.

Fig. 2: The original WRR arbiter with its traffic parameters (on the left) and a transformed WRR (on the right) that comprises fully decomposed queue nodes using our effective service time transformations.

Iv Proposed Methodology and Approach

Figure 2 shows a weighted round-robin arbiter with queues. The packets that belong to traffic class- are stored in queue . The corresponding average arrival rate of class- packets is denoted by , as summarized in Table II. The burstiness of traffic class- can be captured by their squared coefficient of variation of inter-arrival time, denoted by  [30, 17]. Finally, the weight assigned to class- is denoted as for , as illustrated in Figure 2.

To provide a step-by-step derivation, Section IV-A starts with a particular case of the proposed technique tailored to the basic round-robin arbitration, i.e., all weights are set to one. Then, Section IV-B extends the formulation to weighted round-robin arbitration.

Number of traffic classes to an arbiter
Injection rate of class-
Weight assigned to class-
Original mean service time of all traffic classes
Mean value of effective service time of class-
,
() and () server utilizations of class-
Squared coeff. of variation of inter-arrival time of class-
Squared coeff. of variation of orig. service time (all classes)
Squared coeff. of variation of eff. service time of class-
Sq. coeff. of variation of inter-arrival time (merged traffic)
Squared coeff. of variation of inter-departure time
Mean queue occupancy of class-
Probability that queue occupancy of class-
given a mean effective service time of class- =
Average waiting time of class-
Mean effective residual service time of class-
TABLE II: List of the important parameters used in this work.

Iv-a Analytical Model for Basic Round-Robin Arbitration

In the basic round-robin arbitration, all weights are equal to one, i.e., . Each traffic class experiences an additional delay until the arbiter serves the head packets in the other queues. The proposed technique has two steps:

  1. [leftmargin=*]

  2. The first step is to compute the first two moments of the effective service time for each class- (, ) of the fully decomposed queue nodes illustrated in Figure 2.

  3. In the second step, we use the transformed effective service times (, , ) to find the total waiting time (including the queuing delay) of each traffic class.

For the clarity of notation and illustration, the derivation below assumes the original two moments of service time are the same across all traffic classes.

Iv-A1 Mean effective service time of RR ()

The effective service time accounts for the delay experienced by the packets at the head of each queue. This delay includes its own service time and the service time of the packet at the head of other queues since the round-robin arbiter serves them one by one. When all queues have packets, class- packets will be served every cycles, i.e., they will wait for cycles after being served before winning the arbitration again. In general, a traffic class will only contribute to the extra service time if it has a packet waiting for service. Thus, assuming packet arrival of different classes are independent, we can express the mean effective service time as:

(1)

where denotes the probability that the occupancy of class- is greater than zero given a mean effective service period of class- is , for . We approximate using Little’s law as following the busy-cycle approach [27]. Little’s law gives the expected number of class- packets that arrive during the period of , which approximates the probability that the occupancy of class- packets is greater than zero during this period. To ensure that the probability term does not exceed one, we approximate as . Therefore, Equation IV-A1 can be rewritten as:

(2)

We solve Equation 2 through a simple iterative approach described in Algorithm 1. First, Equation 2 is solved by removing the nonlinearity introduced by the operation (lines 4 and 5). We consider the smaller solution out of the two solutions, since we observed through experiments that the smaller solution is more accurate. The solution is used as the initial estimate of the iterative approach. Then, this estimate is plugged into Equation 2 within an iteration loop to obtain a better estimate (line 7). The iterations continue until the change in the effective service time is within a user-provided value, set to in this work. Algorithm 1 typically takes less than ten iterations due to the quadratic nature of Equation 2, which takes only a few microseconds to compute. After the iterations are completed, Algorithm 1 returns the effective service time .

1 Input: Injection rate of each class (), service time (), number of classes (), Convergence
2 Output: Effective service time of each class ()
3 for i = 1:N do
4        Find smaller root of the quadratic equation derived from Equation 2:
5       
6       
7        while  do
8              
9              
10              
11              
12        end while
13       
14 end for
Algorithm 1 Basic round-robin arbitration: Effective service time () computation.

Iv-A2 Coefficient of variation of effective service time for RR ()

This section derives by leveraging the conservation of work principle and property of equal residual service time of individual classes [27]. We leverage these properties to compute the mean effective residual service time of class- (). Then, is used to calculate using the relation found in [30]:

(3)

where is the effective server utilization. In addition, we show that leveraging the conservation of work principle gives our model the capability of handling bursty traffic.

We exploit the conservation of work principle, stating that the same amount of packets are served regardless of how the bandwidth is divided between the traffic flows assuming all packets have same moments of service time. Due to this fundamental principle, the total queue occupancy is invariant under the arbitration policy. Furthermore, this sum can be computed with near-perfect accuracy using maximum entropy-based analytical queuing model of multi-class given in [30]

. One strong aspect of this model is that it can take bursty traffic that conforms to General Geometric distribution (GGeo). By leveraging this model, we can embed the capability of handling bursty traffic into our analytical model. The total queue occupancy (

) can be calculated as:

(4)

using the parameters summarized in Table II. We do not delve into the derivation of this equation for clarity since it is not our major focus. The detailed derivation can be found in the seminal work by Kouvatsos et al. [30].

Next, we derive using the first two moments of transformed effective service time. The occupancy of each queue can be expressed using Little’s law as: , where is the average waiting time of class-. Therefore, we obtain:

(5)

The average waiting time consists of two components. The first one is the waiting time in the queue till the packet reaches the head of the queue (), i.e. until all other class- packets leave the server completely. The second component accounts for the waiting time at the head of the queue when the server is busy serving other classes (class-, ). This component is the additional time captured by the effective service time , i.e., the difference between the effective and original service times: . Furthermore, the waiting is expressed as a function of the residual time [31]. Hence,

(6)

We leverage the property that all individual mean effective residual service time of classes are equal [27]; hence we write (). Thus, we can rewrite the total occupancy by plugging Equation 6 into Equation 5 and then solve for the residual time :

(7)

Then, we can compute the coefficient of variation by substituting the residual time from Equation 7 in Equation 3.

Iv-A3 Average waiting time of RR ()

Finally, we compute the mean waiting time of individual classes by plugging and found for RR arbitration into Equation 3 and Equation 6 as:

(8)

Iv-B Analytical Model for Weighted Round-Robin Arbitration

This section extends the analytical model constructed in Section IV-A to WRR using the same steps. It first describes the derivation of the first two moments of effective service time. Then, it will use these moments to find the average waiting time.

Iv-B1 Mean effective service time of WRR ()

In WRR, the weight of each class can be larger than or equal to one, i.e., . To illustrate the generalization, we start with a system of two queues, where and . Then, we generalize it to queues with arbitrary weights.

Mean effective service time of class- with and : Since , a batch of up to number of class- packets are served without interruption. Let be the total service time of class- packets. When there is a class- packet in , the next batch of class- packet will wait. Therefore, we can express as:

(9)

Note that this equation generalizes Equation IV-A1 to arbitrary for two queues. Using Little’s law and the methodology described in Section IV-A, we approximate as:

(10)

Mean effective service time of class- with and : The batch size of class- packet is one in this case since . They need to wait in the queue for a maximum of class- packets. Therefore, is expressed as:

(11)

We can approximate as to obtain:

(12)

To get to the final form of Equation 12 above, we approximate the conditional probabilities based on [27].

Generalization to queues and arbitrary weights: In general, there are classes with weights . Consider the class- packets with the total service time for a batch of packets. The first part of the effective service time will be due to the packets of same class. This effect is captured in our two-queue illustration for class- packets in Equation 9. The second part of the effective service time will be due to the packets of other classes. This effect is captured in our two-queue illustration for class- packets in Equation 11. By combining them, we can express the generalized effective service time as:

(13)

Note that this equation reduces to Equation 9 for , and to Equation 11 for .

Finally, we can obtain the generalized effective service times by replacing with as follows:

(14)

Due to the non-linearity introduced by the operation, we compute using the iterative Algorithm 2 just like the basic round-robin case. The quadratic nature of Equation IV-B1 enables a fast convergence within iterations under s.

1 Input: Injection rate of each class (), WRR weights (), service time (), number of classes ()
2 Output: Effective service time of each class ()
3 for i = 1:N do
4        Find smaller root of the quadratic equation derived from Equation IV-B1 for :
5       
6       
7        while  do
8              
9              
10              
11              
12              
13        end while
14       
15 end for
Algorithm 2 Obtaining effective service time () of weighted round-robin
Fig. 3: Illustration of extending the proposed analysis to multiple stages. Only two consecutive stages are shown for clarity. The departure statistics at a given stage become the arrival statistics of the subsequent stage. The blue line denotes that the class which wins the arbitration goes to the next stage.

Iv-B2 Coefficient of variation of effective service time of WRR () 111We use the same notation for RR and WRR arbitration since WRR results are generalizations of basic RR expressions.

In the basic round-robin case, the residual service time accounts for the second-order moment of the service time, i.e., . However, effective mean residual service time of different classes are not necessarily equal to each other for arbitrary weights, which is the case in WRR arbitration. The mean effective residual service time of class- packets is expressed as:

(15)

where is found from Algorithm 2 based on Equation IV-B1. Note that the residual service time depends on the first two moments of service time, which are the average service time and the coefficient of variation , respectively. To approximate for WRR, we leverage its monotonically decreasing behavior as a function of the corresponding weight. This behavior is expected since a larger weight corresponds to fewer arbitration stages, decreasing the effective service time variation. This observation indicates that the coefficient of variation for round-robin arbitration is an upper bound of coefficient of variation for weighted round-robin arbitration. This upper bound, for class-, is computed using Equation 3. Given this, we first apply linear approximation to calculate the under WRR as , note we use the squared value of since is the square of coefficient of variation as defined in Table II. We add the parameter to tighten the approximation. Next, we recall Equation 4 and replace by to obtain:

(16)

The left hand side of the Equation 16 () is obtained by applying Equation 4. Therefore, is the only unknown in Equation 16. Once we obtain (by solving Equation 16), we compute under WRR as . This expression captures traffic burstiness addressing one of the significant drawbacks of the prior work.

Iv-B3 Average waiting time of WRR ()

So far, we obtained first two moments of effective service time, i.e., (Equation IV-B1) and . We obtain the average waiting time under WRR arbitration by plugging them to Equation 17:

(17)

Iv-C Estimation of end-to-end latency

The previous section described the canonical model for WRR arbitration, where all traffic classes go through a single arbiter. The packets in NoCs go through a sequence of WRR arbiters while traveling from their sources towards destinations. For example, an 88 2D mesh has 64 routers with at least one arbiter per output queue in each of them. Figure 3(a) illustrates an example with two stages, where the flow that wins the first stage is routed to the second one. The major challenge in modeling multiple stages is obtaining the inter-departure distribution, which becomes the arrival flow at the subsequent stages. For instance, the coefficient of variation of arrival flow 2 in the second stage is equal to the coefficient of variation of the departing flow from stage-1 () in Figure 3(a). We first find the squared co-efficient of variation of inter-departure time for each traffic class ():

(18)

Next, we apply the decomposition technique [32] to obtain the departure distribution at each stage ().

(19)

The departure distribution () is the arrival distribution to the next stage (). For a given source-destination pair, these calculations are performed at each arbitration stage on the path of the packets.

1 Input: NoC size (), injection rates and co-efficient of variation of inter-arrival times for all classes (), Service time (), WRR weights
2 Output: Average NoC () and source–dest. latencies ()
3 Number of routers ,
4 for s = 1: do
5        for d = 1: do
6              
7              
8               ,
9        end for
10       
11 end for
12 function compute_waiting_time()
13        The number of stages for the flow with source and destination ,
14        for j = 1:S do
15               Waiting time at stage using Equation 17
16               Coefficient of variation using Equation 19
17               Use as the in the next stage
18        end for
19       return
20 end
Algorithm 3 Estimation of end-to-end latency

Algorithm 3 presents a step-by-step procedure to obtain end-to-end latency of a given NoC with WRR arbitration and deterministic routing. The input to the algorithm is the NoC size, injection rate, and coefficient of variation of inter-arrival time of all classes, service time of the queues, and weights assigned to each arbiter. The output of the algorithm is the latency of each source-destination pair () as well as the average latency .

Algorithm 3 first computes the waiting time of each source-destination pair (). The function in lines 12–22 describes the procedure for this computation. The procedure starts with finding the number of stages () for the flow from source to destination . At each stage, it computes the average waiting time using Equation 17 (line 16). Next, it adds the waiting time at the current state to the cumulative waiting time found so far (line 17). Then, it computes the coefficient of variation of inter-departure time () using Equation 18 and is used as the coefficient of variation of inter-arrival time in the next stage (lines 18–19). Finally, the procedure returns the total waiting time from source to destination () (line 21). After retrieving the total waiting times, Algorithm 3 adds the free packet delay (), i.e., the latency due to the links and router micro-architecture, which can be found using the number of hops and the router pipeline depth [33, 10]. Then, end-to-end latencies and flow rates are accumulated (line 8). Finally, the average latency of the network is obtained by computing the weighted average of the latency of all source-destination pairs (line 11). We note that different WRR arbiters in the network can be assigned different arbitration weights.

V Experimental Results

V-a Experimental Setup

This section presents a thorough evaluation of the proposed analytical modeling approach under various traffic scenarios and NoC configurations. We employ geometric and bursty traffic distributions in addition to real application traces feeding 18, 66, and 88 NoCs (note that the typical size of state-of-the-art industrial NoCs is 6[11, 2]). The synthetic traffic simulations use 100% last-level cache (LLC) hit and 100% LLC miss scenarios, commonly used corner cases. Furthermore, applications from SPEC CPU 2017 [34] and PARSEC [35] benchmark suits demonstrate the effectiveness of the proposed technique in a broader range of scenarios. The latency results of our analytical models are compared against an in-house cycle-accurate simulator, which is calibrated with an industrial simulator [36]. All simulations run for 200K cycles to reach a steady state and have 20K cycles of warm-up. The source code of the simulator and the analytical model are publicly released at [37].

Fig. 4: Verification of the analytical model for basic round-robin with (a) 81 and (b) 88 NoC.

V-B Results with Basic Round-Robin Arbitration

We first evaluate the proposed analytical model under basic round-robin arbitration. Figure 4 depicts representative average end-to-end latency comparisons between analytical model and the cycle-accurate simulations for 81 ring and 88 mesh NoC. The average injection rates for both NoC configurations are varied until the NoC becomes highly congested, following geometric distribution. On average, the proposed technique incurs only 5% error for 81 ring and 7% error for 88 mesh. We also compare the proposed analytical model with a polling-based model for round-robin arbitration [28]. The polling-based model grossly overestimates the latency both for 81 and 88 NoC. A comprehensive summary for other NoC sizes and traffic patterns is presented in Section V-E. These results demonstrate that the proposed approach is accurate for difference NoC, traffic, and WRR parameters.

V-C Results with Weighted Round-Robin Arbitration

The most important aspect of the proposed technique is considering WRR arbitration. Figure 5 shows the average end-to-end latency comparison of our analytical model against cycle-accurate simulations with an 88 mesh for three different weight configurations. The proposed analytical model incurs on average 8% error when the WRR weights associated with the traffic channels of external input to the NoC and channels connected to internal routers are 1 and 2, respectively. The average error slightly increases to 9% when the arbitration weights of the packets in the NoC increase to 3. This configuration decreases the network congestion by providing higher priority to the packets already in the NoC. Hence, the average waiting time decreases. Since there are no other analysis approaches for WRR, this section does not provide comparisons to state-of-the-art.

Fig. 5: Verification of the analytical model for weighted round-robin with NoC. [x y] denotes that the WRR weights associated with channels connected to the internal routers is x and traffic channels of external input to the NoC is y.
Topo.
81 1.4% 11% 3.6% 7.8% 1.5% 13% 9% 11%
66 5.5% 7.2% 7.4% 11% 5.2% 11% 5.9% 12%
88 2.6% 7.8% 5.2% 10% 3.5% 11% 4.8% 7.2%
TABLE IV: Summary of results for synthetic applications with 100% miss.
Topo.
81 4.3% 4.6% 5.3% 5.7% 4.4% 4.6% 4.5% 4.8%
66 5.0% 5.4% 5.5% 6.0% 5.0% 5.4% 5.0% 5.8%
88 7.1% 8.0% 7.4% 7.6% 7.4% 13% 8.1% 9.0%
TABLE III: Summary of results for synthetic applications with 100% hit.

V-D Results with Bursty Traffic and Real Applications

This section evaluates the proposed model in the presence of bursty traffic. We define probability of burstiness as of the general geometric distribution [30]. Increasing denotes increasing burstiness in the traffic. We note that no other prior work considered weighted round-robin arbitration together with bursty traffic. Figure 6 compares the end-to-end average latency results against cycle-accurate simulations for 66 and 88 mesh when the arbitration weights associated with the packets in the NoC and external packets arrivals are 3 and 1, respectively. When the burst probability is (Figure 6(a)), the average modeling errors for 66 and 88 mesh are 7% and 4%, respectively. When the burst probability increases to (Figure 6(b)), the average errors become 6% and 5%, respectively.

Fig. 6: Verification of analytical model for bursty traffic with (a) and .

Finally, we evaluate the proposed analysis technique while running applications from the SPEC CPU2017 [34] and PARSEC benchmark [35]. The applications from SPEC CPU2017 benchmarks show burstiness in the range of . The applications from PARSEC benchmark are 16-threaded and do not show any burstiness. Figure 7 compares the analysis results against cycle-accurate simulations on an 88 NoC. The WRR weights of packets already in the NoC are set to 3, while new packets from cores have a lower priority set with a weight of 1. We observe that our proposed analysis technique consistently achieves a modeling error of less than 5% for these applications. These results demonstrate that the proposed analysis technique can accurately model the end-to-end NoC latency under WRR arbitration and bursty traffic.

V-E Summary of the Evaluation Results

This section summarizes the accuracy of our performance analysis technique systematically for different NoC sizes, WRR weights, and traffic burstiness. To capture the most commonly observed corner cases, we provide results for two extreme cases – 100% LLC hit and 100% LLC miss traffic. In both cases, cores send packets to each LLC at an equal rate.

Table IV summarizes the comparisons between analysis and simulations for 100% LLC hit. When and the traffic load is moderate (), the modeling error ranges from 1.4% to 5.5% considering both RR and WRR arbitration. Even when the traffic load increases congesting the NoC (), the modeling error remains below 11% for all configurations. A higher level of burstiness to increases the NoC load. Consequently, the error range increases 3.6%–9.0% for the moderate traffic load (). Increasing the load further congests the NoC () pushing the worst-case error only to 13%, which is acceptable since practical systems hardly operate this load due to congestion control mechanisms.

Finally, Table IV summarizes the modeling error of the proposed technique for 100% LLC miss. In this case, the missed core requests are forwarded to memory controllers (one in 81 NoC and two in 66 and 8

8 NoCs). Hence, the memory controllers become hotspots. Despite the high degree of skewness, the error for moderate NoC loads ranges from 4.3% to 8.1% for all burstiness levels, WRR weight configurations, and NoC sizes. Even when the NoC is pushed to congestion, the modeling error remains under 9.0% for all configurations, except for

NoC with WRR weights 3 and 1. The error in this specific case is 13%; the error for and NoCs are 4.6% and 5.4%, respectively.

Fig. 7: Verification against real applications.

V-F Execution Time of the Proposed Analysis Technique

We implemented the proposed technique in C++ to obtain end-to-end latency. The computational complexity of the proposed model is , for NoC () and WRR arbiters with maximum ports. The execution time as a function of NoC sizes with 3 output ports per router are summarized in Table V. We observe that even for NoC, the analytical model takes only 45.07 ms to execute, while the simulation time of the NoC is in the order of minutes. These results show that the proposed analysis technique is lightweight and provides four orders of magnitude speed-up compared to cycle-accurate simulations.

NoC size
Exe. time (ms) 0.56 1.41 2.25 13.52 45.07
TABLE V: Analysis on execution time of the proposed model.

Vi Conclusion and Future Work

This paper presented a scalable performance analysis technique for NoCs with WRR arbitration. WRR arbitration is promising since it enables latency fairness and custom bandwidth allocation to meet the requirements of individual traffic flows, unlike basic round-robin and priority-based arbitrations. The proposed technique handles bursty traffic and provides high accuracy for both ring topologies used in client systems and large 2D mesh topologies used in servers. Extensive evaluations show the proposed approach achieves less than 5% error while executing real applications and 10% error under challenging synthetic traffic with different burstiness levels. Hence, it can be used for fast and accurate design space exploration as well as pre-silicon evaluations. One of our future directions is finding the WRR weights to meet individual traffic flows’ bandwidth and latency requirements.

References

  • [1] Mohamed Arafa et al. Cascade Lake: Next Generation Intel Xeon Scalable Processor. IEEE Micro, 39(2):29–36, 2019.
  • [2] Jack Doweck et al. Inside 6th-generation Intel Core: New Microarchitecture Code-named Skylake. IEEE Micro, 37(2):52–62, 2017.
  • [3] Hanbo Sun et al. Reliability-Aware Training and Performance Modeling for Processing-In-Memory Systems. In 26th Asia and South Pacific Design Automation Conference (ASP-DAC), pages 847–852. IEEE, 2021.
  • [4] Sumit K Mandal et al. A Latency-Optimized Reconfigurable NoC for In-Memory Acceleration of DNNs. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 10(3):362–375, 2020.
  • [5] Seyed Morteza Nabavinejad et al.

    An overview of Efficient Interconnection Networks for Deep Neural Network Accelerators.

    IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 10(3):268–282, 2020.
  • [6] Ming-Chao Chiang, Tse-Chen Yeh, and Guo-Fu Tseng. A QEMU and SystemC-based Cycle-accurate ISS for Performance Estimation on SoC Development. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 30(4):593–606, 2011.
  • [7] Thomas Moscibroda and Onur Mutlu. A Case for Bufferless Routing in on-chip Networks. In Proc. of Intl. Symp. on Computer Architecture, pages 196–207, 2009.
  • [8] Bhavya K Daya, Li-shiuan Peh, and Anantha P Chandrakasan. Quest For High-Performance Bufferless Nocs With Single-Cycle Express Paths And Self-Learning Throttling. In Proc. of Design Automation Conf., pages 1–6, 2016.
  • [9] Radu Marculescu et al. Outstanding Research Problems In Noc Design: System, Microarchitecture, And Circuit Perspectives. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Syst., 28(1):3–21, 2008.
  • [10] William James Dally and Brian Patrick Towles. Principles and Practices of Interconnection Networks. Elsevier, 2004.
  • [11] Avinash Sodani et al. Knights Landing: Second-generation Intel Xeon Phi Product. Ieee micro, 36(2):34–46, 2016.
  • [12] Umit Y Ogras and Radu Marculescu. Modeling, Analysis And Optimization Of Network-On-Chip Communication Architectures, volume 184. Springer Science & Business Media, 2013.
  • [13] Andrea Pellegrini et al. The Arm Neoverse N1 Platform: Building Blocks for the Next-gen Cloud-to-Edge Infrastructure SoC. IEEE Micro, 40(2):53–62, 2020.
  • [14] Eung S Shin, Vincent J Mooney III, and George F Riley. Round-robin Arbiter Design and Generation. In Proc. of Intl. Symp. on System Synthesis, pages 243–248, 2002.
  • [15] Yun-Lung Lee, Jer Min Jou, and Yen-Yu Chen. A High-speed and Decentralized Arbiter Design for NoC. In Proc. of Intl. Conf. on Comp. Systs. and Applications, pages 350–353, 2009.
  • [16] Gao Xiaopeng, Zhang Zhe, and Long Xiang. Round Robin Arbiters for Virtual Channel Router. In Proc. of Multiconference on Computational Engineering in Systems Applications, volume 2, pages 1610–1614, 2006.
  • [17] Yue Qian, Zhonghai Lu, and Qiang Dou. Qos Scheduling for NoCs: Strict Priority Queueing versus Weighted Round Robin. In Proc. of Intl. Conf. on Computer Design, pages 52–59, 2010.
  • [18] Jan Heißwolf, Ralf König, and Jürgen Becker. A Scalable NoC Router Design Providing QoS Support using Weighted Round Robin Scheduling. In Proc. of Intl. Symp. on Parallel and Distributed Processing with Applications, pages 625–632, 2012.
  • [19] Cesar Albenes Zeferino and Altamiro Amadeu Susin. SoCIN: A Parametric and Scalable Network-on-Chip. In Symp. on Integrated Circuits and Systems Design, 2003. SBCCI 2003. Proceedings., pages 169–174, 2003.
  • [20] Lan Wang, Geyong Min, Demetres D Kouvatsos, and Xiaolong Jin. Analytical Modeling of an Integrated Priority and WFQ Scheduling Scheme in Multi-service Networks. Computer Communications, 33:S93–S101, 2010.
  • [21] Zhi-Liang Qian et al. A Support Vector Regression (SVR)-based Latency Model for Network-on-Chip (NoC) Architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 35(3):471–484, 2015.
  • [22] Abbas Eslami Kiasari, Zhonghai Lu, and Axel Jantsch. An Analytical Latency Model for Networks-on-Chip. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 21(1):113–123, 2012.
  • [23] Sumit K Mandal et al. Analytical Performance Modeling of NoCs under Priority Arbitration and Bursty Traffic. IEEE Embedded Systems Letters, 2020.
  • [24] Sumit K Mandal et al. Performance Analysis of Priority-aware NoCs with Deflection Routing under Traffic Congestion. In Proc. of Intl. Conf. on Computer-Aided Design, pages 1–9, 2020.
  • [25] Umit Y Ogras, Paul Bogdan, and Radu Marculescu. An Analytical Approach for Network-on-Chip Performance Analysis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 29(12):2001–2013, 2010.
  • [26] Erik Fischer and Gerhard P Fettweis. An Accurate and Scalable Analytic Model for Round-robin Arbitration in Network-on-Chip. In Proc. of Intl. Symp. on Networks-on-Chip (NoCS), pages 1–8, 2013.
  • [27] Onno J. Boxma and Bernd W. Meister. Waiting-time Approximations in Multi-queue Systems with Cyclic Service. Performance Evaluation, 7(1):59–70, 1987.
  • [28] Wim P Groenendijk and Hanoch Levy. Performance Analysis of Transaction Driven Computer Systems via Queueing Analysis of Polling Models. IEEE Computer Architecture Letters, 41(04):455–466, 1992.
  • [29] Jasper Vanlerberghe. Analysis and Optimization of Discrete-time Generalized Processor Sharing Queues. PhD thesis, Ghent University, 2018.
  • [30] Demetres D Kouvatsos. Entropy Maximisation and Queueing Network Models. Annals of Operations Research, 48(1):63–126, 1994.
  • [31] Dimitri P Bertsekas, Robert G Gallager, and Pierre Humblet. Data Networks, volume 2. Prentice-Hall International New Jersey, 1992.
  • [32] Guy Pujolle and Wu Ai. A Solution for Multiserver and Multiclass Open Queueing Networks. INFOR: Information Systems and Operational Research, 24(3):221–230, 1986.
  • [33] Umit Y Ogras and Radu Marculescu. ”It’s a Small World After All”: Noc Performance Optimization Via Long-Range Link Insertion. IEEE Trans. on Very Large Scale Integration (VLSI) Systs., 14(7):693–706, 2006.
  • [34] James Bucek, Klaus-Dieter Lange, and Jóakim v. Kistowski. SPEC CPU2017: Next-Generation Compute Benchmark. In Companion of the 2018 ACM/SPEC Intl. Conf. on Performance Engineering, pages 41–42, 2018.
  • [35] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proc. of Intl. Conf. on Parallel Architectures and Compilation Techniques, pages 72–81, 2008.
  • [36] Umit Y Ogras, Yunus Emre, Jianping Xu, Timothy Kam, and Michael Kishinevsky. Energy-guided Exploration of On-chip Network Design for Exa-scale Computing. In Proc. of Intl. Workshop on System Level Interconnect Prediction, pages 24–31, 2012.
  • [37] Sumit K. Mandal et al.

    Theoretical Analysis and Evaluation of NoCs with Weighted Round-Robin Arbitration: An Open Source Release.

    https://github.com/sumitkmandal/WRR_NoC_analytical_model, 2021. Accessed 12 August 2021.