I Introduction
Networksonchip continue playing a central role as manycore processors start dominating the server [1, 2]
and deep learning
[3, 4, 5] market. As the commercial solutions scale up, the latency, area, and power consumption overheads of NoCs become increasingly crucial. Designers need analytical powerperformance models to guide complex design decisions during the architecture development and implementation phases. After that, the same models are required by virtual platforms, commonly used to develop and evaluate the software ecosystem and applications [6]. Hence, there is a strong demand for highfidelity analytical techniques that accurately model fundamental aspects of industrial designs across all segments ranging from systemsonchip to client and server systems.NoCs can be broadly classified in terms of buffer usage as buffered and bufferless architectures
[7, 8, 9, 10]. Most early solutions adapted buffered techniques, such as wormhole and virtualchannel switching, where the packets (or their flits) are stored in intermediate routers. Area, latency, and energy consumption of buffers have later led to bufferless architectures, where the intermediate routers forward the incoming flits if they can and deflect otherwise. Bufferless NoCs save significant buffer area and enable ultrafast, as low as singlecycle routing decisions [7, 8]. Therefore, many industrial NoCs used in server and client architectures employ bufferless solutions to minimize the communication latency between the cores, lastlevel caches (LLC), and main memory [11, 2]. These solutions give priority to the packets already in the network to enable predictable and fast communication while stalling the newly generated packets from the processing and storage nodes. However, buffer area savings and low communication latency come at the cost of the early onset of congestion. Indeed, the packets wait longer at the end nodes, and the throughput saturates faster when the NoC load increases. Moreover, all routers in the NoCs remain powered on, increasing the NoC power consumption. Therefore, there is a strong need to address these shortcomings.Buffered NoCs with virtual channel routers have been used more commonly in academic work and most recent industry standards [12, 13]. Shared buffering resources, such as input and output channels, require arbitrating among different requesters. For example, suppose that packets in different input channels request the same output channel. An arbiter needs to resolve the conflicts and grant access to one of the requesters to meet performance target. The architectures proposed to date predominantly employ basic roundrobin (RR) arbiter to provide fairness to all requesters [14, 15, 16]. Although the decisions are locally fair, the number of arbitrations a packet goes through grows with its path length. Hence, RR arbitration is globally unfair. More importantly, basic RR cannot provide preference to a particular input, which is typically desired since not all requests are equal. For example, data and acknowledgment packets can have higher priority than new requests to complete outstanding transactions, especially when the network is congested.
WRR arbitration provides flexibility in allocating bandwidth proportionally to the importance of the traffic classes, unlike basic roundrobin and prioritybased arbitration. Each requester has an assigned weight, which is a measure of its importance. A larger weight indicates that the requester is given more preference in arbitration. Due to its generality, WRR arbitration has been employed in several NoC proposals in the literature [17, 18, 19]. Indeed, WRR arbitration enables higher throughput than RR arbitration [18]. Despite its potential, WRR arbitration has not been analyzed theoretically, especially for largescale NoCs. A large body of literature has proposed performance analysis techniques for buffered and bufferless NoCs since analytical models play a crucial role in fast design space exploration and presilicon evaluation [20, 21, 22, 23, 24]. In contrast, no analytical modeling technique has been proposed to date for NoCs with WRR arbitration. A formal analysis is required to understand the behavior of NoCs with WRR arbitration. At the same time, executable performance models are needed to guide manycore processor design and enable virtual platforms for presilicon evaluation.
This paper presents a fast, accurate, and scalable performance analysis technique for NoCs with WRR arbitration. To the best of our knowledge, it is the first performance analysis technique for NoCs with weighted roundrobin arbitration. Furthermore, the proposed technique supports bursty core traffic observed in real applications, which is typically ignored due to its complexity. It first estimates the effective service time of the packets in the queue due to WRR arbitration. Then, it utilizes the effective service time to obtain the average waiting time of the packets. We also propose a decomposition technique to extend the analytical model for any size of NoC. Extensive experimental evaluations show that the proposed analytical model has less than 5% with real applications and 10% error with synthetic traffic having different burstiness levels congesting the NoC.
The major contributions of the work are listed below:

A novel performance analysis technique for NoCs that employ WRR arbitration,

A decomposition technique to obtain a scalable analytical model for NoC of any size.

Experimental evaluations with multiple NoC configurations with different traffic scenarios showing less than 5% error for real applications.
Ii Related Work and Novel Contributions
Analytical models are required to estimate the NoC performance for fast design space exploration and presilicon evaluation. Multiple prior studies have proposed NoC performance analysis techniques with basic roundrobin arbitration. [25, 26, 21]. Authors in [25]
first construct a contention matrix between multiple flows in the NoC. Then, the average waiting time of the packets corresponding to each flow is computed. Support vector regressionbased analytical model for NoCs is proposed in
[21]. The analytical model proposed in [26] estimates the mean service time of the flows with RR arbitration. The estimated mean service time is used to find the average waiting time of the flows. However, none of these techniques are applicable in the presence of both bursty traffic and WRR arbitration.Analytical modeling of roundrobin arbitration has also been studied outside NoC domain [27, 28, 20]. The techniques presented in [27, 28]
incorporate a polling model to approximate the effective service time of a queue in the presence of RR arbitration. However, none of these approaches are applicable when the input distribution to the queue is not geometric. A Markov chainbased analytical model is proposed in
[20] to account for bursty input traffic. However, the technique is not scalable for a network of queues. Moreover, none of these techniques are applicable for discretetime queuing systems. Since each transaction in NoC happens at discrete clock cycles, the analytical models need to incorporate discretetime queuing systems. The major drawbacks of the prior approaches are summarized in Table I.The basic roundrobin arbitration cannot provide fairness when requesters have widely varying data rate requirements and priorities. Therefore, weighted roundrobin arbitration, i.e., WRR, has been used in onchip communication architectures [17, 18]. Qian et al. compute delay bounds for different channels with different weights to assign appropriate weight to each input channel of the NoC [17]. They show that WRR delivers better quality of service than NoCs with strict prioritybased arbitration. Authors in [18] propose a WRRbased scheduling policy. The proposed technique assigns larger bandwidth to input channels with higher weights. It achieves higher throughput compared to roundrobin arbitration. Although WRR has shown promise, no analytical modeling approach exists for NoCs with WRR to date.
Research  Approach  WRR 

Scalable 


Boxma et al. [27]  Polling model  No  No  Yes  No  
Wim et al. [28] 

No  No  Yes  No  
Wang et al. [20]  Markov chain  No  Yes  No  No  
Fischer et al. [26]  Heuristic  No  No  Yes  No  


No  No  No  Yes  
This work 

Yes  Yes  Yes  Yes 
This paper presents the first performance analysis technique for NoCs with WRR arbitration. It fills an essential gap since WRR can address the shortcomings of prioritybased bufferless NoC architectures and the basic roundrobin arbitration. Furthermore, the proposed technique supports bursty traffic observed in real applications, which is typically ignored due to its complexity. Hence, it is a vital step towards comprehending the theoretical underpinnings of NoCs with WRR arbitration and enabling their deployment in industrial designs.
Iii Background and Overview
Iiia Weighted RoundRobin Arbitration
This work uses weighted roundrobin arbitration in the NoC routers. The basic operating principle of WRR arbitration is illustrated in Figure 1 for three traffic classes. Packets from each class are first written to a dedicated input queue (also known as a channel). Suppose there are input queues . WRR technique assigns a positive integer weight denoted as for . The WRR arbiter serves up to consecutive packets from before moving to the next queue. If has less than packets, then WRR serves until it becomes empty. Then, the WRR arbiter serves the subsequent queues following the same principle. After a cycle is completed, the weights are reset to their initial values, as illustrated in Figure 1, and the same arbitration cycle is repeated. In NoCs with WRR arbitration, whenever two or more requesters compete for the same resource, they are arbitrated following WRR. WRR arbitration can be used both for arbitrating different virtual channels and different ports in the network.
IiiB Usage of the Proposed Performance Analysis Technique
WRR arbitration is promising for NoCs since it can tailor the communication bandwidth to different traffic classes. Furthermore, it provides endtoend latencyfairness to different sourcedestination pairs, unlike basic roundrobin and priority arbitration techniques. However, these capabilities come at the expense of a vast design parameter space. An mesh with port routers has tunable weights, e.g., an 88 2D mesh with 5port routers would have 320 WRR weights. Due to this ample design space, the current practice is limited to assigning two weights to each router (e.g., one weight to local ports and another weight to packets already in the NoC).
The benefits of the proposed theoretical analysis are twofold. First, it can enable accurate presilicon evaluations and design space exploration without timeconsuming cycleaccurate simulations. Second, it can be used to find the combination of weights that optimizes the performance, i.e., to solve the optimization problem in the vast design space described in the previous paragraph. This paper focuses on constructing the proposed analysis technique and its evaluation against cycleaccurate simulation for two reasons. First, the theoretical analysis is complex, and its evaluation deserves a dedicated treatment on its own. Moreover, its application for solving optimization problems requires demonstration of its fidelity first. Multiobjective optimization of the WRR weights using the proposed model is one of our future research directions.
Iv Proposed Methodology and Approach
Figure 2 shows a weighted roundrobin arbiter with queues. The packets that belong to traffic class are stored in queue . The corresponding average arrival rate of class packets is denoted by , as summarized in Table II. The burstiness of traffic class can be captured by their squared coefficient of variation of interarrival time, denoted by [30, 17]. Finally, the weight assigned to class is denoted as for , as illustrated in Figure 2.
To provide a stepbystep derivation, Section IVA starts with a particular case of the proposed technique tailored to the basic roundrobin arbitration, i.e., all weights are set to one. Then, Section IVB extends the formulation to weighted roundrobin arbitration.
Number of traffic classes to an arbiter  
Injection rate of class  
Weight assigned to class  
Original mean service time of all traffic classes  
Mean value of effective service time of class  
, 








Sq. coeff. of variation of interarrival time (merged traffic)  
Squared coeff. of variation of interdeparture time  
Mean queue occupancy of class  



Average waiting time of class  
Mean effective residual service time of class 
Iva Analytical Model for Basic RoundRobin Arbitration
In the basic roundrobin arbitration, all weights are equal to one, i.e., . Each traffic class experiences an additional delay until the arbiter serves the head packets in the other queues. The proposed technique has two steps:

[leftmargin=*]

The first step is to compute the first two moments of the effective service time for each class (, ) of the fully decomposed queue nodes illustrated in Figure 2.

In the second step, we use the transformed effective service times (, , ) to find the total waiting time (including the queuing delay) of each traffic class.
For the clarity of notation and illustration, the derivation below assumes the original two moments of service time are the same across all traffic classes.
IvA1 Mean effective service time of RR ()
The effective service time accounts for the delay experienced by the packets at the head of each queue. This delay includes its own service time and the service time of the packet at the head of other queues since the roundrobin arbiter serves them one by one. When all queues have packets, class packets will be served every cycles, i.e., they will wait for cycles after being served before winning the arbitration again. In general, a traffic class will only contribute to the extra service time if it has a packet waiting for service. Thus, assuming packet arrival of different classes are independent, we can express the mean effective service time as:
(1) 
where denotes the probability that the occupancy of class is greater than zero given a mean effective service period of class is , for . We approximate using Little’s law as following the busycycle approach [27]. Little’s law gives the expected number of class packets that arrive during the period of , which approximates the probability that the occupancy of class packets is greater than zero during this period. To ensure that the probability term does not exceed one, we approximate as . Therefore, Equation IVA1 can be rewritten as:
(2) 
We solve Equation 2 through a simple iterative approach described in Algorithm 1. First, Equation 2 is solved by removing the nonlinearity introduced by the operation (lines 4 and 5). We consider the smaller solution out of the two solutions, since we observed through experiments that the smaller solution is more accurate. The solution is used as the initial estimate of the iterative approach. Then, this estimate is plugged into Equation 2 within an iteration loop to obtain a better estimate (line 7). The iterations continue until the change in the effective service time is within a userprovided value, set to in this work. Algorithm 1 typically takes less than ten iterations due to the quadratic nature of Equation 2, which takes only a few microseconds to compute. After the iterations are completed, Algorithm 1 returns the effective service time .
IvA2 Coefficient of variation of effective service time for RR ()
This section derives by leveraging the conservation of work principle and property of equal residual service time of individual classes [27]. We leverage these properties to compute the mean effective residual service time of class (). Then, is used to calculate using the relation found in [30]:
(3) 
where is the effective server utilization. In addition, we show that leveraging the conservation of work principle gives our model the capability of handling bursty traffic.
We exploit the conservation of work principle, stating that the same amount of packets are served regardless of how the bandwidth is divided between the traffic flows assuming all packets have same moments of service time. Due to this fundamental principle, the total queue occupancy is invariant under the arbitration policy. Furthermore, this sum can be computed with nearperfect accuracy using maximum entropybased analytical queuing model of multiclass given in [30]
. One strong aspect of this model is that it can take bursty traffic that conforms to General Geometric distribution (GGeo). By leveraging this model, we can embed the capability of handling bursty traffic into our analytical model. The total queue occupancy (
) can be calculated as:(4) 
using the parameters summarized in Table II. We do not delve into the derivation of this equation for clarity since it is not our major focus. The detailed derivation can be found in the seminal work by Kouvatsos et al. [30].
Next, we derive using the first two moments of transformed effective service time. The occupancy of each queue can be expressed using Little’s law as: , where is the average waiting time of class. Therefore, we obtain:
(5) 
The average waiting time consists of two components. The first one is the waiting time in the queue till the packet reaches the head of the queue (), i.e. until all other class packets leave the server completely. The second component accounts for the waiting time at the head of the queue when the server is busy serving other classes (class, ). This component is the additional time captured by the effective service time , i.e., the difference between the effective and original service times: . Furthermore, the waiting is expressed as a function of the residual time [31]. Hence,
(6) 
We leverage the property that all individual mean effective residual service time of classes are equal [27]; hence we write (). Thus, we can rewrite the total occupancy by plugging Equation 6 into Equation 5 and then solve for the residual time :
(7) 
Then, we can compute the coefficient of variation by substituting the residual time from Equation 7 in Equation 3.
IvA3 Average waiting time of RR ()
IvB Analytical Model for Weighted RoundRobin Arbitration
This section extends the analytical model constructed in Section IVA to WRR using the same steps. It first describes the derivation of the first two moments of effective service time. Then, it will use these moments to find the average waiting time.
IvB1 Mean effective service time of WRR ()
In WRR, the weight of each class can be larger than or equal to one, i.e., . To illustrate the generalization, we start with a system of two queues, where and . Then, we generalize it to queues with arbitrary weights.
Mean effective service time of class with and : Since , a batch of up to number of class packets are served without interruption. Let be the total service time of class packets. When there is a class packet in , the next batch of class packet will wait. Therefore, we can express as:
(9) 
Note that this equation generalizes Equation IVA1 to arbitrary for two queues. Using Little’s law and the methodology described in Section IVA, we approximate as:
(10) 
Mean effective service time of class with and : The batch size of class packet is one in this case since . They need to wait in the queue for a maximum of class packets. Therefore, is expressed as:
(11) 
We can approximate as to obtain:
(12) 
To get to the final form of Equation 12 above, we approximate the conditional probabilities based on [27].
Generalization to queues and arbitrary weights: In general, there are classes with weights . Consider the class packets with the total service time for a batch of packets. The first part of the effective service time will be due to the packets of same class. This effect is captured in our twoqueue illustration for class packets in Equation 9. The second part of the effective service time will be due to the packets of other classes. This effect is captured in our twoqueue illustration for class packets in Equation 11. By combining them, we can express the generalized effective service time as:
(13) 
Note that this equation reduces to Equation 9 for , and to Equation 11 for .
Finally, we can obtain the generalized effective service times by replacing with as follows:
(14) 
Due to the nonlinearity introduced by the operation, we compute using the iterative Algorithm 2 just like the basic roundrobin case. The quadratic nature of Equation IVB1 enables a fast convergence within iterations under s.
IvB2 Coefficient of variation of effective service time of WRR () ^{1}^{1}1We use the same notation for RR and WRR arbitration since WRR results are generalizations of basic RR expressions.
In the basic roundrobin case, the residual service time accounts for the secondorder moment of the service time, i.e., . However, effective mean residual service time of different classes are not necessarily equal to each other for arbitrary weights, which is the case in WRR arbitration. The mean effective residual service time of class packets is expressed as:
(15) 
where is found from Algorithm 2 based on Equation IVB1. Note that the residual service time depends on the first two moments of service time, which are the average service time and the coefficient of variation , respectively. To approximate for WRR, we leverage its monotonically decreasing behavior as a function of the corresponding weight. This behavior is expected since a larger weight corresponds to fewer arbitration stages, decreasing the effective service time variation. This observation indicates that the coefficient of variation for roundrobin arbitration is an upper bound of coefficient of variation for weighted roundrobin arbitration. This upper bound, for class, is computed using Equation 3. Given this, we first apply linear approximation to calculate the under WRR as , note we use the squared value of since is the square of coefficient of variation as defined in Table II. We add the parameter to tighten the approximation. Next, we recall Equation 4 and replace by to obtain:
(16) 
The left hand side of the Equation 16 () is obtained by applying Equation 4. Therefore, is the only unknown in Equation 16. Once we obtain (by solving Equation 16), we compute under WRR as . This expression captures traffic burstiness addressing one of the significant drawbacks of the prior work.
IvB3 Average waiting time of WRR ()
IvC Estimation of endtoend latency
The previous section described the canonical model for WRR arbitration, where all traffic classes go through a single arbiter. The packets in NoCs go through a sequence of WRR arbiters while traveling from their sources towards destinations. For example, an 88 2D mesh has 64 routers with at least one arbiter per output queue in each of them. Figure 3(a) illustrates an example with two stages, where the flow that wins the first stage is routed to the second one. The major challenge in modeling multiple stages is obtaining the interdeparture distribution, which becomes the arrival flow at the subsequent stages. For instance, the coefficient of variation of arrival flow 2 in the second stage is equal to the coefficient of variation of the departing flow from stage1 () in Figure 3(a). We first find the squared coefficient of variation of interdeparture time for each traffic class ():
(18) 
Next, we apply the decomposition technique [32] to obtain the departure distribution at each stage ().
(19) 
The departure distribution () is the arrival distribution to the next stage (). For a given sourcedestination pair, these calculations are performed at each arbitration stage on the path of the packets.
Algorithm 3 presents a stepbystep procedure to obtain endtoend latency of a given NoC with WRR arbitration and deterministic routing. The input to the algorithm is the NoC size, injection rate, and coefficient of variation of interarrival time of all classes, service time of the queues, and weights assigned to each arbiter. The output of the algorithm is the latency of each sourcedestination pair () as well as the average latency .
Algorithm 3 first computes the waiting time of each sourcedestination pair (). The function in lines 12–22 describes the procedure for this computation. The procedure starts with finding the number of stages () for the flow from source to destination . At each stage, it computes the average waiting time using Equation 17 (line 16). Next, it adds the waiting time at the current state to the cumulative waiting time found so far (line 17). Then, it computes the coefficient of variation of interdeparture time () using Equation 18 and is used as the coefficient of variation of interarrival time in the next stage (lines 18–19). Finally, the procedure returns the total waiting time from source to destination () (line 21). After retrieving the total waiting times, Algorithm 3 adds the free packet delay (), i.e., the latency due to the links and router microarchitecture, which can be found using the number of hops and the router pipeline depth [33, 10]. Then, endtoend latencies and flow rates are accumulated (line 8). Finally, the average latency of the network is obtained by computing the weighted average of the latency of all sourcedestination pairs (line 11). We note that different WRR arbiters in the network can be assigned different arbitration weights.
V Experimental Results
Va Experimental Setup
This section presents a thorough evaluation of the proposed analytical modeling approach under various traffic scenarios and NoC configurations. We employ geometric and bursty traffic distributions in addition to real application traces feeding 18, 66, and 88 NoCs (note that the typical size of stateoftheart industrial NoCs is 66 [11, 2]). The synthetic traffic simulations use 100% lastlevel cache (LLC) hit and 100% LLC miss scenarios, commonly used corner cases. Furthermore, applications from SPEC CPU 2017 [34] and PARSEC [35] benchmark suits demonstrate the effectiveness of the proposed technique in a broader range of scenarios. The latency results of our analytical models are compared against an inhouse cycleaccurate simulator, which is calibrated with an industrial simulator [36]. All simulations run for 200K cycles to reach a steady state and have 20K cycles of warmup. The source code of the simulator and the analytical model are publicly released at [37].
VB Results with Basic RoundRobin Arbitration
We first evaluate the proposed analytical model under basic roundrobin arbitration. Figure 4 depicts representative average endtoend latency comparisons between analytical model and the cycleaccurate simulations for 81 ring and 88 mesh NoC. The average injection rates for both NoC configurations are varied until the NoC becomes highly congested, following geometric distribution. On average, the proposed technique incurs only 5% error for 81 ring and 7% error for 88 mesh. We also compare the proposed analytical model with a pollingbased model for roundrobin arbitration [28]. The pollingbased model grossly overestimates the latency both for 81 and 88 NoC. A comprehensive summary for other NoC sizes and traffic patterns is presented in Section VE. These results demonstrate that the proposed approach is accurate for difference NoC, traffic, and WRR parameters.
VC Results with Weighted RoundRobin Arbitration
The most important aspect of the proposed technique is considering WRR arbitration. Figure 5 shows the average endtoend latency comparison of our analytical model against cycleaccurate simulations with an 88 mesh for three different weight configurations. The proposed analytical model incurs on average 8% error when the WRR weights associated with the traffic channels of external input to the NoC and channels connected to internal routers are 1 and 2, respectively. The average error slightly increases to 9% when the arbitration weights of the packets in the NoC increase to 3. This configuration decreases the network congestion by providing higher priority to the packets already in the NoC. Hence, the average waiting time decreases. Since there are no other analysis approaches for WRR, this section does not provide comparisons to stateoftheart.
Topo.  

81  1.4%  11%  3.6%  7.8%  1.5%  13%  9%  11% 
66  5.5%  7.2%  7.4%  11%  5.2%  11%  5.9%  12% 
88  2.6%  7.8%  5.2%  10%  3.5%  11%  4.8%  7.2% 
Topo.  

81  4.3%  4.6%  5.3%  5.7%  4.4%  4.6%  4.5%  4.8% 
66  5.0%  5.4%  5.5%  6.0%  5.0%  5.4%  5.0%  5.8% 
88  7.1%  8.0%  7.4%  7.6%  7.4%  13%  8.1%  9.0% 
VD Results with Bursty Traffic and Real Applications
This section evaluates the proposed model in the presence of bursty traffic. We define probability of burstiness as of the general geometric distribution [30]. Increasing denotes increasing burstiness in the traffic. We note that no other prior work considered weighted roundrobin arbitration together with bursty traffic. Figure 6 compares the endtoend average latency results against cycleaccurate simulations for 66 and 88 mesh when the arbitration weights associated with the packets in the NoC and external packets arrivals are 3 and 1, respectively. When the burst probability is (Figure 6(a)), the average modeling errors for 66 and 88 mesh are 7% and 4%, respectively. When the burst probability increases to (Figure 6(b)), the average errors become 6% and 5%, respectively.
Finally, we evaluate the proposed analysis technique while running applications from the SPEC CPU2017 [34] and PARSEC benchmark [35]. The applications from SPEC CPU2017 benchmarks show burstiness in the range of . The applications from PARSEC benchmark are 16threaded and do not show any burstiness. Figure 7 compares the analysis results against cycleaccurate simulations on an 88 NoC. The WRR weights of packets already in the NoC are set to 3, while new packets from cores have a lower priority set with a weight of 1. We observe that our proposed analysis technique consistently achieves a modeling error of less than 5% for these applications. These results demonstrate that the proposed analysis technique can accurately model the endtoend NoC latency under WRR arbitration and bursty traffic.
VE Summary of the Evaluation Results
This section summarizes the accuracy of our performance analysis technique systematically for different NoC sizes, WRR weights, and traffic burstiness. To capture the most commonly observed corner cases, we provide results for two extreme cases – 100% LLC hit and 100% LLC miss traffic. In both cases, cores send packets to each LLC at an equal rate.
Table IV summarizes the comparisons between analysis and simulations for 100% LLC hit. When and the traffic load is moderate (), the modeling error ranges from 1.4% to 5.5% considering both RR and WRR arbitration. Even when the traffic load increases congesting the NoC (), the modeling error remains below 11% for all configurations. A higher level of burstiness to increases the NoC load. Consequently, the error range increases 3.6%–9.0% for the moderate traffic load (). Increasing the load further congests the NoC () pushing the worstcase error only to 13%, which is acceptable since practical systems hardly operate this load due to congestion control mechanisms.
Finally, Table IV summarizes the modeling error of the proposed technique for 100% LLC miss. In this case, the missed core requests are forwarded to memory controllers (one in 81 NoC and two in 66 and 8
8 NoCs). Hence, the memory controllers become hotspots. Despite the high degree of skewness, the error for moderate NoC loads ranges from 4.3% to 8.1% for all burstiness levels, WRR weight configurations, and NoC sizes. Even when the NoC is pushed to congestion, the modeling error remains under 9.0% for all configurations, except for
NoC with WRR weights 3 and 1. The error in this specific case is 13%; the error for and NoCs are 4.6% and 5.4%, respectively.VF Execution Time of the Proposed Analysis Technique
We implemented the proposed technique in C++ to obtain endtoend latency. The computational complexity of the proposed model is , for NoC () and WRR arbiters with maximum ports. The execution time as a function of NoC sizes with 3 output ports per router are summarized in Table V. We observe that even for NoC, the analytical model takes only 45.07 ms to execute, while the simulation time of the NoC is in the order of minutes. These results show that the proposed analysis technique is lightweight and provides four orders of magnitude speedup compared to cycleaccurate simulations.
NoC size  

Exe. time (ms)  0.56  1.41  2.25  13.52  45.07 
Vi Conclusion and Future Work
This paper presented a scalable performance analysis technique for NoCs with WRR arbitration. WRR arbitration is promising since it enables latency fairness and custom bandwidth allocation to meet the requirements of individual traffic flows, unlike basic roundrobin and prioritybased arbitrations. The proposed technique handles bursty traffic and provides high accuracy for both ring topologies used in client systems and large 2D mesh topologies used in servers. Extensive evaluations show the proposed approach achieves less than 5% error while executing real applications and 10% error under challenging synthetic traffic with different burstiness levels. Hence, it can be used for fast and accurate design space exploration as well as presilicon evaluations. One of our future directions is finding the WRR weights to meet individual traffic flows’ bandwidth and latency requirements.
References
 [1] Mohamed Arafa et al. Cascade Lake: Next Generation Intel Xeon Scalable Processor. IEEE Micro, 39(2):29–36, 2019.
 [2] Jack Doweck et al. Inside 6thgeneration Intel Core: New Microarchitecture Codenamed Skylake. IEEE Micro, 37(2):52–62, 2017.
 [3] Hanbo Sun et al. ReliabilityAware Training and Performance Modeling for ProcessingInMemory Systems. In 26th Asia and South Pacific Design Automation Conference (ASPDAC), pages 847–852. IEEE, 2021.
 [4] Sumit K Mandal et al. A LatencyOptimized Reconfigurable NoC for InMemory Acceleration of DNNs. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 10(3):362–375, 2020.

[5]
Seyed Morteza Nabavinejad et al.
An overview of Efficient Interconnection Networks for Deep Neural Network Accelerators.
IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 10(3):268–282, 2020.  [6] MingChao Chiang, TseChen Yeh, and GuoFu Tseng. A QEMU and SystemCbased Cycleaccurate ISS for Performance Estimation on SoC Development. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 30(4):593–606, 2011.
 [7] Thomas Moscibroda and Onur Mutlu. A Case for Bufferless Routing in onchip Networks. In Proc. of Intl. Symp. on Computer Architecture, pages 196–207, 2009.
 [8] Bhavya K Daya, Lishiuan Peh, and Anantha P Chandrakasan. Quest For HighPerformance Bufferless Nocs With SingleCycle Express Paths And SelfLearning Throttling. In Proc. of Design Automation Conf., pages 1–6, 2016.
 [9] Radu Marculescu et al. Outstanding Research Problems In Noc Design: System, Microarchitecture, And Circuit Perspectives. IEEE Trans. on ComputerAided Design of Integrated Circuits and Syst., 28(1):3–21, 2008.
 [10] William James Dally and Brian Patrick Towles. Principles and Practices of Interconnection Networks. Elsevier, 2004.
 [11] Avinash Sodani et al. Knights Landing: Secondgeneration Intel Xeon Phi Product. Ieee micro, 36(2):34–46, 2016.
 [12] Umit Y Ogras and Radu Marculescu. Modeling, Analysis And Optimization Of NetworkOnChip Communication Architectures, volume 184. Springer Science & Business Media, 2013.
 [13] Andrea Pellegrini et al. The Arm Neoverse N1 Platform: Building Blocks for the Nextgen CloudtoEdge Infrastructure SoC. IEEE Micro, 40(2):53–62, 2020.
 [14] Eung S Shin, Vincent J Mooney III, and George F Riley. Roundrobin Arbiter Design and Generation. In Proc. of Intl. Symp. on System Synthesis, pages 243–248, 2002.
 [15] YunLung Lee, Jer Min Jou, and YenYu Chen. A Highspeed and Decentralized Arbiter Design for NoC. In Proc. of Intl. Conf. on Comp. Systs. and Applications, pages 350–353, 2009.
 [16] Gao Xiaopeng, Zhang Zhe, and Long Xiang. Round Robin Arbiters for Virtual Channel Router. In Proc. of Multiconference on Computational Engineering in Systems Applications, volume 2, pages 1610–1614, 2006.
 [17] Yue Qian, Zhonghai Lu, and Qiang Dou. Qos Scheduling for NoCs: Strict Priority Queueing versus Weighted Round Robin. In Proc. of Intl. Conf. on Computer Design, pages 52–59, 2010.
 [18] Jan Heißwolf, Ralf König, and Jürgen Becker. A Scalable NoC Router Design Providing QoS Support using Weighted Round Robin Scheduling. In Proc. of Intl. Symp. on Parallel and Distributed Processing with Applications, pages 625–632, 2012.
 [19] Cesar Albenes Zeferino and Altamiro Amadeu Susin. SoCIN: A Parametric and Scalable NetworkonChip. In Symp. on Integrated Circuits and Systems Design, 2003. SBCCI 2003. Proceedings., pages 169–174, 2003.
 [20] Lan Wang, Geyong Min, Demetres D Kouvatsos, and Xiaolong Jin. Analytical Modeling of an Integrated Priority and WFQ Scheduling Scheme in Multiservice Networks. Computer Communications, 33:S93–S101, 2010.
 [21] ZhiLiang Qian et al. A Support Vector Regression (SVR)based Latency Model for NetworkonChip (NoC) Architectures. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 35(3):471–484, 2015.
 [22] Abbas Eslami Kiasari, Zhonghai Lu, and Axel Jantsch. An Analytical Latency Model for NetworksonChip. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 21(1):113–123, 2012.
 [23] Sumit K Mandal et al. Analytical Performance Modeling of NoCs under Priority Arbitration and Bursty Traffic. IEEE Embedded Systems Letters, 2020.
 [24] Sumit K Mandal et al. Performance Analysis of Priorityaware NoCs with Deflection Routing under Traffic Congestion. In Proc. of Intl. Conf. on ComputerAided Design, pages 1–9, 2020.
 [25] Umit Y Ogras, Paul Bogdan, and Radu Marculescu. An Analytical Approach for NetworkonChip Performance Analysis. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 29(12):2001–2013, 2010.
 [26] Erik Fischer and Gerhard P Fettweis. An Accurate and Scalable Analytic Model for Roundrobin Arbitration in NetworkonChip. In Proc. of Intl. Symp. on NetworksonChip (NoCS), pages 1–8, 2013.
 [27] Onno J. Boxma and Bernd W. Meister. Waitingtime Approximations in Multiqueue Systems with Cyclic Service. Performance Evaluation, 7(1):59–70, 1987.
 [28] Wim P Groenendijk and Hanoch Levy. Performance Analysis of Transaction Driven Computer Systems via Queueing Analysis of Polling Models. IEEE Computer Architecture Letters, 41(04):455–466, 1992.
 [29] Jasper Vanlerberghe. Analysis and Optimization of Discretetime Generalized Processor Sharing Queues. PhD thesis, Ghent University, 2018.
 [30] Demetres D Kouvatsos. Entropy Maximisation and Queueing Network Models. Annals of Operations Research, 48(1):63–126, 1994.
 [31] Dimitri P Bertsekas, Robert G Gallager, and Pierre Humblet. Data Networks, volume 2. PrenticeHall International New Jersey, 1992.
 [32] Guy Pujolle and Wu Ai. A Solution for Multiserver and Multiclass Open Queueing Networks. INFOR: Information Systems and Operational Research, 24(3):221–230, 1986.
 [33] Umit Y Ogras and Radu Marculescu. ”It’s a Small World After All”: Noc Performance Optimization Via LongRange Link Insertion. IEEE Trans. on Very Large Scale Integration (VLSI) Systs., 14(7):693–706, 2006.
 [34] James Bucek, KlausDieter Lange, and Jóakim v. Kistowski. SPEC CPU2017: NextGeneration Compute Benchmark. In Companion of the 2018 ACM/SPEC Intl. Conf. on Performance Engineering, pages 41–42, 2018.
 [35] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proc. of Intl. Conf. on Parallel Architectures and Compilation Techniques, pages 72–81, 2008.
 [36] Umit Y Ogras, Yunus Emre, Jianping Xu, Timothy Kam, and Michael Kishinevsky. Energyguided Exploration of Onchip Network Design for Exascale Computing. In Proc. of Intl. Workshop on System Level Interconnect Prediction, pages 24–31, 2012.

[37]
Sumit K. Mandal et al.
Theoretical Analysis and Evaluation of NoCs with Weighted RoundRobin Arbitration: An Open Source Release.
https://github.com/sumitkmandal/WRR_NoC_analytical_model, 2021. Accessed 12 August 2021.
Comments
There are no comments yet.