1. Introduction
Modern design methodologies in industries involve thorough power, performance, and area evaluations before the architectural decisions are frozen. These presilicon evaluations are vital for detecting functional bugs and powerperformance violations, since postsilicon fixes are costly, if feasible at all. Therefore, a significant amount of resources are dedicated to presilicon evaluation using virtual platforms (leupers2011virtual, ) or fullsystem simulators (Binkert2011Gem5, ). NoC simulations play a critical role in these evaluations, as NoCs have become the standard interconnect solution in many core chipmultiprocessors (CMPs) (jeffers2016intel, ; keltcher2003amd, ; kahle2005introduction, ), client CPUs (rotem2015intel, ), and mobile systemsonchip (singh2014evolution, ).
Moreover, there is a growing interest to use NoCs in hardware implementations of deep neural networks
(choi2017chip, ).Since the onchip interconnect is a critical component of multicore architectures, presilicon evaluation platforms contain cycleaccurate NoC simulators (agarwal2009garnet, ; jiang2013detailed, ). NoC simulations take up a significant portion of the total simulation time, which is already limiting the scope of presilicon evaluation (e.g., simulating even a few seconds of applications can take days). For example, Figure 1 shows that 40%70% of total simulation time is spent on the network itself when performing fullsystem simulation using gem5 (Binkert2011Gem5, ). Hence, accelerating NoC simulations without sacrificing accuracy can significantly improve both the quality and scope of presilicon evaluations.
Several performance analysis approaches have been proposed to enable faster NoC design space exploration (ogras2010analytical, ; wu2010analytical, ; qian2015support, ). Prior techniques have assumed a roundrobin arbitration policy in the routers since the majority of router architectures proposed to date have used roundrobin for fairness. In doing so, they miss two critical aspects of the industrial prioritybased NoCs (jeffers2016intel, ; rico2017arm, ; kahle2005introduction, ). First, routers give priority to the flits in the network to achieve predictable latency within the interconnect. For example, let us assume, class1 flits to the neighboring router and class2 flits to the local node in Figure 2 are already in the NoC, while flits from class3 to the neighboring router must wait in the input buffer to be admitted. Consequently, flits in the NoC (class1 and class2) experience deterministic service time at the expense of increased waiting time for new flits. Second, flits from different priority classes can be stored in the same queue. For instance, new read/write requests from the core to tag directories use the same physical and virtual channels as the requests forwarded from the directories to the memory controllers. Moreover, only a fraction of the flits in either the high or low priority queue can compete with the flits in the other queue. For example, suppose the class2 flits in Figure 2 are delivered to the local node. Then, class3 flits must compete with only class1 flits in the highpriority queue. Analytical models that ignore this traffic split significantly overestimate the latency, as shown in Section 4.1. In contrast, analytical models that ignore priority would significantly underestimate the latency. Thus, prior approaches that do not model priority (ogras2010analytical, ; wu2010analytical, ; qian2015support, ) and simple performance models for the priority queues (kiasari2013analytical, ; jin2009modelling, ) are inapplicable to prioritybased industrial NoCs.
This paper presents a novel NoC performance analysis technique that considers traffic classes with different priorities. This problem is theoretically challenging due to the nontrivial interactions between classes and shared resources. For example, queues can be shared by flits with different priorities, as shown in Figure 2. Similarly, routers may merge different classes coming through separate ports, or act as switches that can disjoin flits coming from different physical channels. To address these challenges, we propose a twostep approach that consists of an analysis technique followed by an iterative algorithm. The first step establishes that prioritybased NoCs can be decomposed into separate queues using traffic splits of two types. Since existing performance analysis techniques cannot model these structures with traffic splits, we develop analytical models for these canonical queuing structures. The second step involves a novel iterative algorithm that composes an endtoend latency model for the queuing network of a given NoC topology and input traffic pattern. The proposed approach is evaluated thoroughly using both 2D mesh and ring architectures used in industrial NoCs. It achieves 97% accuracy with respect to cycleaccurate simulations for realistic architectures and applications.
The major contributions of this paper are as follows:

[leftmargin=9mm]

A technique to obtain analytical performance models of prioritybased NoCs with multiple traffic classes,

An algorithm to synthesize endtoend latency for each traffic class using the analytical models,

Extensive evaluations using an industrial cycleaccurate simulator, real applications and synthetic traffic.
The rest of the paper is organized as follows. Section 2 reviews the related work. Section 3 provides a brief overview and background of the proposed work. Section 4 describes the required transformations for two canonical structures and their analytical models. Section 5 describes how these two transformations are used to analyze a queuing network. Section 6 presents experimental evaluations, and Section 7 concludes the paper summarizing the key contributions.
2. Related Work
Performance analysis techniques are useful for exploring design space (pande2005performance, ) and speeding up simulations (ogras2010analytical, ; kiasari2013analytical, ; wu2010analytical, )
. Indeed, there is continuous interest in applying novel techniques such as machine learning
(qian2015support, ) and network calculus (qian2009analysis, ) to NoC performance analysis. However, these studies do not consider multiple traffic classes with different priorities. Since stateoftheart industrial NoC designs (doweck2017inside, ; jeffers2016intel, ) use prioritybased arbitration with multiclass traffic, it is important to develop performance analysis for this type of architectures.Kashif et al. have recently presented priorityaware router architectures (kashif2014bounding, ). However, this work presents analytical models only for worstcase latency. In practice, analyzing the average latency is important since using worstcase latency estimates in fullsystem would lead to inaccurate conclusions. A recent technique proposed an analytical latency model for prioritybased NoC (kiasari2013analytical, ). This technique, however, assumes that each queue in the network contains a single class of flits.
Several techniques present performance analysis of prioritybased queuing networks outside the NoC domain (bertsekas1992data, ; bolch2006queueing, ; ikeharaapproximate, ). Nevertheless, these techniques do not consider multiple traffic classes in the same queue. The work presented in (awan2005analysis, ) considers multiple traffic classes, but it assumes that high priority packets preempt the lower priority packets. However, this is not a valid assumption in the NoC context. A technique that can handle two traffic classes, Empty Buffer Approximation (EBA), has been proposed in (berger2000workload, ) for a prioritybased queuing system. This approach was later extended to multiclass systems (jin2009modelling, ). However, EBA ignores the residual time caused by low priority flits on high priority traffic. Hence, it is impractical to use EBA for priorityaware industrial NoCs.
The aforementioned prior studies assume a continuoustime queuing network model, while the events in synchronous NoCs take place in discrete clock cycles. A discretetime prioritybased queuing system is analyzed in (walraevens2004discrete, )
. This technique forms a Markov chain for a given queuing system, then analyzes this model in zdomain through probability generating functions (PGF). PGFs deal with joint probability distributions where the number of random variables is equal to the number of traffic classes in the queuing system. This approach is not scalable for systems with large number of traffic classes because the corresponding analysis becomes intractable. For example, an industrial 8
8 NoC would have 64 sources and 64 destinations which will result in 4096 (6464) variables with PGF. Furthermore, our approach outperforms this technique, as demonstrated in Section 6.4.In contrast to prior approaches, we propose a scalable and accurate closed form solution for a prioritybased queuing network with multiclass traffic. The proposed technique constructs endtoend latency models using two canonical structures identified for prioritybased NoCs. Unlike prior approaches, our technique scales to any number of traffic classes. To the best of our knowledge, this is the first analytical model for prioritybased NoCs that considers both (1) shared queues among multiple priority classes and (2) traffic arbitration dependencies across the queues.
3. Overview and Background
3.1. Proposed Performance Analysis Flow
The primary target of the proposed model is to accelerate virtual platforms (bartolini2010virtual, ) and fullsystem simulations (Binkert2011Gem5, ; patel2011marss, ; magnusson2002simics, ) by replacing timeconsuming NoC simulations with accurate lightweight analytical models. At the beginning of the simulation, the proposed technique parses the prioritybased NoC topology to construct the analytical models, as shown in Figure 3. The host, such as a virtual platform, maintains a record of traffic load and the destination address for each node. It also periodically (each 10K100K cycles) sends the traffic injections of requesting nodes, such as cores, to the proposed technique. Then, the proposed technique applies the analytical models (steps 2 and 3 in Figure 3) to compute the endtoend latency. Whenever there is a new request from an end node, the host system estimates the latency using the proposed model as a function of the sourcedestination pair. That is, our model replaces the cyclebycycle simulation of flits in NoCs.
We note that the requesting nodes, such as the cores, have a parameterized number of maximum outstanding request credits. Hence, the requesters are automatically throttled by these credits/node, leading to a bounded number of flits in the NoCs. Since the traffic injection rates provided by the host already account for this throttling, we do not explicitly model blocking at the interfaces.
3.2. Basic PriorityBased Queuing Models
We assume a discrete time system in which microarchitectural events, such as writing to a buffer, arbitration and switch traversal happen in the integral number of clock cycles. Therefore, we develop queuing models based on arrival process that follows geometric distribution,
in contrast to continuous time models that are based on Poisson (M for Markovian) arrival assumption. More specifically, we adopt the Geo/G/1 model, in which the interarrival time of the incoming flits to the queue follows geometric distribution (denoted by Geo), service time of the queue follows a general discretetime distribution (denoted by G), and the queue has one server (the ‘1’ in the Geo/G/1 notation). The proposed technique estimates the endtoend latency for realistic applications accurately, as we demonstrate in Section 6.5. However, the accuracy is expected to drop if the NoC operates close to its maximum load since the Geometric (similar to Poisson) packet interarrival time assumption becomes invalid (ogras2010analytical, ).Performance analysis techniques in the literature (bertsekas1992data, ; jin2009modelling, ; kiasari2013analytical, ) discuss basic prioritybased networks in which each priority class has a dedicated queue, as illustrated in Figure 4(a). In this architecture, the flits in Q have higher priority than the flits in Q. That is, flits in Q will be served only when Q is empty and the server is ready to serve new flits. Another example with priority classes is shown in Figure 4(b). The flits in Q have higher priority than flits in Q if . The average waiting time for each priority class for is known for continuous time M/G/1 queues (bertsekas1992data, )
. In the M/G/1 queuing system, flits arrive in the queue following Poisson distribution (M) and the service time of the queue follows general distribution (G). In this work, we first derive waiting time expressions for discrete time Geo/G/1 queues. Then, we employ these models to derive endtoend NoC latency models.
The average waiting time of flits in a queue can be divided into two parts: (1) waiting time due to the flits already buffered in the queue, and (2) waiting time due to the flits which are in the middle of their service, i.e., the residual time. The following lemma expresses the waiting time as a function of input traffic and NoC parameters.
Lemma 1: Consider a queuing network with priority classes as shown in Figure 4(b). Suppose that we are given the injection rates , service rates , residual time , and server utilizations for , where (see Table 1). Then, the waiting time of class flits is given as:
(1) 
Proof: Equation 1 is derived in Appendix A to avoid distorting the flow of the paper.
Shortcomings of the Basic PriorityBased Queuing Models: Although Equation 1 extends the known results from continuous time to discrete time queuing systems, it cannot handle a network of queues in which each queue can store more than one priority class. For example, it does not handle the scenario in which both class1 and class2 flits can use Q in Figure 2. To this end, we present our novel technique that handles multiple priority classes in one queue.
Injection rate of classi flits  
Service time of classi flits  
Service rate of classi flits (=1/)  
Residual time of classi flits  
Server utilization of classi flits (=)  
, 


Coefficient of variation of service time  
, 


Waiting time of classi flits 
4. Canonical Analytical Models
This section describes two canonical queuing structures observed in prioritybased NoCs. We first describe these structures and explain why prior analysis techniques fail to analyze them. Then, we present two novel transformations and accurate analysis techniques.
4.1. Transformation 1: Split at High Priority Queue
Conceptual Illustration: Consider the structure shown in Figure 5(a). As illustrated in Section 1, flits from traffic class1 and 2 are already in the network, while flits from traffic class3 are waiting in Q to be admitted. Since routers give priority to the flits in the network in industrial NoCs, class1 flits have higher priority than those in Q. To facilitate the description of the proposed models, we represent this system by the structure shown in Figure 5(b). In this figure, represents the service rate of class for . If we use Equation 1 to obtain an analytical model for the waiting time of traffic class3, the resulting waiting time will be highly pessimistic, as shown in Figure 6. The basic prioritybased queuing model overestimates the latency, since it assumes each class in the network occupy separate queues. Hence, all flits in Q have higher priority than those in Q.
Proposed Transformation: The basic priority equations cannot be applied to this system since flit distribution of class1 as seen by class3 flits will change depending on the presence of class2 traffic. To address this challenge, we propose a novel structural transformation, Figure 5(b) to Figure 5(c). Comparison of the structures before and after the transformation reveals:

[leftmargin=*]

The top portion (Q with its server) is identical to the original structure, since and remain the same due to higher priority of class1 over class3.

The bottom portion (Q and Q) forms a basic priority queue structure, as highlighted by the red dotted box.
The basic priority queue structure is useful since we have already derived its waiting time model in Equation 1. However, the arrival process at Q must be derived to apply this equation and ensure the equivalence of the structures before and after the transformation.
We derive the second order moment of interdeparture time of class1 using the decomposition technique presented in (bolch2006queueing, ). These interdeparture distributions are functions of interarrival distributions of all traffic classes flowing in the same queue and service rate of the classes, as illustrated in Figure 7. This technique first calculates the effective coefficient of variation at the input () as the weighted sum of the coefficient of variation of individual classes ( in Figure 7Phase 1). Then, it finds the effective coefficient of variation for the interdeparture time () using and the coefficient of variation for the service time (). In the final phase, the coefficient of variation for interdeparture time of individual classes is found, as illustrated in Figure 7
(Phase 3). By calculating the first two moments of the interarrival statistics of Q
as and , we ensure that the transformed structure in Figure 5(c) approximates the original system. This decomposition enables us to find the residual time for class1 as:(2) 
Proposed Analytical Model: The bottom part of the transformed system in Figure 5(c) is the basic priority queue (marked with the dotted red box). Therefore, the higher priority part of Equation 1 can be used to express the waiting time of class1 flits as:
(3) 
where the residual time of class1 flits is found using Equation 2. Subsequently, this result is substituted in the lower priority portion of Equation 1 to find the waiting time for class3 flits:
(4) 
We also note the waiting time of class2 flits, , is not affected by this transformation. Hence, we can express it as , using Equation 1 for the degenerate case of .
Figure 6 shows that the waiting time calculated by the proposed analytical model for flits of traffic class3 is quite accurate with respect to the waiting time obtained from the simulation. The average error in waiting time of traffic class3 is 2% for the system shown in Figure 5(a), with a deterministic service time of two cycles.
4.2. Transformation 2: Split at Low Priority Queue
Conceptual Illustration: Consider the queuing system shown in Figure 8(a). In this system, class1 flits () are waiting in Q, while class2 flits () and class3 flits () are waiting in Q. Class1 and class3 flits share the same channel and compete for the same output, while class2 flits are sent to a separate output. Class1 flits always win the arbitration since they have higher priority. Similar to the previous transformation, the queuing model in Figure 8(b) is used as an intermediate representation to facilitate the discussion. In this system, Q and Q are represented as Q and Q respectively.
If we ignore the impact of class1 traffic while modeling the waiting time for class3, the resulting analytical models will be highly optimistic, as shown in Figure 9. Accounting for the impact of class1 traffic on class2 is challenging, since only fraction of the flits in Q that compete with class1 are blocked. In other words, class2 flits which go to the local node are not directly blocked by class1 flits. Hence, there is a need for a new transformation that can address the split at the lowpriority queue.
Proposed Transformation: The highpriority flow (class1) is not affected by class2 traffic since they do not share the same server. Therefore, the waiting time of class1 flits can be readily obtained using Equation 1 as:
(5) 
Hence, we represent Q as a standalone queue, as shown in Figure 8(c). However, the opposite is not true; class1 flits affect both class2 (indirectly) and class3 (directly). Therefore, we represent them using a new queue with modified service rate statistics. To ensure that Figure 8(c) closely approximates the original system, we characterize the effect on the service rate of class3 using a novel analytical model.
Proposed Analytical Model: Both the service time and residual time of class3 change due to the interaction with class1. To quantify these changes, we set such that the effect of class2 is isolated. In this case, the waiting time of class3 flits can be found using Equation 1 as:
(6) 
We can find also by using the modified service time () and residual time of class3. The probability that a class3 flit cannot be served due to class1 is equal to server utilization . Moreover, there will be extra utilization due to the residual effect of class3 on class1, i.e., flits in Q. Hence, the probability that a class3 flit is delayed due to class1 flits is:
(7) 
Each time class3 flit is blocked by the class1 flits, the extra delay will be , i.e., class1 service time. Since each flit can be blocked multiple consecutive times, the additional busy period of serving class3 is expressed as:
(8) 
Consequently, the modified service time () and utilization () of class3 can be expressed as:
(9) 
Suppose that the modified residual time of class3 is denoted by . We can plug , the modified utilization from Equation 9, and the additional busy period from Equation 8 into Geo/G/1 model to express the waiting time as:
(10) 
When is set to zero, this expression should give the class3 waiting time found in Equation 6. Hence, we can find the following expression for by combining Equation 6 and Equation 10:
(11) 
Since the modified service time and residual times are computed, we can apply the Geo/G/1 queuing model one more time to find the waiting time of class2 and class3 flits as:
(12) 
Figure 9 shows that the class3 waiting time calculated using the proposed analytical modeling technique is very close to simulation results. The modeling error is within 4% using a deterministic service time of 2 cycles.
5. Generalization for Arbitrary Number of Queues
In this section, we show how the proposed transformations are used to generate analytical models for prioritybased NoCs with arbitrary topologies and input traffic. Algorithm 1 describes the model generation technique, which is a part of the proposed methodology to be used in a virtual platform. This algorithm takes injection rates for all traffic classes, the NoC topology, and the routing of individual traffic classes. Then, it uses the transformations described in Section 4.1 and Section 4.2 iteratively to construct analytical performance models for each traffic class.
First, Algorithm 1 extracts all traffic classes originating from a particular queue, as shown in line 6. Next, the waiting time for each of these classes is computed separately, as each has a different dependency on other classes due to priority arbitration. At line 8, all classes that have higher priority than the current class are obtained. In lines 11–16, the structural transformation as described in Section 4.1 is applied. For that, the coefficient of variation of interdeparture time for each of the higher priority classes is computed. Through structural transformation, reference waiting time for the current class is obtained, as depicted in line 17 of the algorithm. At line 18, we compute the modified service time of the current class following the method described in Section 4.2. Using and , the residual time ( in line 19) is computed. Using residual time expressions for all classes in a queue, we obtain waiting time expressions for each class separately, as shown in line 23 of the algorithm.
Figure 10 illustrates the proposed approach on a representative example of a prioritybased network to decompose the system. Figure 10(a) shows the original queuing network. This network consists of three queues: Q, Q, and Q. Q stores flits from class1 and class2 flows, while Q buffers class3 and class4 flits. Flits of class2 have higher priority than both class3 and class4, as denoted by the first port of the switch that connects these flows. Finally, class5 flits are stored in Q. We note that class5 flits have lower priority than that of class3, while they are independent of class2 and class4 flits. To solve this queuing system, we first apply the structural transformation on class1 and class2 of Q by bypassing class2 flits to Q as shown in Figure 10(b). Next, the service rate transformation on class3 and class4 is applied to obtain modified service time . This transformation allows us to form the network by decomposing Q and Q, as depicted in Figure 10(c). After that, structural transformation is applied on class3 as flits of class3 have higher priority than those of class5. Finally, service rate transformation is performed on class5 to achieve a fully decomposed system, which is shown in Figure 10(e).
Automation of Model Generation Technique: We developed a framework to automatically generate the analytical performance model for NoCs with arbitrary size 2D Mesh and ring topologies. The proposed framework operates in two steps. In the first step, we extract all architecturerelated information of the NoC. This includes information about the traffic classes in each queue and priority relations between classes. In the second step, the automation framework uses this architecture information to generate analytical models.
6. Experimental evaluations
6.1. Experimental Setup
We applied the proposed analytical models to a widely used prioritybased industrial NoC design (jeffers2016intel, ). We implemented the proposed analytical models in C and observed that on average it takes to calculate latency value per sourcetodestination pair. At each router of the NoC, there are queues in which tokens wait to be routed. This NoC design incorporates deterministic service time across all queues. We compared average latency values in the steady state found in this approach against an industrial cycleaccurate simulator written in SystemC (Xplore, ; ogras2012energy, ). We ran each simulation for 10 million cycles to obtain steady state latency values, with a warmup period of 5000 cycles. Average latency values are obtained by averaging latencies of all flits injected after the warmup period. Injection rates are swept from to . Beyond , server utilization becomes greater than one, which is not practical. We show the average latency of flits as a function of the flit injection rate for different NoC topologies. We also present experimental results considering the cache coherency protocol with different hit rates, network topologies, and floorplans. With a decreasing hit rate, traffic towards the memory controller increases, leading to more congestion in the network.
6.2. FullSystem Simulations on gem5
Applications are profiled in the fullsystem simulator gem5 (Binkert2011Gem5, ) using Linux ‘perf’ tools (de2010new, ). The ‘perf’ tool captures the time taken by each function call and their children in the gem5 source. It represents the statistics through a function call graph. From this call graph, we obtain the time taken by the functions related to Garnet2.0, which is the onchip interconnect for gem5. Figure 11 shows components of Garnet2.0, which takes up a significant portion of the total simulation time while running Streamcluster application on gem5. These components are router, networklink, and functional write. The ‘other components’ shown in Figure 11 consists of the functions not related to network simulation. We observe that the functional write takes 50%, and the whole network takes around 60% of the total simulation time in this case.
Simulation Time: To evaluate the decrease in simulation time with the proposed approach, we first run the Streamcluster application with a 16core CPU on gem5 in full system mode using Garnet2.0, a cycleaccurate network simulator. Then, we repeat the same simulation by replacing the cycleaccurate simulation with the proposed analytical model. The total simulation time is reduced from 12,466 seconds to 4986 seconds when we replace the cycleaccurate NoC simulations with the proposed analytical models. Hence, we achieve a 2.5 speedup in cycleaccurate fullsystem simulation with the proposed NoC performance analysis technique.
6.3. Validation on Ring Architectures
This section evaluates the proposed analytical models on prioritybased ring architecture that consists of eight nodes. In this experiment, all nodes inject flits with an equal injection rate. Flits injected from a node go to other nodes with equal probability. We obtain the latency between each sourcedestination pair using the proposed analytical models. The simulation and analysis results are compared in Figure 12. The proposed analysis technique has only 2% error on average. The accuracy is higher at lower injection rates and degrades gradually with increasing injection rates, as expected. However, the error at the highest injection rate is only 5.2%.
6.4. Validation on Mesh Architectures
This section evaluates the proposed analytical model for 66 and 88 prioritybased mesh NoCs with YX routing. As described in (jeffers2016intel, ), a mesh is a combination of horizontal and vertical half rings. The analytical model generation technique for prioritybased NoC architecture is applied to horizontal and vertical rings individually. Then, these latencies, as well as the time it takes to switch from one to the other are used to obtain the latency for each sourcedestination pair. We first consider uniform random alltoall traffic, as in Section 6.3. The comparison with the cycleaccurate simulator shows that the proposed analytical models are on average 97% and 96% accurate for 66 and 88 mesh, as shown in Figure 14 and Figure 14, respectively. At the highest injection rate, the analytical models show 11% error for both cases.
Comparison to Prior Techniques: We compare the proposed analytical models to the existing priorityaware analytical models in literature (walraevens2004discrete, ). Since these techniques do not consider multiple priority traffic classes in the network, they fail to accurately estimate the endtoend latency. For example, Figure 14 and Figure 14 show that they overestimate NoC latency at high injection rates for 66 and 88 mesh networks, respectively. In contrast, since it captures the interactions between different classes, the proposed technique is able to estimate latencies accurately. Finally, we analyze the impact of using each transformation individually. If we apply only the Structural Transformation (ST), then the latency is severely underestimated at higher injection rates, since contentions are not captured accurately. In contrast, applying only Service Rate Transformation (RT) results in overestimating the latency at higher injection rates as the model becomes pessimistic.
Impact of coefficient of variation: One of the important parameters in our analytical model is the coefficient of variation of interarrival time. When the interarrival time between the incoming flits follows geometric distribution, increasing coefficient of variation implies larger interarrival time. Hence, the average flit latency is expected to decrease with an increasing coefficient of variation. Indeed, the simulation and analysis results demonstrate this behavior for a 66 mesh in Figure 15. We observe that the proposed technique accurately estimates the average latency in comparison to cycleaccurate simulation. On average, the analytical models are 97% accurate with respect to latency obtained from the simulation in this case.
Evaluation with Intel^{®} Xeon^{®} Scalable Server Processor Architecture: This section evaluates the proposed analytical model with the floorplan of a variant of the Intel^{®} Xeon^{®} Scalable Server Processor Architecture (doweck2017inside, ) architecture. This version of the Xeon server has 26 cores, 26 banks of the last level cache (LLC), and 2 memory controllers.
The cores and LLC are distributed on a 66 mesh NoC. The comparison of simulation and proposed analytical models with this floorplan is shown in Figure 16. On average, the accuracy is 98% when all cores send flits to all caches with equal injection rates. Similar to the evaluations on 66 mesh and 88 mesh, the stateoftheart NoC performance analysis technique (walraevens2004discrete, ) highly overestimates the average latency for this server architecture, as shown in Figure 16. Applying only ST underestimates the average latency and applying only RT overestimates the average latency.
The NoC latency is a function of the traffic class, since higher priority classes experience less contention. To demonstrate the latency for different classes, we present the NoC latencies for 9 representative traffic classes of the server architecture described above. Figure 17 shows the latency of each class of the server architecture described above normalized with respect to the average latency obtained from the simulation. Higher priority classes experience lower latency, as expected. The proposed performance analysis technique achieves 91% accuracy on average for the classes which have the lowest priority in the NoC. For the classes having medium priority and highest priority, the accuracy is 99% on average. Therefore, the proposed technique is reliable for all classes with different levels of priority.





100  98.8  93.9  
50  97.7  98.1  
0  97.7  98.0 
Finally, we evaluate the proposed technique with different LLC hit rates. Table 2 shows that the proposed approach achieves over 97% accuracy in estimating the average latency of the address network for all hit rates. Similarly, the latencies in the data network are estimated with 98% or greater accuracy for 0% and 50% hit rates. The accuracy drops to 93.9% for 100% hit rates, since this scenario leads to the highest level of congestion due to alltoall traffic behavior.
6.5. Evaluation with Real Applications
In this section, evaluations of the proposed technique with real applications are shown. We use gem5 (Binkert2011Gem5, ) to extract traces of applications in FullSystem (FS) mode. Garnet2.0 (agarwal2009garnet, ) is used as the network simulator in gem5 with the Ruby memory system. Table 3 shows the various configuration settings we used for FS simulation in gem5.
Processor  Number of Cores  16  

Frequency of Cores  2 GHz  
Instruction Set  x86  

Topology  4x4 Mesh  
Routing Algorithm  XY deterministic  

L1 Cache 


Memory Size  3 GB  
Kernel  Type  Linux  
Version  3.4.112 
We collect traces of six 16threaded applications from PARSEC (bienia2008parsec, ) benchmark suites: Blackscholes, Canneal, Swaptions, Bodytrack, Fluidanimate, and Streamcluster. We selected applications that show relatively higher network utilization as discussed in (wettin2014performance, ). The accuracy obtained for these applications is an important indicator of the practicality of the proposed technique since real applications do not necessarily comply with a known interarrival time distribution (bogdan2011non, ), such as the geometric distribution used in this work. The traces are parsed and simulated through our custom inhouse simulator with prioritybased router model. For each application, a window of one million cycles with the highest injection rate is chosen for simulation. From the traces of these applications, we get the average injection rate of each source and destination pair. These injection rates are fed to our analytical models to obtain average latency.
Figure 18 shows the comparison of the average latency between the proposed analytical model and the simulation. The xaxes represent mean absolute percentage error (MAPE) between the average simulation latency () and average latency obtained from analytical models (). MAPE is defined by the following equation:
(13) 
The yaxes in the plots represent the percentage of source to destination pairs having the corresponding MAPE. From this figure, we observe that the latency obtained from the proposed analytical model is always within 10% of the latency reported by the cycleaccurate simulations. In particular, only 1% sourcedestination pair has MAPE of 10% for the Canneal application. On average, the analytical models have 3% error in comparison to latency obtained from the simulation for real applications. These results demonstrate that our technique achieves high accuracy for applications which may have arbitrary interarrival time distributions.
We further divide the window of one million cycles into 10 smaller windows containing 100,000 cycles each. Average latency comparison for Streamcluster application in these smaller windows is shown in Figure 19. The largest MAPE between latency obtained from the simulation and analytical model is observed for window 10, which is 7%. On average, the proposed analytical models are 98% accurate for these 10 windows. This confirms the reliability of the proposed analytical models at an even more granular level for the application. Finally, we note that the experiments with synthetic traffic shown in Section 6.3 and Section 6.4 exercise higher injection rates than these applications. Hence, the proposed technique performs well both under real application traces and heavy traffic scenarios.
Prior work showed that the deviation from Poisson distribution becomes larger as the network load approaches saturation (ogras2010analytical, ). Similar to this result, we also observe that the Geometric distribution assumption is a good approximation until the NoC operates near saturation point. Therefore, we obtain high accuracy for real application workloads. Since this accuracy can degrade with increasing traffic load, we plan to generalize the proposed models by relaxing the assumption of Geometric distribution in our future work.
7. Conclusion
In this work, we propose an approach to build analytical models for prioritybased NoCs with multiclass flits. As we emphasized, no prior work has presented analytical models that consider priority arbitration and multiclass flits in a single queue simultaneously. Such a prioritybased queuing network is decomposed into independent queues using novel transformations proposed in this work. We evaluate the efficiency of the proposed approach by computing endtoend latency of flits in a realistic industrial platform and using real application benchmarks. Our extensive evaluations show that the proposed technique achieves a high accuracy of 97% accuracy compared to cycleaccurate simulations for different network sizes and traffic flows.
References
 [1] N. Agarwal et al. GARNET: A Detailed onchip Network Model Inside a Fullsystem Simulator. In 2009 IEEE intl. symp. on performance analysis of systems and software, pages 33–42.
 [2] I. Awan and R. Fretwell. Analysis of DiscreteTime Queues with Space and Service Priorities for Arbitrary Arrival Processes. In Parallel and Distributed Systems. Proc. 11th Intl Conf. on, volume 2, pages 115–119, 2005.
 [3] A. Bartolini et al. A Virtual Platform Environment For Exploring Power, Thermal And Reliability Management Control Strategies In HighPerformance Multicores. In Proc. of the Great lakes Symp. on VLSI, pages 311–316, 2010.
 [4] A. W. Berger and W. Whitt. Workload Bounds in Fluid Models with Priorities. Performance evaluation, 41(4):249–267, 2000.
 [5] D. P. Bertsekas, R. G. Gallager, and P. Humblet. Data Networks, volume 2. PrenticeHall International New Jersey, 1992.
 [6] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proc. of the Intl. Conf. on Parallel Arch. and Compilation Tech., pages 72–81, 2008.
 [7] N. Binkert et al. The Gem5 Simulator. SIGARCH Comp. Arch. News, May. 2011.
 [8] P. Bogdan and R. Marculescu. Nonstationary Traffic Analysis and its Implications on Multicore Platform Design. IEEE Trans. on ComputerAided Design of Integrated Circuits and Systems, 30(4):508–519, 2011.
 [9] G. Bolch, S. Greiner, H. De Meer, and K. S. Trivedi. Queueing Networks and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications. John Wiley & Sons, 2006.
 [10] W. Choi et al. OnChip Communication Network for Efficient Training of Deep Convolutional Networks on Heterogeneous Manycore Systems. IEEE Trans. on Computers, 67(5):672–686, 2017.
 [11] A. C. de Melo. The New Linux Perf Tools. In Linux Kongress, volume 18, 2010.
 [12] J. Doweck et al. Inside 6thgeneration Intel Core: New Microarchitecture Codenamed Skylake. IEEE Micro, (2):52–62, 2017.
 [13] S. Ikehara and M. Miyazaki. Approximate Analysis of Queueing Networks with Nonpreemptive Priority Scheduling. In Proc. 11th Int. Teletraffic Congr.
 [14] J. Jeffers, J. Reinders, and A. Sodani. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann, 2016.
 [15] N. Jiang et al. A Detailed and Flexible Cycleaccurate Networkonchip Simulator. In 2013 IEEE Intl. Symp. on Performance Analysis of Systems and Software (ISPASS), pages 86–96.
 [16] X. Jin and G. Min. Modelling and Analysis of Priority Queueing Systems with Multiclass Selfsimilar Network Traffic: a Novel and Efficient Queuedecomposition Approach. IEEE Trans. on Communications, 57(5), 2009.
 [17] J. A. Kahle et al. Introduction to the Cell multiprocessor. IBM journal of Research and Development, 49(4.5):589–604, 2005.
 [18] H. Kashif and H. Patel. Bounding Buffer Space Requirements for Realtime Priorityaware Networks. In Asia and South Pacific Design Autom. Conf., pages 113–118, 2014.
 [19] C. N. Keltcher, K. J. McGrath, A. Ahmed, and P. Conway. The AMD Opteron Processor for Multiprocessor Servers. IEEE Micro, 23(2):66–76, 2003.
 [20] A. E. Kiasari, Z. Lu, and A. Jantsch. An Analytical Latency Model for NetworksonChip. IEEE Trans. on Very Large Scale Integration (VLSI) Systems, 21(1):113–123, 2013.
 [21] R. Leupers et al. Virtual Manycore platforms: Moving towards 100+ processor cores. In Proc. of DATE, pages 1–6, 2011.
 [22] P. S. Magnusson et al. Simics: A Full System Simulation Platform. Computer, 35(2):50–58.
 [23] U. Y. Ogras, P. Bogdan, and R. Marculescu. An Analytical Approach for NetworkonChip Performance Analysis. IEEE Trans. on ComputerAided Design of Integrated Circuits and Systems, 29(12):2001–2013, 2010.
 [24] U. Y. Ogras, Y. Emre, J. Xu, T. Kam, and M. Kishinevsky. EnergyGuided Exploration of OnChip Network Design for ExaScale Computing. In Proc. of Intl. Workshop on System Level Interconnect Prediction, pages 24–31, 2012.
 [25] U. Y. Ogras, M. Kishinevsky, and S. Chatterjee. xPLORE: Communication Fabric Design and Optimization Framework. Developed at Strategic CAD Labs, Intel Corp.
 [26] P. P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh. Performance Evaluation and Design Tradeoffs for NetworkonChip Interconnect Architectures. IEEE transactions on Computers, 54(8):1025–1040, 2005.
 [27] A. Patel et al. MARSS: a Full System Simulator for Multicore x86 CPUs. In Design Autom. Conf., pages 1050–1055, 2011.
 [28] Y. Qian, Z. Lu, and W. Dou. Analysis of Worstcase Delay Bounds for Besteffort Communication in Wormhole Networks on Chip. In 2009 3rd ACM/IEEE Interl. Symp. on NetworksonChip, pages 44–53.
 [29] Z.L. Qian et al. A Support Sector Regression (SVR)based Latency Model for NetworkonChip (NoC) Architectures. IEEE Trans. on ComputerAided Design of Integrated Circuits and Systems, 35(3):471–484, 2015.

[30]
A. Rico et al.
ARM HPC Ecosystem and the Reemergence of Vectors.
In Proc. of the Computing Frontiers Conf., pages 329–334. ACM, 2017.  [31] E. Rotem and S. P. Engineer. Intel Architecture, Code Name Skylake Deep Dive: A New Architecture to Manage Power Performance and Energy Efficiency. In Intel Developer Forum, 2015.
 [32] M. P. Singh and M. K. Jain. Evolution of Processor Architecture in Mobile Phones. Intl. Journ. of Computer Applications, 90(4), 2014.
 [33] J. Walraevens. Discretetime Queueing Models with Priorities. PhD thesis, Ghent University, 2004.
 [34] P. Wettin et al. Performance Evaluation of Wireless NoCs in Presence of Irregular Network Routing Strategies. In Proc. of the conf. on DATE, page 272, 2014.
 [35] Y. Wu et al. Analytical Modelling of Networks in Multicomputer Systems under Bursty and Batch Arrival Traffic. The Journ. of Supercomputing, 51(2):115–130, 2010.
Appendix A
Residual time calculation: Residual time is the delay of serving the next flit due to the remaining service time for a currently processed flit. As illustrated in Figure 20, class2 flits (lowpriority flits) have to wait until the server becomes free. Equations Appendix A and 16 are expressions for total residual time effect on class1 and class2 respectively. The analytical models for residual time have been well studied for continuous systems, yet little attention was given for discrete systems [5]. In this section, we derive analytical expressions of residual time for prioritybased queuing systems using discrete time domain analysis. These equations are derived assuming that the arrival process at each queue follows geometric distribution. Average residual time for each class of flit is evaluated by averaging the area of the residual time triangles shown in Figure 20 over all flits injected in a large amount of time. Let us assume that and are the total numbers of flits of class1 and class2 respectively which are injected into the system in amount of time and is the service time for flit. When a new service duration of begins, then the residual time of () starts and decays linearly. If we take time average of the residual time, then we obtain Equation Appendix A:
(14) 
In the derivation of Equation Appendix A, is an auxiliary variable that represents different residual time values for a particular flit. Multiplying and dividing the first expression in the summation by and second expression by , we obtain:
(15) 
Where follows from the fact that is the average number of injected flits in one time unit, which is . Also, and denote the first and second order moments of the service time of classi. We obtain because is the residual time of classi ().
Similarly, we compute the effective residual time of class2 (). At any cycle, both class1 and class2 flits can arrive in the system. If at that time the server is empty, then service will be started for class1 flit, as it has higher priority. Therefore, the portion of residual time that occurs due to class1 flits will decay linearly from instead of . Therefore, can be written as:
(16) 
In general, if there are total N classes, residual time expression for flits of class will be:
(17) 
where for Geo/G/1 queues and is the average utilization of server for class flits.
Average Queuing Time Expressions: The queuing time expression for tokens with traffic class1 and class2 can be written as [5]:
(18) 
(19) 
Substituting and from Equation Appendix A and 16 respectively:
(20)  
(21) 
In general, if there are total N classes, queuing time expression for class flits will be
(22) 
Comments
There are no comments yet.