Modern design methodologies in industries involve thorough power, performance, and area evaluations before the architectural decisions are frozen.
These pre-silicon evaluations are vital for detecting functional bugs and power-performance violations, since post-silicon fixes are costly, if feasible at all.
Therefore, a significant amount of resources are dedicated to pre-silicon evaluation
using virtual platforms (leupers2011virtual, ) or full-system simulators (Binkert2011Gem5, ).
NoC simulations play a critical role in these evaluations,
as NoCs have become the standard interconnect solution in many core chip-multiprocessors (CMPs) (jeffers2016intel, ; keltcher2003amd, ; kahle2005introduction, ), client CPUs (rotem2015intel, ), and mobile systems-on-chip (singh2014evolution, ).
Moreover, there is a growing interest to use NoCs in hardware implementations of deep neural networks
Moreover, there is a growing interest to use NoCs in hardware implementations of deep neural networks(choi2017chip, ).
Since the on-chip interconnect is a critical component of multicore architectures, pre-silicon evaluation platforms contain cycle-accurate NoC simulators (agarwal2009garnet, ; jiang2013detailed, ). NoC simulations take up a significant portion of the total simulation time, which is already limiting the scope of pre-silicon evaluation (e.g., simulating even a few seconds of applications can take days). For example, Figure 1 shows that 40%-70% of total simulation time is spent on the network itself when performing full-system simulation using gem5 (Binkert2011Gem5, ). Hence, accelerating NoC simulations without sacrificing accuracy can significantly improve both the quality and scope of pre-silicon evaluations.
Several performance analysis approaches have been proposed to enable faster NoC design space exploration (ogras2010analytical, ; wu2010analytical, ; qian2015support, ). Prior techniques have assumed a round-robin arbitration policy in the routers since the majority of router architectures proposed to date have used round-robin for fairness. In doing so, they miss two critical aspects of the industrial priority-based NoCs (jeffers2016intel, ; rico2017arm, ; kahle2005introduction, ). First, routers give priority to the flits in the network to achieve predictable latency within the interconnect. For example, let us assume, class-1 flits to the neighboring router and class-2 flits to the local node in Figure 2 are already in the NoC, while flits from class-3 to the neighboring router must wait in the input buffer to be admitted. Consequently, flits in the NoC (class-1 and class-2) experience deterministic service time at the expense of increased waiting time for new flits. Second, flits from different priority classes can be stored in the same queue. For instance, new read/write requests from the core to tag directories use the same physical and virtual channels as the requests forwarded from the directories to the memory controllers. Moreover, only a fraction of the flits in either the high or low priority queue can compete with the flits in the other queue. For example, suppose the class-2 flits in Figure 2 are delivered to the local node. Then, class-3 flits must compete with only class-1 flits in the high-priority queue. Analytical models that ignore this traffic split significantly overestimate the latency, as shown in Section 4.1. In contrast, analytical models that ignore priority would significantly underestimate the latency. Thus, prior approaches that do not model priority (ogras2010analytical, ; wu2010analytical, ; qian2015support, ) and simple performance models for the priority queues (kiasari2013analytical, ; jin2009modelling, ) are inapplicable to priority-based industrial NoCs.
This paper presents a novel NoC performance analysis technique that considers traffic classes with different priorities. This problem is theoretically challenging due to the non-trivial interactions between classes and shared resources. For example, queues can be shared by flits with different priorities, as shown in Figure 2. Similarly, routers may merge different classes coming through separate ports, or act as switches that can disjoin flits coming from different physical channels. To address these challenges, we propose a two-step approach that consists of an analysis technique followed by an iterative algorithm. The first step establishes that priority-based NoCs can be decomposed into separate queues using traffic splits of two types. Since existing performance analysis techniques cannot model these structures with traffic splits, we develop analytical models for these canonical queuing structures. The second step involves a novel iterative algorithm that composes an end-to-end latency model for the queuing network of a given NoC topology and input traffic pattern. The proposed approach is evaluated thoroughly using both 2D mesh and ring architectures used in industrial NoCs. It achieves 97% accuracy with respect to cycle-accurate simulations for realistic architectures and applications.
The major contributions of this paper are as follows:
A technique to obtain analytical performance models of priority-based NoCs with multiple traffic classes,
An algorithm to synthesize end-to-end latency for each traffic class using the analytical models,
Extensive evaluations using an industrial cycle-accurate simulator, real applications and synthetic traffic.
The rest of the paper is organized as follows. Section 2 reviews the related work. Section 3 provides a brief overview and background of the proposed work. Section 4 describes the required transformations for two canonical structures and their analytical models. Section 5 describes how these two transformations are used to analyze a queuing network. Section 6 presents experimental evaluations, and Section 7 concludes the paper summarizing the key contributions.
2. Related Work
. Indeed, there is continuous interest in applying novel techniques such as machine learning(qian2015support, ) and network calculus (qian2009analysis, ) to NoC performance analysis. However, these studies do not consider multiple traffic classes with different priorities. Since state-of-the-art industrial NoC designs (doweck2017inside, ; jeffers2016intel, ) use priority-based arbitration with multi-class traffic, it is important to develop performance analysis for this type of architectures.
Kashif et al. have recently presented priority-aware router architectures (kashif2014bounding, ). However, this work presents analytical models only for worst-case latency. In practice, analyzing the average latency is important since using worst-case latency estimates in full-system would lead to inaccurate conclusions. A recent technique proposed an analytical latency model for priority-based NoC (kiasari2013analytical, ). This technique, however, assumes that each queue in the network contains a single class of flits.
Several techniques present performance analysis of priority-based queuing networks outside the NoC domain (bertsekas1992data, ; bolch2006queueing, ; ikeharaapproximate, ). Nevertheless, these techniques do not consider multiple traffic classes in the same queue. The work presented in (awan2005analysis, ) considers multiple traffic classes, but it assumes that high priority packets preempt the lower priority packets. However, this is not a valid assumption in the NoC context. A technique that can handle two traffic classes, Empty Buffer Approximation (EBA), has been proposed in (berger2000workload, ) for a priority-based queuing system. This approach was later extended to multi-class systems (jin2009modelling, ). However, EBA ignores the residual time caused by low priority flits on high priority traffic. Hence, it is impractical to use EBA for priority-aware industrial NoCs.
The aforementioned prior studies assume a continuous-time queuing network model, while the events in synchronous NoCs take place in discrete clock cycles. A discrete-time priority-based queuing system is analyzed in (walraevens2004discrete, )
. This technique forms a Markov chain for a given queuing system, then analyzes this model in z-domain through probability generating functions (PGF). PGFs deal with joint probability distributions where the number of random variables is equal to the number of traffic classes in the queuing system. This approach is not scalable for systems with large number of traffic classes because the corresponding analysis becomes intractable. For example, an industrial 88 NoC would have 64 sources and 64 destinations which will result in 4096 (6464) variables with PGF. Furthermore, our approach outperforms this technique, as demonstrated in Section 6.4.
In contrast to prior approaches, we propose a scalable and accurate closed form solution for a priority-based queuing network with multi-class traffic. The proposed technique constructs end-to-end latency models using two canonical structures identified for priority-based NoCs. Unlike prior approaches, our technique scales to any number of traffic classes. To the best of our knowledge, this is the first analytical model for priority-based NoCs that considers both (1) shared queues among multiple priority classes and (2) traffic arbitration dependencies across the queues.
3. Overview and Background
3.1. Proposed Performance Analysis Flow
The primary target of the proposed model is to accelerate virtual platforms (bartolini2010virtual, ) and full-system simulations (Binkert2011Gem5, ; patel2011marss, ; magnusson2002simics, ) by replacing time-consuming NoC simulations with accurate lightweight analytical models. At the beginning of the simulation, the proposed technique parses the priority-based NoC topology to construct the analytical models, as shown in Figure 3. The host, such as a virtual platform, maintains a record of traffic load and the destination address for each node. It also periodically (each 10K-100K cycles) sends the traffic injections of requesting nodes, such as cores, to the proposed technique. Then, the proposed technique applies the analytical models (steps 2 and 3 in Figure 3) to compute the end-to-end latency. Whenever there is a new request from an end node, the host system estimates the latency using the proposed model as a function of the source-destination pair. That is, our model replaces the cycle-by-cycle simulation of flits in NoCs.
We note that the requesting nodes, such as the cores, have a parameterized number of maximum outstanding request credits. Hence, the requesters are automatically throttled by these credits/node, leading to a bounded number of flits in the NoCs. Since the traffic injection rates provided by the host already account for this throttling, we do not explicitly model blocking at the interfaces.
3.2. Basic Priority-Based Queuing Models
We assume a discrete time system in which micro-architectural events, such as writing to a buffer, arbitration and switch traversal happen in the integral number of clock cycles. Therefore, we develop queuing models based on arrival process that follows geometric distribution,in contrast to continuous time models that are based on Poisson (M for Markovian) arrival assumption. More specifically, we adopt the Geo/G/1 model, in which the inter-arrival time of the incoming flits to the queue follows geometric distribution (denoted by Geo), service time of the queue follows a general discrete-time distribution (denoted by G), and the queue has one server (the ‘1’ in the Geo/G/1 notation). The proposed technique estimates the end-to-end latency for realistic applications accurately, as we demonstrate in Section 6.5. However, the accuracy is expected to drop if the NoC operates close to its maximum load since the Geometric (similar to Poisson) packet inter-arrival time assumption becomes invalid (ogras2010analytical, ).
Performance analysis techniques in the literature (bertsekas1992data, ; jin2009modelling, ; kiasari2013analytical, ) discuss basic priority-based networks in which each priority class has a dedicated queue, as illustrated in Figure 4(a). In this architecture, the flits in Q have higher priority than the flits in Q. That is, flits in Q will be served only when Q is empty and the server is ready to serve new flits. Another example with priority classes is shown in Figure 4(b). The flits in Q have higher priority than flits in Q if . The average waiting time for each priority class for is known for continuous time M/G/1 queues (bertsekas1992data, )
. In the M/G/1 queuing system, flits arrive in the queue following Poisson distribution (M) and the service time of the queue follows general distribution (G). In this work, we first derive waiting time expressions for discrete time Geo/G/1 queues. Then, we employ these models to derive end-to-end NoC latency models.
The average waiting time of flits in a queue can be divided into two parts: (1) waiting time due to the flits already buffered in the queue, and (2) waiting time due to the flits which are in the middle of their service, i.e., the residual time. The following lemma expresses the waiting time as a function of input traffic and NoC parameters.
Lemma 1: Consider a queuing network with priority classes as shown in Figure 4(b). Suppose that we are given the injection rates , service rates , residual time , and server utilizations for , where (see Table 1). Then, the waiting time of class- flits is given as:
Proof: Equation 1 is derived in Appendix A to avoid distorting the flow of the paper.
Shortcomings of the Basic Priority-Based Queuing Models: Although Equation 1 extends the known results from continuous time to discrete time queuing systems, it cannot handle a network of queues in which each queue can store more than one priority class. For example, it does not handle the scenario in which both class-1 and class-2 flits can use Q in Figure 2. To this end, we present our novel technique that handles multiple priority classes in one queue.
|Injection rate of class-i flits|
|Service time of class-i flits|
|Service rate of class-i flits (=1/)|
|Residual time of class-i flits|
|Server utilization of class-i flits (=)|
|Coefficient of variation of service time|
|Waiting time of class-i flits|
4. Canonical Analytical Models
This section describes two canonical queuing structures observed in priority-based NoCs. We first describe these structures and explain why prior analysis techniques fail to analyze them. Then, we present two novel transformations and accurate analysis techniques.
4.1. Transformation 1: Split at High Priority Queue
Conceptual Illustration: Consider the structure shown in Figure 5(a). As illustrated in Section 1, flits from traffic class-1 and 2 are already in the network, while flits from traffic class-3 are waiting in Q to be admitted. Since routers give priority to the flits in the network in industrial NoCs, class-1 flits have higher priority than those in Q. To facilitate the description of the proposed models, we represent this system by the structure shown in Figure 5(b). In this figure, represents the service rate of class- for . If we use Equation 1 to obtain an analytical model for the waiting time of traffic class-3, the resulting waiting time will be highly pessimistic, as shown in Figure 6. The basic priority-based queuing model overestimates the latency, since it assumes each class in the network occupy separate queues. Hence, all flits in Q have higher priority than those in Q.
Proposed Transformation: The basic priority equations cannot be applied to this system since flit distribution of class-1 as seen by class-3 flits will change depending on the presence of class-2 traffic. To address this challenge, we propose a novel structural transformation, Figure 5(b) to Figure 5(c). Comparison of the structures before and after the transformation reveals:
The top portion (Q with its server) is identical to the original structure, since and remain the same due to higher priority of class-1 over class-3.
The bottom portion (Q and Q) forms a basic priority queue structure, as highlighted by the red dotted box.
The basic priority queue structure is useful since we have already derived its waiting time model in Equation 1. However, the arrival process at Q must be derived to apply this equation and ensure the equivalence of the structures before and after the transformation.
We derive the second order moment of inter-departure time of class-1 using the decomposition technique presented in (bolch2006queueing, ). These inter-departure distributions are functions of inter-arrival distributions of all traffic classes flowing in the same queue and service rate of the classes, as illustrated in Figure 7. This technique first calculates the effective coefficient of variation at the input () as the weighted sum of the coefficient of variation of individual classes ( in Figure 7-Phase 1). Then, it finds the effective coefficient of variation for the inter-departure time () using and the coefficient of variation for the service time (). In the final phase, the coefficient of variation for inter-departure time of individual classes is found, as illustrated in Figure 7
(Phase 3). By calculating the first two moments of the inter-arrival statistics of Qas and , we ensure that the transformed structure in Figure 5(c) approximates the original system. This decomposition enables us to find the residual time for class-1 as:
Proposed Analytical Model: The bottom part of the transformed system in Figure 5(c) is the basic priority queue (marked with the dotted red box). Therefore, the higher priority part of Equation 1 can be used to express the waiting time of class-1 flits as:
We also note the waiting time of class-2 flits, , is not affected by this transformation. Hence, we can express it as , using Equation 1 for the degenerate case of .
Figure 6 shows that the waiting time calculated by the proposed analytical model for flits of traffic class-3 is quite accurate with respect to the waiting time obtained from the simulation. The average error in waiting time of traffic class-3 is 2% for the system shown in Figure 5(a), with a deterministic service time of two cycles.
4.2. Transformation 2: Split at Low Priority Queue
Conceptual Illustration: Consider the queuing system shown in Figure 8(a). In this system, class-1 flits () are waiting in Q, while class-2 flits () and class-3 flits () are waiting in Q. Class-1 and class-3 flits share the same channel and compete for the same output, while class-2 flits are sent to a separate output. Class-1 flits always win the arbitration since they have higher priority. Similar to the previous transformation, the queuing model in Figure 8(b) is used as an intermediate representation to facilitate the discussion. In this system, Q and Q are represented as Q and Q respectively.
If we ignore the impact of class-1 traffic while modeling the waiting time for class-3, the resulting analytical models will be highly optimistic, as shown in Figure 9. Accounting for the impact of class-1 traffic on class-2 is challenging, since only fraction of the flits in Q that compete with class-1 are blocked. In other words, class-2 flits which go to the local node are not directly blocked by class-1 flits. Hence, there is a need for a new transformation that can address the split at the low-priority queue.
Proposed Transformation: The high-priority flow (class-1) is not affected by class-2 traffic since they do not share the same server. Therefore, the waiting time of class-1 flits can be readily obtained using Equation 1 as:
Hence, we represent Q as a stand-alone queue, as shown in Figure 8(c). However, the opposite is not true; class-1 flits affect both class-2 (indirectly) and class-3 (directly). Therefore, we represent them using a new queue with modified service rate statistics. To ensure that Figure 8(c) closely approximates the original system, we characterize the effect on the service rate of class-3 using a novel analytical model.
Proposed Analytical Model: Both the service time and residual time of class-3 change due to the interaction with class-1. To quantify these changes, we set such that the effect of class-2 is isolated. In this case, the waiting time of class-3 flits can be found using Equation 1 as:
We can find also by using the modified service time () and residual time of class-3. The probability that a class-3 flit cannot be served due to class-1 is equal to server utilization . Moreover, there will be extra utilization due to the residual effect of class-3 on class-1, i.e., flits in Q. Hence, the probability that a class-3 flit is delayed due to class-1 flits is:
Each time class-3 flit is blocked by the class-1 flits, the extra delay will be , i.e., class-1 service time. Since each flit can be blocked multiple consecutive times, the additional busy period of serving class-3 is expressed as:
Consequently, the modified service time () and utilization () of class-3 can be expressed as:
Suppose that the modified residual time of class-3 is denoted by . We can plug , the modified utilization from Equation 9, and the additional busy period from Equation 8 into Geo/G/1 model to express the waiting time as:
Since the modified service time and residual times are computed, we can apply the Geo/G/1 queuing model one more time to find the waiting time of class-2 and class-3 flits as:
Figure 9 shows that the class-3 waiting time calculated using the proposed analytical modeling technique is very close to simulation results. The modeling error is within 4% using a deterministic service time of 2 cycles.
5. Generalization for Arbitrary Number of Queues
In this section, we show how the proposed transformations are used to generate analytical models for priority-based NoCs with arbitrary topologies and input traffic. Algorithm 1 describes the model generation technique, which is a part of the proposed methodology to be used in a virtual platform. This algorithm takes injection rates for all traffic classes, the NoC topology, and the routing of individual traffic classes. Then, it uses the transformations described in Section 4.1 and Section 4.2 iteratively to construct analytical performance models for each traffic class.
First, Algorithm 1 extracts all traffic classes originating from a particular queue, as shown in line 6. Next, the waiting time for each of these classes is computed separately, as each has a different dependency on other classes due to priority arbitration. At line 8, all classes that have higher priority than the current class are obtained. In lines 11–16, the structural transformation as described in Section 4.1 is applied. For that, the coefficient of variation of inter-departure time for each of the higher priority classes is computed. Through structural transformation, reference waiting time for the current class is obtained, as depicted in line 17 of the algorithm. At line 18, we compute the modified service time of the current class following the method described in Section 4.2. Using and , the residual time ( in line 19) is computed. Using residual time expressions for all classes in a queue, we obtain waiting time expressions for each class separately, as shown in line 23 of the algorithm.
Figure 10 illustrates the proposed approach on a representative example of a priority-based network to decompose the system. Figure 10(a) shows the original queuing network. This network consists of three queues: Q, Q, and Q. Q stores flits from class-1 and class-2 flows, while Q buffers class-3 and class-4 flits. Flits of class-2 have higher priority than both class-3 and class-4, as denoted by the first port of the switch that connects these flows. Finally, class-5 flits are stored in Q. We note that class-5 flits have lower priority than that of class-3, while they are independent of class-2 and class-4 flits. To solve this queuing system, we first apply the structural transformation on class-1 and class-2 of Q by bypassing class-2 flits to Q as shown in Figure 10(b). Next, the service rate transformation on class-3 and class-4 is applied to obtain modified service time . This transformation allows us to form the network by decomposing Q and Q, as depicted in Figure 10(c). After that, structural transformation is applied on class-3 as flits of class-3 have higher priority than those of class-5. Finally, service rate transformation is performed on class-5 to achieve a fully decomposed system, which is shown in Figure 10(e).
Automation of Model Generation Technique: We developed a framework to automatically generate the analytical performance model for NoCs with arbitrary size 2D Mesh and ring topologies. The proposed framework operates in two steps. In the first step, we extract all architecture-related information of the NoC. This includes information about the traffic classes in each queue and priority relations between classes. In the second step, the automation framework uses this architecture information to generate analytical models.
6. Experimental evaluations
6.1. Experimental Setup
We applied the proposed analytical models to a widely used priority-based industrial NoC design (jeffers2016intel, ). We implemented the proposed analytical models in C and observed that on average it takes to calculate latency value per source-to-destination pair. At each router of the NoC, there are queues in which tokens wait to be routed. This NoC design incorporates deterministic service time across all queues. We compared average latency values in the steady state found in this approach against an industrial cycle-accurate simulator written in SystemC (Xplore, ; ogras2012energy, ). We ran each simulation for 10 million cycles to obtain steady state latency values, with a warm-up period of 5000 cycles. Average latency values are obtained by averaging latencies of all flits injected after the warm-up period. Injection rates are swept from to . Beyond , server utilization becomes greater than one, which is not practical. We show the average latency of flits as a function of the flit injection rate for different NoC topologies. We also present experimental results considering the cache coherency protocol with different hit rates, network topologies, and floorplans. With a decreasing hit rate, traffic towards the memory controller increases, leading to more congestion in the network.
6.2. Full-System Simulations on gem5
Applications are profiled in the full-system simulator gem5 (Binkert2011Gem5, ) using Linux ‘perf’ tools (de2010new, ). The ‘perf’ tool captures the time taken by each function call and their children in the gem5 source. It represents the statistics through a function call graph. From this call graph, we obtain the time taken by the functions related to Garnet2.0, which is the on-chip interconnect for gem5. Figure 11 shows components of Garnet2.0, which takes up a significant portion of the total simulation time while running Streamcluster application on gem5. These components are router, network-link, and functional write. The ‘other components’ shown in Figure 11 consists of the functions not related to network simulation. We observe that the functional write takes 50%, and the whole network takes around 60% of the total simulation time in this case.
Simulation Time: To evaluate the decrease in simulation time with the proposed approach, we first run the Streamcluster application with a 16-core CPU on gem5 in full system mode using Garnet2.0, a cycle-accurate network simulator. Then, we repeat the same simulation by replacing the cycle-accurate simulation with the proposed analytical model. The total simulation time is reduced from 12,466 seconds to 4986 seconds when we replace the cycle-accurate NoC simulations with the proposed analytical models. Hence, we achieve a 2.5 speedup in cycle-accurate full-system simulation with the proposed NoC performance analysis technique.
6.3. Validation on Ring Architectures
This section evaluates the proposed analytical models on priority-based ring architecture that consists of eight nodes. In this experiment, all nodes inject flits with an equal injection rate. Flits injected from a node go to other nodes with equal probability. We obtain the latency between each source-destination pair using the proposed analytical models. The simulation and analysis results are compared in Figure 12. The proposed analysis technique has only 2% error on average. The accuracy is higher at lower injection rates and degrades gradually with increasing injection rates, as expected. However, the error at the highest injection rate is only 5.2%.
6.4. Validation on Mesh Architectures
This section evaluates the proposed analytical model for 66 and 88 priority-based mesh NoCs with Y-X routing. As described in (jeffers2016intel, ), a mesh is a combination of horizontal and vertical half rings. The analytical model generation technique for priority-based NoC architecture is applied to horizontal and vertical rings individually. Then, these latencies, as well as the time it takes to switch from one to the other are used to obtain the latency for each source-destination pair. We first consider uniform random all-to-all traffic, as in Section 6.3. The comparison with the cycle-accurate simulator shows that the proposed analytical models are on average 97% and 96% accurate for 66 and 88 mesh, as shown in Figure 14 and Figure 14, respectively. At the highest injection rate, the analytical models show 11% error for both cases.
Comparison to Prior Techniques: We compare the proposed analytical models to the existing priority-aware analytical models in literature (walraevens2004discrete, ). Since these techniques do not consider multiple priority traffic classes in the network, they fail to accurately estimate the end-to-end latency. For example, Figure 14 and Figure 14 show that they overestimate NoC latency at high injection rates for 66 and 88 mesh networks, respectively. In contrast, since it captures the interactions between different classes, the proposed technique is able to estimate latencies accurately. Finally, we analyze the impact of using each transformation individually. If we apply only the Structural Transformation (ST), then the latency is severely underestimated at higher injection rates, since contentions are not captured accurately. In contrast, applying only Service Rate Transformation (RT) results in overestimating the latency at higher injection rates as the model becomes pessimistic.
Impact of coefficient of variation: One of the important parameters in our analytical model is the coefficient of variation of inter-arrival time. When the inter-arrival time between the incoming flits follows geometric distribution, increasing coefficient of variation implies larger inter-arrival time. Hence, the average flit latency is expected to decrease with an increasing coefficient of variation. Indeed, the simulation and analysis results demonstrate this behavior for a 66 mesh in Figure 15. We observe that the proposed technique accurately estimates the average latency in comparison to cycle-accurate simulation. On average, the analytical models are 97% accurate with respect to latency obtained from the simulation in this case.
Evaluation with Intel® Xeon® Scalable Server Processor Architecture: This section evaluates the proposed analytical model with the floorplan of a variant of the Intel® Xeon® Scalable Server Processor Architecture (doweck2017inside, ) architecture. This version of the Xeon server has 26 cores, 26 banks of the last level cache (LLC), and 2 memory controllers.
The cores and LLC are distributed on a 66 mesh NoC. The comparison of simulation and proposed analytical models with this floorplan is shown in Figure 16. On average, the accuracy is 98% when all cores send flits to all caches with equal injection rates. Similar to the evaluations on 66 mesh and 88 mesh, the state-of-the-art NoC performance analysis technique (walraevens2004discrete, ) highly overestimates the average latency for this server architecture, as shown in Figure 16. Applying only ST underestimates the average latency and applying only RT overestimates the average latency.
The NoC latency is a function of the traffic class, since higher priority classes experience less contention. To demonstrate the latency for different classes, we present the NoC latencies for 9 representative traffic classes of the server architecture described above. Figure 17 shows the latency of each class of the server architecture described above normalized with respect to the average latency obtained from the simulation. Higher priority classes experience lower latency, as expected. The proposed performance analysis technique achieves 91% accuracy on average for the classes which have the lowest priority in the NoC. For the classes having medium priority and highest priority, the accuracy is 99% on average. Therefore, the proposed technique is reliable for all classes with different levels of priority.
Finally, we evaluate the proposed technique with different LLC hit rates. Table 2 shows that the proposed approach achieves over 97% accuracy in estimating the average latency of the address network for all hit rates. Similarly, the latencies in the data network are estimated with 98% or greater accuracy for 0% and 50% hit rates. The accuracy drops to 93.9% for 100% hit rates, since this scenario leads to the highest level of congestion due to all-to-all traffic behavior.
6.5. Evaluation with Real Applications
In this section, evaluations of the proposed technique with real applications are shown. We use gem5 (Binkert2011Gem5, ) to extract traces of applications in Full-System (FS) mode. Garnet2.0 (agarwal2009garnet, ) is used as the network simulator in gem5 with the Ruby memory system. Table 3 shows the various configuration settings we used for FS simulation in gem5.
|Processor||Number of Cores||16|
|Frequency of Cores||2 GHz|
|Routing Algorithm||X-Y deterministic|
|Memory Size||3 GB|
We collect traces of six 16-threaded applications from PARSEC (bienia2008parsec, ) benchmark suites: Blackscholes, Canneal, Swaptions, Bodytrack, Fluidanimate, and Streamcluster. We selected applications that show relatively higher network utilization as discussed in (wettin2014performance, ). The accuracy obtained for these applications is an important indicator of the practicality of the proposed technique since real applications do not necessarily comply with a known inter-arrival time distribution (bogdan2011non, ), such as the geometric distribution used in this work. The traces are parsed and simulated through our custom in-house simulator with priority-based router model. For each application, a window of one million cycles with the highest injection rate is chosen for simulation. From the traces of these applications, we get the average injection rate of each source and destination pair. These injection rates are fed to our analytical models to obtain average latency.
Figure 18 shows the comparison of the average latency between the proposed analytical model and the simulation. The x-axes represent mean absolute percentage error (MAPE) between the average simulation latency () and average latency obtained from analytical models (). MAPE is defined by the following equation:
The y-axes in the plots represent the percentage of source to destination pairs having the corresponding MAPE. From this figure, we observe that the latency obtained from the proposed analytical model is always within 10% of the latency reported by the cycle-accurate simulations. In particular, only 1% source-destination pair has MAPE of 10% for the Canneal application. On average, the analytical models have 3% error in comparison to latency obtained from the simulation for real applications. These results demonstrate that our technique achieves high accuracy for applications which may have arbitrary inter-arrival time distributions.
We further divide the window of one million cycles into 10 smaller windows containing 100,000 cycles each. Average latency comparison for Streamcluster application in these smaller windows is shown in Figure 19. The largest MAPE between latency obtained from the simulation and analytical model is observed for window 10, which is 7%. On average, the proposed analytical models are 98% accurate for these 10 windows. This confirms the reliability of the proposed analytical models at an even more granular level for the application. Finally, we note that the experiments with synthetic traffic shown in Section 6.3 and Section 6.4 exercise higher injection rates than these applications. Hence, the proposed technique performs well both under real application traces and heavy traffic scenarios.
Prior work showed that the deviation from Poisson distribution becomes larger as the network load approaches saturation (ogras2010analytical, ). Similar to this result, we also observe that the Geometric distribution assumption is a good approximation until the NoC operates near saturation point. Therefore, we obtain high accuracy for real application workloads. Since this accuracy can degrade with increasing traffic load, we plan to generalize the proposed models by relaxing the assumption of Geometric distribution in our future work.
In this work, we propose an approach to build analytical models for priority-based NoCs with multi-class flits. As we emphasized, no prior work has presented analytical models that consider priority arbitration and multi-class flits in a single queue simultaneously. Such a priority-based queuing network is decomposed into independent queues using novel transformations proposed in this work. We evaluate the efficiency of the proposed approach by computing end-to-end latency of flits in a realistic industrial platform and using real application benchmarks. Our extensive evaluations show that the proposed technique achieves a high accuracy of 97% accuracy compared to cycle-accurate simulations for different network sizes and traffic flows.
-  N. Agarwal et al. GARNET: A Detailed on-chip Network Model Inside a Full-system Simulator. In 2009 IEEE intl. symp. on performance analysis of systems and software, pages 33–42.
-  I. Awan and R. Fretwell. Analysis of Discrete-Time Queues with Space and Service Priorities for Arbitrary Arrival Processes. In Parallel and Distributed Systems. Proc. 11th Intl Conf. on, volume 2, pages 115–119, 2005.
-  A. Bartolini et al. A Virtual Platform Environment For Exploring Power, Thermal And Reliability Management Control Strategies In High-Performance Multicores. In Proc. of the Great lakes Symp. on VLSI, pages 311–316, 2010.
-  A. W. Berger and W. Whitt. Workload Bounds in Fluid Models with Priorities. Performance evaluation, 41(4):249–267, 2000.
-  D. P. Bertsekas, R. G. Gallager, and P. Humblet. Data Networks, volume 2. Prentice-Hall International New Jersey, 1992.
-  C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proc. of the Intl. Conf. on Parallel Arch. and Compilation Tech., pages 72–81, 2008.
-  N. Binkert et al. The Gem5 Simulator. SIGARCH Comp. Arch. News, May. 2011.
-  P. Bogdan and R. Marculescu. Non-stationary Traffic Analysis and its Implications on Multicore Platform Design. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 30(4):508–519, 2011.
-  G. Bolch, S. Greiner, H. De Meer, and K. S. Trivedi. Queueing Networks and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications. John Wiley & Sons, 2006.
-  W. Choi et al. On-Chip Communication Network for Efficient Training of Deep Convolutional Networks on Heterogeneous Manycore Systems. IEEE Trans. on Computers, 67(5):672–686, 2017.
-  A. C. de Melo. The New Linux Perf Tools. In Linux Kongress, volume 18, 2010.
-  J. Doweck et al. Inside 6th-generation Intel Core: New Microarchitecture Code-named Skylake. IEEE Micro, (2):52–62, 2017.
-  S. Ikehara and M. Miyazaki. Approximate Analysis of Queueing Networks with Nonpreemptive Priority Scheduling. In Proc. 11th Int. Teletraffic Congr.
-  J. Jeffers, J. Reinders, and A. Sodani. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann, 2016.
-  N. Jiang et al. A Detailed and Flexible Cycle-accurate Network-on-chip Simulator. In 2013 IEEE Intl. Symp. on Performance Analysis of Systems and Software (ISPASS), pages 86–96.
-  X. Jin and G. Min. Modelling and Analysis of Priority Queueing Systems with Multi-class Self-similar Network Traffic: a Novel and Efficient Queue-decomposition Approach. IEEE Trans. on Communications, 57(5), 2009.
-  J. A. Kahle et al. Introduction to the Cell multiprocessor. IBM journal of Research and Development, 49(4.5):589–604, 2005.
-  H. Kashif and H. Patel. Bounding Buffer Space Requirements for Real-time Priority-aware Networks. In Asia and South Pacific Design Autom. Conf., pages 113–118, 2014.
-  C. N. Keltcher, K. J. McGrath, A. Ahmed, and P. Conway. The AMD Opteron Processor for Multiprocessor Servers. IEEE Micro, 23(2):66–76, 2003.
-  A. E. Kiasari, Z. Lu, and A. Jantsch. An Analytical Latency Model for Networks-on-Chip. IEEE Trans. on Very Large Scale Integration (VLSI) Systems, 21(1):113–123, 2013.
-  R. Leupers et al. Virtual Manycore platforms: Moving towards 100+ processor cores. In Proc. of DATE, pages 1–6, 2011.
-  P. S. Magnusson et al. Simics: A Full System Simulation Platform. Computer, 35(2):50–58.
-  U. Y. Ogras, P. Bogdan, and R. Marculescu. An Analytical Approach for Network-on-Chip Performance Analysis. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 29(12):2001–2013, 2010.
-  U. Y. Ogras, Y. Emre, J. Xu, T. Kam, and M. Kishinevsky. Energy-Guided Exploration of On-Chip Network Design for Exa-Scale Computing. In Proc. of Intl. Workshop on System Level Interconnect Prediction, pages 24–31, 2012.
-  U. Y. Ogras, M. Kishinevsky, and S. Chatterjee. xPLORE: Communication Fabric Design and Optimization Framework. Developed at Strategic CAD Labs, Intel Corp.
-  P. P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh. Performance Evaluation and Design Trade-offs for Network-on-Chip Interconnect Architectures. IEEE transactions on Computers, 54(8):1025–1040, 2005.
-  A. Patel et al. MARSS: a Full System Simulator for Multicore x86 CPUs. In Design Autom. Conf., pages 1050–1055, 2011.
-  Y. Qian, Z. Lu, and W. Dou. Analysis of Worst-case Delay Bounds for Best-effort Communication in Wormhole Networks on Chip. In 2009 3rd ACM/IEEE Interl. Symp. on Networks-on-Chip, pages 44–53.
-  Z.-L. Qian et al. A Support Sector Regression (SVR)-based Latency Model for Network-on-Chip (NoC) Architectures. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 35(3):471–484, 2015.
A. Rico et al.
ARM HPC Ecosystem and the Reemergence of Vectors.In Proc. of the Computing Frontiers Conf., pages 329–334. ACM, 2017.
-  E. Rotem and S. P. Engineer. Intel Architecture, Code Name Skylake Deep Dive: A New Architecture to Manage Power Performance and Energy Efficiency. In Intel Developer Forum, 2015.
-  M. P. Singh and M. K. Jain. Evolution of Processor Architecture in Mobile Phones. Intl. Journ. of Computer Applications, 90(4), 2014.
-  J. Walraevens. Discrete-time Queueing Models with Priorities. PhD thesis, Ghent University, 2004.
-  P. Wettin et al. Performance Evaluation of Wireless NoCs in Presence of Irregular Network Routing Strategies. In Proc. of the conf. on DATE, page 272, 2014.
-  Y. Wu et al. Analytical Modelling of Networks in Multicomputer Systems under Bursty and Batch Arrival Traffic. The Journ. of Supercomputing, 51(2):115–130, 2010.
Residual time calculation: Residual time is the delay of serving the next flit due to the remaining service time for a currently processed flit. As illustrated in Figure 20, class-2 flits (low-priority flits) have to wait until the server becomes free. Equations Appendix A and 16 are expressions for total residual time effect on class-1 and class-2 respectively. The analytical models for residual time have been well studied for continuous systems, yet little attention was given for discrete systems . In this section, we derive analytical expressions of residual time for priority-based queuing systems using discrete time domain analysis. These equations are derived assuming that the arrival process at each queue follows geometric distribution. Average residual time for each class of flit is evaluated by averaging the area of the residual time triangles shown in Figure 20 over all flits injected in a large amount of time. Let us assume that and are the total numbers of flits of class-1 and class-2 respectively which are injected into the system in amount of time and is the service time for flit. When a new service duration of begins, then the residual time of () starts and decays linearly. If we take time average of the residual time, then we obtain Equation Appendix A:
In the derivation of Equation Appendix A, is an auxiliary variable that represents different residual time values for a particular flit. Multiplying and dividing the first expression in the summation by and second expression by , we obtain:
Where follows from the fact that is the average number of injected flits in one time unit, which is . Also, and denote the first and second order moments of the service time of class-i. We obtain because is the residual time of class-i ().
Similarly, we compute the effective residual time of class-2 (). At any cycle, both class-1 and class-2 flits can arrive in the system. If at that time the server is empty, then service will be started for class-1 flit, as it has higher priority. Therefore, the portion of residual time that occurs due to class-1 flits will decay linearly from instead of . Therefore, can be written as:
In general, if there are total N classes, residual time expression for flits of class- will be:
where for Geo/G/1 queues and is the average utilization of server for class- flits.
Average Queuing Time Expressions: The queuing time expression for tokens with traffic class-1 and class-2 can be written as :
In general, if there are total N classes, queuing time expression for class- flits will be